# Part 2: Sentiment analysis


_Exercise: Creating Word Shifts_
>    1. Pick a day of your choice in 2020. We call it $d$. It is more interesting if you pick a day where you expect something relevant to occur (e.g. Christmas, New Year, Corona starting, the market crashes...).
>    2. Build two lists $l$ and $l_{ref}$ containing all tokens for submissions posted on r/wallstreebets on day $d$, and in the 7 days preceding day $d$, respectively. 
>    3. For each token $i$, compute the relative frequency in the two lists $l$ and $l_{ref}$. We call them $p(i,l)$ and $p(i,l_{ref})$, respectively. The relative frequency is computed as the number of times a token occurs over the total length of the document. Store the result in a dictionary.
>    4. For each token $i$, compute the difference in relative frequency $\delta p(i) = p(i,l) - p(i,l_{ref})$. Store the values in a dictionary. Print the top 10 tokens (those with largest relative frequency). Do you notice anything interesting?
>    5. Now, for each token, compute the happiness $h(i) = labMT(i) - 5$, using the labMT dictionary. Here, we subtract $5$, so that positive tokens will have a positive value and negative tokens will have a negative value. Then, compute the product $\delta \Phi = h(i)\cdot \delta p(i)$. Store the results in a dictionary. 
>    6. Print the top 10 tokens, ordered by the absolute value of $|\delta \Phi|$. Explain in your own words the meaning of $\delta \Phi$. If that is unclear, have a look at [this page](https://shifterator.readthedocs.io/en/latest/cookbook/weighted_avg_shifts.html).
>    7. Now install the [``shifterator``](https://shifterator.readthedocs.io/en/latest/installation.html) Python package. We will use it for plotting Word Shifts. 
>    8. Use the function ``shifterator.WeightedAvgShift`` to plot the WordShift, showing which words contributed the most to make your day of choice _d_ happier or more sad then days in the preceding 7 days. Comment on the figure. 
>    9. How do words that you printed in step 6 relate to those shown by the WordShift? 

In [300]:
import nltk, re, pprint
from nltk import word_tokenize
import numpy as np
import pandas as pd
import matplotlib.dates as mdate
from scipy.stats import pearsonr
from scipy import stats
import matplotlib.pyplot as plt
import re
import datetime as datetime

data = pd.read_csv("wallstreet_subs.csv", parse_dates = True)

In [301]:
data.head()

Unnamed: 0,created_utc,title,selftext,score
0,1586173811,What is the Fed actually buying?,"Okay, I may actually just be retarded. On my d...",1
1,1586173320,I didn’t learn about puts because I was lazy,"Beginning of the this virus shit, everyone was...",1
2,1586173268,HOT TAKE,Literally everyone has free time on their hand...,1
3,1586172639,Fuck you Gordon,"Gordon I believed in you, I can't even begin t...",1
4,1586171822,Can’t find a picture,Someone uploaded a ohoto of the stock market h...,1


Combining titles and main text

In [302]:
empty = []
for i in range(len(data['title'])):
    empty.append(data['title'][i]+' '+data['selftext'][i])
data['text'] = empty

>    1. Pick a day of your choice in 2020. We call it $d$. It is more interesting if you pick a day where you expect something relevant to occur (e.g. Christmas, New Year, Corona starting, the market crashes...).

We pick November 4rd as d, which is the day after election day of the US presidential election.

OR

We pick March 13th as d, which is the day the president delcared a state of emergency in the country over the Corona virus

Converting utc to datetime

In [303]:
data['date'] = [datetime.datetime.fromtimestamp(ts) for ts in data['created_utc']]
data['date'] = data['date'].dt.date
data['date'] = pd.to_datetime(data['date'])


In [304]:
data.set_index('date', inplace = True)

In [305]:
data.loc['2020-11-03']

Unnamed: 0_level_0,created_utc,title,selftext,score,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-11-03,1604382582,Calling it now,Trump wins...\nMarket crashes as the party of ...,1,Calling it now Trump wins...\nMarket crashes a...
2020-11-03,1604381964,Turnaround day for COLM,Fellow Autists:\nCOLM had a trash earnings cal...,1,Turnaround day for COLM Fellow Autists:\nCOLM ...
2020-11-03,1604381864,"Guaranteed 100% gains, don't miss out you Ret....",Many of you R words have provably not even hea...,1,"Guaranteed 100% gains, don't miss out you Ret...."
2020-11-03,1604381141,BREAKING ELECTION NEWS,BREAKING: Less than 24 hours before Election D...,1,BREAKING ELECTION NEWS BREAKING: Less than 24 ...
2020-11-03,1604380951,"SHORT SELLERS HAVE TO COVER, PUMP INCOMING GNCA","hey fellow cigarettes,\n\nHere's a good tip fo...",1,"SHORT SELLERS HAVE TO COVER, PUMP INCOMING GNC..."
...,...,...,...,...,...
2020-11-03,1604426628,"since there's a lot of dead bears out today,I ...",https://youtu.be/wMxKbkWrvDc,1,"since there's a lot of dead bears out today,I ..."
2020-11-03,1604426482,Is anyone buying $NEE here?!,A big IF since no one knows who wins tonight e...,1,Is anyone buying $NEE here?! A big IF since no...
2020-11-03,1604425864,$BHC Earnings Call Jitters?,BHC was up 6% after it beat revenue and profit...,1,$BHC Earnings Call Jitters? BHC was up 6% afte...
2020-11-03,1604425400,A warning ⚠️ for my beloved autist’s,I know I know take a look at this guy is what ...,1,A warning ⚠️ for my beloved autist’s I know I ...


>    2. Build two lists $l$ and $l_{ref}$ containing all tokens for submissions posted on r/wallstreebets on day $d$, and in the 7 days preceding day $d$, respectively.

In [412]:
d = data.loc['2020-03-13'].index[0]
d_min = d - datetime.timedelta(days=7)

In [356]:
import re, string, unicodedata
import nltk
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

def remove_URL(sample):
    """Remove URLs from a sample string"""
    return re.sub(r"http\S+", "", sample)

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    new_words = []
    for word in words:
        if word.isdigit():
            continue
        else:
            new_words.append(word)
    return new_words


def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_numbers(words)
    words = remove_stopwords(words)
    return words

def preprocess(sample):
    sample = remove_URL(sample)
    sample = replace_contractions(sample)
    # Tokenize
    words = nltk.word_tokenize(sample)

    # Normalize
    return normalize(words)

In [357]:
#data['tokens'] = [preprocess(data['text'][i]) for i in range(len(data['text']))]

#^takes a long time to run

In [413]:
l = np.ndarray.tolist(np.concatenate(data.loc[d]['tokens']))

In [414]:
l_ref = []
date = d_min
while date <= d:
    l_ref.append(np.concatenate(data.loc[date]['tokens']))
    date += datetime.timedelta(days=1)
l_ref = [item for sublist in l_ref for item in sublist]
    

>    3. For each token $i$, compute the relative frequency in the two lists $l$ and $l_{ref}$. We call them $p(i,l)$ and $p(i,l_{ref})$, respectively. The relative frequency is computed as the number of times a token occurs over the total length of the document. Store the result in a dictionary.

In [415]:
from collections import Counter

In [419]:
p = dict([(item[0],item[1]/len(l)) for item in Counter(l).items()])
p_ref = dict([(item[0],item[1]/len(l_ref)) for item in Counter(l_ref).items()])

In [420]:
sorted(p.items(), key = lambda x:x[1],reverse=True)[:10]

[('puts', 0.011876339149066422),
 ('going', 0.00945821854912764),
 ('market', 0.008172635445362718),
 ('people', 0.00655035200489746),
 ('get', 0.006213651668197122),
 ('like', 0.006091215182124273),
 ('buy', 0.005693296602387512),
 ('go', 0.004897459442913988),
 ('spy', 0.004897459442913988),
 ('would', 0.0047444138353229266)]

In [421]:
sorted(p_ref.items(), key = lambda x:x[1],reverse=True)[:10]

[('puts', 0.008769190874114216),
 ('going', 0.007578443772921882),
 ('market', 0.007266203421942559),
 ('like', 0.005837306900511757),
 ('get', 0.005609741898950555),
 ('people', 0.005302793757309864),
 ('amp', 0.005011722243685071),
 ('buy', 0.004731235148737543),
 ('would', 0.004387241541726424),
 ('time', 0.004291981773631038)]

>    4. For each token $i$, compute the difference in relative frequency $\delta p(i) = p(i,l) - p(i,l_{ref})$. Store the values in a dictionary. Print the top 10 tokens (those with largest relative frequency). Do you notice anything interesting?

In [422]:
all_tokens = set(p.keys()).union(set(p_ref.keys()))

In [423]:
dp = dict([(token, p.get(token,0) - p_ref.get(token,0)) for token in all_tokens])

In [424]:
sorted(dp.items(), key = lambda x:x[1], reverse = True)[:10]

[('puts', 0.0031071482749522056),
 ('going', 0.0018797747762057584),
 ('trump', 0.0017760494170471043),
 ('today', 0.0013283633348227172),
 ('fucking', 0.0013010406639645039),
 ('next', 0.0012511225108896443),
 ('people', 0.0012475582475875956),
 ('weekend', 0.0009784937069499647),
 ('even', 0.0009724952217399995),
 ('buy', 0.0009620614536499685)]

Notice anything interesting??

Not really...?

>    5. Now, for each token, compute the happiness $h(i) = labMT(i) - 5$, using the labMT dictionary. Here, we subtract $5$, so that positive tokens will have a positive value and negative tokens will have a negative value. Then, compute the product $\delta \Phi = h(i)\cdot \delta p(i)$. Store the results in a dictionary. 

In [453]:
labMt_dict = pd.read_csv("Hedonometer.csv")

In [465]:
labMt_dict = labMt_dict.set_index("Word")

In [466]:
labMt_dict.loc["love"]["Happiness Score"]

8.42

In [477]:
print(labMt_dict["Happiness Score"].get("love",np.nan))

8.42


In [479]:
h = dict([(token, labMt_dict["Happiness Score"].get(token,np.nan)-5) for token in all_tokens])

In [482]:
h[""]

nan

In [438]:
max(h, key = h.get)

'tops'

In [439]:
h["tops"]

nan

In [441]:
labMt_dict["happy"]

KeyError: 'happy'