# Sentiment analysis feature

We have 5 different sentiment:

__2 from packages__
- TextBlob
- VaderSentiment

__3 word sentiment libraries__
- AFINN-111
- SentStrength
- ANEW

Before I started with this function we created dictionaries (in a json file) from the word sentiment libraries. As AFINN and SentStrength both scored words on a -2 to +2 scale I could combine those. If a word was scored in both libraries I chose SentStrength over AFINN's score.

ANEW scored sentiment on 3 levels: Valence, Dominance and Arousal:
- Valence how positive or negative a word is (1.25 - 8.82)
- Dominance how important the word is to the sentence
- Arousal how much impact a word has on the person reading it

To get to positive/negative sentiment I transformed the valence to a -2 till +2 scale value. Then multiplied valence with the sum of dominance and arousal (the effect of the word for the sentence) $Valence * (Arousal + Dominance)$.

The functions clean the words in a tweet, make it lowercase and check it for value on the dictionary. The sum of all values of found words in a tweet make the tweet's sentiment score.

After the sentiment we can also look for extreme sentiment, we do this based on extreme values within the df.

## Import packages

In [9]:
import pandas as pd
import numpy as np
import nltk
import json

from sklearn.model_selection import train_test_split #We need this to split the data
from sklearn.preprocessing import normalize #get the function needed to normalize our data.
from sklearn.neighbors import KNeighborsClassifier #the object class we need
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

from textblob import TextBlob, Word, Blobber
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Define functions

In [10]:
def open_file(filename):
    with open(filename) as json_file:
        return json.load(json_file)

def RT(tweet):
    if ('RT @' in tweet):
        return True
    else:
        return False

def clean_words(word):
    word = word.lower()
    word = word.replace(",", "")
    word = word.replace(".", "")
    word = word.replace(":", "")
    word = word.replace("?", "")
    word = word.replace("#", "")
    word = word.replace("(", "")
    word = word.replace(")", "")
    word = word.replace("!", "")
    word = word.replace("'", "")
    word = word.replace(";", "")
    word = word.replace("&", "")
    word = word.replace("'", "")
    return word

def add_sentiment(tweet):
    #open the dictionary with the sentiment and set default sentiment to 0 (neutral)
    words = open_file('sentiment_dict.json')
    sentiment = 0
    
    #for each word in the tweet
    for w in tweet.split():
        #clean the word, remove hashtags, punctuation, etc. and make it lower case
        w = clean_words(w)
        w = w.lower()
    
        #loop through dictionary and see if we can find a match
        for word in words:
            if word == w:
                #if we have a match add this sentiment to the total amount
                sentiment += words[w]
            else: pass

    #after all words are matched return total sum of sentiment
    return sentiment

def add_anew_sentiment(tweet):
    #open the dictionary with the sentiment and set default sentiment to 0 (neutral)
    words = open_file('anew_sentiment_dict.json')
    sentiment = 0
    
    #for each word in the tweet
    for w in tweet.split():
        #clean the word, remove hashtags, punctuation, etc. and make it lower case
        w = clean_words(w)
        w = w.lower()
    
        #loop through dictionary and see if we can find a match
        for word in words:
            if word == w:
                #if we have a match add this sentiment to the total amount
                sentiment += words[w]
            else: pass

    #after all words are matched return total sum of sentiment
    return sentiment

def add_extreme_sentiment(df):
    df['extreme_AFINN_SentStrength'] = np.where((df['sentiment_AFINN_SentStrength'] > 3) | (df['sentiment_AFINN_SentStrength'] < -3), 1, 0)
    df['extreme_textblob'] = np.where((df['sentiment_textblob'] > 0.25) | (df['sentiment_textblob'] < -0.25), 1, 0)
    df['extreme_ANEW'] = np.where((df['sentiment_ANEW'] > 20) | (df['sentiment_ANEW'] < -19), 1, 0)
    return df

def add_vader_sentiment(df):
    analyzer = SentimentIntensityAnalyzer()

    df['compound'] = [analyzer.polarity_scores(x)['compound'] for x in df['text']]
    df['neg'] = [analyzer.polarity_scores(x)['neg'] for x in df['text']]
    df['neu'] = [analyzer.polarity_scores(x)['neu'] for x in df['text']]
    df['pos'] = [analyzer.polarity_scores(x)['pos'] for x in df['text']]
    return df

## Import DataFrame

In [11]:
df = pd.read_csv('tweets_labeled.csv', index_col=0)
df.head()

Unnamed: 0_level_0,text,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1161040537207463936,'RT @SenJeffMerkley: The Endangered Species Ac...,1
1176360756239118342,'RT @LindseyGrahamSC: Interesting concept -- i...,1
1099036648573145088,'RT @RealJamesWoods: #BuildTheWall #DeportThem...,0
1092915693203480577,'RT @PatriotJackiB: Why would the MEXICAN GOV’...,0
1149038450668187654,'RT @TheOnion: Sweden Announces Plan To Get 10...,0


## No Retweets
Apply only if you don't want RT's in your dataset

In [12]:
df['RT'] = df['text'].apply(RT)
df = df[df.RT == False]

## Apply functions to add sentiment

In [13]:
df['sentiment_textblob'] = df['text'].apply(lambda x: TextBlob(x).sentiment[0])
df['sentiment_AFINN_SentStrength'] = df['text'].apply(add_sentiment)
df['sentiment_ANEW'] = df['text'].apply(add_anew_sentiment)
df = add_vader_sentiment(df)

## Apply functions for extreme sentiment 
Isn't written for Vader yet

In [14]:
df = add_extreme_sentiment(df)

## New DataFrame

In [15]:
df.head()

Unnamed: 0_level_0,text,label,RT,sentiment_textblob,sentiment_AFINN_SentStrength,sentiment_ANEW,compound,neg,neu,pos,extreme_AFINN_SentStrength,extreme_textblob,extreme_ANEW
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1081722778125062144,'Planned Parenthood Erects Billboards Urging W...,0,False,0.0,0,-9.98,0.0,0.0,1.0,0.0,0,0,0
1158761795739217921,'https://t.co/MvrznF1fWVWhoever obstructing th...,1,False,0.0,-2,0.0,-0.6808,0.318,0.682,0.0,0,0,0
1095142095621365760,'CAIR Backs Ilhan Omar's 'Legitimate Criticism...,0,False,0.0,-1,0.0,-0.4767,0.291,0.709,0.0,0,0,0
1137856356818595841,'@nopasa @cathmckenna https://t.co/ldEruis5Js',0,False,0.0,0,0.0,0.0,0.0,1.0,0.0,0,0,0
1090272871958695936,'Not suprised! ! https://t.co/PHc6lTQ0wl',0,False,0.0,0,0.0,0.0,0.0,1.0,0.0,0,0,0


## Check out correlations

In [16]:
corr = df.corr()
corr

Unnamed: 0,label,RT,sentiment_textblob,sentiment_AFINN_SentStrength,sentiment_ANEW,compound,neg,neu,pos,extreme_AFINN_SentStrength,extreme_textblob,extreme_ANEW
label,1.0,,0.039046,-0.047524,0.035705,-0.010043,0.016187,-0.012116,-0.001418,0.020305,-0.037341,0.023894
RT,,,,,,,,,,,,
sentiment_textblob,0.039046,,1.0,0.386611,0.205084,0.394417,-0.300692,-0.016245,0.35603,-0.086154,0.098845,0.066466
sentiment_AFINN_SentStrength,-0.047524,,0.386611,1.0,0.390974,0.7625,-0.60912,0.169768,0.444164,-0.456032,-0.022957,-0.04813
sentiment_ANEW,0.035705,,0.205084,0.390974,1.0,0.350701,-0.236137,0.022178,0.231832,-0.149907,0.036263,0.443821
compound,-0.010043,,0.394417,0.7625,0.350701,1.0,-0.746319,0.148831,0.625106,-0.305776,-0.018894,0.038304
neg,0.016187,,-0.300692,-0.60912,-0.236137,-0.746319,1.0,-0.692309,-0.163752,0.413045,0.205121,0.088974
neu,-0.012116,,-0.016245,0.169768,0.022178,0.148831,-0.692309,1.0,-0.598493,-0.37627,-0.335124,-0.198797
pos,-0.001418,,0.35603,0.444164,0.231832,0.625106,-0.163752,-0.598493,1.0,0.05586,0.230416,0.173016
extreme_AFINN_SentStrength,0.020305,,-0.086154,-0.456032,-0.149907,-0.305776,0.413045,-0.37627,0.05586,1.0,0.196126,0.193516
