## Sentiment Analysis

In this notebook, I perform sentiment analysis of the MT-GINCO dataset and prepare a model which can be used on the deployed site (in addition to the objectivity classifier.)

I use the VADER lexicon, following the tutorial here: https://constellate.org/tutorials/sentiment-analysis-with-vader

This notebook uses a rule-based algorithm named VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER is a rule-based algorithm that is "specifically attuned to sentiments expressed in social media." It relies on a specialized lexicon of words, phrases, and emojis. Each token in the lexicon is assigned a "mean-sentiment rating" between -4 (extremely negative) to 4 (extremely positive).

In [36]:
import pandas as pd
from tqdm import tqdm
import pickle

In [None]:
!pip install vaderSentiment

In [3]:
# Import the SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Creat the variable sa to hold the VADER lexicon object 
sa = SentimentIntensityAnalyzer()

In [37]:
pickle.dump(sa, open(f'Vader-sentiment.pkl','wb'))

In [7]:
# Preview the lexicon contents
# There are over 7500 tokens in the lexicon
sa.lexicon

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

In [10]:
# Import the dataset, prepared in "1-Data-Preparation.ipynb"
dataset = pd.read_csv("data&results/MT-GINCO-split-objectivity-dataset.csv")
dataset.describe()

Unnamed: 0,text,label,split
count,632,632,632
unique,632,2,2
top,"For the first time since 2008 <p/> Dallas, 12....",subjective,train
freq,1,378,505


In [11]:
dataset.head()

Unnamed: 0,text,label,split
0,"For the first time since 2008 <p/> Dallas, 12....",subjective,train
1,"17 replies to "" Even in the municipality of Ra...",subjective,train
2,Esimit Europa <p/> Vasili's main sponsor is Es...,subjective,train
3,Beekeepers' successes <p/> The 37th National M...,objective,train
4,Kundalini Yoga <p/> GUIDE <p/> Kundalini Yoga ...,subjective,train


Now we will analyze each product and assign it a "normalized, weighted composite score" based on summing the valence scores of each word in the lexicon (with some adjustments based on word order and other rules). VADER measures the proportion of text that falls into positive, negative, and neutral sentiment. The result is a sentiment score that falls between -1 (the most negative) and +1 (the most positive). (This is different from the lexicon scores that fall between -4 to +4!)

In [20]:
text_list = list(dataset["text"])
len(text_list)

632

In [19]:
example = sa.polarity_scores(["This is lovely."])

example

{'neg': 0.0, 'neu': 0.345, 'pos': 0.655, 'compound': 0.5859}

In [24]:
scores_list = []

for i in tqdm(text_list):
    current_score = sa.polarity_scores([i])
    scores_list.append(current_score) 

100%|██████████| 632/632 [00:11<00:00, 55.49it/s] 


In [25]:
scores_list

[{'neg': 0.027, 'neu': 0.756, 'pos': 0.217, 'compound': 0.9887},
 {'neg': 0.081, 'neu': 0.825, 'pos': 0.094, 'compound': 0.952},
 {'neg': 0.0, 'neu': 0.867, 'pos': 0.133, 'compound': 0.9771},
 {'neg': 0.0, 'neu': 0.893, 'pos': 0.107, 'compound': 0.8442},
 {'neg': 0.05, 'neu': 0.77, 'pos': 0.18, 'compound': 0.9904},
 {'neg': 0.025, 'neu': 0.857, 'pos': 0.118, 'compound': 0.9562},
 {'neg': 0.031, 'neu': 0.835, 'pos': 0.134, 'compound': 0.9738},
 {'neg': 0.024, 'neu': 0.854, 'pos': 0.123, 'compound': 0.9538},
 {'neg': 0.027, 'neu': 0.924, 'pos': 0.049, 'compound': 0.2716},
 {'neg': 0.069, 'neu': 0.847, 'pos': 0.084, 'compound': 0.8308},
 {'neg': 0.011, 'neu': 0.961, 'pos': 0.028, 'compound': 0.1901},
 {'neg': 0.0, 'neu': 0.956, 'pos': 0.044, 'compound': 0.3382},
 {'neg': 0.008, 'neu': 0.912, 'pos': 0.079, 'compound': 0.9911},
 {'neg': 0.014, 'neu': 0.806, 'pos': 0.18, 'compound': 0.9777},
 {'neg': 0.021, 'neu': 0.848, 'pos': 0.132, 'compound': 0.9527},
 {'neg': 0.155, 'neu': 0.817, 'pos':

In [29]:
compound_scores_list = []

for i in scores_list:
    compound_scores_list.append(i["compound"])

compound_scores_list

[0.9887,
 0.952,
 0.9771,
 0.8442,
 0.9904,
 0.9562,
 0.9738,
 0.9538,
 0.2716,
 0.8308,
 0.1901,
 0.3382,
 0.9911,
 0.9777,
 0.9527,
 -0.9897,
 -0.861,
 0.687,
 -0.9353,
 0.9707,
 0.8553,
 0.998,
 0.9914,
 0.9976,
 0.7579,
 0.8515,
 0.9274,
 0.1724,
 0.3612,
 0.9506,
 0.9182,
 0.9559,
 0.7964,
 0.5423,
 -0.9917,
 0.8775,
 0.9647,
 0.8735,
 0.9746,
 0.9871,
 0.8442,
 0.9117,
 0.886,
 0.2716,
 -0.9108,
 0.8979,
 0.7351,
 -0.7736,
 -0.785,
 0.9933,
 -0.8402,
 0.9971,
 -0.3107,
 0.9482,
 0.998,
 0.7964,
 0.9493,
 0.4753,
 0.0,
 0.9802,
 -0.9966,
 0.9169,
 0.9268,
 0.9127,
 0.962,
 0.1779,
 -0.7591,
 0.8885,
 0.6486,
 0.9845,
 0.8481,
 0.9022,
 -0.5574,
 0.9423,
 0.9927,
 0.7351,
 0.7747,
 0.0,
 0.8641,
 0.9324,
 0.8816,
 0.3182,
 0.6705,
 0.8402,
 0.7356,
 0.7935,
 0.6482,
 0.9936,
 0.9981,
 0.9813,
 -0.2529,
 0.9822,
 0.967,
 0.9993,
 0.4767,
 0.9699,
 0.9865,
 0.278,
 0.7906,
 0.3612,
 0.9693,
 0.9634,
 0.6908,
 0.8976,
 0.8316,
 0.9972,
 -0.085,
 0.0,
 0.5106,
 0.4675,
 0.8986,
 0.718,

In [30]:
dataset["sentiment_scores"] = compound_scores_list
dataset.tail()

Unnamed: 0,text,label,split,polarity_scores,sentiment_scores
627,Ducks: nature outdoors <p/> Zofka did her bit ...,subjective,test,"{'neg': 0.0, 'neu': 0.901, 'pos': 0.099, 'comp...",0.7579
628,Most items have a quality letter described in ...,objective,test,"{'neg': 0.0, 'neu': 0.956, 'pos': 0.044, 'comp...",0.8402
629,The authors Jadranka Boban Pejić and Zlatko Pe...,subjective,test,"{'neg': 0.045, 'neu': 0.904, 'pos': 0.051, 'co...",0.2263
630,Description. The type and quantity of pollutan...,objective,test,"{'neg': 0.0, 'neu': 0.952, 'pos': 0.048, 'comp...",0.8316
631,Contributions <p/> We ask all parents to come ...,objective,test,"{'neg': 0.0, 'neu': 0.951, 'pos': 0.049, 'comp...",0.9246


In [32]:
def score_to_sentiment(score):
    if score >= 0.5:
        return "positive"
    else:
        return "negative"

dataset["sentiment"] = dataset["sentiment_scores"].apply(score_to_sentiment)

dataset.head()

Unnamed: 0,text,label,split,polarity_scores,sentiment_scores,sentiment
0,"For the first time since 2008 <p/> Dallas, 12....",subjective,train,"{'neg': 0.027, 'neu': 0.756, 'pos': 0.217, 'co...",0.9887,positive
1,"17 replies to "" Even in the municipality of Ra...",subjective,train,"{'neg': 0.081, 'neu': 0.825, 'pos': 0.094, 'co...",0.952,positive
2,Esimit Europa <p/> Vasili's main sponsor is Es...,subjective,train,"{'neg': 0.0, 'neu': 0.867, 'pos': 0.133, 'comp...",0.9771,positive
3,Beekeepers' successes <p/> The 37th National M...,objective,train,"{'neg': 0.0, 'neu': 0.893, 'pos': 0.107, 'comp...",0.8442,positive
4,Kundalini Yoga <p/> GUIDE <p/> Kundalini Yoga ...,subjective,train,"{'neg': 0.05, 'neu': 0.77, 'pos': 0.18, 'compo...",0.9904,positive


In [34]:
dataset.sentiment.value_counts(normalize=True)

positive    0.718354
negative    0.281646
Name: sentiment, dtype: float64