# Vader - Sentiment Analysis
VADER (Valence Aware Dictionary for sEntiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strenght) of the emotion.

Basically we can not only binary classify a sentiment, but also its strenght.

Vader is capable of understanding negation, i.e., it is able to differentiate between 'love' and 'do not love'.

It is available in NTLK, first we need to download the lexicon.

In [1]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/ohtar10-kudu/nltk_data...


True

Now let's load a sentiment intensity analyzer

In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
a = "This is a good movie"
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

As we can see, in the sentence 'a', we don't have any negative words, hence the negative score is zero, in the contrary, we have a positive score since we are saying that this is a 'good' movie.

Vader is capable of recognize capitalization and exclamation marks, this will alter the intensity score.

In [7]:
b = "This was the best, most awesome movie EVER MADE!!!"
sid.polarity_scores(b)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

Notice now how the positive score raised up a bit.

Now, let's try with the opposite:

In [8]:
c = "This was the WORST movie that has ever disgraced the screen."
sid.polarity_scores(c)

{'neg': 0.465, 'neu': 0.535, 'pos': 0.0, 'compound': -0.8331}

The balance is skewed to the negative side now.

## Explore a real dataset
Let's use an amazon review dataset

In [9]:
import pandas as pd

df = pd.read_csv('../UPDATED_NLP_COURSE/TextFiles/amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [11]:
df.label.value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

Ensure there are no missing values for evaluation.

In [12]:
df.dropna(inplace=True)

Let's check if we have blanks

In [18]:
df.review = df.review.apply(lambda r: r.strip())
df[df.review == '']

Unnamed: 0,label,review


We do not have empty records, so we can continue, in case we had ones, we just need to drop the indexes.

Let's invoke the polarity score for a couple of reviews.

In [22]:
review = df.iloc[0].review
label = df.iloc[0].label
socres = sid.polarity_scores(df.iloc[0].review)

print(f"Review:\n {review}\n")
print(f"Label: {label}\n")
print(f"Scores: {socres}")


Review:
 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

Label: pos

Scores: {'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}


As we can see, we got a small score of negativity, but overall the score was positive, and so it was the label, so in this particular case, it went well.

Now let's try to perform the scoring over all the reviews in the data set.

In [23]:
df['scores'] = df.review.apply(lambda r: sid.polarity_scores(r))
df['compound'] = df.scores.apply(lambda s: s['compound'])
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


Now let's convert the score into the labels

In [27]:
df['prediction'] = df['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
df.head()

Unnamed: 0,label,review,scores,compound,prediction
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


Now, let's measure the predicting power of this model.

In [31]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_true = df.label.values
y_pred = df.prediction.values

cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)

print(f"Accuracy score:\n{acc}")
print(f"Confusion Matrix:\n{cm}")
print(f"Classification Report:\n{cr}")

Accuracy score:
0.7122
Confusion Matrix:
[[2709 2388]
 [ 490 4413]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.85      0.53      0.65      5097
         pos       0.65      0.90      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.72      0.70     10000
weighted avg       0.75      0.71      0.70     10000



These results are telling us that VADER is not performing bad, but it seems it is having problems with the negative reviews. Vader is not able to detect sarcasm hence this might be a cause.