# Vader - Sentiment Analysis
VADER (Valence Aware Dictionary for sEntiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strenght) of the emotion.

Basically we can not only binary classify a sentiment, but also its strenght.

Vader is capable of understanding negation, i.e., it is able to differentiate between 'love' and 'do not love'.

It is available in NTLK, first we need to download the lexicon.

In [1]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/ohtar10/nltk_data...


True

Now let's load a sentiment intensity analyzer

In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
a = "This is a good movie"
sid.polarity_scores(a)



{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

As we can see, in the sentence 'a', we don't have any negative words, hence the negative score is zero, in the contrary, we have a positive score since we are saying that this is a 'good' movie.

Vader is capable of recognize capitalization and exclamation marks, this will alter the intensity score.

In [3]:
b = "This was the best, most awesome movie EVER MADE!!!"
sid.polarity_scores(b)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

Notice now how the positive score raised up a bit.

Now, let's try with the opposite:

In [4]:
c = "This was the WORST movie that has ever disgraced the screen."
sid.polarity_scores(c)

{'neg': 0.465, 'neu': 0.535, 'pos': 0.0, 'compound': -0.8331}

The balance is skewed to the negative side now.

## Explore a real dataset
Let's use an amazon review dataset

In [5]:
import pandas as pd

df = pd.read_csv('../UPDATED_NLP_COURSE/TextFiles/amazonreviews.tsv', sep='\t')
df.head()

FileNotFoundError: File b'../UPDATED_NLP_COURSE/TextFiles/amazonreviews.tsv' does not exist

In [None]:
df.label.value_counts()

Ensure there are no missing values for evaluation.

In [None]:
df.dropna(inplace=True)

Let's check if we have blanks

In [None]:
df.review = df.review.apply(lambda r: r.strip())
df[df.review == '']

We do not have empty records, so we can continue, in case we had ones, we just need to drop the indexes.

Let's invoke the polarity score for a couple of reviews.

In [None]:
review = df.iloc[0].review
label = df.iloc[0].label
socres = sid.polarity_scores(df.iloc[0].review)

print(f"Review:\n {review}\n")
print(f"Label: {label}\n")
print(f"Scores: {socres}")


As we can see, we got a small score of negativity, but overall the score was positive, and so it was the label, so in this particular case, it went well.

Now let's try to perform the scoring over all the reviews in the data set.

In [None]:
df['scores'] = df.review.apply(lambda r: sid.polarity_scores(r))
df['compound'] = df.scores.apply(lambda s: s['compound'])
df.head()

Now let's convert the score into the labels

In [None]:
df['prediction'] = df['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
df.head()

Now, let's measure the predicting power of this model.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_true = df.label.values
y_pred = df.prediction.values

cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)

print(f"Accuracy score:\n{acc}")
print(f"Confusion Matrix:\n{cm}")
print(f"Classification Report:\n{cr}")

These results are telling us that VADER is not performing bad, but it seems it is having problems with the negative reviews. Vader is not able to detect sarcasm hence this might be a cause.