**Sentiment Analysis Project**

For this project, we'll perform the NLTK VADER sentiment analysis on movie reviews dataset.

**Load the Data**

In [24]:
import pandas as pd
import numpy as np

In [25]:
df = pd.read_csv("moviereviews.tsv",sep="\t")

In [26]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [27]:
df["label"].value_counts()

pos    1000
neg    1000
Name: label, dtype: int64

**Remove Blank Records**

In [28]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

In [29]:
blanks = []

for i,lb,rv in df.itertuples():
  if rv.isspace():
    blanks.append(i)

In [7]:
df.drop(blanks,inplace=True)

In [8]:
df["label"].value_counts()

pos    969
neg    969
Name: label, dtype: int64

**Import SentimentIntensityAnalyzer and create an sid object**

This assumes that the VADER lexicon has been downloaded.

In [9]:
import nltk
nltk.download("vader_lexicon")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...




In [10]:
sid = SentimentIntensityAnalyzer()

In [11]:
sid.polarity_scores(df.iloc[0]["review"])

{'compound': -0.9125, 'neg': 0.121, 'neu': 0.778, 'pos': 0.101}

**Use sid to append a comp_score to the dataset**

In [12]:
df["scores"] = df["review"].apply(lambda review: sid.polarity_scores(review))

In [13]:
df.head()

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com..."


In [14]:
df["compound"] = df["scores"].apply(lambda d: d["compound"])

In [15]:
df.head()

Unnamed: 0,label,review,scores,compound
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...",0.9953
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...",0.9972
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...",-0.7264


In [16]:
df["compound_score"] = df["compound"].apply(lambda score: "pos" if score >= 0 else "neg")

In [17]:
df.head()

Unnamed: 0,label,review,scores,compound,compound_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...",0.9953,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...",-0.7264,neg


**Perform a comparison analysis between the original label and comp_score**

In [18]:
from sklearn import metrics

In [19]:
metrics.accuracy_score(df["label"],df["compound_score"])

0.6367389060887513

In [20]:
print(metrics.confusion_matrix(df["label"],df["compound_score"]))

[[427 542]
 [162 807]]


In [21]:
print(metrics.classification_report(df["label"],df["compound_score"]))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.