## Here we will perform the NLTK VADER sentiment analysis in the movie reviews data set

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv',sep='\t')
df.head()


Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


### Data Cleaning

In [3]:
#Removing NAN values and Empty strings
df.dropna(inplace=True)

blanks = []

for i,lb,rv in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
df.drop(blanks,inplace = True)


In [4]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

### Impoting Sentiment analyser and create an sid object

In [7]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

### Using sid to append a comp_score to the dataset

In [10]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound'] = df['scores'].apply(lambda score_dict:score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c>=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...",0.9953,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...",-0.7264,neg


### Performing Comparison analysis

In [11]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [12]:
accuracy_score(df['label'],df['comp_score'])

0.6367389060887513

In [13]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [14]:
print(confusion_matrix(df['label'],df['comp_score']))

[[427 542]
 [162 807]]


VADER struggled to accurately assess movie reviews,highlighting a major challenge in sentiment analysis. Which is grasping the nuances of human semantics. Some reviews praised the aspects of the film but deferred the overall judgement until the final sentence.

## END