
# Sentiment Analysis On Movie Review Using NLTK

In this project we are having 2000 records IMDB movie review database and we will learn sentiment analysis using NLTK directly.
Here we used NLTK lexicon called "VADER". 

Note: This text classification only consider the Word and how the word is represent emotion in the sentence i.e. "Lucky" and "Lucky!!!" have very diffrent score.

# Load Data

In [59]:
import numpy as np
import pandas as pd

df=pd.read_csv("../TextFiles/moviereviews.tsv", sep='\t') #loading dataset

df.head() #display first 5 records from dataset

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


# Remove blank and null records from dataset

In [60]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [61]:
df['label'].value_counts() # TO check total count of label record

neg    969
pos    969
Name: label, dtype: int64

# Import `SentimentIntensityAnalyzer` and create an sid object

In [62]:
# First we will download VADER NLTK lexicon which is one time process.

import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to C:\Users\Nihal
[nltk_data]     Singh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [63]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

# Adding Scores and Labels to the DataFrame

In [64]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review)) 
#Creates score column with polarity scores that is negative, neutral, postive and compund(Normalize score >=0 represents positive abd <0 represents negative)

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
# Extracting compound score from polarity scores

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
#Converting compound to 'neg' and 'pos' to match with the label

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


# Evaluation Metrics

In [66]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix #Importing libraries

In [67]:
accuracy_score(df['label'],df['comp_score'])

0.6357069143446853

In [68]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [69]:
print(confusion_matrix(df['label'],df['comp_score']))

[[427 542]
 [164 805]]


This complete our analysis.

As we can see VADER is not able to identify very accurately label or sentiment of review. This is the biggeset chanllenges of any sentiment analysis- Understanding the human semantics.