# Movie Review Sentiment Analysis Project

For this project, we'll perform movie review sentiment analysis on the IMDB dataset using NLTK VADER.

## Import Libraries & Load the Data

In [10]:
import numpy as np
import pandas as pd
import nltk
nltk.download('vader_lexicon')

df = pd.read_csv('IMDB Dataset.csv')
df.head()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Sounak\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming te...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Remove Blank Records (optional)

In [2]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
print(df.isnull().sum())
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,rv,st in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
print(blanks)
df.drop(blanks, inplace=True)

review       0
sentiment    0
dtype: int64
[]


In [3]:
df['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

## Import `SentimentIntensityAnalyzer` and create an sid object
This assumes that the VADER lexicon has been downloaded.

In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

## Use sid to append a `comp_score` to the dataset

In [5]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'positive' if c >=0 else 'negative')

df.head()

Unnamed: 0,review,sentiment,scores,compound,comp_score
0,One of the other reviewers has mentioned that ...,positive,"{'neg': 0.205, 'neu': 0.746, 'pos': 0.049, 'co...",-0.9951,negative
1,A wonderful little production. The filming te...,positive,"{'neg': 0.055, 'neu': 0.768, 'pos': 0.177, 'co...",0.9641,positive
2,I thought this was a wonderful way to spend ti...,positive,"{'neg': 0.093, 'neu': 0.689, 'pos': 0.218, 'co...",0.978,positive
3,Basically there's a family where a little boy ...,negative,"{'neg': 0.14, 'neu': 0.773, 'pos': 0.088, 'com...",-0.8819,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"{'neg': 0.052, 'neu': 0.791, 'pos': 0.157, 'co...",0.9766,positive


## Perform a comparison analysis between the original `label` and `comp_score`

In [6]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [7]:
accuracy_score(df['sentiment'],df['comp_score'])

0.69702

In [8]:
print(classification_report(df['sentiment'],df['comp_score']))

              precision    recall  f1-score   support

    negative       0.79      0.54      0.64     25000
    positive       0.65      0.86      0.74     25000

    accuracy                           0.70     50000
   macro avg       0.72      0.70      0.69     50000
weighted avg       0.72      0.70      0.69     50000



In [9]:
print(confusion_matrix(df['sentiment'],df['comp_score']))

[[13412 11588]
 [ 3561 21439]]


So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.