# Movie Review Sentiment Analysis Project

For this project, we'll perform movie review sentiment analysis on the IMDB dataset using NLTK VADER.

## Import Libraries & Load the Data

In [5]:
import numpy as np
import pandas as pd

df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming te...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Remove Blank Records (optional)

In [6]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
print(df.isnull().sum())
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,rv,st in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
print(blanks)
df.drop(blanks, inplace=True)

review       0
sentiment    0
dtype: int64
[]


In [7]:
df['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

## Import `SentimentIntensityAnalyzer` and create an sid object
This assumes that the VADER lexicon has been downloaded.

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

## Use sid to append a `comp_score` to the dataset

In [None]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'positive' if c >=0 else 'negative')

df.head()

## Perform a comparison analysis between the original `label` and `comp_score`

In [10]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [11]:
accuracy_score(df['sentiment'],df['comp_score'])

0.0

In [11]:
print(classification_report(df['sentiment'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [12]:
print(confusion_matrix(df['sentiment'],df['comp_score']))

[[427 542]
 [162 807]]


So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.
## Great Job!