___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Sentiment Analysis Project

## Load the Data

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


## Remove Blank Records (optional)

In [3]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [7]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

## Import `SentimentIntensityAnalyzer` and create an sid object
This assumes that the VADER lexicon has been downloaded.

In [8]:
! pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [10]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

## Use sid to append a `comp_score` to the dataset

In [11]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.109, 'neu': 0.803, 'pos': 0.089, 'co...",-0.925,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.103, 'neu': 0.81, 'pos': 0.087, 'com...",-0.9087,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.059, 'neu': 0.802, 'pos': 0.139, 'co...",0.9968,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.06, 'neu': 0.805, 'pos': 0.135, 'com...",0.9976,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.078, 'neu': 0.843, 'pos': 0.079, 'co...",-0.6399,neg


## Perform a comparison analysis between the original `label` and `comp_score`

In [12]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [13]:
accuracy_score(df['label'],df['comp_score'])

0.6388028895768834

In [14]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.74      0.43      0.54       969
         pos       0.60      0.85      0.70       969

    accuracy                           0.64      1938
   macro avg       0.67      0.64      0.62      1938
weighted avg       0.67      0.64      0.62      1938



In [15]:
print(confusion_matrix(df['label'],df['comp_score']))

[[417 552]
 [148 821]]


So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.
## Great Job!