## Sentiment Analysis using Naive Bayes

In the following we will predict whether a movie review is positive or not based on the text within the review. This problem is a supervised learning problem where each review has a corresponding label of good or bad. The algorithm used for this classification task will be Naive Bayes, as it has been found to perform particularly well on text, as well as being fast.

In [43]:
# Load libraries

import pandas as pd
import numpy as np

from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

In [4]:
# Load data

df = pd.read_csv('./Data/rt_critics.csv')
df.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


In [9]:
# Remove reviews where classification label was unavailable

df = df[~(df['fresh'] == 'none')]

In [11]:
# Re-label classification to binary

df['fresh'] = df['fresh'].map(lambda x: 1 if x == 'fresh' else 0)

### Modeliing

In [12]:
# Create predictor matrix using words from critic quotes

# Binary = True will ensure that all non-zero counts will be set to 1, which is useful for descrete probabality 
# models such as Naive Bayes 
cv = CountVectorizer(stop_words = 'english', binary = True, ngram_range = (1, 2), max_features = 3000)
X = cv.fit_transform(df['quote'])

In [14]:
# Create dataframe from words

X_df = pd.DataFrame(X.todense(), columns = cv.get_feature_names())

In [16]:
# Train test split

X_train, X_test, y_train, y_test = train_test_split(X_df.values, df['fresh'].values, 
                                                    test_size = 0.3, random_state = 42)

In [19]:
# The baseline accuracy is defined by predicting good (fresh) for every single review - 61%

df['fresh'].value_counts() / len(df['fresh'])

1    0.613069
0    0.386931
Name: fresh, dtype: float64

In [20]:
# Here we use Bernoulli Naive Bayes, as the classification is a binary result, good or bad

nb = BernoulliNB()
scores = cross_val_score(nb, X_train, y_train, cv = 5, scoring = 'accuracy')
print 'Model Training Accuracy: ', scores.mean()

Model Training Accuracy:  0.741203406564


In [44]:
# Test set accuracy

model = nb.fit(X_train, y_train)
print 'Model Test Accuracy: ', accuracy_score(y_test, model.predict(X_test))

Model Test Accuracy:  0.746856465006


In [26]:
# Probabilities of fresh / rotten given words

log_proba = model.feature_log_prob_
proba = np.exp(log_proba)

In [29]:
# Fresh / Rotten probablilities

rotten_proba = proba[0]
fresh_proba = proba[1]

In [36]:
# Create data frame of words with corresponding probabilities

proba_df = pd.DataFrame({ 'Words':X_df.columns.values, 'Fresh_proba':fresh_proba, 'Rotten_proba':rotten_proba})
proba_df.tail(2)

Unnamed: 0,Fresh_proba,Rotten_proba,Words
2998,0.000994,0.000789,zemeckis
2999,0.000331,0.002105,zone


In [40]:
# Create column that is the difference between fresh probability of appearance and rotten

proba_df['Diff'] = proba_df['Fresh_proba'] - proba_df['Rotten_proba']

#### Most likely words for fresh and rotten reviews

In [41]:
# Fresh

proba_df.sort_values(by = ['Diff'], ascending= False).head(10)

Unnamed: 0,Fresh_proba,Rotten_proba,Words,Diff
996,0.162498,0.112865,film,0.049633
240,0.042902,0.017101,best,0.025801
1178,0.02816,0.00763,great,0.02053
848,0.02319,0.005788,entertaining,0.017402
1929,0.0217,0.006577,performance,0.015122
849,0.018884,0.004473,entertainment,0.014411
104,0.021368,0.008156,american,0.013212
1094,0.024681,0.011839,fun,0.012842
1930,0.021037,0.008682,performances,0.012355
1099,0.034454,0.022363,funny,0.012092


In [42]:
# Rotten

proba_df.sort_values(by = ['Diff'], ascending= False).tail(10)

Unnamed: 0,Fresh_proba,Rotten_proba,Words,Diff
1554,0.019546,0.02894,little,-0.009394
2286,0.010104,0.021836,script,-0.011732
741,0.016565,0.028677,doesn,-0.012112
2124,0.007288,0.019732,really,-0.012443
1975,0.01408,0.026572,plot,-0.012492
1453,0.027166,0.041042,just,-0.013876
1417,0.010436,0.025783,isn,-0.015347
1748,0.128375,0.145751,movie,-0.017376
198,0.007288,0.026309,bad,-0.01902
1541,0.043896,0.069718,like,-0.025823


#### Top movies likely to be rotten and fresh

In [46]:
# Fit model on entire data set

nb = BernoulliNB()
model = nb.fit(X_df.values, df['fresh'])

In [47]:
# Predict using model

proba = model.predict_proba(X_df.values)
rotten_proba = proba[:,0]
fresh_proba = proba[:,1]

In [49]:
# Add probabilities to original data frame

df['Rotten_proba'] = rotten_proba
df['Fresh_proba'] = fresh_proba

In [53]:
# Top 10 movies likely to be Rotten

sorted_ = df.sort_values(by = 'Fresh_proba', ascending= False)
for i in sorted_['title'].head(10):
    print i

Kundun
Frozen River
2001: A Space Odyssey
Sophie's Choice
American Beauty
The Wild Bunch
Repo Man
Where the Wild Things Are
City Hall
Wolf


In [54]:
# Top 10 movies likely to be Fresh

sorted_ = df.sort_values(by = 'Rotten_proba', ascending= False)
for i in sorted_['title'].head(10):
    print i

The Beverly Hillbillies
Pokémon: The First Movie
Kazaam
Tank Girl
Joe's Apartment
Wing Commander
House Arrest
Gung Ho
Snow Day
Prêt-à-Porter
