# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Case Study: Sentiment Analysis with Naive Bayes

Week 9 | Lab 4.1


This lab will have you put together some of the tools you've seen before to find which words are most likely to appear in positive or negative valenced reviews, and to predict whether a review is positive or not based on the text. This is a supervised learning problem, where we require some labelled data on reviews to start our classifications. You could explore other classifiers for this problem too; as we have discussed before Naive Bayes has been found empirically to perform particularly well on text (at least in a bag-of-words context where word order is not relevant), as well as being fast (and word datasets can get very large very quickly).

### Load packages and movie data

Do any cleaning you deem necessary.

In [3]:
from __future__ import print_function
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [4]:
rt = pd.read_csv('../assets/datasets/rt_critics.csv')

rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt["fresh"] = rt["fresh"].map(lambda x: 1 if x == 'fresh' else 0)
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


### Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [6]:
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt["quote"])

In [7]:
words.shape

(14049, 2500)

In [8]:
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [9]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
words.shape

(14049, 2500)

### Split data into training and testing splits

You should keep 25% of the data in the test set.

In [12]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt["fresh"].values, test_size=0.25)

In [14]:
print(Xtrain.shape, Xtest.shape)

(10536, 2500) (3513, 2500)


### Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [15]:
nb = BernoulliNB()

In [16]:
nb.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [17]:
nb_scores = cross_val_score(BernoulliNB(), Xtrain, ytrain, cv=5)
print(nb_scores)
print(np.mean(nb_scores))
print(np.mean(ytrain))

[ 0.73624288  0.73908918  0.741813    0.72567632  0.71699905]
0.731964087991
0.616837509491


### Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 1. Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [28]:
feat_lp = nb.feature_log_prob_
fresh_p = np.exp(feat_lp[1])
rotten_p = np.exp(feat_lp[0])
print(fresh_p)
print(rotten_p)


[ 0.00215351  0.0013844   0.0013844  ...,  0.0013844   0.00092293
  0.00061529]
[ 0.00445655  0.00099034  0.00049517 ...,  0.00123793  0.00099034
  0.00222827]


#### 2. Make a dataframe with the probabilities and features

In [30]:
feat_probs = pd.DataFrame({'fresh_p':fresh_p, 'rotten_p':rotten_p, 'feature':words.columns.values})
feat_probs.tail()

Unnamed: 0,feature,fresh_p,rotten_p
2495,york,0.002461,0.002476
2496,young,0.008922,0.00619
2497,younger,0.001384,0.001238
2498,youth,0.000923,0.00099
2499,zone,0.000615,0.002228


#### 3. Create a column that is the difference between fresh probability of appearance and rotten

In [31]:
feat_probs['fresh_diff'] = feat_probs.fresh_p - feat_probs.rotten_p
feat_probs.tail()

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
2495,york,0.002461,0.002476,-1.5e-05
2496,young,0.008922,0.00619,0.002732
2497,younger,0.001384,0.001238,0.000146
2498,youth,0.000923,0.00099,-6.7e-05
2499,zone,0.000615,0.002228,-0.001613


#### 4. Look at the most likely words for fresh and rotten reviews

In [32]:
feat_probs.sort_values('fresh_diff', ascending=False, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
825,film,0.153207,0.112156,0.041051
193,best,0.040763,0.018817,0.021946
965,great,0.028303,0.009656,0.018647
693,entertaining,0.023996,0.005694,0.018302
1584,performance,0.021381,0.00718,0.014201
87,american,0.021381,0.007923,0.013459
837,films,0.025073,0.011884,0.013189
948,good,0.045531,0.032434,0.013098
694,entertainment,0.017074,0.004952,0.012123
1585,performances,0.020458,0.008666,0.011793


In [33]:
feat_probs.tail(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
851,flat,0.001538,0.008666,-0.007127
612,don,0.010768,0.018074,-0.007306
2473,worst,0.001846,0.009161,-0.007315
634,dull,0.001077,0.008418,-0.007341
2308,tv,0.002769,0.010151,-0.007382
768,fails,0.002,0.009408,-0.007409
1670,predictable,0.001846,0.009408,-0.007562
1221,lacks,0.002307,0.009903,-0.007596
1182,jokes,0.003384,0.011141,-0.007757
810,feels,0.003538,0.011637,-0.008099


### Examine how your model performs on the test set

In [34]:
print(nb.score(Xtest, ytest))
print(np.mean(ytest))

0.734130372901
0.601764873328


### Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

Just to note: Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%).

In [35]:
X = words.values
y = rt.fresh

In [36]:
nbfull = BernoulliNB().fit(X,y)

In [37]:
pp = pd.DataFrame({
        'prob_fresh':nbfull.predict_proba(X)[:,1],
        'movie':rt.title,
        'quote':rt.quote
    })

In [41]:
pp.sort_values('prob_fresh', ascending=False, inplace=True)
for movie, quote in zip(pp["movie"][0:10], pp["quote"][0:10]):
    print(movie,'\t', quote)
    print('--------------------------------------------------\n')

Kundun 	 Stunning, odd, glorious, calm and sensationally absorbing, director Martin Scorsese's Kundun is a remarkable piece of work with vital colors and a wrenching message.
--------------------------------------------------

The Wild Bunch 	 The Wild Bunch is Peckinpah's most complex inquiry into the metamorphosis of man into myth. Not incidentally, it is also a raucous, violent, powerful feat of American film making.
--------------------------------------------------

Witness 	 Powerful, assured, full of beautiful imagery and thankfully devoid of easy moralising, it also offers a performance of surprising skill and sensitivity from Ford.
--------------------------------------------------

The English Patient 	 This is one of the year's most unabashed and powerful love stories, using flawless performances, intelligent dialogue, crisp camera work, and loaded glances to attain a level of eroticism and emotional connection that many similar films miss.
----------------------------------

In [42]:
pp.sort_values('prob_fresh', ascending=True, inplace=True)
for movie, quote in zip(pp["movie"][0:10], pp["quote"][0:10]):
    print(movie,'\t', quote)
    print('--------------------------------------------------\n')

Pokémon: The First Movie 	 With intentionally stilted animation, uninspired music and lame jokes, Pokemon is basically an ultralong version of the phenomenon's own boring TV 'toon.
--------------------------------------------------

Joe's Apartment 	 There's not enough story here for something half that length, so we're subjected to numerous pointless and irritating song-and-dance numbers designed to nudge the lame plot towards its conclusion.
--------------------------------------------------

Kazaam 	 As fairy tale, buddy comedy, family drama, thriller or rap revue, Kazaam is simply uninspired and unconvincing, and Mr. O'Neal, who can carry a basketball team, lacks the charisma to rescue this misguided effort.
--------------------------------------------------

Gung Ho 	 A disappointment, a movie in which the Japanese are mostly used for the mechanical requirements of the plot, and the Americans are constructed from durable but boring stereotypes.
------------------------------------

In [43]:
# subset to movies with at least 10 reviews:
movie_counts = pp["movie"].value_counts().reset_index()
movie_counts.columns = ['movie','counts']
movie_counts.head()

Unnamed: 0,movie,counts
0,The Hurricane,20
1,Fever Pitch,20
2,The Truman Show,20
3,The Green Mile,20
4,The Sixth Sense,20


In [44]:
pp_movies = pp[['movie','prob_fresh']].groupby('movie').agg(np.mean).reset_index()
pp_movies = pp_movies[pp_movies["movie"].isin(movie_counts[movie_counts.counts >= 10]["movie"])]

In [45]:
pp_movies.sort_values('prob_fresh', ascending=False, inplace=True)
pp_movies.head(20)

Unnamed: 0,movie,prob_fresh
1417,The Iron Giant,0.979857
862,Midnight Run,0.938485
209,Boogie Nights,0.933018
830,Manhattan,0.923313
1447,The Little Mermaid,0.91311
1058,Raging Bull,0.909063
652,Il conformista,0.899021
1055,Quiz Show,0.897883
1615,Toy story,0.894362
298,Cookie's Fortune,0.889969


In [46]:
pp_movies.tail(20)

Unnamed: 0,movie,prob_fresh
932,Next Friday,0.265943
140,Bad Company,0.261892
867,Milk Money,0.259989
361,Deuce Bigalow: Male Gigolo,0.250894
527,Georgia Rule,0.250723
1175,Soldier,0.245642
1164,Sleepover,0.240277
916,My Fellow Americans,0.239758
811,Lost & Found,0.227661
1632,Twisted,0.227563


---

## [Bonus] Take a look at some other classifiers for this problem