<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis and Naive Bayes

_Instructor: Aymeric Flaisler_

---

In the sentiment analysis lesson we used a predefined dictionary of positive and negative valences for words. This  lab has invert the process: you'll find which words are most likely to appear in positive or negative reviews by using the rotten vs. fresh binary label.

### Naive Bayes

A practical and common way to do this is with the Naive Bayes algorithm. For this lab you'll  be leveraging the sklearn implementation.

Given a feature $x_i$ and target $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### 1. Load packages and movie data

Do any cleaning you deem necessary.

In [99]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB,MultinomialNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [100]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [101]:
rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [102]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### 2. We need to create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [103]:
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt.quote)

In [104]:
words.shape

(14049, 2500)

In [105]:
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [106]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [107]:
print(words.shape)

(14049, 2500)


---

### 3. Let's split data into training and testing splits

You should keep 25% of the data in the test set.

In [108]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt.fresh.values, test_size=0.25)
print(Xtrain.shape, Xtest.shape)

(10536, 2500) (3513, 2500)


---

### 4. Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [109]:
# A:
bernd=BernoulliNB()
print('cross-validation:',cross_val_score(bernd,Xtrain,ytrain,cv=5).mean())
print('baseline        :',ytrain.mean())

cross-validation: 0.734336340914456
baseline        : 0.6105732725892179


---

### 5. Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correspond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 5.1 Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [110]:
# A:
bernd.fit(Xtrain,ytrain)
np.exp(bernd.feature_log_prob_)

array([[0.00365408, 0.00097442, 0.00048721, ..., 0.00121803, 0.00073082,
        0.00146163],
       [0.0020202 , 0.0013986 , 0.0010878 , ..., 0.0009324 , 0.0010878 ,
        0.0004662 ]])

#### 5.2 Make a dataframe with the probabilities and features

In [111]:
# A:
prop_df=pd.DataFrame(np.exp(bernd.feature_log_prob_),columns=cv.get_feature_names())
prop_df

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0.003654,0.000974,0.000487,0.000244,0.000731,0.001462,0.000244,0.000974,0.001705,0.001705,...,0.008283,0.001462,0.004141,0.000974,0.001218,0.00268,0.005847,0.001218,0.000731,0.001462
1,0.00202,0.001399,0.001088,0.001399,0.001088,0.002331,0.001088,0.001709,0.000777,0.001554,...,0.014763,0.002176,0.013675,0.001399,0.002331,0.001709,0.011189,0.000932,0.001088,0.000466


#### 5.3 Create a column that is the difference between fresh probability of appearance and rotten

In [112]:
# A:
prop_df['prop_of_fresh_rotten']=np.exp(bernd.feature_log_prob_).sum(axis=1)

#### 5.4 Look at the most likely words for fresh and rotten reviews

In [135]:
# A:
print('Rotten:','\n',prop_df.iloc[0,:].nlargest(10)[1:])

Rotten: 
 movie       0.139586
film        0.111084
like        0.067722
story       0.041900
comedy      0.039464
just        0.038002
good        0.032643
director    0.030451
little      0.030207
Name: 0, dtype: float64


In [136]:
# A:
print('Fresh:','\n',prop_df.iloc[1,:].nlargest(10)[1:])

Fresh: 
 film        0.159596
movie       0.130536
good        0.043357
best        0.043201
story       0.041492
like        0.039782
time        0.036208
director    0.036053
comedy      0.035742
Name: 1, dtype: float64


---

### 6. Examine how your model performs on the test set

In [139]:
# A:
print('Model   :',bernd.score(Xtest,ytest))
print('Baseline:',ytrain.mean())

Model   : 0.7455166524338173
Baseline: 0.6105732725892179


---

### 7. Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

> **Note:** Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

In [248]:
# A:
# top 10 likely to be rotten.
bernd.fit(words,rt.fresh)
final_df=pd.DataFrame(bernd.predict_proba(words))
final_df['title']=rt.title
final_df['quote']=rt.quote
final_df.nlargest(10,0).drop(1,axis=1)

Unnamed: 0,0,title,quote
12548,0.999988,Dogma,"Dogma is a raucous, profane but surprisingly e..."
3542,0.999987,The Adventures of Pinocchio,It's the woodsy children's movie Walt Disney d...
3517,0.999959,Kazaam,Are you bored yet?
9893,0.99995,Desperately Seeking Susan,All of this is cause for consistent smiling an...
2108,0.999938,The Beverly Hillbillies,"It's thin stuff, but the ingratiating naivete ..."
13772,0.999936,The Beach,All the beauty in the world can't camouflage t...
1576,0.999932,The Quick and the Dead,"Although Stone may be pleasing to some eyes, s..."
896,0.999932,Basic Instinct 2,"Like many sequels this is actually a remake, a..."
5674,0.999877,Down by Law,After the initial establishment of character a...
6828,0.999863,Batman & Robin,"[Schumacher's] storytelling is limp, and the c..."


In [249]:
# top ten likely to be fresh.
final_df.nlargest(10,1).drop(0,axis=1)

Unnamed: 0,1,title,quote
7539,0.999996,Jackie Brown,[A] surprisingly sluggish Tarantino piece.
2914,0.999992,The Wild Bunch,In an era when body-count films mirror the mou...
7342,0.999989,Boogie Nights,"The film is bemused and entertained, as we are..."
5064,0.999988,The English Patient,"Torrid, witty, passionate and intelligent."
12844,0.999987,North Country,You cannot help being stirred by the reach and...
7178,0.999986,Chasing Amy,The script moves beyond Smith's customary cata...
7653,0.999984,Where the Wild Things Are,"Intellectually interesting, visually arresting..."
5604,0.999978,Diva,"Made with wit and humor, this French stunner a..."
3937,0.999977,Sunset Blvd.,Remains the best drama ever made about the mov...
1610,0.999977,The Secret of Roan Inish,The rhythms are placid and the camerawork (by ...


---

### 8. Find the most likely to be fresh and rotten for movies with at least 10 reviews.

In [268]:
# rotten
wanted_columns=list(rt.groupby('title').count()['rtid'][rt.groupby('title').count()['rtid']>=10].index)
final_df[final_df['title'].isin(wanted_columns)].nlargest(10,0).drop(1,axis=1)

Unnamed: 0,0,title,quote
12548,0.999988,Dogma,"Dogma is a raucous, profane but surprisingly e..."
3517,0.999959,Kazaam,Are you bored yet?
13772,0.999936,The Beach,All the beauty in the world can't camouflage t...
896,0.999932,Basic Instinct 2,"Like many sequels this is actually a remake, a..."
6828,0.999863,Batman & Robin,"[Schumacher's] storytelling is limp, and the c..."
215,0.999858,Ninja Assassin,"Ninja Assassin lives in the moment, a visceral..."
1659,0.999809,Tank Girl,Lori Petty does a nice job in the title role o...
10658,0.999767,Hideous Kinky,Thanks in part to Kate Winslet's adventurous p...
1440,0.99968,Man of the House,Straight-to- video-quality mess.
796,0.999674,Man of the Year,"It's a comedy, a political thriller, a love st..."


In [269]:
# fresh
final_df[final_df['title'].isin(wanted_columns)].nlargest(10,1).drop(0,axis=1)

Unnamed: 0,1,title,quote
7539,0.999996,Jackie Brown,[A] surprisingly sluggish Tarantino piece.
2914,0.999992,The Wild Bunch,In an era when body-count films mirror the mou...
7342,0.999989,Boogie Nights,"The film is bemused and entertained, as we are..."
5064,0.999988,The English Patient,"Torrid, witty, passionate and intelligent."
12844,0.999987,North Country,You cannot help being stirred by the reach and...
7178,0.999986,Chasing Amy,The script moves beyond Smith's customary cata...
7653,0.999984,Where the Wild Things Are,"Intellectually interesting, visually arresting..."
5604,0.999978,Diva,"Made with wit and humor, this French stunner a..."
1610,0.999977,The Secret of Roan Inish,The rhythms are placid and the camerawork (by ...
260,0.999977,La cité des enfants perdus,Watching the film is like leafing through a gi...
