# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Case Study: Sentiment Analysis with Naive Bayes

Week 9 | Lab 4.1


This lab will have you put together some of the tools you've seen before to find which words are most likely to appear in positive or negative valenced reviews, and to predict whether a review is positive or not based on the text. This is a supervised learning problem, where we require some labelled data on reviews to start our classifications. You could explore other classifiers for this problem too; as we have discussed before Naive Bayes has been found empirically to perform particularly well on text (at least in a bag-of-words context where word order is not relevant), as well as being fast (and word datasets can get very large very quickly).

### Load packages and movie data

Do any cleaning you deem necessary.

In [4]:
from __future__ import division, print_function, unicode_literals
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
rt = pd.read_csv('./assets/datasets/rt_critics.csv')

rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt["fresh"] = rt["fresh"].map(lambda x: 1 if x == 'fresh' else 0)
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


### Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [30]:
cvec = CountVectorizer(binary=True, ngram_range=(1,3), stop_words='english', max_features=5000)
X = cvec.fit_transform(rt["quote"])
X.shape

(14049, 5000)

In [31]:
y = rt["fresh"]
y.shape

(14049,)

### Split data into training and testing splits

You should keep 25% of the data in the test set.

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [33]:
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score

nb = naive_bayes.BernoulliNB()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)
print("Accuracy:", accuracy_score(predictions, y_test))

Accuracy: 0.762311414745


In [34]:
# Baseline: model that predicts most frequent class
max(rt["fresh"].value_counts())/len(rt["fresh"])

0.61306854580397185

### Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 1. Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [35]:
log_probs = nb.feature_log_prob_
log_probs

array([[-6.92510392, -7.6182511 , -5.74644892, ..., -7.21278599,
        -6.70196037, -6.23195674],
       [-7.6763191 , -6.98317192, -6.20998203, ..., -6.82902124,
        -8.08178421, -8.08178421]])

In [43]:
log_probs_rotten = log_probs[0]
print(log_probs_rotten.shape)
log_probs_fresh = log_probs[1]
print(log_probs_fresh.shape)

(5000,)
(5000,)


#### 2. Make a dataframe with the probabilities and features

#### 3. Create a column that is the difference between fresh probability of appearance and rotten

#### 4. Look at the most likely words for fresh and rotten reviews

### Examine how your model performs on the test set

### Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

Just to note: Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%).

---

## [Bonus] Take a look at some other classifiers for this problem