# Introduction to Data Science
## Homework 5

Student Name: Zhe Huang

Student Netid: zh1087
***

### Part 1: Naive Bayes (5 Points)

1\. From your reading you know that the naive Bayes classifier works by calculating the conditional probabilities of each feature, $e_i$, occuring with each class $c$ and treating them independently. This results in the probability of a certain class occuring given a set of features, or a piece of evidence, $E$, as

$$P(c \mid E) = \frac{p(e_1 \mid c) \cdot p(e_2 \mid c) \cdot \cdot \cdot p(e_k \mid c) \cdot p(c)}{p(E)}.$$

The conditional probability of each piece of evidence occuring with a given class is given by

$$P(e_i \mid c) = \frac{\text{count}(e_i, c)}{\text{count}(c)}.$$

In the above equation $\text{count}(e_i, c)$ is the number of documents in a given class that contain feature $e_i$ and $\text{count}(c)$ is the number of documents that belong to class $c$. 

A common variation of the above is to use Laplace (sometimes called +1) smoothing. Recall the use of Laplace smoothing introduced toward the end of Chapter 3 in the section Probability Estimation. This is done in sklearn by setting `alpha=1` in the `BernoulliNB()` function (this is also the default behavior). The result of Laplace smoothing will slightly change the conditional probabilities,

$$P(e_i \mid c) = \frac{\text{count}(e_i, c) + 1}{\text{count}(c) + 2}.$$

In no more than **one paragraph**, describe why this is useful, and use the bias-variance tradeoff to justify its use. Try to think of a case when not using Laplace smoothing would result in "bad" models. Try to give an example. Be precise.

Answer here!

### Part 2: Text classification for sentiment analysis (20 Points)
For this part of the assignment, we are going to use a data set of movie ratings from IMDB.com. The data consists of the text of a movie review and a target variable which tells us whether the reviewer had a positive feeling towards the movie (equivalent to rating the movie between 7 and 10) or a negative feeling (rating the movie between 1 and 4). Neutral reactions are not included in the data.

The data are located in "`data/imdb.csv`". The first column is the review text; the second is the text label 'P' for positive or 'N' for negative.

1 (1 Point) \. Load the data into a pandas `DataFrame()`.

In [1]:
import pandas as pd
data = pd.read_csv('imdb.csv')

In [2]:
data.head(10)

Unnamed: 0,Text,Class
0,'One of the first of the best musicals Anchors...,P
1,'Visually disjointed and full of itself the di...,N
2,'These type of movies about young teenagers st...,P
3,'I would rather of had my eyes gouged out with...,N
4,'The title says it all. Tail Gunner Joe was a ...,N
5,'There is no greater disservice to do to histo...,P
6,'National Lampoon Goes to the Movies (1981) is...,N
7,'I rented this on DVD yesterday and did not re...,N
8,'Midnight Cowboy is one of those films thats b...,P
9,'Its not a big film. The acting is not amazing...,P


2 (1 Point)\. Code the target variable to be numeric: use the value `1` to represent 'P' and `0` to represent 'N'.

In [3]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3

In [4]:
# Code here
data['Class'].replace(['P','N'],[int(1),int(0)],inplace=True)

In [5]:
data.head(10)

Unnamed: 0,Text,Class
0,'One of the first of the best musicals Anchors...,1.0
1,'Visually disjointed and full of itself the di...,0.0
2,'These type of movies about young teenagers st...,1.0
3,'I would rather of had my eyes gouged out with...,0.0
4,'The title says it all. Tail Gunner Joe was a ...,0.0
5,'There is no greater disservice to do to histo...,1.0
6,'National Lampoon Goes to the Movies (1981) is...,0.0
7,'I rented this on DVD yesterday and did not re...,0.0
8,'Midnight Cowboy is one of those films thats b...,1.0
9,'Its not a big film. The acting is not amazing...,1.0


In [7]:
type(data.Class[1])

numpy.float64

3 (2 Points)\. Put all of the text into a data frame called `X` and the target variable in a data frame called `Y`. Make a train/test split where you give 75% of the data to training. Feel free to use any function from sklearn.

In [10]:
%%time
from sklearn.cross_validation import train_test_split

X = pd.DataFrame(data['Text'])
Y = pd.DataFrame(data['Class'])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.25)

CPU times: user 18.2 ms, sys: 3.6 ms, total: 21.8 ms
Wall time: 18.8 ms


In [43]:
Y_test.iloc[760:765,]

Unnamed: 0,Class
4063,0.0
1025,0.0
3544,
4808,0.0
920,1.0


4 (5 Points)\. Create a binary `CountVectorizer()` and a binary `TfidfVectorizer()`. Use the original single words as well as bigrams (in the same model). Also, use an "english" stop word list. Fit these to the training data to extract a vocabulary and then transform both the train and test data. Hint - look at the API documentation for both vectorizers to see what we mean by "binary."

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

ct_vectorizer=CountVectorizer(binary=True,stop_words='english',ngram_range=(1,2))
tfidf_vectorizer=TfidfVectorizer(stop_words='english',ngram_range=(1,2))

#CountVectorizer:
X_train_ct=ct_vectorizer.fit_transform(X_train['Text'])
X_test_ct=ct_vectorizer.transform(X_test['Text'])

#TfidfVectorizer:
X_train_tfidf=tfidf_vectorizer.fit_transform(X_train['Text'])
X_test_tfidf=tfidf_vectorizer.transform(X_test['Text'])

In [12]:
X_train_ct.shape

(6375, 570998)

5 (6 Points)\. Create `LogisticRegression()` and `BernoulliNB()` models. For all settings, keep the default values. In a single plot, show the AUC curve for both classifiers and both vectorizers defined above. In the legend, include the area under the ROC curve (AUC). Do not forget to label your axes. Your final plot will be a single window with 4 curves.

Which model do you think does a better job? Why? Explain in no more than a paragraph.

Extra credit (2 points): Do any of the options perform identically? If so, can you explain why?

In [16]:
# Run this so your plots show properly
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 12

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

def plotROC(truth, pred_score, label_string, c):
    fpr, tpr, thresholds = roc_curve(truth,pred_score)
    roc_auc = auc(fpr,tpr)
    plt.plot(fpr,tpr,color=c,label=label_string+' (AUC = %0.3f)' %roc_auc)
    return roc_auc

plt.figure()

#CountVectorizer
clf_lr_ct=LogisticRegression()
clf_BNB_ct=BernoulliNB()

clf_lr_ct.fit(X_train_ct,Y_train['Class'])
preds_lr=clf_lr_ct.predict_proba(X_test_ct)[:,1]
fpr, tpr, threshold = metrics.roc_curve(Y_test['Class'], preds_lr)
score = metrics.roc_auc_score(Y_test['Class'], preds_lr)
randc = (np.random.rand(), np.random.rand(), np.random.rand())
plt.plot(fpr, tpr, color=randc, label="LR " + label + ", auc: "+ str(score))
plt.show()
# plotROC(Y_test,preds_lr,'CountVectorizer_BNB','red')

#print(np.isfinite(preds_lr).all())
#print(np.isnan(preds_lr).any())
#np.isinf(preds_lr).any()

clf_BNB_ct.fit(X_train_ct,Y_train)
preds_BNB=clf_BNB_ct.predict_proba(X_test_ct)[:,1]
plotROC(Y_test,preds_BNB,'CountVectorizer_BNB','red')

# #TfidfVecotrizer
clf_lr_tfidf=LogisticRegression()
clf_BNB_tfidf=BernoulliNB()

clf_lr_tfidf.fit(X_train_tfidf,Y_train)
preds_lr=clf_lr_tfidf.predict_proba(X_test_tfidf)[:,1]
plotROC(Y_test,preds_lr,'TfidfVectorizer_LR','black')

clf_BNB_tfidf.fit(X_train_tfidf,Y_train)
preds_BNB=clf_BNB_tfidf.predict_proba(X_test_tfidf)[:,1]
plotROC(Y_test,preds_lr,'TfidfVectorizer_BNB','green')

plt.show()
# Code here

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

<matplotlib.figure.Figure at 0x11e343518>

In [37]:
import numpy as np
print(np.isfinite(preds_lr).sum())
print(np.isfinite(Y_test['Class']).sum())

2125
2124


In [28]:
np.where(np.isfinite(Y_test['Class']) == False)

(array([762]),)

In [40]:
Y_test.iloc[762,0]

nan

In [43]:
Y_test.iloc[760:765,]

Unnamed: 0,Class
4063,0.0
1025,0.0
3544,
4808,0.0
920,1.0


In [None]:
: Note that sklearn is usually using dtype=np.float32 for maximum efficiency,
so it converts sparse matrix to np.float32 (by X = X.astype(dtype = np.float32)) when it can.
In this conversion from float64 to np.float32, a very high number (e.g.,2.9e+200) are converted to inf.

In [58]:
np.isfinite([-np.inf, 0., np.inf]).any()

True

Explanation here!

5\. Use the model from question 4 that you think did the best job and predict the rating of the test data. Find 5 examples the should have been positive, but were incorrectly classified as negative. Print out the reviews below and include an explanation as to why you think it may have been incorrectly classified. You can pick any 5. They do not have to be at random.

In [None]:
# Code here to display 5 incorrect reviews.

Explanation for the 5 reviews chosen here!