# IMT574 Problem Set 5: Naïve Bayes

    
### Introduction

This problem set has two aims: a) learn to understand Naive Bayes, and b) learn to handle text, a form of data that does not come as a table of numbers. You will implement your own Naive Bayes classifier and use this for categorizing Rotten Tomatoes reviews into rotten and fresh ones. Finally, you also find the optimal smoothing parameter. Please submit a) your code (notebooks, rmd, whatever) and b) the results in a final output form (html or pdf). Note that this is groupwork, you should find 2-3 student groups (or wait until we assign you to one).

Rotten Tomatoes
Our first task is to load, clean and explore the Rotten Tomatoes movie reviews data. Please familiarize yourself a little bit with the webpage. Briey, approved critics can write reviews for movies, and evaluate the movie as fresh or rotten. The webpage normally shows a short quote from each critic, and whether it evaluates the movie as fresh or rotten. You will work on these quotes below. The central variables for our purpose in rotten-tomatoes.csv are the following:
fresh evaluation: 'fresh' or 'rotten'
quote short version of the review
There are more variables like links to IMDB.

In [1]:
#IPython is what you are using now to run the notebook
import IPython
print( "IPython version:      %6.6s (need at least 1.0)" % IPython.__version__)

# Numpy is a library for working with arrays and matrices
import numpy as np
print( "Numpy version:        %6.6s (need at least 1.7.1)" % np.__version__)

# SciPy implements many different numerical algorithms
import scipy as sp
print( "SciPy version:        %6.6s (need at least 0.12.0)" % sp.__version__)

# Pandas makes working with data tables easier
import pandas as pd
print( "Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__)

# Module for plotting
import matplotlib.pyplot as plt  
from pylab import *
print( "Mapltolib version:    %6.6s (need at least 1.2.1)" %
       matplotlib.__version__)
%matplotlib inline
# necessary for in-line graphics

# SciKit Learn implements several Machine Learning algorithms
import sklearn
print( "Scikit-Learn version: %6.6s (need at least 0.13.1)" %
       sklearn.__version__)
import os


# for certain system-related functions

from scipy import stats

import statsmodels.formula.api as smf


IPython version:       7.8.0 (need at least 1.0)
Numpy version:        1.16.5 (need at least 1.7.1)
SciPy version:         1.3.1 (need at least 0.12.0)
Pandas version:       0.25.1 (need at least 0.11.0)
Mapltolib version:     3.1.1 (need at least 1.2.1)
Scikit-Learn version: 0.21.3 (need at least 0.13.1)


**1. Explore and clean the data**

***Q1.1 Take a look at a few lines of data (you may use pd.sample for this).***

In [2]:
data = pd.read_csv("rotten-tomatoes.csv",sep = ",")

In [3]:
data.shape

(13442, 9)

In [4]:
data.head(5)

Unnamed: 0,critic,fresh,imdb,link,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,http://www.timeout.com/film/reviews/87745/toy-...,Time Out,"So ingenious in concept, design and execution ...",2009-10-04 00:00:00,9559,Toy Story
1,Richard Corliss,fresh,114709,"http://www.time.com/time/magazine/article/0,91...",TIME Magazine,The year's most inventive comedy.,2008-08-31 00:00:00,9559,Toy Story
2,David Ansen,fresh,114709,http://www.newsweek.com/id/104199,Newsweek,A winning animated feature that has something ...,2008-08-18 00:00:00,9559,Toy Story
3,Leonard Klady,fresh,114709,http://www.variety.com/review/VE1117941294.htm...,Variety,The film sports a provocative and appealing st...,2008-06-09 00:00:00,9559,Toy Story
4,Jonathan Rosenbaum,fresh,114709,http://onfilm.chicagoreader.com/movies/capsule...,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10 00:00:00,9559,Toy Story


***Q1.2 print out all variable names.***

In [5]:
variables = data.columns
print("The list of variables is as following: \n{}".format(variables))

The list of variables is as following: 
Index(['critic', 'fresh', 'imdb', 'link', 'publication', 'quote',
       'review_date', 'rtid', 'title'],
      dtype='object')


In [6]:
data.fresh.unique()

array(['fresh', 'rotten', 'none'], dtype=object)

***Q1.3 create a summary table (maybe more like a bullet list) where you print out the most important
summary statistics for the most interesting variables. The most interesting facts you should present
should include: a) number of missings for fresh and quote; b) all different values for fresh/rotten
evaluations; c) counts or percentages of these values; d) number of zero-length or only whitespace
quote-s; e) minimum-maximum-average length of quotes (either in words, or in characters). (Can
you do this as an one-liner?); f) how many reviews are in data multiple times. Feel free to add more
figures you consider relevant.***

In [7]:
data.fresh.value_counts()

fresh     8389
rotten    5030
none        23
Name: fresh, dtype: int64


**There are 23 rows having none(missing value) in fresh column. There are 1833 fresh evaluations and 1260 rotten evaluations.**


In [8]:
data.quote.isnull().sum()

0

There is 0 missing value in the quotes column.

In [9]:
data.fresh.isnull().sum()

0

In [10]:
fresh_percent = (8389/(8389+5030))*100
fresh_percent

62.51583575527238

In [11]:
rotten_percent = (5030/(8389+5030))*100
rotten_percent

37.48416424472762

In [12]:
(data.quote == ' ').sum()

0

There are 0 number of only white spaces in quotes

In [13]:
(data.quote == '').sum()

0

There are 0 number of rows having zero-length in quotes

In [14]:
#We are finding the mean length of the quotes column
mean_length = data.quote.str.len().mean()
mean_length

121.23128998660914

In [15]:
#We are finding the min length of the quotes column
min_length = data.quote.str.len().min()
min_length

4

In [16]:
#We are finding the max length of the quotes column
max_length = data.quote.str.len().max()
max_length

256

In [17]:
#we are making another dataframe for only those titles which are duplicate or more than one entry
df_duplicate_title = data[data.duplicated(['title'])]

In [18]:
duplicates=df_duplicate_title['title'].nunique() 
duplicates

1555

There are 1555 titles having duplicates or more than one entry

In [19]:
df2 = pd.DataFrame(np.array([["number of rows having none (missing value) in fresh column",23],["percent of fresh reviews",fresh_percent],["percent of rotten reviews",rotten_percent],
                           ["number of rows having missing values in quote column",0],["number of titles having duplicate entries",duplicates]]),
                  columns = ['heading','value'])

In [20]:
df2

Unnamed: 0,heading,value
0,number of rows having none (missing value) in ...,23.0
1,percent of fresh reviews,62.51583575527238
2,percent of rotten reviews,37.48416424472762
3,number of rows having missing values in quote ...,0.0
4,number of titles having duplicate entries,1555.0


***Q1.4-Now when you have an overview what you have in data, clean it by removing all the inconsistencies
the table reveals. We have to ensure that the central variables, quote and fresh are not missing,
quote is not an empty string (or just contain spaces and such), and all rows are unique.***

In [21]:
data.shape

(13442, 9)

The shape of uncleaned data frame is 13442,9

In [22]:
def clean_data(df):
    cleaned_df = df.copy()
    cleaned_df = cleaned_df.dropna(how = 'any', subset = ['fresh','quote']) #removing missing values from fresh and quote columns 
    cleaned_df = cleaned_df[cleaned_df['fresh'] != 'none'] #removing rows having none in the fresh column
    
    return cleaned_df

In [23]:
cleaned_df = clean_data(data)
cleaned_df.shape 

(13419, 9)

We deleted the rows having missing values in quote and fresh columns. We also deleted the rows having none in the fresh column.
The shape of cleaned_df is 13419, 9

### 2 Naïve Bayes


**Now where you are familiar with the data, it's time to get serious and implement the Naive Bayes classifier from scratch. But first things first.**

**1. Ensure you are familiar with Naive Bayes. Consult the readings, available on canvas. Schutt & O'Neill is an easy and accessible (and long) introduction, Whitten & Frank is a lot shorter but still accessible introduction. The Lecture notes contains examples how to create baf-of-words (BOW), and how to compute Naive Bayes classifier using BOW-s.**

**2. Convert your data (quotes) into bag-of-words. Your code should look something like this:**


In [24]:
from sklearn.feature_extraction.text import CountVectorizer

#if vectorizer == None:
vectorizer = CountVectorizer(min_df = 0,binary=True,stop_words = "english") 
        
text = cleaned_df.quote.values
vectorizer.fit(text)
X = vectorizer.transform(text)
feature_names = vectorizer.vocabulary_
X = X.toarray()
Y = cleaned_df.fresh.apply(lambda x: 1 if x == 'fresh' else 0)


In [25]:
vectorizer.get_feature_names()[0:5]

['000', '0014', '007', '044', '07']

In [26]:
#vectorizer.vocabulary_

In [27]:
X.shape

(13419, 20584)

In [28]:
Y.shape

(13419,)

In [29]:
#feature_names

We would convert X(Bag of Words) to dataframe and add Y as a column to it.

In [30]:
X_BOW = X.copy()

In [31]:
X_BOW_df = pd.DataFrame(X_BOW, columns=vectorizer.get_feature_names())

In [32]:
X_BOW_df.head(5)

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zoom,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
X_BOW_df.shape

(13419, 20584)

In [34]:
XY_BOW_df = X_BOW_df.copy()

In [35]:
XY_BOW_df['Y'] = Y

In [36]:
Y.unique()

array([1, 0], dtype=int64)

In [37]:
XY_BOW_df.Y.value_counts()

1.0    8379
0.0    5017
Name: Y, dtype: int64

In [38]:
XY_BOW_df.Y.unique()

array([ 1.,  0., nan])

We notice that in some of Y we have got nan, we would remove all such rows having Y as nan.

In [39]:
XY_BOW_df1 = XY_BOW_df.dropna(subset = ['Y'])

In [40]:
XY_BOW_df1.shape

(13396, 20585)

In [41]:
XY_BOW_df1.Y.unique()

array([1., 0.])

In [42]:
XY_BOW_df1.head(5)

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0


binary=True specifies that we don't want BOW-s that contains counts of words but just 1/0 for the presence/non-presence of the words.

***Q2.3- Split your work data and target (i.e. the variable fresh) into training and validation chunks (80/20 or so).***

In [43]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(XY_BOW_df1, test_size=0.2)

In [44]:
train.shape

(10716, 20585)

In [45]:
test.shape

(2680, 20585)

In [46]:
"""
from sklearn.model_selection import train_test_split
# split data into training and test set 
X_train, X_test, y_train, y_test = train_test_split(XY_df.iloc[:,:-1],XY_df.iloc[:,-1], test_size = 0.2)"""

'\nfrom sklearn.model_selection import train_test_split\n# split data into training and test set \nX_train, X_test, y_train, y_test = train_test_split(XY_df.iloc[:,:-1],XY_df.iloc[:,-1], test_size = 0.2)'

***Q2.4- Compute the unconditional (log) probability that the tomato is fresh/rotten, log Pr(F), and log Pr(R).
These probabilities are based on the values of fresh alone, not on the words the quotes contain.***

We are finding the value counts of Y = 1 and Y = 0 in training data.

In [47]:
train.Y.value_counts()

1.0    6715
0.0    4001
Name: Y, dtype: int64

We have denoted fresh as 1 and rotten as 0.

In [48]:
pd.DataFrame(train.Y.value_counts())

Unnamed: 0,Y
1.0,6715
0.0,4001


In [49]:
pd.DataFrame(train.Y.value_counts())['Y'][0]

4001

In [50]:
total = 6667 + 4049
Pr_fresh = 6667/total
Pr_rotten = 4049/total

In [51]:
log_Pr_fresh = np.log(Pr_fresh)
log_Pr_rotten = np.log(Pr_rotten)

In [52]:
#log probability of fresh
log_Pr_fresh

-0.4745679680464393

In [53]:
#log probability of rotten
log_Pr_rotten

-0.9732680146323383

***Q2.5- For each word w, compute log Pr(w|F) and log Pr(w|R), the (log) probability that the word is present
in a fresh/rotten review. These probabilities can easily be calculated from counts of how many times
these words are present for each class.
Hint: these computations are based on your BOW-s X. Look at ways to sum along columns in this
matrix.***

Let us take our training data and subset it for Y = 1(fresh) and Y = 0(rotten)

In [54]:
train_fresh = train[train['Y'] == 1]

In [55]:
train_rotten = train[train['Y'] == 0]

In [56]:
train_fresh.shape

(6715, 20585)

In [57]:
train_fresh.head(5)

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
10974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
11364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
8260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
5968,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
7779,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0


In [58]:
columns_fresh = train_fresh.columns
columns_fresh

Index(['000', '0014', '007', '044', '07', '10', '100', '101', '104', '105',
       ...
       'zooming', 'zooms', 'zorro', 'zorros', 'zowie', 'zucker', 'zweibel',
       'zwick', 'zzzzzzzzz', 'Y'],
      dtype='object', length=20585)

In [59]:
words_fresh_count = {}

In [60]:
#Calculating the log likelihood for word given fresh. We assign 10e-5 for all words that have count 0.
for col in columns_fresh:
    count = train_fresh[col].sum()
    if count == 0:
        words_fresh_count[col] = np.log(0.00001)
        
    else:
        
        words_fresh_count[col] = np.log(count/ train_rotten.shape[0]) 

In [61]:
df_fresh_wordprob = pd.DataFrame(words_fresh_count, index = [0])

In [62]:
df_fresh_wordprob

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,-6.908005,-11.512925,-6.684862,-11.512925,-11.512925,-5.403928,-6.348389,-8.2943,-8.2943,-8.2943,...,-11.512925,-7.601152,-7.601152,-7.601152,-8.2943,-7.601152,-11.512925,-7.601152,-11.512925,0.5178


The words_fresh_count dictionary has all probability of every word given the word appeared in a fresh review.

In [63]:
columns_rotten = train_rotten.columns
columns_rotten

Index(['000', '0014', '007', '044', '07', '10', '100', '101', '104', '105',
       ...
       'zooming', 'zooms', 'zorro', 'zorros', 'zowie', 'zucker', 'zweibel',
       'zwick', 'zzzzzzzzz', 'Y'],
      dtype='object', length=20585)

In [64]:
words_rotten_count = {}

In [65]:
#Calculating the log likelihood for word given rotten. We assign 10e-5 for all words that have count 0.
for col in columns_rotten:
    count = train_rotten[col].sum()
    if count == 0:
        words_rotten_count[col] = np.log(0.00001)
        
    else:
        
        words_rotten_count[col] = np.log(count/ train_rotten.shape[0])

The words_rotten_count dictionary has all probability of every word given the word appeared in a rotten review.

In [66]:
df_rotten_wordprob = pd.DataFrame(words_rotten_count, index = [0])

In [67]:
df_rotten_wordprob

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,-8.2943,-11.512925,-7.601152,-11.512925,-11.512925,-5.809393,-6.684862,-11.512925,-11.512925,-11.512925,...,-8.2943,-11.512925,-7.601152,-11.512925,-11.512925,-8.2943,-11.512925,-8.2943,-11.512925,-11.512925


***Q2.6- For both destination classes, F and R, compute the log-likelihood that the quote belongs to this class.
log-likelihood is what is given inside the brackets in equation (1) on slide 28, and the equations on
Schutt Doing Data Science, page 102. In lecture notes it is explained before the email classication
example (and in the example too). On the slides we have the log-likelihood essentially as (although
we do not write it out):***

We have got the Pr(word/ fresh) from the words_fresh_count dictionary for every word.

We have got the Pr(word/ fresh) from the words_rotten_count dictionary for every word.

We would use these probabilities on the test set to check if the quote is fresh or rotten.

We are making a copy of the test set.

In [68]:
test_df = test.copy()

The last column of test_df, df_fresh_wordprob, df_rotten_wordprob is Y, we do not need this column for calculating log liklihoods. So we would get rid of this column.

In [69]:
test_df_withoutY = test_df[test_df.columns.difference(['Y'])]
df_fresh_wordprob_withoutY = df_fresh_wordprob[df_fresh_wordprob.columns.difference(['Y'])]
df_rotten_wordprob_withoutY = df_rotten_wordprob[df_rotten_wordprob.columns.difference(['Y'])]


In [70]:
test_df_withoutY.shape

(2680, 20584)

In [71]:
 df_fresh_wordprob_withoutY.shape

(1, 20584)

We would add another columns for the liklihood of each quote being fresh or rotten.

In [72]:
#We get a series for every quote's log liklihood of being fresh

fresh_liklihood = test_df_withoutY @ df_fresh_wordprob_withoutY.transpose() + log_Pr_fresh

In [73]:
#We get a series for every quote's log liklihood of being rotten

rotten_liklihood= test_df_withoutY @ df_rotten_wordprob_withoutY.transpose() + log_Pr_rotten

In [74]:
test_df['fresh_liklihood'] = fresh_liklihood #We add a column for every quote's log liklihood of being fresh

In [75]:
test_df['rotten_liklihood'] = rotten_liklihood #We add a column for every quote's log liklihood of being rotten

In [76]:
#We make a prediction column based on liklihood of being fresh or rotten

test_df['prediction'] = np.where(test_df['fresh_liklihood'] >= test_df['rotten_liklihood'], 1, 0)

In [77]:
test_df.Y.unique()

array([0., 1.])

In [78]:
test_df.head(5)

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y,fresh_liklihood,rotten_liklihood
13126,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,-40.156332,-45.798893
978,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1.0,-91.461301,-94.663536
10904,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1.0,-16.168146,-21.971471
11621,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,-87.891281,-91.913357
2807,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1.0,-32.760565,-36.053072


In [79]:
test_df.prediction.unique()

array([1, 0], dtype=int64)

***Q2.7 Print the resulting confusion matrix and accuracy (feel free to use existing libraries).***

We would make a copy of the data frame test_df; because we have to change data types on Y and prediction columns to print confusion matrix. Currently one is int and the other is float.

In [80]:
test_df1 = test_df.copy()

In [81]:
test_df1.Y.value_counts()

1.0    1664
0.0    1016
Name: Y, dtype: int64

In [82]:
test_df1.Y.unique()

array([0., 1.])

In [83]:
test_df1.prediction.value_counts()

1    2546
0     134
Name: prediction, dtype: int64

In [84]:
test_df1.prediction.unique()

array([1, 0], dtype=int64)

In [85]:
test_df1.Y = test_df1.Y.astype(int)

In [86]:
test_df1.Y.value_counts()

1    1664
0    1016
Name: Y, dtype: int64

In [87]:
test_df1.prediction.value_counts()

1    2546
0     134
Name: prediction, dtype: int64

In [88]:
#Printing the confusion matrix
c = pd.DataFrame(sklearn.metrics.confusion_matrix(test_df.Y, test_df.prediction, labels = [1,0]))
c

Unnamed: 0,0,1
0,1592,72
1,954,62


In [89]:
#Calculating accuracy
accuracy = (c[0][0] + c[1][1])/ (c[0][0] + c[1][1] + c[0][1]+c[1][0])
accuracy

0.6171641791044776

**Q3- Interpretation**

***Q3.1- Extract from your conditional probability vectors log Pr(w|F) and log Pr(w|R) the probabilities that
correspond to frequent words only.***

In [90]:
df_fresh_wordprob

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,-6.908005,-11.512925,-6.684862,-11.512925,-11.512925,-5.403928,-6.348389,-8.2943,-8.2943,-8.2943,...,-11.512925,-7.601152,-7.601152,-7.601152,-8.2943,-7.601152,-11.512925,-7.601152,-11.512925,0.5178


The df_fresh_wordprob has the log(Pr(word|fresh)) for every word. We would make its copy and sort in descending order.

In [91]:
df_frequent_words_fresh = df_fresh_wordprob.copy()

In [92]:
df_frequent_words_fresh

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,-6.908005,-11.512925,-6.684862,-11.512925,-11.512925,-5.403928,-6.348389,-8.2943,-8.2943,-8.2943,...,-11.512925,-7.601152,-7.601152,-7.601152,-8.2943,-7.601152,-11.512925,-7.601152,-11.512925,0.5178


In [93]:
fresh_words_frequent = df_frequent_words_fresh.sort_values(by = 0,axis=1,ascending = False)

In [94]:
fresh_words_frequent

Unnamed: 0,Y,film,movie,like,story,good,best,time,comedy,just,...,antipollution,singers,antiquated,persuasions,gaggle,sinewy,sinecures,persuasiveness,sin,sh
0,0.5178,-1.374616,-1.475376,-2.380797,-2.510474,-2.551296,-2.753036,-2.822029,-2.834714,-2.843261,...,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925,-11.512925


In [95]:
fresh_words_frequent.columns[0:11]

Index(['Y', 'film', 'movie', 'like', 'story', 'good', 'best', 'time', 'comedy',
       'just', 'director'],
      dtype='object')

The 10 most frequent words in the movies reviewed as fresh are: 

'film', 'movie', 'like', 'story', 'good', 'best', 'just','director', 'comedy','time'

The df_rotten_wordprob has the log(Pr(word/ rotten)) for every word. We would make its copy and sort in descending order.

In [96]:
df_frequent_words_rotten = df_rotten_wordprob.copy()
rotten_words_frequent = df_frequent_words_rotten.sort_values(by = 0,axis=1,ascending = False)
rotten_words_frequent.columns[0:10]

Index(['film', 'movie', 'like', 'comedy', 'good', 'story', 'director', 'time',
       'movies', 'just'],
      dtype='object')

The 10 most frequent words in the movies reviewed as rotten are: 

'film', 'movie', 'like', 'comedy', 'good', 'story', 'time', 'director', 'funny', 'make'

***Q3.1- Extract from your conditional probability vectors log Pr(w|F) and log Pr(w|R) the probabilities that
correspond to frequent words only.***

We would consider those words who appeared at leaset 30 times in the fresh reviews; and those words which appeared more than 30 times in the rotten reviews.

We would calculate the log of 30/total number of rows; and find the words whose log liklihood are grtaer than this for fresh and rotten seprately.

In [97]:
threshold = log(30/test_df.shape[0])

In [98]:
threshold

-4.492374691842747

In [99]:
words_fresh_morethan30 = {} #creating an empty dictionary to store the frequent words and theirlog liklihoods from fresh reviews.
for col in df_fresh_wordprob.columns:
    if df_fresh_wordprob[col][0] >= threshold:
        words_fresh_morethan30[col] = df_fresh_wordprob[col][0]
    

In [100]:
df_words_fresh_morethan30 = pd.DataFrame(words_fresh_morethan30, index = [0]) #converting dictionary to dataframe

In [101]:
words_rotten_morethan30 = {} #creating an empty dictionary to store the frequent words and theirlog liklihoods from rotten reviews.
for col in df_rotten_wordprob.columns:
    if df_rotten_wordprob[col][0] >= threshold:
        words_rotten_morethan30[col] = df_rotten_wordprob[col][0] 

In [102]:
df_words_rotten_morethan30 = pd.DataFrame(words_rotten_morethan30, index = [0]) #converting dictionary to dataframe

***Q3.2 Find 10 best words to predict F and 10 best words to predict R. Hint: imagine we have a review that
contains just a single word. Which word will give the highest weight to the probability the review is
fresh? Which one to the likelihood it is rotten?
Comment your results.***

In [103]:
df_words_fresh_morethan30 = df_words_fresh_morethan30.sort_values(by = 0,axis=1,ascending = False)

In [104]:
df_words_fresh_morethan30.columns[0:10] 

Index(['Y', 'film', 'movie', 'like', 'story', 'good', 'best', 'time', 'comedy',
       'just'],
      dtype='object')

In [105]:
df_words_rotten_morethan30 = df_words_rotten_morethan30.sort_values(by = 0,axis=1,ascending = False)

In [106]:
df_words_rotten_morethan30.columns[0:10]

Index(['film', 'movie', 'like', 'comedy', 'good', 'story', 'director', 'time',
       'movies', 'just'],
      dtype='object')

We calculated the log probabilities of the word given it is in fresh and rotten reviews respectively.  Then we took the difference of log probabilities of the same word occuring in fresh and rotten reviews. If the difference is positive then it means the word is a good predictor that the review is fresh. If the difference is negative then we conclude that the word is a good predictor of a rotten review.

In [107]:
difference = df_fresh_wordprob - df_rotten_wordprob

In [108]:
difference

Unnamed: 0,000,0014,007,044,07,10,100,101,104,105,...,zooming,zooms,zorro,zorros,zowie,zucker,zweibel,zwick,zzzzzzzzz,Y
0,1.386294,0.0,0.916291,0.0,0.0,0.405465,0.336472,3.218626,3.218626,3.218626,...,-3.218626,3.911773,0.0,3.911773,3.218626,0.693147,0.0,0.693147,0.0,12.030725


In [109]:
difference_descending = difference.sort_values(by = 0,axis=1,ascending = False)

In [110]:
difference_descending

Unnamed: 0,Y,capra,sturges,scare,balanced,confused,preston,convoluted,mendes,destined,...,segments,lie,wives,wayans,bertino,mcadams,dangerfield,bana,georgia,excruciating
0,12.030725,5.857683,5.703533,5.703533,5.703533,5.616521,5.41585,5.41585,5.298067,5.298067,...,-4.828064,-4.828064,-4.828064,-4.828064,-5.010385,-5.010385,-5.010385,-5.010385,-5.010385,-5.010385


In [111]:
difference_descending.columns[0:10]

Index(['Y', 'capra', 'sturges', 'scare', 'balanced', 'confused', 'preston',
       'convoluted', 'mendes', 'destined'],
      dtype='object')

We have words like nostalgic, confused, balanced and craftsmanship that occur the most as good indicators of fresh review. Kindly ignore Y which is a target variable. 

In [112]:
difference_ascending = difference.sort_values(by = 0,axis=1,ascending = True)

In [113]:
difference_ascending

Unnamed: 0,dangerfield,bana,georgia,excruciating,bertino,mcadams,wayans,lie,pile,segments,...,heads,freudian,convoluted,preston,confused,scare,sturges,balanced,capra,Y
0,-5.010385,-5.010385,-5.010385,-5.010385,-5.010385,-5.010385,-4.828064,-4.828064,-4.828064,-4.828064,...,5.298067,5.298067,5.41585,5.41585,5.616521,5.703533,5.703533,5.703533,5.857683,12.030725


In [114]:
difference_ascending.columns[0:10]

Index(['dangerfield', 'bana', 'georgia', 'excruciating', 'bertino', 'mcadams',
       'wayans', 'lie', 'pile', 'segments'],
      dtype='object')

We have words like shocker, burns, dangerfield that occur the most as good indicators of rotten review.

***3.3  Print out a few missclassified quotes. Can you understand why these are misclassified?***

In [115]:
#Finding the quotes which are misclassified as rotten instead of fresh and store in list Quote_labelFtoR
Quote_labelFtoR = test_df1[test_df1['Y']>test_df1['prediction']].index.tolist() 

In [116]:
len(Quote_labelFtoR)

72

In [117]:
#some sample indices we got above
Quote_labelFtoR[0:5]

[6339, 4959, 10337, 9891, 12657]

In [118]:
#Finding the quotes which are misclassified as fresh instead of rotten and store in list Quote_labelRtoF
Quote_labelRtoF = test_df1[test_df1['Y']<test_df1['prediction']].index.tolist() 

In [119]:
#This loop is used to print 5 quotes that are misclassified from rotten to fresh.
for index in Quote_labelRtoF[20:25]:
    print(cleaned_df.quote.values[index])    

Alfred Hitchcock's first indisputable masterpiece.
Made with fluid skill and a passion for storytelling, its tale of how the Vietnam War and American society affect a black Marine remains accessible while confounding expectations.
There are no laughs to be found in writer-director Michael Traeger's would-be comedy The Amateurs, but there is one big mystery: how actors of this caliber could have been convinced to take part.
This setup isn't exactly what you'd call plausible, but the follow-through is consistent and clever.
Coppola adapted the novel himself, and he's done a good job of paring it down.


In [120]:
#This loop is used to print 5 quotes that are misclassified from fresh to rotten.
for index in Quote_labelFtoR[0:5]:
    print(cleaned_df.quote.values[index])

A likable but wan romance.
Besson fatally misjudges the cinematic interest of his theme.
You sometimes have to giggle at this movie the way you do when you catch 4-year-olds playing dress-up in front of Mom's closet.
Those in sore need of busting a gut had better look elsewhere for comic relief.
There is beauty in Kagemusha but it is impersonal, distant and ghostly. The old master has never been more rigorous.


**Lets take one sentence from above i.e "A movie so in love with itself it hardly needs us at all." The overall tone of this review is negative by taking all words together and understanding the context of the review quote. However if we see the individual log probabilities of these words when taken independently weighs more in favor of a fresh review hence our model has classified as fresh instead of rotten.We will demonstrate this below:**

In [121]:
quote = "A movie so in love with itself it hardly needs us at all"

In [122]:
quote_words = quote.split()

In [123]:
words_prob_infresh = {}
for word in quote_words:
    if word in df_fresh_wordprob.columns:
        words_prob_infresh[word]=df_fresh_wordprob[word][0]

In [124]:
words_prob_infresh

{'movie': -1.4753755435817137,
 'love': -3.5669117901448946,
 'hardly': -5.40392785096107,
 'needs': -5.158805392928086}

In [125]:
words_prob_inrotten = {}
for word in quote_words:
    if word in df_rotten_wordprob.columns:
        words_prob_inrotten[word]=df_rotten_wordprob[word][0]

In [126]:
words_prob_inrotten

{'movie': -2.004584037948238,
 'love': -4.119912338961598,
 'hardly': -5.896404336058865,
 'needs': -5.461086264801019}

**So as we can see for the words in example sentence have higher log probability for fresh when compared to rotten. As a result of which our model classifies this quote/review as fresh instead of rotten as we see through human eyes when taken together in the context instead of independent word log probabilities.**

The Pr(word/fresh) is higher than Pr(word/rotten) for each of these words in the quote. That is why we feel our model misclassified this quote as fresh although it should actually be rotten review.

**Q4- NB with smoothing**

***Q4.1 and Q4.2***

In [127]:
def model_fit(train, alpha): #train= training data, alpha = smoothehing parameter
    
    number_fresh = pd.DataFrame(train.Y.value_counts())['Y'][1] #the fresh reviews have Y =1
    number_rotten = pd.DataFrame(train.Y.value_counts())['Y'][0] #the rotten reviews have Y = 0
    total = number_fresh + number_rotten
    
    Pr_fresh = number_fresh/total #finding probability of a review being fresh
    Pr_rotten = number_rotten/total #finding probability of a review being rotten
    
    log_Pr_fresh = np.log(Pr_fresh)
    log_Pr_rotten = np.log(Pr_rotten)
    
    train_fresh = train[train['Y'] == 1] #subsetting the data for Y =1
    train_rotten = train[train['Y'] == 0] #subsetting the data for Y =0
    
    #finding log(Pr(word/ fresh)) for every word
    columns_fresh = train_fresh.columns
    words_fresh_count = {}
    
    for col in columns_fresh:
        
        count = train_fresh[col].sum()
        words_fresh_count[col] = np.log((count+alpha)/ (train_rotten.shape[0]+alpha)) 
        
    df_fresh_wordprob = pd.DataFrame(words_fresh_count, index = [0])
        
    #finding log(Pr(word/ rotten)) for every word
    columns_rotten = train_rotten.columns
    words_rotten_count = {}
    
    for col in columns_rotten:
        
        count = train_rotten[col].sum()
        words_rotten_count[col] = np.log((count+alpha)/ (train_rotten.shape[0]+alpha)) #We are adding alpha in numerator and denominator for smoothening
        
    df_rotten_wordprob = pd.DataFrame(words_rotten_count, index = [0])
    
    return log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob
    
#We are returning 4 things: 1.the probability of a quote being fresh, 2.the probability of a quote being fresh,
# 3.the data frame having words as column; and a single row having log(Pr(Word/ fresh)) for each word,
# 4.the data frame having words as column; and a single row having log(Pr(Word/ rotten)) for each word,

In [128]:
log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob = model_fit(train, 0.5)

In [129]:
#note that we are passing the trst data set and not the training dataset.

In [130]:
def model_predict(log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob, test): #test means the validation dataset
    
    test_df = test.copy()
    
    #The last column of test_df, df_fresh_wordprob, df_rotten_wordprob is Y, we do not need this column for calculating log liklihoods. 
    #So we would get rid of this Y column.
    
    test_df_withoutY = test_df[test_df.columns.difference(['Y'])]
    df_fresh_wordprob_withoutY = df_fresh_wordprob[df_fresh_wordprob.columns.difference(['Y'])]
    df_rotten_wordprob_withoutY = df_rotten_wordprob[df_rotten_wordprob.columns.difference(['Y'])]
    
    #doing matrix multiplications
    fresh_liklihood = test_df_withoutY @ df_fresh_wordprob_withoutY.transpose() + log_Pr_fresh
    rotten_liklihood= test_df_withoutY @ df_rotten_wordprob_withoutY.transpose() + log_Pr_rotten
    
    test_df['fresh_liklihood'] = fresh_liklihood #We add a column for every quote's log liklihood of being fresh
    test_df['rotten_liklihood'] = rotten_liklihood #We add a column for every quote's log liklihood of being rotten
    
    #We make a prediction column based on liklihood of being fresh or rotten
    test_df['prediction'] = np.where(test_df['fresh_liklihood'] >= test_df['rotten_liklihood'], 1, 0)
    
    #creating confusion matrix
    c = pd.DataFrame(sklearn.metrics.confusion_matrix(test_df.Y, test_df.prediction, labels = [1,0]))
    accuracy = (c[0][0] + c[1][1])/ (c[0][0] + c[1][1] + c[0][1]+c[1][0])
    
    return accuracy
    

In [131]:
accuracy = model_predict(log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob, test)
accuracy

0.6179104477611941

***Q4.3- Cross-validate the accuracy (on the validation data) on a number of  values and nd the  that
gives you the best result. You can use your own CV algorithm you created for PS4, or an existing
library.***

XY_BOW_df1 is our cleaned data set. We would be using this for k fold cross validation. 

We want to run for different alpha values.

In [134]:

    
    from sklearn.model_selection import KFold
    kfold = KFold(n_splits=5, random_state=100, shuffle=False) #writing code for k fold

    def cross_validate(XY_BOW_df1):
        alpha_list = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
        

        for alpha in alpha_list:
            accuracy_list = []
            for train_index, test_index  in kfold.split(XY_BOW_df1): #get train and test indices for k fold cross validation
                train, test = XY_BOW_df1.iloc[train_index], XY_BOW_df1.iloc[test_index]

                
                log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob = model_fit(train, alpha)

                accuracy = model_predict(log_Pr_fresh, log_Pr_rotten, df_fresh_wordprob, df_rotten_wordprob, test)
                
                accuracy_list.append(accuracy)
            avg_accuracy = sum(accuracy_list)/len(accuracy_list)

            print("The alpha is {} and the average accuracy is {}".format(alpha, avg_accuracy))



In [135]:
cross_validate(XY_BOW_df1)

The alpha is 0.1 and the average accuracy is 0.6157094148518327
The alpha is 0.2 and the average accuracy is 0.6172771361557274
The alpha is 0.3 and the average accuracy is 0.6189941669034448
The alpha is 0.4 and the average accuracy is 0.6204871220604703
The alpha is 0.5 and the average accuracy is 0.6213829787234043
The alpha is 0.6 and the average accuracy is 0.6216068036079401
The alpha is 0.7 and the average accuracy is 0.6228758503117113
The alpha is 0.8 and the average accuracy is 0.6232491796337463
The alpha is 0.9 and the average accuracy is 0.6236224532433019


When I was running the cross validation, the highest accuracy was for alpha = 0.9, and the accuracy was 62.3%.