# Using Multinomial Naive Bayes algorithm for sentiment analysis

In this assigment, you will learn to classify a movie review as 'positive' or 'negative' using Multinomial Naive Bayes.

The dataset used in this assignment is a version of a dataset available at http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/, which was previously used in a Stanford project and a Kaggle competition.  (See the page referenced in the URL for further details.)

The data for this assignment is divided into four files: trainfilex.txt, trainfiley.txt, testfilex.txt, testfiley.txt

Each line of trainfilex.txt contains a review of a film.  The labels of these reviews are in trainfiley.txt.
Line i of trainfilex.txt has the label of the review in line i of trainfiley.txt.
The test reviews are in testfilex.txt and their corresponding labels are in testfiley.txt.

The reviews in these files have been pre-processed to replace punctuation with whitespace and to convert capital letters into lower case.

You need to write a program that uses the Multinomial Naive Bayes algorithm to train on the training files
and then predict the labels of the reviews in the test file.  Use smoothing with m=0.2 when estimating the P(w|C) quantities.  Use log likelihood to avoid underflow.  Don't try to be selective with your vocabulary.  Include all tokens from the training set in your vocabulary.

Your program should calculate
the prediction accuracy (percentage of correct predictions) achieved on the test reviews, by comparing
the predictions made by your algorithm to the labels in testlabels.txt.  (Your program should NOT
access testlabels.txt until it needs to calculate the prediction accuracy.)
Do not use sklearn in your program.


## Step 1:  Reading in the data files


In [1]:
import numpy as np

# read in train and test files, removing newlines

f = open("trainfilex.txt", "r")
trainrevs = [line.rstrip('\n') for line in f]

f = open("trainfiley.txt","r")
trainlabels = [line.rstrip('\n') for line in f]

f = open("testfilex.txt","r")
testrevs = [line.rstrip('\n') for line in f]

f = open("testfiley.txt","r")
testlabels = [line.rstrip('\n') for line in f]

# the first part of the training file contains all the
# negative training reviews, and the second part contains all the positive
# training reviews
#
# Check the first few lines of the trainrevs file
print(trainrevs[0:5])
# Check the first few lines of the trainlabels file
print(trainlabels[0:5])

print("The number of training examples: ", len(trainrevs))
print("The number of test examples: ", len(testrevs))



[' serious and thoughtful  ', ' with a completely predictable plot  you  ll swear that you  ve seen it all before  even if you  ve never come within a mile of the longest yard  ', ' if there was any doubt that peter o fallon did n t have an original bone in his body  a rumor of angels should dispel it  ', ' i like my christmas movies with more elves and snow and less pimps and ho  s  ', ' a terrifically entertaining specimen of spielbergian sci-fi  ']
['1', '0', '0', '0', '1']
The number of training examples:  1349
The number of test examples:  151


We'll write two initial helper functions.
###  Compute the vocabulary from the reviews in the training set

In [2]:
# Input: a list of strings. each string consists of tokens separated by whitespace.
# Output: a list of all distinct tokens found in the strings
def build_vocab(x):
########## TO DO ##########
    
    vocab = []
    
    for line in x:
        for word in line.split():
            if word not in vocab:
                vocab.append(word)
            
##########
    return vocab


### Compute smoothed estimate of P(w|C)

In [3]:
# Input: number n of occurences of w in C, total length N of docs in C, smoothing parameter m, 
# size of vocabulary vsize
# Output: Smoothed estimate of P(w|C)
def smooth_estimate(n,N,m,vsize):
############# TO DO ###########
    estimate = (n + m) / (N + (vsize * m))

########
    return estimate

### Write the rest of your code here

In [4]:
#Training x and y

#set m
m = .2

pos_count = 0
for x in trainlabels:
    if x == '1':
        pos_count += 1
num_pos = pos_count
num_neg = len(trainlabels) - pos_count

#priors of classes 
c_pos = num_pos / len(trainlabels)
c_neg = num_neg / len(trainlabels)

#building main vocab
vocab_all = build_vocab(trainrevs)
vocab_all_size = int(len(vocab_all))  

#building pos/neg vocab
pos_revs = []
neg_revs = []
for i in range(len(trainlabels)):
    if trainlabels[i] == '1':
        pos_revs.append(trainrevs[i])
    else:
        neg_revs.append(trainrevs[i])

vocab_pos = build_vocab(pos_revs)
vocab_neg = build_vocab(neg_revs)

#build dic of word count for each class
pos_dic = {}
neg_dic = {}
for i in range(len(trainlabels)):
    if trainlabels[i] == '1':
        for word in trainrevs[i].split():
            if word in pos_dic.keys():
                pos_dic[word] += 1
            else:
                pos_dic[word] = 1
    else:   
        for word in trainrevs[i].split():
            if word in neg_dic.keys():
                neg_dic[word] += 1
            else:
                neg_dic[word] = 1


In [5]:
print(vocab_all_size)

5456


In [10]:
#Questions:

#a
print('a: P(y=0)',c_neg,'\n')

#b
print('b: P(y=1)',c_pos,'\n')

#c
print('c:')
print('P(intelligence|y=1) =',smooth_estimate(pos_dic['intelligence'],len(vocab_pos),m,vocab_all_size) )
print('P(intelligence|y=0) = 0\n') #doesn't exist

print('d:')
print('P(movie|y=1) = ',smooth_estimate(pos_dic['movie'],len(vocab_pos),m,vocab_all_size) )
print('P(movie|y=0) = ',smooth_estimate(neg_dic['movie'],len(vocab_neg),m,vocab_all_size) ,'\n')

#e
correct_count = 0
wrong_count = 0

for line in range(len(testrevs)):

    prob_pos = 0
    prob_neg = 0
    
    for word in testrevs[line].split():
        
        #positive
        if word in pos_dic.keys(): #P(d|C)
            prob_pos = prob_pos + np.log(smooth_estimate(pos_dic[word],len(vocab_pos),.2,vocab_all_size))
        elif word in neg_dic.keys():
            prob_pos = prob_pos + np.log(smooth_estimate(0,len(vocab_pos),.2,vocab_all_size))
        #negative
        if word in neg_dic.keys(): #P(d|C)
            prob_neg = prob_neg + np.log(smooth_estimate(neg_dic[word],len(vocab_neg),.2,vocab_all_size))
        elif word in pos_dic.keys():
            prob_neg = prob_neg + np.log(smooth_estimate(0,len(vocab_neg),.2,vocab_all_size))
    
    prob_neg = prob_neg + np.log(c_neg) #cond * prior
    prob_pos = prob_pos + np.log(c_pos) #cond * prior

    if prob_pos >= prob_neg:
        if testlabels[line] == '1':
            correct_count += 1
        else:
            wrong_count += 1
    elif prob_pos < prob_neg:
        if testlabels[line] == '0':
            correct_count += 1
        else:
            wrong_count += 1
accuracy = correct_count / len(testrevs)
print('e: accuracy (m = .2) = ',accuracy)
print('Correct Count, Wrong_count: ',correct_count,wrong_count,'\n')      


#f
correct_count = 0
wrong_count = 0
for line in range(len(testrevs)):

    prob_pos = 0
    prob_neg = 0
    
    for word in testrevs[line].split():
        
        #positive
        if word in pos_dic.keys(): #P(d|C)
            prob_pos = prob_pos + np.log(smooth_estimate(pos_dic[word],len(vocab_pos),1,vocab_all_size))
        elif word in neg_dic.keys():
            prob_pos = prob_pos + np.log(smooth_estimate(0,len(vocab_pos),1,vocab_all_size))
        #negative
        if word in neg_dic.keys(): #P(d|C)
            prob_neg = prob_neg + np.log(smooth_estimate(neg_dic[word],len(vocab_neg),1,vocab_all_size))
        elif word in pos_dic.keys():
            prob_neg = prob_neg + np.log(smooth_estimate(0,len(vocab_neg),1,vocab_all_size))
            
    prob_neg = prob_neg + np.log(c_neg) #log cond + prior
    prob_pos = prob_pos + np.log(c_pos) #log cond + prior

    if prob_pos >= prob_neg:
        if testlabels[line] == '1':
            correct_count += 1
        else:
            wrong_count += 1
    elif prob_pos < prob_neg:
        if testlabels[line] == '0':
            correct_count += 1
        else:
            wrong_count += 1
accuracy1 = correct_count / len(testrevs)

print('f: accuracy (m = 1) =',accuracy1)
print('Correct Count, Wrong_count: ',correct_count,wrong_count,'\n') 


#g
pos = 0
for i in range(len(testlabels)):
    if testlabels[i] == '1':
        pos += 1
neg = len(testlabels) - pos
if pos > neg:
    zero_r = pos / len(testlabels)
else:
    zero_r = neg / len(testlabels)
print('g: ')
print('positive count: ', pos)
print('negative count: ', neg)
print('zero_r accuracy: ', zero_r,'\n')


#h
print('h: sklearn accuracy: 0.847682119205298')
print('This is better than our multinomial naive bayes accuracy of 84.1%')
    

a: P(y=0) 0.45959970348406226 

b: P(y=1) 0.5404002965159377 

c:
P(intelligence|y=1) = 0.001148612829121753
P(intelligence|y=0) = 0

d:
P(movie|y=1) =  0.01837780526594805
P(movie|y=0) =  0.02614968440036069 

e: accuracy (m = .2) =  0.8410596026490066
Correct Count, Wrong_count:  127 24 

f: accuracy (m = 1) = 0.8278145695364238
Correct Count, Wrong_count:  125 26 

g: 
positive count:  82
negative count:  69
zero_r accuracy:  0.543046357615894 

h: sklearn accuracy: 0.847682119205298
This is better than our multinomial naive bayes accuracy of 84.1%


###  Using sklearn on this dataset
Sklearn has sophisticated tools that can be used to run Multinomial Naive Bayes
on this dataset.  Let's explore those tools.



### Creating the feature vector from the text (feature extraction)

Each review will have its own feature vector.  The features will be the tokens in the vocabulary.
The $j$th feature corresponds to the $j$th token in the vocabulary, and the value of $x_j$ for a review is the number of times
that token appears in the review.  In each review, most of the features $x_j$ will be set to 0.

We will use the sklearn method CountVectorize to create the feature vectors for every messge.
This method creates the vocabulary and then creates the feature vectors for the reviews.
In contrast to the approach we used above, of placing all tokens from the training set into the vocabulary, 
CountVectorize can be more selective.  

CountVectorize can do the following (and more):
* remove capitalization (already done for our files)
* remove punctuation (already done for our files)
* tokenize (i.e. split the document into individual words)
* count frequencies of each token 
* remove 'stop words' (these are words that will not help us predict since they occur in most documents, e.g. 'a', 'and', 'the', 'him', 'is' ...

In [157]:
# importing the libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# creating an instance of CountVectorizer
# Note there are issues with the way CountVectorizer removes stop words.  To learn more: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
#vectorizer = CountVectorizer(stop_words='english')
vectorizer = CountVectorizer()

In [158]:
# To see the 'stop words' 
#print(vectorizer.get_stop_words())

In [159]:
# Create the vocabulary for our feature transformation
vectorizer.fit(trainrevs)

# Next we create the feature vectors for the training data
X_train = vectorizer.transform(trainrevs).toarray() # code to turn the training reviews into a feature vector
X_test = vectorizer.transform(testrevs).toarray() # code to turn the test reviews into a feature vector

# create the multinomial naive bayes classifier and fit it to the training data
mnb = MultinomialNB()
mnb.fit(X_train,trainlabels)

# compute the accuracy of the classifier on the test set
mnb.score(X_test,testlabels)

0.847682119205298