# Lab: Naive Bayes

## Part 5: Bernoulli naive bayes on real-world data

In this notebook you will have to apply the Bernoulli naive bayes algorithm on the dataset of movie reviews.

You will have to optimize your code so that this won't take hours to run.

Your name: Benjamin Fraeyman

#### Import(s)

In [1]:
from __future__ import print_function
import numpy as np
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score

In [2]:
reviews = load_files('reviews',encoding='us-ascii')

In [3]:
sentences = [review.split() for review in reviews.data]
X_train = sentences[:150]
X_test = sentences[150:]
class_vec = reviews.target
y_train = class_vec[:150]
y_test = class_vec[150:]

In [8]:
# List which will contain all unique words in the data set
all_words = []

# Transform the data set (which is a list of lists) to a single list
for sentence in X_train:
    all_words.extend(sentence)

# Use the numpy function "unique" to get all unique elements from a list
vocab = np.unique(all_words)

def encode_binary(vocab,sentence):
    vocab_list = vocab.tolist()
    binary_sentence=np.zeros(len(vocab_list),)
    for word in sentence:
        if word in vocab:
            binary_sentence[vocab_list.index(word)]=1.
    return binary_sentence

# apply the function defined above to every sentence in the data set
data_set = []
for sentence in X_train:
    binary_sentence = encode_binary(vocab, sentence)
    data_set.append(binary_sentence)
    
data_set = np.array(data_set)
print(data_set)

[[1. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 ...
 [1. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


### 2. Prior calculation

In [9]:
# Calculate the priors
# Total number of sentences
N = np.float(len(X_train))

prior_0 = len(y_train[y_train==0])/N
prior_1 = len(y_train[y_train==1])/N

print("Prior for 0: ", prior_0)
print("Prior for 1: ", prior_1)

Prior for 0:  0.513333333333
Prior for 1:  0.486666666667


### 3. Likelihood calculation

In [10]:
# Calculate the likelihood of each word given a class: P(wt|C) 
# For each word, we want to know in how many documents of a certain class it occured
# +1 for the smoothing
word_count_class_0 = np.sum(data_set[y_train==0],axis=0) + 1
word_count_class_1 = np.sum(data_set[y_train==1],axis=0) + 1

# For each class we want to know how many documents belong to it 
# +2 for laplacian smoothing
doc_count_class_0 = len(y_train[y_train==0]) + 2
doc_count_class_1 = len(y_train[y_train==1]) + 2

# Multiply by 1. to force conversion to floating number
words_likelihood_0 = 1. * word_count_class_0 / doc_count_class_0
words_likelihood_1 = 1. * word_count_class_1 / doc_count_class_1

print("words_likelihood_0:", words_likelihood_0)
print("words_likelihood_1:", words_likelihood_1)

words_likelihood_0: [0.35443038 0.82278481 0.03797468 ... 0.01265823 0.02531646 0.02531646]
words_likelihood_1: [0.26666667 0.70666667 0.01333333 ... 0.02666667 0.01333333 0.01333333]


### 4. Classification (posterior calculation)

In [11]:
def classify(sentence,vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1):
    coded_sentence = encode_binary(vocab,sentence)
    log_likelihood_0 = np.sum((coded_sentence*np.log(words_likelihood_0))+((1-coded_sentence)*np.log(1-words_likelihood_0))) # equation 4 where C=0 
    log_likelihood_1 = np.sum((coded_sentence*np.log(words_likelihood_1))+((1-coded_sentence)*np.log(1-words_likelihood_1))) # equation 4 where C=1 
    posterior_0 = np.log(prior_0) + log_likelihood_0
    posterior_1 = np.log(prior_1) + log_likelihood_1
    if posterior_0 > posterior_1:
        return 0
    else:
        return 1

### 5. Test

Apply your classification function on every sentence in the training set (X_train) and store the results in a list.

In [12]:
results = []
for i in range(0,len(X_train)):
    results.append(classify(X_train[i],vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1))

To know how well the algorithm works, the accuracy score has to be calculated. For more information on the accuracy_score function visit: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html.

In [13]:
def accuracy_score(class_vec,results):
    correct = class_vec == results
    correct_count = np.count_nonzero(correct == True)
    return 1.0*correct_count / len(results);
    
print(accuracy_score(y_train, results))

1.0


Now apply your classification function to every sentence in the test set (X_test).

In [14]:
y_pred = []
for i in range(0,len(X_test)):
    y_pred.append(classify(X_test[i],vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1))

Calculate the accuracy score here too.

In [16]:
print (accuracy_score(y_test,y_pred))

0.718253968254


### 6. Questions

**You also calculated the accuracy on the test set of the movie reviews when multinomial naive bayes was used (exercise 3). Which algorithm performed the best on the test set? Multinomial Naive bayes or Bernoulli Naive bayes?**


Multinomial Naive bayes

** Why did the one perform better than the other? **

BNB: checks for presence of a term

MNB: calculates probality a term is present. When a word is not present in the sentence, but is present in the vocabulary, the likelihood calculation for multinomial naive Bayes will not take this into account.