# Lab: Naive Bayes

## Part 3: Multinomial Naive Bayes on real-world data

The goal of this exercise is to do **sentiment analysis** on movie reviews of imdb. In other words, your machine learning algorithm (multinomial naive Bayes) will have to determine if a movie review is positive or negative.

You will have to optimize your code so that this won't take hours to run.


<img src="https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg
" alt="Drawing" style="height: 150px;"/>


Your name: Benjamin Fraeyman

### 1. Imports and data set creation

In [3]:
from __future__ import print_function
import numpy as np
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score

In [4]:
reviews = load_files('reviews',encoding='us-ascii')

In [5]:
sentences = [review.split() for review in reviews.data]
X_train = sentences[:150]
X_test = sentences[150:]

In [6]:
class_vec = reviews.target
y_train = class_vec[:150]
y_test = class_vec[150:]

In [10]:
#Create the vocabulary
# List which will contain all unique words contained in the data set
all_words = []

# Transform the data set (which is a list of lists) to a single list
for sentence in X_train:
    all_words.extend(sentence)

# Use the numpy function #unique# to get all unique elements from a list
vocab = np.unique(all_words)
print(vocab)

[u'!' u'"' u'$1000' ... u'zweibel' u"zwick's" u"zwigoff's"]


In [14]:
#Encode the training set
def encode_multinomial(vocab,sentence):
    vocab_list = vocab.tolist()
    binary_sentence=np.zeros(len(vocab_list),)
    for word in sentence:
        if word in vocab:
            binary_sentence[vocab_list.index(word)] += 1
    return binary_sentence


data_set = []
for sentence in X_train:
    binary_sentence = encode_multinomial(vocab, sentence)
    data_set.append(binary_sentence)
    
data_set = np.array(data_set)

print(data_set)

[[ 7.  6.  0. ...  0.  0.  0.]
 [ 0. 24.  0. ...  0.  0.  0.]
 [ 1.  2.  1. ...  0.  0.  0.]
 ...
 [ 7.  8.  0. ...  0.  0.  0.]
 [ 0.  2.  0. ...  0.  0.  0.]
 [ 0.  4.  0. ...  0.  0.  0.]]


### 2. Prior calculation

In [17]:
# Calculate the priors
N = np.float(len(X_train))

prior_0 = len(y_train[y_train==0])/N
prior_1 = len(y_train[y_train==1])/N

print("Prior for 0: ", prior_0)
print("Prior for 1: ", prior_1)

Prior for 0:  0.513333333333
Prior for 1:  0.486666666667


### 3. Likelihood calculation

In [19]:
# Calculate the P(wt|C) so that it can be used in the next step to calculate the likelihood of a document given a class.
# For each word, we want to know in how many documents of a certain class it occured
# +1 for the smoothing
word_count_class_0 = np.sum(data_set[y_train==0],axis=0) + 1
word_count_class_1 = np.sum(data_set[y_train==1],axis=0) + 1

# sum of word freq
total_count_class_0 = np.sum(data_set[y_train==0])
total_count_class_1 = np.sum(data_set[y_train==1])


# Multiply by 1. to force conversion to floating number
words_likelihood_0 = 1. * word_count_class_0 / (total_count_class_0 + len(vocab))
words_likelihood_1 = 1. * word_count_class_1 / (total_count_class_1 + len(vocab))

print("words_likelihood_0:", words_likelihood_0)
print("words_likelihood_1:", words_likelihood_1)

words_likelihood_0: [1.37448457e-03 9.63701112e-03 4.68574285e-05 ... 1.56191428e-05
 3.12382856e-05 3.12382856e-05]
words_likelihood_1: [7.33836089e-04 8.27517292e-03 1.56135338e-05 ... 3.12270676e-05
 1.56135338e-05 1.56135338e-05]


### 4. Classification

In [20]:
# Create a classification function
# Create a function, as in the previous notebook that can classify a new sentence.
# The function uses the sentence, the vocabulary, the likelihoods for the two classes and the priors for the two classes.
# The function should return the class label for the new sentence
def classify(sentence,vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1):
    # Create a BOW representation of the new sentence
    coded_sentence = encode_multinomial(vocab,sentence)

    # Apply equation (4) to get the likelihoods
    log_likelihood_0 = np.sum((coded_sentence*np.log(words_likelihood_0))) # equation 4 where C=0 
    log_likelihood_1 = np.sum((coded_sentence*np.log(words_likelihood_1))) # equation 4 where C=1 
    
    # Apply equation (5) to get the eventual results.
    posterior_0 = np.log(prior_0) + log_likelihood_0
    posterior_1 = np.log(prior_1) + log_likelihood_1
    
    # Classify according to equation (6)
    if posterior_0 > posterior_1:
        return 0
    else:
        return 1

### 5. Test


Apply your classification function on every sentence in the training set (X_train) and store the results in a list.

In [35]:
results = np.zeros(len(X_train))
for i in range(0,len(X_train)):
    results[i] = classify(X_train[i],vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1)


To know how well the algorithm works, the accuracy score has to be calculated. For more information on the accuracy_score function visit: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html.

In [40]:
def accuracy_score(class_vec,results):
    correct = class_vec == results
    correct_count = np.count_nonzero(correct == True)
    return 1.0*correct_count / len(results);
    
print(accuracy_score(y_train, results))
#print(results)
#print(zip(np.bincount(results), np.nonzero(results)[0]))

1.0


Now apply your classification function to every sentence in the test set (X_test).

In [41]:
y_pred = np.zeros(len(X_test))
for i in range(0,len(X_test)):
    y_pred[i] = classify(X_test[i],vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1)



Calculate the accuracy score here too.

In [42]:
print(accuracy_score(y_test,y_pred))

0.72619047619


### 6. Questions

**Which accuracy is the best, the one calculated for the training set or the test set? **

Training set

**Why is the one better than the other?**

Training set is what we used to optimize our algorithm.

The test set is harder to get right since it introduces new values into the mix. The best the algorithm can do is makes its guess and it's up to the user of the resulting data to handle it correctly.


Additionally have a look at the confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html.
        

In [50]:
from sklearn.metrics import confusion_matrix
labels = reviews.target_names
print(confusion_matrix(y_test,y_pred, labels =[0, 1]))
print("0 are the %s reviews" % labels[0])
print("and 1 are the %s reviews" % labels[1])

tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
(tn, fp, fn, tp)

[[96 29]
 [40 87]]
0 are the neg reviews
and 1 are the pos reviews


(96, 29, 40, 87)

**How many positive reviews are classified as positive?**

In [1]:
TP = 87

**How many positive reviews are classified as negative?**

In [2]:
FN = 40