# Lab: Naive Bayes

## Part 2: Multinomial naive Bayes

As with Bernoulli naive Bayes, Bayes' rule is applied to calculate the propability of a class given the data. However, for multinomial naive Bayes, sentences are not encoded as binary vector, but vectors containing the word count.

Suppose that $\textbf{x}_{i}$ is the feature vector describing a document, then $x_{it}$ will contain how many times the t-th term occurs in the i-th document.

Name: Benjamin Fraeyman

### 1. Imports and data set creation

In [1]:
from __future__ import print_function
import numpy as np

In [2]:
class_vec = np.array([0,1,0,1,0,1])
sentences = np.array([['my', 'dog', 'has', 'flea', 'problems', 'help', 'please','help'],
             ['maybe', 'stop', 'taking', 'him', 'to', 'dog', 'park', 'stupid'],
             ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
             ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
             ['mr', 'licks', 'ate', 'my', 'steak', 'how','to', 'stop', 'him'],
             ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']])

In [3]:
# List which will contain all unique words contained in the data set
all_words = []

# Transform the data set (which is a list of lists) to a single list
for sentence in sentences:
    all_words.extend(sentence)

# Use the numpy function #unique# to get all unique elements from a list
vocab = np.unique(all_words)


Instead of indicating if a term of the vocabulary is present in a sentence, count how many times it present in the sentence.


In [4]:
def encode_multinomial(vocab,sentence):
    vocab_list = vocab.tolist()
    binary_sentence=np.zeros(len(vocab_list),)
    for word in sentence:
        if word in vocab:
            binary_sentence[vocab_list.index(word)] += 1
    return binary_sentence

In [5]:
# apply the function defined above to every sentence in the data set to create a new data set containing the tokenized sentences
data_set = []
for sentence in sentences:
    binary_sentence = encode_multinomial(vocab, sentence)
    data_set.append(binary_sentence)
    
data_set = np.array(data_set)
print(data_set)

print("The first sentence:")
print(sentences[0])
print("A check to see if the sentence was properly encoded")
print(vocab[data_set[0]!=0.0])

[[0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 2. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0.
  0. 0. 1. 1. 1. 1. 0.]
 [1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 1. 1. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0.
  0. 1. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 1. 0. 0. 1.]]
The first sentence:
['my', 'dog', 'has', 'flea', 'problems', 'help', 'please', 'help']
A check to see if the sentence was properly encoded
['dog' 'flea' 'has' 'help' 'my' 'please' 'problems']


### 2. Prior calculation

In [6]:
#calculate the priors
# Total number of sentences
N = np.float(len(sentences))

prior_0 = len(class_vec[class_vec==0])/N
prior_1 = len(class_vec[class_vec==1])/N

print("Prior for 0: ", prior_0)
print("Prior for 1: ", prior_1)

Prior for 0:  0.5
Prior for 1:  0.5


### 3. Likelihood calculation

The likelihood for multinomial naive Bayes is calculated as follows:

\begin{equation*}
P(D_i|C)\approx P(\textbf{x}_i|C) = \prod_{t=1}^{|V|} P(w_t|C)^{N_{it}}
\quad\quad\text{(1)}
\end{equation*}

- $N_{it}$: the number of time word $w_t$ occurs in $D_i$

$P(w_t|C=k)$ can be determined by calculating how many times word $w_t$ occurs in documents belonging to class $k$ divided by the total number of words that are present in documents belonging to class $k$.

To apply smoothing, add 1 to the nominator, and $|V|$ to the denominator.

Formally:

\begin{equation*}
P(w_t|C=k) = \frac{1 + \text{tf(}w_t,D \in k\text{)}}{|V| + \sum_i \text{tf(}w_i,D \in k\text{)}}
\quad\quad\text{(2)}
\end{equation*}

- tf($w_t$,$D \in k$): term frequency, i.e. how many times $w_t$ occurs in documents of class $k$
- $|V|$: the size of your vocabulary

In [7]:
# Calculate the P(wt|C) so that it can be used in the next step to calculate the likelihood of a document given a class.
# For each word, we want to know in how many documents of a certain class it occured
# +1 for the smoothing
word_count_class_0 = np.sum(data_set[class_vec==0],axis=0) + 1
word_count_class_1 = np.sum(data_set[class_vec==1],axis=0) + 1

# sum of word freq
total_count_class_0 = np.sum(data_set[class_vec==0])
total_count_class_1 = np.sum(data_set[class_vec==1])


# Multiply by 1. to force conversion to floating number
words_likelihood_0 = 1. * word_count_class_0 / (total_count_class_0 + len(vocab))
words_likelihood_1 = 1. * word_count_class_1 / (total_count_class_1 + len(vocab))

print("words_likelihood_0:", words_likelihood_0)
print("words_likelihood_1:", words_likelihood_1)

words_likelihood_0: [0.03571429 0.03571429 0.01785714 0.03571429 0.03571429 0.03571429
 0.03571429 0.01785714 0.01785714 0.03571429 0.05357143 0.05357143
 0.03571429 0.03571429 0.03571429 0.03571429 0.01785714 0.03571429
 0.07142857 0.01785714 0.03571429 0.01785714 0.03571429 0.01785714
 0.03571429 0.03571429 0.03571429 0.01785714 0.01785714 0.03571429
 0.01785714]
words_likelihood_1: [0.02 0.02 0.04 0.02 0.02 0.06 0.02 0.04 0.04 0.02 0.02 0.04 0.02 0.02
 0.02 0.02 0.04 0.02 0.02 0.04 0.02 0.04 0.02 0.04 0.02 0.02 0.06 0.08
 0.04 0.04 0.06]


## 4. Classification (posterior calculation)

To classify a (new) sentence we need the priors (P(C=0) and P(C=1)) and the likelihoods (P(D|C=0) and P(D|C=1)). The likelihood will be calculated next using P(wt|C).

As a probability can be a small number, and multiplying small numbers can lead to precision problems, the following logarithmic rule is applied:

\begin{equation*}
\log(uv) = \log(u) + \log(v)
\end{equation*}

We apply this rule on equation *(1)*, resulting in:

\begin{align}
P(D_i|C)\approx P(\textbf{x}_i|C) & = \prod_{t=1}^{|V|} P(w_t|C)^{N_{it}} \\
\log(P(D_i|C)) \approx log(P(\textbf{x}_i|C)) & = \sum_{t=1}^{|V|} N_{it} \log(P(w_t|C)) \quad\quad\text{(3)}
\end{align}

 and subsequently on Bayes' rule itself to calculate the posterior:

\begin{equation*}
\log(P(C|D)) \propto \log(P(D|C))+\log(P(C))
\quad\quad\text{(4)}
\end{equation*}

Note here, that when a word is not present in the sentence, but is present in the vocabulary, the likelihood calculation for multinomial naive Bayes will not take this into account.

Eventually, the sentence can then be classified as follows:

\begin{equation*}
\begin{cases}
      0, & \text{if}\ \log(P(C=0|D_i))> \log(P(C=1|D_i)) \\
      1, & \text{otherwise}
\end{cases}\quad\text{(5)}
\end{equation*}

In [8]:
# Create a function, as in the previous notebook that can classify a new sentence.
# The function uses the sentence, the vocabulary, the likelihoods for the two classes and the priors for the two classes.
# The function should return the class label for the new sentence
def classify(sentence,vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1):
    # Create a BOW representation of the new sentence
    coded_sentence = encode_multinomial(vocab,sentence)

    # Apply equation (4) to get the likelihoods
    log_likelihood_0 = np.sum((coded_sentence*np.log(words_likelihood_0))) # equation 4 where C=0 
    log_likelihood_1 = np.sum((coded_sentence*np.log(words_likelihood_1))) # equation 4 where C=1 
    
    # Apply equation (5) to get the eventual results.
    posterior_0 = np.log(prior_0) + log_likelihood_0
    posterior_1 = np.log(prior_1) + log_likelihood_1
    
    # Classify according to equation (6)
    if posterior_0 > posterior_1:
        return 0
    else:
        return 1

### Try to classify the two sentences

In [9]:
sentence1 = ['my','dog','is','cute','he','licks','me']
# Use your classify function
classify(sentence1,vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1)

0

In [10]:
sentence2 = ['my','dog','is','stupid','and','worthless',"real"]
classify(sentence2,vocab,words_likelihood_0,words_likelihood_1,prior_0,prior_1)

1