# PS2-6: Naive Bayes Text Classification

## Introduction

Here we recall the Naive Bayes model and set notation. 

A Naive Bayes text model seeks to predict binary class labels $y \in \{0,1\}$ of blocks of text. The example in this problem is classifying a text message as spam f a **vocabulary** of words which we will track and use to decide whether a message is spam $y = 1$ or non-spam $y = 0$. 

We create a vocabulary by taking all individual words (text separated by white-space on either side) which appear in at least 5 messages in the training set. Each element of the vocabulary is called a token.

From the point of view of the model, a message $m$ is represented as a sequence of tokens 

$$m = t_1 t_2 \dots t_{k_m}$$

where $k_m$ is the number of tokens appearing in $m$, counted with multiplicity. The "naive Bayes" assumption is that given the label of the text, the appearance of given token is independent of the appearance of any other tokens. In other words, in a spam email, the event that token $j$ is equal to 'drugs' is independent of the event that token $k$ is equal to 'buy' for any $j,k$. We also assume that the value of token $j$ is independent of $j$, i.e. the chance that the first token is 'drugs' is the same as the chance that the last token is 'drugs'. It follows from these two assumptions that 

$$p(m | y ) = p(t_1 ,\dots, t_{k_m} | y) = \Pi \, p(t_j | y) $$

if $N$ is the length of the vocabulary, each token $t_j$ is a multinomial random variable valued in $\{0,\dots,N-1\}$ and we let 

$$\phi_{k,1} = p(t_j = k | y = 1)$$
$$\phi_{k,0} = p(t_j = k | y = 0)$$

note that $\phi_{k,-}$ is independent of $j$ by the second assumption. We also have the additional bernoulli random variable

$$\phi_y = p(y = 1)$$

which is the probability that a message is spam, without knowing its contents. By Baye's rule,

\begin{equation}
p(y = 1 | t_1 , \dots , t_k) = \frac{p(t_1, \dots, t_k | y = 1)p(y=1)}{p(t_1,\dots,t_k)}
\end{equation}

The maximum likelihood estimate for the parameters is 

$$p(y = 1) = \frac{\text{\# of spam emails}}{\text{\# of total emails}}$$
$$\phi_{k,1} = \frac{\text{\# of times\,}t_k\,\text{appears in spam emails}}{\text{total \# of tokens in spam emails}}$$

However, adding Laplace smoothing modifies $\phi_{k,j}$, $j = 0,1$ by adding a one to the numerator and $N$, the vocabulary size, to the denominator. It follows from equation (1) that if we want to predict the labels, we can calculate the ratio

$$\frac{p(y = 1 | t_1 , \dots , t_k)}{p(y=0 | t_1 , \dots , t_k)} = \frac{\Pi \phi_{t_k,1}\phi_y }{\Pi \phi_{t_k,0}(1-\phi_y)}$$

Or, if we take the log of this quantity, we get 

$$ \sum \log \phi_{t_k,1} - \log \phi_{t_k,0} + \log \phi_y - \log (1-\phi_y) $$

and the sign of this determines the label prediction, $y = 1$ if the above is non-negative, $y = 0$ if it is negative. Therefore, to implement the code, we need only keep track of the differences 

$$\log \phi_{t_k,1} - \log \phi_{t_k,0}$$

for all of the tokens $t_k$

## (A)

Write code to process the text dataset into numpy arrays that can be fed into machine learning models. Complete the `get_words`, `create_dictionary`, and `transform_text` functions in `/src/p06_spam.py`.

We use the collections package to vectorize the text.

## (B) 
#### Implement a Naive Bayes model with Laplace smoothing bycompleting the `fit_naive_bayes_model` and `predict_from_naive_bayes_model` functions in `/src/p06_spam.py`.

## (C)
#### 