# POC - AI Pool 2022 - Day 01 - Multinomial Naive Bayes

## Introduction

### Multinomial Naive Bayes

Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.

Naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The presence or absence of a feature does not affect the presence or absence of the other feature.

Multinomial Naive Bayes explained [here](https://www.youtube.com/watch?v=O2L2Uv9pdDA)

### What are you going to do?

We are going to perform a multinomial Bayesian classification in order to be able to classify a sms as a spam or as a ham (Ham is a message which is not a spam).

To do this, you will first process the dataset: ``./data/SMSSpamCollection`` so that it best fits our application. Then we will apply Bayesian classification on test data to evaluate the performance of our program.

## I) Data preprocessing

### Import

For the first cell, we are going to import Pandas, you used it during the data science activity.\
Pandas will allow us to manipulate the data in order to implement our algorithm.

 * [Pandas documentation](https://pandas.pydata.org/)

**Exercice :**

 * Import ``pandas`` and declare an alias as ``pd``.

In [None]:
#import pandas here

### Load dataset

We are going to load the file ``data/SMSspamCollection`` in a dataframe with ``pandas``.\
if you have observed the file, the columns are separated by a ``\t`` we will have to indicate it to pandas.\
In general datasets contain a header, that is the first line which instead of having values contains the name of the columns. However the file does not contain a header, so we will have to indicate that our file doesn't contain header and indicate the name of the columns when loading the dataset.

 * [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
 
**Exercice :**

 * 1. Load the file ``data/SMSspamCollection`` and name the columns ``['Label', 'SMS']``.\
 * 2. Check the sep, header and names parameters.

In [None]:
sms_spam = None # <- Code here

assert sms_spam['SMS'][1] == "Ok lar... Joking wif u oni...", "You failed to load the dataset, check the read function from pandas"

### Data analyse

**Here is a description of our dataset:**

|Value|Meaning|
|:---:|:-----:|
|Label| **``ham``** is a sms which is not a spam, then **``spam``** is a spam|
|SMS|Content of the message.|

We obtained a dataframe following the ``read_csv`` function. A dataframe is composed of different columns which are ``Series`` and which have different methods, for example you will use the ``value_counts`` method in order to see the percentage of spam and ham sms in the dataset.

 * docs : [value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

**Exerice :**

 * Use the ``values_counts`` method with normalization set to True on the ``Label`` column.

In [None]:
distribution = None # <- Code here

assert distribution[0] == 0.8659368269921034 and distribution[1] == 0.13406317300789664, "You should check the method value_counts."

### Split data

When working on machine learning algorithms, it is necessary to split the data into two parts, the test data and the training data.

We must have more training data than test data, in general we take 80% of the data for training and the rest for testing.

The objective with this method is that when testing our algorithm, the algorithm never sees the data we send it. In this way, it is not possible that the algorithm has learned the data by heart and so we confirm that the algorithm works in a general way.

**Exercice :**

 * 1. take the ``4458`` first sample of data for training, and the rest to test data, you could use slice operator.
 * 2. the dimension should be ``(4458, 2)`` for training and ``(1114, 2)`` for test.

In [None]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Split into training and test sets
training_set = None # <- Code here
test_set = None # <- Code here

print(training_set.shape)
print(test_set.shape)

assert training_set.shape == (4458, 2) and test_set.shape == (1114, 2), "You should have theses shapes, look instructions."

### Verify data

When splitting the data for classification, it is important to ensure that the training and test data respect the overall distribution of the data. We have previously observed that 13.5% of messages are spam and 86.5% are not. If your data does not respect this distribution, rework the previous part to have approximately the same distribution as the base dataset.

**Exercice :**

 * 1. Re-use the methode used in Data analyse part on ``training_set``, with values_counts.
 * 2. Re-use the methode used in Data analyse part on ``test_set``, with values_counts.
 
You should reload the dataset and all the previous cell after a fail

In [None]:
train_distrib = None # <- Code here
test_distrib = None # <- Code here

assert train_distrib[0] > 0.8 and test_distrib[0] > 0.8

### Data cleaning

If you look at the dataset using a ``training_set.head()``, you will observe that the sentences have punctuation, which is totally useless to our algorithm. That's why we're going to clean up the dataset by removing the punctuation using ``str.replace('\W', ' ')`` on the ``SMS`` column and applying ``str.lower()`` in the process, which will turn all the letters into lower case.

 * [Regular expression -> \W](https://www.geeksforgeeks.org/javascript-regexp-w-metacharacter/)
 * [pd.Series.str.replace()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)
 * [pd.Series.str.lower()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html)

**Exercice :**

 * 1. Replace non-word characters with spaces, by applying ``str.replace('\W', ' ')`` to ``training_set['SMS']``.
 * 2. Apply the ``str.lower()`` method to ``training_set['SMS']``. 

In [None]:
training_set['SMS'] = None # <- Code here
training_set['SMS'] = None # <- Code here
training_set.head(3)


### Creating the Vocabulary

The first step is to create a vocabulary, the vocabulary is a list of unique values of all words.

Indeed, we want to obtain the probability that a word appears in a sms.\
To do this we need to count the number of occurrences of each word in our train dataset.

* [delete doublon in python list](https://www.w3schools.com/python/python_howto_remove_duplicates.asp)

**Exercice :**
 * 1. Add each word of the training_set to the ``vocabulary`` list.
 * 2. Remove duplicates from the list.

In [None]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        pass # # <- Code here : Add each word in vocabulary list
        
# <- Code here : Delete doublon in the vocabulary list


assert len(vocabulary) == 7783, "The vocabulary should have a lenght of 7783"

### Fill vocabulary

Now we will count the number of occurrences of each word in each sentence of the train dataset.

The final goal is always to have the probability that a word appears in a sentence.\
Check the code and go next part.

In [None]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
assert word_counts_per_sms['and'][-2] == 1, "You did not reach the expected behavior."

In [None]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

### Adapt dataset shape

We are going to concatenate the training dataframe with the one created in the previous cell in order to gather all our information in a single dataframe, we will thus have access to the number of occurrences of each word for each sentence.

In [None]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

## II) Multinomial Naives Bayes

Now, we will apply Multinomial Naive Bayes algorithm

### Calculating Constants First

Now that we're done with cleaning the training set, we can begin coding the spam filter.\
The multinomial Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:


$\boldsymbol{\mathbf{P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)}}$

$\boldsymbol{\mathbf{P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)}}$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

$\boldsymbol{\mathbf{P(w_i|Spam) = \frac{N_{w_i|Spam}  + \alpha}{N_{Spam} +  \alpha \cdot N_{Vocabulary}}}}$

$\boldsymbol{\mathbf{P(w_i|Ham) = \frac{N_{w_i|Ham}  + \alpha}{N_{Ham} +  \alpha \cdot N_{Vocabulary}}}}$

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. As a start, let's first calculate:

 * P(Spam) and P(Ham)
 
 * NSpam, NHam, NVocabulary
 
It's important to note that:

 * NSpam is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.

* NHam is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

We'll also use Laplace smoothing and set $\alpha=1$ for avoiding multiplication per 0.

**Exercice :**

 - 1. Calculate P(Spam) : $$\frac{len(spam\_messages)} {len(training\_set\_clean}$$
 - 2. Calculate P(Ham) : $$\frac{len(ham\_messages)} {len(training\_set\_clean}$$
 - 3. Set ``n_vocabulary`` equal to vocabulary's length.
 - 1. Set ``alpha`` to 1

In [None]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = None # <- Code here
p_ham = None # <- Code here

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary length
n_vocabulary = None # <- Code here

# Laplace smoothing
alpha = None # <- Code here

assert alpha == 1 and p_spam == 0.13458950201884254 and p_ham == 0.8654104979811574 and n_vocabulary == 7783, "One of these values is wrong."

### Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters P(wi|Spam) and P(wi|Ham).

P($w_i$|Spam) and P($w_i$|Ham) will vary depending on the individual words.For instance, P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("lovely"|Spam) will most likely have other values.

Therefore, each parameter will be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using these two equations:

$$\boldsymbol{\mathbf{P(w_i|Spam) = \frac{N_{w_i|Spam} +  \alpha}{N_{Spam} +  \alpha \cdot N_{Vocabulary}}}}$$

$$\boldsymbol{\mathbf{P(w_i|Ham) = \frac{N_{w_i|Ham} +  \alpha}{N_{Ham}  + \alpha \cdot N_{Vocabulary}}}}$$

**Exercice :**

 - 1. Calculate ``p_word_given_spam``: $$\frac{n\_word\_given\_spam + \alpha}{n\_spam + \alpha * n\_vocabulary}$$
 - 2. Calculate ``p_word_given_ham``: $$\frac{n\_word\_given\_ham + \alpha}{n\_ham + \alpha * n\_vocabulary}$$

In [None]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum() # spam_messages already defined
    p_word_given_spam = None # <- Code here
    parameters_spam[word] = p_word_given_spam

    n_word_given_ham = ham_messages[word].sum() # ham_messages already defined
    p_word_given_ham = None # <- Code here
    parameters_ham[word] = p_word_given_ham

### Classifying A New Message

Now that we have all our parameters calculated, we can start creating the spam filter. \
The spam filter is understood as a function that:

 * Takes in as input a new message ($w_1$, $w_2$, ..., $w_n$).
 * Calculates P(Spam|$w_1$, $w_2$, ..., $w_n$) and P(Ham|$w_1$, $w_2$, ..., $w_n$).
 * Compares the values of P(Spam|$w_1$, $w_2$, ..., $w_n$) and P(Ham|$w_1$, $w_2$, ..., $w_n$), and:
   * If P(Ham|$w_1$, $w_2$, ..., $w_n$) > P(Spam|$w_1$, $w_2$, ..., $w_n$), then the message is classified as ham.
   * If P(Ham|$w_1$, $w_2$, ..., $w_n$) <= P(Spam|$w_1$, $w_2$, ..., $w_n$), then the message is classified as spam.
   
**Exercice :**

 * Complete the condition with : ``if p_ham_given_message is greater than p_spam_given_message``

In [None]:
import re

def classify(message) -> str:
    '''
    message: a string
    '''

    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham: 
            p_ham_given_message *= parameters_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    # complete the instruction
    if None: # <- Code here
        print('Label: Ham')
        return "ham"
    else:
        print('Label: Spam')
        return "spam"

``message`` = ``WINNER!! This is the secret code to unlock the money: C3421.`` and should be classify as ``spam``

**Exercice :**

 * Use the classify() function you developped to detect if this ``message`` is a **spam** or **ham**.

In [None]:
message = "WINNER!! This is the secret code to unlock the money: C3421."

label = None # <- Code here
assert label == "spam", "Your algorithm should classify this as spam message."

message = ``Sounds good, Tom, then see u there`` and should be classify as ``ham``

**Exercice :**

 * Use the classify() function you developped to detect if this ``message`` is a **spam** or **ham**.

In [None]:
message = "Sounds good, Tom, then see u there"

label = None # <- Code here
assert label == "ham", "Your algorithm should classify this as ham message."

### Measuring the Spam Filter's Accuracy

The two results look promising, but let's see how well the filter does on our test set, which has **1,114** messages.

Check this function that returns classification labels instead of printing them.

In [None]:
def classify_test_set(message):
    '''
    message: a string
    '''

    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    else:
        return 'spam'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [None]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

We can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages.\
To make the measurement, we'll use accuracy as a metric:

**Exercice :**

 - 1. Get the number of correct prediction
 - 2. Calculate ``p_word_given_ham`` : $\frac{correct}{total}$

In [None]:
accuracy = None
total = test_set.shape[0]
correct = 0

for row in test_set.iterrows():
    row = row[1]
    print("Prediction : " + row['predicted'] + " -> Label : " + row['Label'])



assert accuracy > 0.98, "Your accuracy should be greather than 98%."

The accuracy is close to 99.00%, which is really good. our filter classified more than 98 percent of the messages correctly.

You just created a filter able to classify messages and detect spam using the multinomial naive bayes algorithm.

And you also just finished this day, rest well for the next one you, will learn how to create neural networks, exciting no? :D