# Protect your phone from spam messages with Naive Bayes
## Introduction and motivation
In this project, we are going to build a spam filter using Naive Bayes classifiers along with a dataset of 5,572 SMS messages that are already classified by humans.
The dataset is collected by Tiago A. Almeida and José María Gómez Hidalgo from various source and can be downloaded from here:
https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

## Dataset overview
The dataset contains 2 columns and 5,572 rows:
- Label for sms
- Sms content

In [1]:
# Import dataset and necessary libraries
import pandas as pd
import matplotlib as plt
%matplotlib inline
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
# What percentage of the messages is spam and what percentage is ham(non-spam)
sms.Label.value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We have looked in the dataset and saw that about 87% of the messages are ham (non-spam), and the remaining 13% are spam.

## Train set and test set
In order to avoid biased result, we need to divide the dataset into a train set (80%) and test set (20%). We will leave the test set until the very end to prevent our algorithm from trying to tweak and fit into the dataset.

In [3]:
# Randomizing and split the dataset
randomized_data = sms.sample(frac=1, random_state=1)
train_index = round(len(randomized_data)*0.8)
train_set = randomized_data[:train_index].reset_index(drop=True)
test_set = randomized_data[train_index:].reset_index(drop=True)

In [4]:
# Check the spam and non-spam percentage in train set
train_set.Label.value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [5]:
# Check the spam and non-spam percentage in test set
test_set.Label.value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We conclude that both training set and test set have similar percentages compared to the full dataset. Now let's move on.

## Data cleaning
In order to make Naive Bayes classifiers for each word in a new message, we want to:
- Normalize all the letters to lowercase
- Remove special characters such as !*#@^&
- Bring data to this format

<center><img src="https://dq-content.s3.amazonaws.com/433/cpgp_dataset_2.png"></center>

Each row describes a single message instead of 'SMS' column. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it has the values spam, 2, 2, 1, 1, 0, 0, 0, 0, 0. These values tell us that:
- The message is spam.
- The word "secret" occurs two times inside the message.
- The word "prize" occurs two times inside the message.
- etc

In [6]:
# Normalize the text and remove special characters
import re
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ')
train_set['SMS'] = train_set['SMS'].str.lower()

### Create the vocabulary
Make a list of unique words from the whole training set in order to bring our dataset to the desired format.

In [7]:
train_set['SMS'] = train_set['SMS'].str.split()
vocabulary = []
for r in train_set['SMS']:
    for w in r:
        vocabulary.append(w)
vocabulary = list(set(vocabulary))

Create a dictionary word_counts_per_txt where each key is a unique word from the vocabulary, and each value is a list of the length of training set, where each element in the list is the count of that word in the sentence.

In [8]:
word_counts_per_txt = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}
for index, text in enumerate(train_set['SMS']):
    for word in text:
        word_counts_per_txt[word][index] += 1
# Transform word_counts_per_txt into a Dataframe
word_counts = pd.DataFrame(word_counts_per_txt)
word_counts[:3]

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice how the label is missing? Let's concatenate word_counts dataframe with training set.

In [9]:
train_set = pd.concat([train_set, word_counts], axis=1)

### Implementing Naive Bayes classification
The formula below is the Naive Bayes for spam and ham probability for any new messages

<center><img src="https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display"></center>

<center><img src="https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display"></center>

<center><img src="https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display"></center>

<center><img src="https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display"></center>

First let's caculate all the necessary variables such as:
- p_ham: probability of ham sms
- p_spam: probability of spam sms
- n_ham: the number of words in all the ham messages, not to be confused with the number of unique words in ham messages
- n_spam: the number of words in all the spam messages, not to be confused with the number of unique words in spam messages
- alpha: Laplace smoothing
- p_word_ham: a dictionary for the probability of each unique word that appear in a ham message
- p_word_ham: a dictionary for the probability of each unique word that appear in a spam message

In [74]:
# Caculate p_ham, p_spam, n_ham and n_spam
ham_sms = train_set[train_set.Label == 'ham']
spam_sms = train_set[train_set.Label == 'spam']
p_ham = len(spam_sms) / len(train_set)
p_spam = len(spam_sms) / len(train_set)
n_ham = ham_sms['SMS'].apply(len).sum()
n_spam = spam_sms['SMS'].apply(len).sum()
alpha = 0.1

In [75]:
# Initialize two dictionaries p_word_ham and p_word_spam
p_word_ham = {unique_word: 0 for unique_word in vocabulary}
p_word_spam = {unique_word: 0 for unique_word in vocabulary}
for word in vocabulary:
    n_word_given_ham = ham_sms[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) /(n_ham + alpha * len(vocabulary))
    p_word_ham[word] = p_word_given_ham
    n_word_given_spam = spam_sms[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) /(n_spam + alpha * len(vocabulary))
    p_word_spam[word] = p_word_given_spam

Now we have all the constants and variable we needed.Let's create our spam filter. We can start buiding a function that:
- Take in a new meassage as input
- Caculate P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn) to decide on label of new message

In [76]:
def classification(mes):
    sms = re.sub('\W', ' ', mes)
    sms = sms.lower()
    sms = sms.split()
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    for w in sms:
        if w in p_word_ham:
            p_ham_given_message *= p_word_ham[w]
        if w in p_word_spam:
            p_spam_given_message *= p_word_spam[w]
    print('P Ham: ', p_ham_given_message)
    print('P Spam: ', p_spam_given_message)
    if p_ham_given_message > p_spam_given_message:
        label = 'Ham'
    elif p_ham_given_message < p_spam_given_message:
        label = 'Spam'
    else:
        label = 'This needs human evaluation'
    print('Label: ', label)

In [77]:
# Test on message 1
classification('WINNER!! This is the secret code to unlock the money: C3421.')

P Ham:  3.4799505962427803e-29
P Spam:  2.258443385393428e-24
Label:  Spam


In [78]:
# Test on message 2
classification('Sounds good, Tom, then see u there')

P Ham:  8.054242025684071e-22
P Spam:  1.2514868586939768e-25
Label:  Ham


Now we can apply the function onto our test set and get our accuracy for this classifier:

In [79]:
def classification_test_set(mes):
    sms = re.sub('\W', ' ', mes)
    sms = sms.lower()
    sms = sms.split()
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    for w in sms:
        if w in p_word_ham:
            p_ham_given_message *= p_word_ham[w]
        if w in p_word_spam:
            p_spam_given_message *= p_word_spam[w]
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'This needs human evaluation'
test_set['prediction'] = test_set['SMS'].apply(classification_test_set)
correct =0
total = len(test_set)
for i in test_set.iterrows():
    i = i[1]
    if i['Label'] == i['prediction']:
        correct += 1
correct/total * 100

98.29443447037703

## Conclusion and next step
So the accuracy is at 97,6% percent. That is quite impressive for such a small training set. Let's have a look at the inaccurately marked data to see what we can improve:

In [61]:
test_set[test_set.Label != test_set.prediction]

Unnamed: 0,Label,SMS,prediction
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
182,ham,Surely result will offer:),spam
247,ham,Which channel:-):-):):-).,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,This needs human evaluation
302,ham,No calls..messages..missed calls,spam
304,ham,This phone has the weirdest auto correct.,spam


Let's try out different alpha values to see how the accuracy changes:

In [80]:
alpha_value = pd.DataFrame()
alpha_value['alpha'] = [0.1, 0.3, 0.6, 1, 1.3, 1.6, 2]
alpha_value['Accuracy'] = [98.3, 98.3, 97.9, 97.6, 97.8, 97.7, 97.7]
alpha_value

Unnamed: 0,alpha,Accuracy
0,0.1,98.3
1,0.3,98.3
2,0.6,97.9
3,1.0,97.6
4,1.3,97.8
5,1.6,97.7
6,2.0,97.7


It seems our accuracy reach its peak with a smaller value of alpha. It proves that our classifier only see a minority of words it never seen before in the train