# <center>Project #15 - Creating a Spam Filter with a Naive Bayes Alogrithm</center>

![GettyImages-122143117-5c64996246e0fb0001f256b1.jpg](attachment:GettyImages-122143117-5c64996246e0fb0001f256b1.jpg)

In this project, we're going to study the practical side of the naive Bayes algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we can utilise a naive Bayes algorith that:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re

import warnings
warnings.filterwarnings('ignore')

# Reading in the dataset

In [2]:
# Reading in the spam message data.
# The data is a tab delimited file with no headers - so we are accounting for this in the read_csv function.
spam_data = pd.read_csv('./Data/SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# Exploring the dataset

In [3]:
# Exploring the dataset 
display('First five entries:')
display(spam_data.head(5))
display('Last five entries:')
display(spam_data.tail(5))

# Figuring out the dimensions of the dataset
display('There are ' + str(spam_data.shape[0]) + ' rows in the dataset')
display('There are ' + str(spam_data.shape[1]) + ' columns in the dataset')

'First five entries:'

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


'Last five entries:'

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


'There are 5572 rows in the dataset'

'There are 2 columns in the dataset'

Spam messages are appropriately labelled <mark>spam</mark>, while non-spam messages are labelled <mark>ham</mark>.

In [4]:
# Calculating the proportion of spam to non-spam messages.

spam_data['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We can see that the 86.6% of the messages are genuine, while the remaining 13.4% are spam.

## Creating training and test datasets.

Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

However, before creating it, it's very helpful to first think of a way of testing how well it works. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

To test the spam filter, we're first going to split our dataset into two categories:

- A __training__ set, which we'll use to "train" the computer how to classify messages.
- A __test__ set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The __training__ set will have 4,458 messages (about 80% of the dataset).
- The __test__ set will have 1,114 messages (about 20% of the dataset).

We're going to randomise the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset. 

In [5]:
# Randomising the order of the dataset, incase there are any inherent ordering of the messages.
spam_data_randomised = spam_data.sample(frac=1, random_state=1)

# Setting the length of the training dataset for splitting the randomised dataset.
length_training_ds = int(round(len(spam_data_randomised) * .8, 0))

# Resetting the index as we have randomised all the entries.
spam_data_randomised = spam_data_randomised.reset_index(drop=True)

# Splitting based on index:
training_ds = spam_data_randomised[:length_training_ds]
test_ds = spam_data_randomised[length_training_ds:].reset_index(drop=True)

# Figuring out the dimensions of the dataset
display('There are ' + str(training_ds.shape[0]) + ' rows in the training dataset')

display('There are ' + str(test_ds.shape[0]) + ' rows in the test dataset')


'There are 4458 rows in the training dataset'

'There are 1114 rows in the test dataset'

We can see that we have successfully randomised and divided the dataset into a training and test set. Next we want to check that these datasets have proportions of spam and non-spam messages that are consistent with the original dataset we received.

In [6]:
# Calculating the proportion of spam and non-spam messages using pd.value_counts

print('The original dataset has the following proportions of spam and non-spam:')
display(spam_data['Label'].value_counts(normalize=True)*100)

print('The sampled test dataset has the following proportions of spam and non-spam:')
display(test_ds['Label'].value_counts(normalize=True)*100)

print('The sampled training dataset has the following proportions of spam and non-spam:')
display(training_ds['Label'].value_counts(normalize=True)*100)

The original dataset has the following proportions of spam and non-spam:


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The sampled test dataset has the following proportions of spam and non-spam:


ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The sampled training dataset has the following proportions of spam and non-spam:


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

## Training Our Algorithm to Identify Spam SMS

We have split our dataset into a training set and a test set. The next big step is to use the training set to teach the algorithm to classify new messages.


Our Naive Bayes algorithm will make the classification based on the results it gets to these two equations:
$$P(Spam|w_1,w_2,...,w_n)\propto P(Spam)\cdot\prod_{i=1}^n P(w_i|Spam)$$

$$P(Ham|w_1,w_2,...,w_n)\propto P(Ham)\cdot\prod_{i=1}^n P(w_i|Ham)$$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations:

$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha\cdot N_{Vocabulary}}$$

$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha\cdot N_{Vocabulary}}$$

Summarising the terms in the equations above:

$N_{w_i|Spam}$ = the number of times the word $w_i$ occurs in spam messages.

$N_{w_i|Ham}$ = the number of times the word $w_i$ occurs in non-spam messages.

$N_{Spam}$ = total number of words in spam messages.

$N_{Ham}$ = total number of words in non-spam messages.

$N_{Vocabulary}$ = total number of words in the vocabulary.

$\alpha = 1$ ($\alpha$ is a smoothing parameter).

To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.



## Data Cleaning

We first need to strip out the punctuation from the SMS messages. Punctuation would effectively make different words, e.g. money, money!, money?, money!!!, would all be treated as different words.
### Punctuation

In [7]:
# Looking at the data before the punctuation removed.
display(training_ds.head(10))

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...
5,ham,Ok i thk i got it. Then u wan me 2 come now or...
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...
7,ham,No dear i was sleeping :-P
8,ham,Ok pa. Nothing problem:-)
9,ham,Ill be there on &lt;#&gt; ok.


In [8]:
# Using Series.str.replace() combined with regex pattern \W, to remove all punctuaation.

training_ds['SMS'] = training_ds['SMS'].str.replace('\W', ' ')
test_ds['SMS'] = test_ds['SMS'].str.replace('\W', ' ')

In [9]:
# Checking that the data has had the punctuation removed.
display(training_ds.tail(10))

Unnamed: 0,Label,SMS
4448,ham,I donno its in your genes or something
4449,spam,YOUR CHANCE TO BE ON A REALITY FANTASY SHOW ca...
4450,ham,Prakesh is there know
4451,ham,The beauty of life is in next second which h...
4452,ham,How about clothes jewelry and trips
4453,ham,Sorry I ll call later in meeting any thing re...
4454,ham,Babe I fucking love you too You know Fuck...
4455,spam,U ve been selected to stay in 1 of 250 top Bri...
4456,ham,Hello my boytoy Geeee I miss you already a...
4457,ham,Wherre s my boytoy


### Capitalisation / Case

In the same vein, we want to make sure that all the words are in a consistent case. Capitalisation can effectively make different words, e.g. Money, money, mOney, MONEY, would all be treated as different words. We are going to convert all the words to a lower case.

In [10]:
# Using Series.str.lower() to convert our SMS words to lower case.

test_ds['SMS'] = test_ds['SMS'].str.lower()
training_ds['SMS'] = training_ds['SMS'].str.lower()

In [11]:
# Checking that the data has been converted to lower case.
display(training_ds.tail(10))

Unnamed: 0,Label,SMS
4448,ham,i donno its in your genes or something
4449,spam,your chance to be on a reality fantasy show ca...
4450,ham,prakesh is there know
4451,ham,the beauty of life is in next second which h...
4452,ham,how about clothes jewelry and trips
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...
4457,ham,wherre s my boytoy


## Creating a Vocabulary

Now that we have standardised the SMS messages, we want to create a Vocabulary for our messages by splitting the SMS strings into individual words.

In [12]:
# Splitting the SMS messages into a list of words and assinging to a new column called 'Word List'
training_ds['Word List'] = training_ds['SMS'].str.split()

training_ds.head()

Unnamed: 0,Label,SMS,Word List
0,ham,yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]"
1,ham,yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,..."
2,ham,welp apparently he retired,"[welp, apparently, he, retired]"
3,ham,havent,[havent]
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [13]:
# Writing a loop and nested loop to go through each SMS and add the individual words to a vocabulary list.

vocab = []

for sms in training_ds['Word List']:
    for word in sms:
        vocab.append(word)

# Converting this list to a set to remove duplicate words, and returning to a list.

vocab = list(set(vocab))

In [14]:
# Sampling five random entries from the vacabulary
import random

random.sample(vocab,5)

['idk', 'listened2the', 'doke', 'lengths', 'weird']

## Creating a Word Count Dictionary for All SMS Messages

Now we're going to use the vocabulary to calculate how often each word occurs in each SMS message. Eventually, we're going to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need.

To create the dictionary we need for our training set, we can use the code below, where:

- We start by initializing a dictionary named word_counts_per_sms, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a 0.

    - The code [0] * len(training_ds['Word List']) outputs a list of the length of training_ds['Word List'], where each element in the list will be a 0.
- We loop over training_ds['Word List'] using at the same time the enumerate() function to get both the index and the SMS message (index and sms).

    - Using a nested loop, we loop over SMS (where SMS is a list of strings, where each string represents a word in a message).
        - We incremenent word_counts_per_sms[word][index] by 1.


In [15]:
# We start by initialing a dictionary with keys being the vocabulary words.
# The entries under different keys (vocab words) are lists of numbers, correlating with how often that word occurs 
# in respective SMS

word_counts_per_sms = {unique_word: [0] * len(training_ds['Word List']) for unique_word in vocab}

# We now iterate through all the SMS messages to tally up the occurence of words in each message 
for index, sms in enumerate(training_ds['Word List']):
    
    # Iterating through the words in each SMS and increasing the value in the respective list object within the word key.
    for word in sms:
        word_counts_per_sms[word][index] += 1

Looking at the dictionary entry for the word __the__, we see the corresponding tallies for each different SMS message in the training dataset:

In [16]:
word_counts_per_sms['the']

[1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 3,
 0,
 0,
 1,
 0,
 0,
 0,
 2,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 2,
 2,
 0,
 2,
 0,
 0,
 0,
 0,
 3,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 4,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 4,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 3,
 0,
 2,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 3,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,


In [17]:
# Converting the dictionary to a more usable DataFrame

word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)

Concatenating the DataFrame we just built above with the DataFrame containing the training set:

In [18]:
# Concatenating the original dataset with the individual word counts in each message

training_ds_wc = pd.concat([training_ds,word_counts_per_sms_df],axis=1)

We are going to look at a specific message to highlight what this new DataFrame contains.

In [19]:
# Looking at a message with the index number 1,000 (arbitrarily choosen)
training_ds_wc.iloc[1000,1]

'i m going to try for 2 months ha ha only joking'

The columns we concatenated to the original training DataFrame have tallies for how often our vocabulary words have occured in this message:

In [20]:
# Looking at the occurence of 'ha' in the message above (index number 1,000)
print('The word ha occurs: ' + str(training_ds_wc['ha'].iloc[1000]) + ' times')

# Looking at the occurence of 'joking' in the message above (index number 1,000)
print('The word joking occurs: ' + str(training_ds_wc['joking'].iloc[1000]) + ' times')

# Looking at the occurence of 'fish' in the message above (index number 1,000)
print('The word fish occurs: ' + str(training_ds_wc['fish'].iloc[1000]) + ' times')


The word ha occurs: 2 times
The word joking occurs: 1 times
The word fish occurs: 0 times


## Creating the Spam Filter

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

$$P(Spam|w_1,w_2,...,w_n)\propto P(Spam)\cdot\prod_{i=1}^n P(w_i|Spam)$$

$$P(Ham|w_1,w_2,...,w_n)\propto P(Ham)\cdot\prod_{i=1}^n P(w_i|Ham)$$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations:

$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha\cdot N_{Vocabulary}}$$

$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha\cdot N_{Vocabulary}}$$

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
- $P(Spam)$ and $P(Ham)$
- $N_{Ham}, N_{Spam}, N_{Vocabulary}$

We'll also use Laplace smoothing and set $\alpha = 1$.

## Calculating Parameters

In [21]:
# Calculating P(Spam) and P(Ham) using the training dataset. We use the normalise function in value_counts to get the
# proportions of each mail type, which is also the probabilty of getting either type of message within on training dataset.

display(training_ds_wc['Label'].value_counts(normalize=True))

# Using indexing to extract the probabilities for ham and spam

p_ham = training_ds_wc['Label'].value_counts(normalize=True)[0]
p_spam = training_ds_wc['Label'].value_counts(normalize=True)[1]

print('The probability of a non-spam SMS is ' + str(p_ham) + ', and a spam SMS is ' + str(p_spam) + '.')


ham     0.86541
spam    0.13459
Name: Label, dtype: float64

The probability of a non-spam SMS is 0.8654104979811574, and a spam SMS is 0.13458950201884254.


In [22]:
# Calculating N_spam, N_ham and N_vocab

# First of all we are going to divide the training dataset into spam and ham categories
training_ds_wc_spam = training_ds_wc[training_ds_wc['Label'] == 'spam']
training_ds_wc_ham = training_ds_wc[training_ds_wc['Label'] == 'ham']


# N_total is equal to the number of words in all the messages - not the number of messages
# Setting a counter for the total number of words in all the messages as n_total
n_total = 0

# Looping through all unique words and adding their total mentions to the n_total counter
for word in vocab:
    n_total += sum(training_ds_wc[word])

print('n_total (The total number of words in the training texts) is: ' + str(n_total))

# N_Spam is equal to the number of words in all the spam messages - not the number of spam messages
# Setting a counter for the total number of words in all the spam messages as n_spam

n_spam = 0

# Looping through all unique words and adding their total mentions in spam messages to the n_spam counter
for word in vocab:
    n_spam += sum(training_ds_wc_spam[word])
    
# Alternative way to calculate:
#spam_words = training_ds_wc_spam['Word List'].apply(len)
#n_spam = spam_words.sum()
    
print('n_spam (The total number of words in spam SMS within the training dataset) is: ' + str(n_spam))

# N_Ham is equal to the number of words in all the non-spam messages - not the number of non-spam messages
# Setting a counter for the total number of words in all the non-spam messages as n_ham

n_ham = 0

# Looping through all unique words and adding their total mentions in non-spam messages to the n_ham counter
for word in vocab:
    n_ham += sum(training_ds_wc_ham[word])
    
# Alternative way to calculate:
#ham_words = training_ds_wc_ham['Word List'].apply(len)
#n_ham = ham_words.sum()

print('N_ham (The total number of words in non-spam SMS within the training dataset) is: ' + str(n_ham))

# We want to calculate the number of unique words in the vocabulary, n_vocab
n_vocab = len(vocab)

print('N_vocab (The total number of words in the training dataset vocabulary) is: ' + str(n_vocab))

# Lastly we want to set a laplace smoothing coeffecient 
alpha = 1

n_total (The total number of words in the training texts) is: 72427
n_spam (The total number of words in spam SMS within the training dataset) is: 15190
N_ham (The total number of words in non-spam SMS within the training dataset) is: 57237
N_vocab (The total number of words in the training dataset vocabulary) is: 7783


## Calculating Conditional Probability Parameters

Let's now calculate $P(w_i|Spam)$ (the probability of a given word occuring, given the message is spam) and $P(w_i|Ham)$ (the probability of a given word occuring, given the message is not spam) for all the words in the training vocabulary, using the equations below:

$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha\cdot N_{Vocabulary}}$$

$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha\cdot N_{Vocabulary}}$$

In [23]:
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string.
# We'll need one dictionary to store the parameters for P(wi|Spam), and the other for P(wi|Ham).

p_word_given_spam = {word: [0] for word in vocab}
p_word_given_ham = {word: [0] for word in vocab}

# Looping through all unique words in the vocabulary and calculating the probility that word will occur given the
# nature of the SMS (spam or non-spam), and assigning that probabilty to the dictionary
for word in vocab:
    p_word_given_spam[word] = (training_ds_wc_spam[word].sum() + alpha) / (n_spam + alpha * n_vocab)
    p_word_given_ham[word] = (training_ds_wc_ham[word].sum() + alpha) / (n_ham + alpha * n_vocab)

## Testing Word Probabilites
Looking at a few words and their probabilities of occuring in either spam on non-spam SMS:

In [24]:
print('The word \"offer\" occurs in"' + str(round(p_word_given_ham['offer'] * 100,3)) + '% of non-spam messages.')
print('The word \"offer\" occurs in ' + str(round(p_word_given_spam['offer'] * 100,3)) + '% of spam messages.')
print('')
print('The word \"urgent\" occurs in"' + str(round(p_word_given_ham['urgent'] * 100,3)) + '% of non-spam messages.')
print('The word \"urgent\" occurs in ' + str(round(p_word_given_spam['urgent'] * 100,3)) + '% of spam messages.')
print('')
print('The word \"important\" occurs in ' + str(round(p_word_given_ham['important'] * 100,3)) + '% of non-spam messages.')
print('The word \"important\" occurs in ' + str(round(p_word_given_spam['important'] * 100,3)) + '% of spam messages.')
print('')
print('The word \"thanks\" occurs in ' + str(round(p_word_given_ham['thanks'] * 100,3)) + '% of non-spam messages.')
print('The word \"thanks\" occurs in ' + str(round(p_word_given_spam['thanks'] * 100,3)) + '% of spam messages.')
print('')
print('The word \"love\" occurs in ' + str(round(p_word_given_ham['love'] * 100,3)) + '% of non-spam messages.')
print('The word \"love\" occurs in ' + str(round(p_word_given_spam['love'] * 100,3)) + '% of spam messages.')
print('')
print('The word \"careful\" occurs in ' + str(round(p_word_given_ham['love'] * 100,3)) + '% of non-spam messages.')
print('The word \"careful\" occurs in ' + str(round(p_word_given_spam['love'] * 100,3)) + '% of spam messages.')

The word "offer" occurs in"0.011% of non-spam messages.
The word "offer" occurs in 0.096% of spam messages.

The word "urgent" occurs in"0.009% of non-spam messages.
The word "urgent" occurs in 0.209% of spam messages.

The word "important" occurs in 0.017% of non-spam messages.
The word "important" occurs in 0.048% of spam messages.

The word "thanks" occurs in 0.082% of non-spam messages.
The word "thanks" occurs in 0.061% of spam messages.

The word "love" occurs in 0.252% of non-spam messages.
The word "love" occurs in 0.044% of spam messages.

The word "careful" occurs in 0.252% of non-spam messages.
The word "careful" occurs in 0.044% of spam messages.


## Observations

We can see that imperative words such as _important_, _urgent_ and _offer_ are much more likely to occur in spam messages. This is often a tactic of spam messages to try an encourage engagement by making it seem that there is a sense of urgency in reacting to their messages. Often this can be in conjuction with scamming and monetary persuit.

In contrast, more personable words such as _love_, _thanks_ and _careful_ are more likely to occur in non-spam messages. This is likely because these words represent much more genuine interactions between people who are familiar with one another already.

## Creating a Function To Calculate If a SMS is Spam or Not

We now have all the parameters required to determine whether a new SMS message is more likely to be genuine or spam, using the equations below:

$$P(Spam|w_1,w_2,...,w_n)\propto P(Spam)\cdot\prod_{i=1}^n P(w_i|Spam)$$

$$P(Ham|w_1,w_2,...,w_n)\propto P(Ham)\cdot\prod_{i=1}^n P(w_i|Ham)$$

In [25]:
# Writing a function to determine whether a message is spam or not.
# This function takes in a text message (SMS) and uses the product probability equations above

def determine_spam(SMS):
    # First we do some data cleaning to remove any punctuation using regex
    SMS_np = re.sub(r'[^\w\s]', '', SMS)
    # We then make all the words lowercase to match the casing of our dictionaries
    SMS_np_lc = SMS_np.lower()
    # Then we split the SMS string into a list, which allows us to iterate over each word in the SMS
    SMS_word_list = SMS_np_lc.split()
    
    # We set the probabilities for spam and non-spam messages as one, as we will multiply these by the probabilities 
    # of each word.
    non_spam_probability = p_ham
    spam_probability = p_spam
    
    # Now we cycle through each word in the input SMS and then multiply the probabilities of spam and non-spam,
    # for the scenarios that the message is spam or non-spam,
    for word in SMS_word_list:

        if word in p_word_given_spam:
            non_spam_probability *= p_word_given_ham[word]
        if word in p_word_given_ham:
            spam_probability *= p_word_given_spam[word]
    
    # Once we have iterated over all the words in the message, we compare the probabilities that the message is
    # spam or non-spam, to give the user a message indicating what the message is more likely to be.
    
    if non_spam_probability < spam_probability:
        print('P(Spam|text) is ' + str(spam_probability*100) + '%') 
        print('P(Ham|text) is ' + str(non_spam_probability*100) + '%') 
        print('\"' + str(SMS) + '\"' + ' is a spam message')
    else:
        print('P(Spam|text) is ' + str(spam_probability*100) + '%') 
        print('P(Ham|text) is ' + str(non_spam_probability*100) + '%') 
        print('\"' + str(SMS) + '\"' + ' is not a spam message')

## Testing the Spam Classification Function

With the function written we can generate some test text messages. One which is very typically spam and the other which is a genuine message between two people who know one another:

In [26]:
# Non-spam message
SMS_message = 'hey, should we grab a beer after work?'

# We can now run the function on the sample SMS messages
determine_spam(SMS_message)

P(Spam|text) is 1.2824234293423733e-27%
P(Ham|text) is 3.503014781000854e-23%
"hey, should we grab a beer after work?" is not a spam message


In [27]:
# Spam message
SMS_message = 'WINNER!! This is the secret code to unlock the money: C3421'

# We can now run the function on the sample SMS messages
determine_spam(SMS_message)

P(Spam|text) is 1.3481290211300842e-23%
P(Ham|text) is 1.9368049028589874e-25%
"WINNER!! This is the secret code to unlock the money: C3421" is a spam message


## Observations

We can see that the function is able to distinguish between these two messages using the conditional probabilities determined for the message, using the pre-determined parameters of the training dataset.

## Application to the Test Dataset

We can now modify this function to determine whether the messages in the training dataset are spam on not. Instead of returning a printed message, we will return a label <mark>ham</mark> or <mark>spam</mark>.

In [28]:
def classify_spam(SMS):
    # First we do some data cleaning to remove any punctuation using regex
    SMS_np = re.sub(r'[^\w\s]', '', SMS)
    # We then make all the words lowercase to match the casing of our dictionaries
    SMS_np_lc = SMS_np.lower()
    # Then we split the SMS string into a list, which allows us to iterate over each word in the SMS
    SMS_word_list = SMS_np_lc.split()
    
    # We set the probabilities for spam and non-spam messages as one, as we will multiply these by the probabilities 
    # of each word.
    non_spam_probability = 1
    spam_probability = 1
    
    # Now we cycle through each word in the input SMS and then multiply the probabilities of spam and non-spam,
    # for the scenarios that the message is spam or non-spam,
    for word in SMS_word_list:
        if word in p_word_given_spam:
            non_spam_probability *= p_word_given_ham[word]
        if word in p_word_given_ham:
            spam_probability *= p_word_given_spam[word]
    
    # Once we have iterated over all the words in the message, we compare the probabilities that the message is
    # spam or non-spam, to return a label or 'ham' or 'spam'.
    
    if non_spam_probability < spam_probability:
        return 'spam'
    else:
        return 'ham'

In [29]:
test_ds['New Classification'] = test_ds['SMS'].apply(classify_spam)

In [30]:
test_ds['New Classification'].value_counts(normalize=False)

ham     952
spam    162
Name: New Classification, dtype: int64

In [31]:
test_ds['Label'].value_counts(normalize=False)

ham     967
spam    147
Name: Label, dtype: int64

In [32]:
# Looping through our test dataset to count how many are correct and how many are incorrect

correct = 0
incorrect = 0
total = test_ds.shape[0]


for row in test_ds.iterrows():
    row=row[1]
    if row['Label'] == row['New Classification']:
        correct += 1
    else:
        incorrect += 1
        
print('Our algorithm correctly identified the nature of ' + str(correct) + ' messages.')
print('Our algorithm incorrectly identified the nature of ' + str(incorrect) + ' messages.')
print('This is an accuracy of ' + str(round((correct/total),3)) + '%.')

Our algorithm correctly identified the nature of 1089 messages.
Our algorithm incorrectly identified the nature of 25 messages.
This is an accuracy of 0.978%.


## Results

We can see that our algorithm has correctly labelled 98% of the test dataset as either spam or non-spam messages.

## Investigating the Incorrect Labelling

To finish we are going to look at the mislabelled text messages to figure out why they were incorrectly identified.

In [33]:
test_ds[test_ds['Label'] != test_ds['New Classification']]

Unnamed: 0,Label,SMS,New Classification
9,ham,i liked the new mobile,spam
114,spam,not heard from u4 a while call me now am here...,ham
152,ham,unlimited texts limited minutes,spam
159,ham,26th of july,spam
182,ham,surely result will offer,spam
247,ham,which channel,spam
284,ham,nokia phone is lovly,spam
302,ham,no calls messages missed calls,spam
304,ham,this phone has the weirdest auto correct,spam
319,ham,we have sent jd for customer service cum accou...,spam
