In [1]:
# import libraries
from IPython.display import Image  # for displaying images in markdown cells
import pandas as pd  # Dataframe manipulation
import numpy as np  # Arrays manipulation

# Dataquest - Conditional Probabilities <br/> <br/> Project Title: Building A Spam Filter With Naive Bayes

## 1) Introduction and Exploring the Dataset



#### Background:

Provided by: [Dataquest.io](https://www.dataquest.io/)

We're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, the computer:

1) Learns how humans classify messages.

2) Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.

3) Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Note that due to the nature of spam messages, the dataset contains content that may be offensive to some users.

In [2]:
# read and load file, review for familiarity
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)


df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
rename_map = {0:'Label', 1:'SMS'}

df.rename(columns=rename_map, inplace=True)
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
# review number of rows and columns
df.shape

(5572, 2)

In [5]:
# review any null values (non-null values should match 5,572 rows)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
# review number and percentage of spam and ham
print(df['Label'].value_counts())
print('')
print(df['Label'].value_counts() / df['Label'].count() * 100)

ham     4825
spam     747
Name: Label, dtype: int64

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


## 2) Training And Test Set

Provided by: [Dataquest.io](https://www.dataquest.io/)

We read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

Before creating the spam filter, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A **training set**, which we'll use to "train" the computer how to classify messages.
- A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

The training set will have 4,458 messages (about 80% of the dataset).
The test set will have 1,114 messages (about 20% of the dataset).

All 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

For now, let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

In [7]:
# Start by randomizing the entire dataset by using the DataFrame.sample() method.


# frac=1 means entire dataset is randomised.
# random_state=1 is simply to specify a random seed, for reproducibility as well.
df_sample = df.sample(frac=1, random_state=1)


# Split the randomized dataset into a training and a test set.
# training set = 4,458 rows (80%), test set = 1,114 rows (20%)
training_set = df_sample.iloc[:4458, :]
test_set = df_sample.iloc[4458:, :]

training_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [8]:
training_set.head(2)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"


In [9]:
test_set.head(2)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...


In [10]:
# review number and percentage of spam and ham in each set (training set and test set)
print(training_set['Label'].value_counts())
print('')
print(training_set['Label'].value_counts() / training_set['Label'].count() * 100)
print('')
print(test_set['Label'].value_counts())
print('')
print(test_set['Label'].value_counts() / test_set['Label'].count() * 100)

ham     3858
spam     600
Name: Label, dtype: int64

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

ham     967
spam    147
Name: Label, dtype: int64

ham     86.804309
spam    13.195691
Name: Label, dtype: float64


#### Findings (Training And Test Set):

Upon splitting the dataset into training and test sets, the percentages of ham and spam messages in each set is quite similiar and close to the initial dataset (87:13 split).

## 3) Letter Case And Punctuation

Provided by: [Dataquest.io](https://www.dataquest.io/)

The next big step is to use the training set to teach the algorithm to classify new messages.

When a new message comes in, our Naive Bayes algorithm will make the classification based on the results it gets to these two equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) inside the formulas above, we need to use these equations (accounting for additive smoothing as classifying categorical data):

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Where:

\begin{aligned}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Ham} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Ham} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{aligned}


#### Data cleaning
In order to implement the above algorithm, we would want to transform the existing data sets into new tables:
- multiple columns where each represent one unique word
- values of each such column contain the number of times each unique word occur
- one row still represents one unique message
- ignore both punctuation and differentiation between lower and upper cases

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [11]:
# Remove all the punctuation from the SMS column. Use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.

import re  # to work with regex

# remove any non-word char, but keep whitespace
# turn to lowercases
training_set['SMS'] = training_set['SMS'].str.replace(r'[^A-Za-z0-9_\s]', '', flags=re.I).str.lower()
training_set

  training_set['SMS'] = training_set['SMS'].str.replace(r'[^A-Za-z0-9_\s]', '', flags=re.I).str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_set['SMS'] = training_set['SMS'].str.replace(r'[^A-Za-z0-9_\s]', '', flags=re.I).str.lower()


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask all smth theres a card on da p...
...,...,...
4453,ham,sorry ill call later in meeting any thing rela...
4454,ham,babe i fucking love you too you know fuck it ...
4455,spam,uve been selected to stay in 1 of 250 top brit...
4456,ham,hello my boytoy geeee i miss you already and ...


## 4) Creating The Vocabulary

Provided by: [Dataquest.io](https://www.dataquest.io/)

We want a transformed table with columns to represent a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message). 

We call the set of unique words a **vocabulary**.

Let's first build the list of vocabulary that we need.

In [12]:
# Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string.

# split string from sms messages
training_str_list = training_set['SMS'].str.split(r'\s', expand=False)
vocabulary = []  # initiate empty list

# review transformation
print(training_str_list.head(),'\n')

def append_message_to_vocabulary(row):
    for word in row:
        vocabulary.append(word)
    pass
       
training_str_list.apply(append_message_to_vocabulary)

# transform list to set to remove duplicates
# then transform back to list
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

# drop first word as it is a blank from review
vocabulary = vocabulary[1:]

# review transformation
print(vocabulary[:5])
print(vocabulary[-5:])

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, , all, smth, theres, a, ca...
Name: SMS, dtype: object 

['dajst', 'ms', '2bold', 'mom', 'sorts']
['favor', 'itxt', 'super', 'ive', 'leading']


## 5) The Final Training Set

Provided by: [Dataquest.io](https://www.dataquest.io/)

Now we're going to use the vocabulary to make a dictionary with frequency of word count, which we can later turn into a dataframe.

In [13]:
# initiate dictionary template, which can turn into dataframe later
# first row first value is the word count of first unique word from 1st message
# first row second value is the word count of first unique word from 2nd message ..etc
# len(training_set['SMS']: number of row (messages) in training set
word_counts_per_sms = {}

for key in vocabulary:
    word_counts_per_sms[key] = [0] * len(training_set['SMS'])

# review transformation (since large dictionary - test check only)
for i, key in zip(range(0,1,1), word_counts_per_sms):
    print(key, word_counts_per_sms[key])
print('\n')


dajst [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [14]:
# setup new training list where each row represent 1 message for later iteration to count number of unique words
training_str_list_new = training_set['SMS'].str.split(r'\s', expand=False)
training_str_list_new

# increment the frequency word count from training set to dictionary
for index, sms in enumerate(training_str_list_new):
    for word in sms:
        if word == '':  # to avoid error from parsing any weird char/word
            pass
        else:
            word_counts_per_sms[word][index] += 1

# review transformation (since large dictionary - test check only)
for i, key in zip(range(0,1,1), word_counts_per_sms):
    print(sum(word_counts_per_sms[key]))

1


In [15]:
# Transform word_counts_per_sms into a DataFrame
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms

Unnamed: 0,dajst,ms,2bold,mom,sorts,mi,wrongly,bedrm900,poo,sortedbut,...,blu,wild,wwwsmsacuhmmross,311004,haunt,favor,itxt,super,ive,leading
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# concatenate the 2 df so it includes label and sms as well
training_set_new = pd.concat([training_set, word_counts_per_sms], axis=1)


# review transformation
training_set_new

Unnamed: 0,Label,SMS,dajst,ms,2bold,mom,sorts,mi,wrongly,bedrm900,...,blu,wild,wwwsmsacuhmmross,311004,haunt,favor,itxt,super,ive,leading
0,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask all smth theres a card on da p...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,ham,sorry ill call later in meeting any thing rela...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,babe i fucking love you too you know fuck it ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,uve been selected to stay in 1 of 250 top brit...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,hello my boytoy geeee i miss you already and ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 6) Calculating Constants first

Provided by: [Dataquest.io](https://www.dataquest.io/)

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter using the Naive Bayes algorithm described above in Part 3.

As a start, let's first calculate the constants:

- P(Spam) and P(Ham)
- N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>

Where:

\begin{aligned}
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Ham} = \text{total number of words in non-spam messages} \\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{aligned}

In [17]:
# using the new training set: Calculate P(Spam) and P(Ham).

num_spam = (training_set_new['Label'] == 'spam').sum()
num_total = len(training_set_new)
p_spam = num_spam / num_total

num_ham = (training_set_new['Label'] == 'ham').sum()
p_ham = num_ham / num_total

# review transformation
print(p_spam, p_ham, p_spam+p_ham)

0.13458950201884254 0.8654104979811574 1.0


In [18]:
# Calculate N_Spam, N_Ham, N_Vocabulary, alpha


# create bool masks
mask_spam = training_set_new['Label'] == 'spam'
mask_ham = training_set_new['Label'] == 'ham'

# apply masks and get number of words in spam/non-spam messages
n_spam = training_set_new[mask_spam].iloc[:,2:].sum().sum()
n_ham = training_set_new[mask_ham].iloc[:,2:].sum().sum()

# vocab: set of unique words
n_vocab = len(training_set_new.columns[2:])

# initiate smoothing parameter 'alpha'
alpha = 1

# review transformation
print(n_spam)
print(n_ham)
print(n_vocab)
print(alpha)

14081
54405
8443
1


## 7) Calculating Parameters

Provided by: [Dataquest.io](https://www.dataquest.io/)

P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) will vary depending on the individual words.

Although both P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) vary depending on the word, the probability for each individual word is constant for every new message.

We can use the training set to get the values we need to find a result for the equation below:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Where: 
\begin{aligned}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Ham} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\end{aligned}


For each word, we need to calculate both P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham).

The fact that we calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms). When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

In [19]:
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. Dictionaries to store P(wi|Spam) and P(wi|Ham).

spam_counter = {}
ham_counter = {}
for key in vocabulary:
    spam_counter[key] = 0
    ham_counter[key] = 0

# review transformation
for i, key in zip(range(0,5,1), vocabulary):
    print(key, spam_counter[key])
    print(key, ham_counter[key])

dajst 0
dajst 0
ms 0
ms 0
2bold 0
2bold 0
mom 0
mom 0
sorts 0
sorts 0


In [20]:
# Isolate the spam and the ham messages in the training set into two different DataFrames.
train_spam = training_set_new[mask_spam]
train_ham = training_set_new[mask_ham]

# review transformation
train_spam.tail(1)

Unnamed: 0,Label,SMS,dajst,ms,2bold,mom,sorts,mi,wrongly,bedrm900,...,blu,wild,wwwsmsacuhmmross,311004,haunt,favor,itxt,super,ive,leading
4455,spam,uve been selected to stay in 1 of 250 top brit...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Iterate over the vocabulary, and, for each word, calculate P(wi|Spam) and P(wi|Ham) using the formulas we mentioned above.

for key in vocabulary:
    n_word_in_spam = train_spam[key].sum()
    spam_counter[key] = (n_word_in_spam + alpha) / (n_spam + alpha * n_vocab)
    
    n_word_in_ham = train_ham[key].sum()
    ham_counter[key] = (n_word_in_ham + alpha) / (n_ham + alpha * n_vocab)

# review transformation
for i, key in zip(range(0,5,1), vocabulary):
    print(key, spam_counter[key])
    print(key, ham_counter[key])

dajst 4.439708755105665e-05
dajst 3.182281059063136e-05
ms 4.439708755105665e-05
ms 3.182281059063136e-05
2bold 4.439708755105665e-05
2bold 3.182281059063136e-05
mom 4.439708755105665e-05
mom 0.00015911405295315683
sorts 4.439708755105665e-05
sorts 3.182281059063136e-05


## 8) Classifying A New Message

Provided by: [Dataquest.io](https://www.dataquest.io/)

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter.

Below established relations to compare probablilities guide us to a classifying decision.

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}


The spam filter can be understood as a function that:

- Takes in as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
- Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., wn) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
- Compares the values of P(Spam|w<1, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), and:
- If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.
- If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) < P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as spam.
- If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) = P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the algorithm may request human help.

Now we'll write the code for calculating p_spam_given_message and p_ham_given_message, and then we'll use them in the function 'classify' to classify two new messages.

In [22]:
# write the code needed for calculating p_spam_given_message and p_ham_given_message, and also the function to classify a new message and label it spam or ham.

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # initiate variables first
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # multiply the variables with each subsequent spam_counter word probability
    # ignore the specific word from new message if it don't exist in the either dictionaries
    for word in message:
        if word in spam_counter.keys():
            p_spam_given_message *= spam_counter[word]
        if word in ham_counter.keys():
            p_ham_given_message *= ham_counter[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    # define decision rule and output message
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [23]:
# Use the classify() function to classify two sample new messages (adhoc suggested messages).

print(classify('WINNER!! This is the secret code to unlock the money: C3421.'),'\n')
print(classify("Sounds good, Tom, then see u there"))

P(Spam|message): 1.478015493585903e-25
P(Ham|message): 2.3039424453264186e-27
Label: Spam
None 

P(Spam|message): 1.6913150682744822e-25
P(Ham|message): 3.1866047825851744e-21
Label: Ham
None


## 9) Measuring the Spam Filter's Accuracy

Provided by: [Dataquest.io](https://www.dataquest.io/)

We managed to create a spam filter from our training set, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human - already labelled in the test set).

First off, we'll change the classify() function that we wrote previously to return the labels instead of printing them. Below, note that we now have return statements instead of print() functions.

Since now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

Then, we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:

\begin{equation}
\text{Accuracy} = \frac{\text{number of correctly classified messages}}{\text{total number of classified messages}}
\end{equation}

In [24]:
# redefine the classify_test_set_function
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_counter.keys():
            p_spam_given_message *= spam_counter[word]
        if word in ham_counter.keys():
            p_ham_given_message *= ham_counter[word]

    # define decision rule and return value
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'
    
# use redefined function to create new columns for predicted nature of message
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)

# review transformation
print(test_set['predicted'].value_counts(dropna=False))
test_set.head()

ham                           970
spam                          143
needs human classification      1
Name: predicted, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set['predicted'] = test_set['SMS'].apply(classify_test_set)


Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [25]:
# Measure the accuracy of the spam filter


correct = 0
total = len(test_set)

# Iterate over the test set DataFrame
for index, row in test_set.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
    else:
        pass

accuracy = correct / total
print(accuracy)

0.9793536804308797


#### Findings (Measuring the Spam Filter's Accuracy):

From above findings, the accuracy is approximate 98%, which we have derived by applying the Naive Bayes algorithm with additive smoothing to classify messages to output binominal results (spam or non-spam).

That seems pretty effective!

## 10) Conclusion:

Key skills applied:
- Apply the Naive Bayes algorithm with additive smoothing to classify messages to output binominal results (spam or non-spam).
    - Assign probabilities to events based on certain conditions by using conditional probability rules.
    - Assign probabilities to events based on whether they are in relationship of statistical independence or not with other events.
    - Assign probabilities to events based on prior knowledge by using Bayes' theorem.
    - Create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

Potential next steps:
- Isolate messages that were classified incorrectly (False negatives / False positives) and try to figure out why the algorithm reached the wrong conclusions.

- Make the filtering process more complex by making the algorithm sensitive to letter case.