# Guided Project 12: Building a Spam Filter with Naive Bayes

In [2]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Slide 1
---
We've come a long way in this course — we've learned to:


- Assign probabilities to events based on certain conditions by using conditional probability rules.


- Assign probabilities to events based on whether they are in relationship of statistical independence or not with other events.


- Assign probabilities to events based on prior knowledge by using Bayes' theorem.


- Create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

In our last lesson, we focused extensively on learning how the Naive Bayes algorithm works from a theoretical standpoint (more specifically, we learned about the multinomial Naive Bayes algorithm). In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous lesson that the computer:


1. Learns how humans classify messages.


2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.


3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) Repository.

Let's start by reading in the dataset. You'll be able to find the [solutions to this project at this link](https://github.com/dataquestio/solutions/blob/master/Mission433Solutions.ipynb) or by clicking the key icon at the top right of the interface.

Note that due to the nature of spam messages, the dataset contains content that may be offensive to some users.



#### Instructions
---
1. To help readers gain context into your project, use the first Markdown cell of the notebook to add a title and a short introduction where you concisely explain what the project is about and what your goal is in this project (the title and the introduction are tentative at this point, so don't spend too much time here — you can come back at the end of your work to refine them).


2. Open the `SMSSpamCollection` file using the `read_csv()` function from the pandas package.
    - The data points are tab separated, so we'll need to use the `sep='\t'` parameter for our `read_csv()` function.
    - The dataset doesn't have a header row, which means we need to use the `header=None` parameter, otherwise the first row will be wrongly used as the header row.
    - Use the `names=['Label', 'SMS']` parameter to name the columns as `Label` and `SMS`.


3. Explore the dataset a little.
    - Find how many rows and columns it has.
    - Find what percentage of the messages is spam and what percentage is ham ("ham" means non-spam).

## Introduction

This project aims at creating a spam filter that resorts to a multinomial Naive Bayes algorithm in order to distinguish spam messages from regular ones. The filter will be tested on a data set comprised of 5572 messages that have been previously determined by humans, if they are spam or not.

The spam filter is considered to be successful if it can filter out 80% of the spam from a test set.

The spam data set to be worked with will be named `sms_spam`. 

In [3]:
sms_spam_full = pd.read_csv('SMSSpamCollection.txt',
                   sep='\t',
                   names=['Label', 'SMS'])

Basic info:

In [4]:
sms_spam_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 43.6+ KB


The `Label` column only has two values:

- `spam`.
- `ham` (not spam).

In [5]:
sms_spam_full.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


Below, we can see that only aprox. 13.4% of the messages are spam.

In [6]:
count_label = sms_spam_full.Label.value_counts(normalize=True).round(3)*100

count_label = count_label.rename('ham vs spam (%)')

count_label

ham     86.6
spam    13.4
Name: ham vs spam (%), dtype: float64

## Slide 2
---
On the previous screen, we read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

However, before creating it, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.


Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:


- A **training set**, which we'll use to "train" the computer how to classify messages.


- A **test set**, which we'll use to test how good the spam filter is with classifying new messages.


We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:


- The training set will have 4,458 messages (about 80% of the dataset).


- The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

**For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80%** — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

We'll come back to testing toward the end of this guided project, but for now, let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset. 



#### Instructions
---
1. Start by randomizing the entire dataset by using the `DataFrame.sample()` method.
    - Use the `frac=1` parameter to randomize the entire dataset.
    - Use the `random_state=1` parameter to make sure your results are reproducible.


2. Split the randomized dataset into a training and a test set.
    - The training set should account for 80% of the dataset, and the remaining 20% of the data should be the test set.
    - Reset the index labels for both data sets — the index labels remained unordered after randomization. You can use the `DataFrame.reset_index()` method.


3. Find the percentage of spam and ham in both the training and the test set. Are the percentages similar to what we have in the full dataset?

The next step is to create a training set and a test set out of `sms_spam_full`:

1. Randomizing the `sms_spam_full`.


2. Split randomized DF into a training set (20%) and a testing set (remaining 80%).


3. Compare the 'Label' value distribution of the entire randomized DF with the previously made subsets.

In [7]:
# 1.
random_sms_spam = sms_spam_full.sample(n=None, frac=1, random_state=1).reset_index(drop=True)

From earlier on we know that `random_sms_spam` has 5572 entries (0 to 5571).

In [8]:
eighty_perc = random_sms_spam.shape[0] * 0.8

print(f'Total number of observations/rows in random_sms_spam: {random_sms_spam.shape[0]}', 
      f'\n20% of total observations/rows in random_sms_spam: {round(eighty_perc, 0)}')

Total number of observations/rows in random_sms_spam: 5572 
20% of total observations/rows in random_sms_spam: 4458.0


Based on the information above we set the training set as the rows 0 to 4458, 80% of the total rows in `random_sms_spam`. The remaining rows, will form the testing set. 

In [9]:
# 2.
training_set = random_sms_spam.copy().iloc[:4458+1, :]

testing_set = random_sms_spam.copy().iloc[4458:, :]

Finally, comparing the distribution of values in the `Label` column across DataFrames.

In [10]:
# 3.

# Training set.
count_label_training = training_set.Label.value_counts(normalize=True).round(3)*100

count_label_training = count_label_training.rename('ham vs spam (%)')

# Testing set.
count_label_testing = testing_set.Label.value_counts(normalize=True).round(3)*100

count_label_testing = count_label_testing.rename('ham vs spam (%)')


# Combining all label percentage counts for comparison.
compare_label = pd.DataFrame({'random_sms_spam': count_label,
                              'count_label_training': count_label_training,
                              'count_label_testing': count_label_testing})

compare_label

Unnamed: 0,random_sms_spam,count_label_training,count_label_testing
ham,86.6,86.5,86.8
spam,13.4,13.5,13.2


As we can see in the table above, the value distribution is very similar across the panel, meaning that we can infer the conclusions produced from the training set to the testing set, since both sets resemble the original series.

In [11]:
# Saving RAM 1
del sms_spam_full
del random_sms_spam

## Slide 3
---
On the previous screen, we split our dataset into a training set and a test set. The next big step is to use the training set to teach the algorithm to classify new messages.

Recall from the previous lesson that when a new message comes in, our Naive Bayes algorithm will make the classification based on the results it gets to these two equations (note that we replaced "$Spam^C$" with "$Ham$" inside the second equation below):

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ inside the formulas above, recall that we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Let's also summarize what the terms in the equations above mean:

\begin{align}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Spam^C} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Spam^C} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{align}

To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Right now, our training and test sets have this format (the messages are fictitious to make the example easier to understand):

![img_1](1.jpg)


To make the calculations easier, we want bring the data to this format (the table below is a transformation of the table you see above):


![img_2](2.jpg)


About the transformation above, notice that:

- The `SMS` column doesn't exist anymore.


- Instead, the `SMS` column is replaced by a series of new columns, where each column represents a unique word from the vocabulary.


- Each row describes a single message. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it has the values `spam, 2, 2, 1, 1, 0, 0, 0, 0, 0`. These values tell us that:
    - The message is spam.
    - The word "secret" occurs two times inside the message.
    - The word "prize" occurs two times inside the message.
    - The word "claim" occurs one time inside the message.
    - The word "now" occurs one time inside the message.
    - The words "coming", "to", "my", "party", and "winner" occur zero times inside the message.
    
    
- All words in the vocabulary are in lower case, so "SECRET" and "secret" come to be considered to be the same word.


- Punctuation is not taken into account anymore (for instance, we can't look at the table and conclude that the first message initially had three exclamation marks).

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

#### Instructions
---
1. Remove all the punctuation from the `SMS` column. You can use the regex `'\W'` to detect any character that is not from a-z, A-Z or 0-9.
    - For instance, the function `re.sub('\W', ' ', 'Secret!! Money, goods.' )` strips the punctuation marks and outputs the string 'Secret Money goods '.
    - For simplicity, you can use the `Series.str.replace()` method.
    
    
2. For each message, transform every letter in every word to lower case. You may want to use the `Series.str.lower()` method.

For the next stage we focus on working with the training set. In order to apply the Naive Bayes Theorem, we can build a table that, for each message, classifies it as spam or not spam and counts the number of times a word, from the vocabulary, appears in the message. Whenever a message does not contain a word from the vocabulary, each word in the vocabulary having its own column, it registers `0`. Remember that the vocabulary is the group of all unique words gathered from all the messages within the training set.

Prior to building this DataFrame, two cleaning steps applied to the `SMS` column will be undertaken:

- eliminate punctuation.
- lower case every word.

In [12]:
all_messages_str = ''

for index, value in enumerate(training_set.SMS):
    all_messages_str = ' '.join([all_messages_str, value])
    

non_words_set = set(re.findall('\W', all_messages_str))


non_words_list = list(non_words_set)

non_words_list_w_spaces = [' ' + i + ' ' for i in non_words_list]

In [13]:
non_words_list

['»',
 '"',
 '[',
 '(',
 '\x91',
 '<',
 '\x93',
 '=',
 '¡',
 '.',
 ',',
 '“',
 '^',
 '┾',
 '&',
 '\x92',
 ' ',
 '?',
 '|',
 ')',
 '~',
 '-',
 '\n',
 '\x96',
 '%',
 '@',
 ':',
 '…',
 "'",
 '*',
 '\\',
 '$',
 '’',
 '>',
 ']',
 '!',
 ';',
 '–',
 '+',
 '\x94',
 '\t',
 '/',
 '—',
 '#',
 '£',
 '‘']

In [14]:
# Training set `SMS` cleaned series

ts_cleaned_SMS = training_set.SMS.copy().str.replace('[^A-Za-z0-9\s£€\$]', ' ', regex=True)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('€', ' € ', regex=False)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('£', ' £ ', regex=True)
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('\$', ' $ ', regex=True)
 
#`\s+` ensures that if there are two or more joined whitespaces they are converted to just one.
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('\s+', ' ', regex=True) 

ts_cleaned_SMS.head(30)

0                           Yep by the pretty sculpture
1           Yes princess Are you going to make me moan 
2                            Welp apparently he retired
3                                               Havent 
4     I forgot 2 ask all smth There s a card on da p...
5     Ok i thk i got it Then u wan me 2 come now or ...
6     I want kfc its Tuesday Only buy 2 meals ONLY 2...
7                              No dear i was sleeping P
8                                Ok pa Nothing problem 
9                             Ill be there on lt gt ok 
10    My uncles in Atlanta Wish you guys a great sem...
11                                             My phone
12                         Ok which your another number
13    The greatest test of courage on earth is to be...
14    Dai what this da Can i send my resume to this id 
15                         I am late I will be there at
16    FreeMsg Why haven t you replied to my text I m...
17                     K text me when you re on 

The last version of `ts_cleaned_SMS` still has rows which have whitespaces at the beginning or end of the message which can be removed.

In [15]:
ts_cleaned_SMS = ts_cleaned_SMS.str.replace('(\A +| +\Z)', '', regex=True)

Below we confirm that there is no whitespaces at the beginning or end of the row/string. 

In [16]:
cond1 = ts_cleaned_SMS.str.contains(pat='(?:\A +| +\Z)')

ts_cleaned_SMS[cond1]

Series([], Name: SMS, dtype: bool)
Series([], Name: SMS, dtype: bool)


In [17]:
# Lower case for every string.
ts_cleaned_SMS = ts_cleaned_SMS.str.lower()

Checking if the changes in `training_set.SMS` were successful:

In [18]:
ts_cleaned_SMS.head()

0                          yep by the pretty sculpture
1           yes princess are you going to make me moan
2                           welp apparently he retired
3                                               havent
4    i forgot 2 ask all smth there s a card on da p...
Name: SMS, dtype: object

## Slide 4
---
On the previous screen, we removed the punctuation and changed all letters to lowercase. Recall that our end goal with this data cleaning process is to bring our training set to this format:

![img_3](3.jpg)

With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message). Recall from the previous lesson that we call the set of unique words a **vocabulary**.

We'll eventually bring the training set to that format ourselves, but first, let's create a list with all of the unique words that occur in the messages of our training set.



#### Instructions
---
1. Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string.

- Begin by transforming each message from the `SMS` column into a list by splitting the string at the space character — use the `Series.str.split()` method.


- Initiate an empty list named `vocabulary`.


- Iterate over the the `SMS` column (each message in this column should be a list of strings by the time you start this loop).
    - Using a nested loop, iterate each message in the `SMS` column (each message should be a list of strings) and append each string (word) to the vocabulary list.
    
    
- Transform the `vocabulary` list into a set using the `set()` function. This will remove the duplicates from the `vocabulary` list.


- Transform the `vocabulary` set back into a list using the `list()` function.

The next stage is producing the vocabulary (set of unique words), with the following steps:

1. Split messages into columns - one unique string per column (removing most of the whitespaces as well).
2. Concatenating all the columns of strings into a single Series.
3. Dropping Nan-values.
3. Convert Series into a list.
4. Converting list into a set, thus excluding duplicated values/words.
5. Convert set into back into a list.

In [19]:
# 1.
ts_cleaned_SMS_split = ts_cleaned_SMS.str.split(' ', expand=True) 

# 2.
ts_cleaned_SMS_split_cat = pd.concat([ts_cleaned_SMS_split.iloc[i, :] for i in range(0, ts_cleaned_SMS_split.shape[0])],
                                     ignore_index=True)
# 3.
ts_cleaned_SMS_split_cat = ts_cleaned_SMS_split_cat.dropna()

# 4.
ts_cleaned_SMS_split_cat_to_list = ts_cleaned_SMS_split_cat.to_list()

#5.
vocabulary_set = set(ts_cleaned_SMS_split_cat_to_list)

#6.
vocabulary = list(vocabulary_set)

When checking `vocabulary` below, we see that whitespace still appears as a value, thus it can be removed.

In [20]:
vocabulary[:5]

['', 'cookies', 'uncles', 'shaping', 'shoranur']

In [21]:
for index, el in enumerate(vocabulary):
    if el == '':
        del vocabulary[index]
        
vocabulary[:5]

['cookies', 'uncles', 'shaping', 'shoranur', 'frank']

In [22]:
# Free RAM 2

del ts_cleaned_SMS_split_cat 
del ts_cleaned_SMS_split_cat_to_list
del vocabulary_set

## Slide 5
---
On the previous screen, we managed to create the vocabulary for our messages in the training set. Now we're going to use the vocabulary to make the data transformation we need:

![img_4](4.jpg)

Eventually, we're going to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need.

For instance, to create the table we see above, we could use this dictionary and then convert it to a DataFrame:

    word_counts_per_sms = {'secret': [2,1,1],
                           'prize': [2,0,1],
                           'claim': [1,0,1],
                           'now': [1,0,1],
                           'coming': [0,1,0],
                           'to': [0,1,0],
                           'my': [0,1,0],
                           'party': [0,1,0],
                           'winner': [0,0,1]
                          }


    word_counts = pd.DataFrame(word_counts_per_sms)
    word_counts.head()

Output:

|   |secret|prize|claim|now|coming|to |my |party|winner|
|---|------|-----|-----|---|------|---|---|-----|------|
|0  |2     |2    |1    |1  |0     |0  |0  |0    |0     |
|1  |1     |0    |0    |0  |1     |1  |1  |1    |0     |
|2  |1     |1    |1    |1  |0     |0  |0  |0    |1     |

(As you may have noticed from the output above, the `Label` column is missing, but we'll get to that in the next exercise.)

To create the dictionary we need for our training set, we can use the code below, where:

- We start by initializing a dictionary named `word_counts_per_sms`, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a `0`.
    - The code `[0] * 5` outputs `[0, 0, 0, 0, 0]`. So the code `[0] * len(training_set['SMS']`) outputs a list of the length of `training_set['SMS']`, where each element in the list will be a `0`.


- We loop over `training_set['SMS']` using at the same time the `enumerate()` function to get both the `index` and the SMS message (index and `sms`).
    - Using a nested loop, we loop over `sms `(where `sms` is a list of strings, where each string represents a word in a message).
        - We increment `word_counts_per_sms[word][index]` by `1`.

---

    word_counts_per_sms = {unique_word: [0] * 
        len(training_set['SMS']) for unique_word in vocabulary}

    for index, sms in enumerate(training_set['SMS']):
        for word in sms:
            word_counts_per_sms[word][index] += 1

Now that we have the dictionary we need, let's do the final transformations to our training set and then move forward with creating the spam filter.

#### Instructions
---
1. Run the code you see above to get the `word_counts_per_sms` dictionary. In case you want to do a bit of exploration, note that this is a large dictionary, and printing it all is not recommended (you should rather use a for loop and print only the first five or so key-value pairs).


2. Transform `word_counts_per_sms` into a DataFrame using `pd.DataFrame()`.


3. Concatenate the DataFrame we just built above with the DataFrame containing the training set (this way, we'll also have the `Label` and the` SMS` columns). Use the `pd.concat()` function.

Prior to what is in the instructions we create a splitted version of `ts_cleaned_SMS` that does not expand each string into a single column but compiles the message into a list with strings.

Similarly to `vocabulary`, I delete empty strings, `''`; this time resorting to a custom function - `remove_elements` that is applied to `ts_cleaned_SMS_split_listed`.

In [23]:
def remove_elements(list_x, list_strings):
    """Strings in list_strings are removed from list_x if this later 
    list contains any of those strings.
    """
    
    for index, el in enumerate(list_x):
        if el in list_strings:
            del list_x[index]
    
    return list_x

In [24]:
ts_cleaned_SMS_split_listed = ts_cleaned_SMS.str.split(' ') # `expand=False` by default

strings_to_remove = ['']

ts_cleaned_SMS_split_listed_1 = ts_cleaned_SMS_split_listed.copy().apply(lambda x: remove_elements(x, strings_to_remove))

As suggested by the tutorial, first a dictionary is created and then I go through every message and count each word repetition (each word is a key), filling out the dictionary.

In [25]:
# From the tutorial.

# 1.
word_counts_per_sms = {unique_word: [0] * len(ts_cleaned_SMS_split_listed_1) for unique_word in vocabulary}

for index, sms in enumerate(ts_cleaned_SMS_split_listed_1):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [26]:
# 2. Convert dictionary into DataFrame

word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)

Assessing if the conversion was successful. 

In [27]:
word_counts_per_sms_df.iloc[:2, 10:]

Unnamed: 0,postal,pleassssssseeeeee,titles,couple,deficient,vegetables,settings,finding,aiya,walkin,...,missin,max6,quiet,bishan,comedy,sterling,braved,token,into,weight
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finally, concatenating `training_set` with `word_counts_per_sms_df` into a new DataFrame - `training_set_2`.

In [28]:
# 3. 

# `sort=False` is required to preserve the order of the columns in a 'first in' fashion.
training_set_2 = pd.concat([training_set, word_counts_per_sms_df], axis=1, sort=False)

In [29]:
training_set_2.iloc[:2, :10]

Unnamed: 0,Label,SMS,cookies,uncles,shaping,shoranur,frank,icmb3cktz8r7,150pm,winning
0,ham,"Yep, by the pretty sculpture",0,0,0,0,0,0,0,0
1,ham,"Yes, princess. Are you going to make me moan?",0,0,0,0,0,0,0,0


Assessing whether the rows in `training_set` were well aligned with the correspondent rows in `word_counts_per_sms_df`, two examples:

Example 1: row 0

In [30]:
cond_row_0 = training_set_2.iloc[0, 2:] == 1 

print('training_set_2["SMS"], row 0: {}'.format(training_set_2.iloc[0, 1]))
print('\n')
print('training_set_2, row 0 - columns that are "1 or more":\n\n{}'.format(training_set_2.iloc[0, 2:][cond_row_0]))

training_set_2["SMS"], row 0: Yep, by the pretty sculpture


training_set_2, row 0 - columns that are "1 or more":

by           1
the          1
pretty       1
sculpture    1
yep          1
Name: 0, dtype: object


Example 2: row 1

In [31]:
cond_row_1 = training_set_2.iloc[1, 2:] == 1 

print(training_set_2.iloc[1, 1])
print('\n')
print(training_set_2.iloc[1, 2:][cond_row_1])

Yes, princess. Are you going to make me moan?


princess    1
going       1
make        1
you         1
to          1
yes         1
are         1
moan        1
me          1
Name: 1, dtype: object


In [32]:
# Free RAM 3

del ts_cleaned_SMS_split_listed
del word_counts_per_sms

## Slide 6
---
Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ inside the formulas above, recall that we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}


Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

- $P(Spam)$ and $P(Ham)$.


- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$.


Recall from the previous lesson that:

- $N_{Spam}$ is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's _not_ equal to the total number of _unique_ words in spam messages.


- $N_{Ham}$ is equal to the number of words in all the non-spam messages — it's _not_ equal to the number of non-spam messages, and it's not equal to the total number of _unique_ words in non-spam messages.

- We'll also use Laplace smoothing and set $α=1$.


#### Instructions
---
1. Calculate P(Spam) and P(Ham). There's more than one way to write the code that can calculate this — feel free to choose any solution you want.


2. Calculate $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$. Feel free to choose any programming solution you like.


3. Initiate a variable named alpha with a value of 1.

Calculating the elements within the Naive Bayes algorithm. Starting with:

- $P(Spam)$ and $P(Ham)$.
- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$

The Laplace smoothing parameter is set to 1: $α=1$.

Probability of Spam and Ham. 

In [33]:
label_counts_ts2 = training_set_2.Label.value_counts(normalize=True)

p_spam = label_counts_ts2.spam

p_ham = label_counts_ts2.ham

In [34]:
p_spam

0.13455931823278763

In [35]:
p_ham

0.8654406817672123

Counting $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$.

In [36]:
# n_vocabulary.

n_vocabulary = len(vocabulary)

n_vocabulary

7780

To sum up the total number of words for spam messages and for non-spam messages - $N_{Spam}$, $N_{Ham}$ respectively, the procedure will be the following:

1. add a column summing up the number of words per message/row.

2. calculate sum of words if row if messages are spam; same for non-spam.

In [37]:
# 1.
training_set_2['sum_words_sms'] = training_set_2.iloc[:, 2:].copy().sum(axis=1)

In [38]:
# 2.

n_spam = training_set_2.loc[training_set_2.Label=='spam', 'sum_words_sms'].sum()

n_ham = training_set_2.loc[training_set_2.Label=='ham', 'sum_words_sms'].sum()

In [39]:
print(f'n_spam: {n_spam}',
     f'\nn_ham: {n_ham}')

n_spam: 15456 
n_ham: 57129


Finally setting $α=1$.

In [40]:
alpha = 1 

## Slide 7
---
On the previous screen, we managed to calculate a few terms for our equations:

- $P(Spam)$ and $P(Ham)$.


- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$.


As we've already mentioned, all these terms will have constant values in our equations for every new message (regardless of the message or each individual word in the message).

However, $P(w_i|Spam)$ and $P(w_i|Ham)$ will vary depending on the individual words. For instance, P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("lovely"|Spam) will most likely have other values.

Although both $P(w_i|Spam)$ and $P(w_i|Ham)$ vary depending on the word, the probability for each individual word is constant for every new message.



For instance, let's say we receive two new messages:


- "secret code".


- "secret party 2night".


We'll need to calculate P("secret"|Spam) for both these messages, and we can use the training set to get the values we need to find a result for the equation below:


\begin{equation}
P(\text{"secret"}|Spam) = \frac{N_{"secret"|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

The steps we take to calculate P("secret"|Spam) will be identical for both of our new messages above, or for any other new message that contains the word "secret". The key detail here is that calculating P("secret"|Spam) only depends on the training set, and as long as we don't make changes to the training set, P("secret"|Spam) stays constant. The same reasoning also applies to P("secret"|Ham).

This means that we can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:


- P("lost"|Spam) and P("lost"|Ham)


- P("navigate"|Spam) and P("navigate"|Ham)


- P("sea"|Spam) and P("sea"|Ham)


We have 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both $P(w_i|Spam)$ and $P(w_i|Ham)$.

In more technical language, the probability values that $P(w_i|Spam)$ and $P(w_i|Ham)$ will take are called **parameters**.

The fact that we calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms). When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

If we didn't calculate all these values beforehand, then all these calculations would need to be done every time a new message comes in. Imagine the algorithm will be used to classify 1,000,000 new messages. Why repeat all these calculations 1,000,000 times when we could just do them once at the beginning?

Let's now calculate all the parameters using the equations below:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} = \frac{\text{(number of times $w_i$ appears in spam messages)} + 1}{\text{(total number of spam words)} + (1 * \text{total number of unique words})} \\
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} = \frac{\text{(number of times $w_i$ appears in ham messages)} + 1}{\text{(total number of ham words)} + (1 * \text{total number of unique words})} \\
\end{equation}

Recall that $P(w_i|Spam)$ and $P(w_i|Ham)$ are key parts of the equations that we need to classify the new messages:


\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}



#### Instructions
---


1. Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is `0`. We'll need one dictionary to store the parameters for $P(w_i|Spam)$, and the other for $P(w_i|Ham)$.
    - If the entire vocabulary were `['sea', 'navigate']`, we'd need to initialize two dictionaries, one for spam and one for ham, and both should look like this: `{'sea': 0, 'navigate': 0}`.


2. Isolate the spam and the ham messages in the training set into two different DataFrames. The `Label` column will help you isolate the messages.

3. Iterate over the vocabulary, and, for each word, calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ using the formulas we mentioned above.
    - Recall that $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$, and $α$ are already calculated from the last screen.
    
    - Recall from the previous lesson that $N_{w_i|Spam}$ is equal to the number of times the word $w_i$ occurs in all the spam messages, while $N_{w_i|Ham}$ is equal to the number of times the word $w_i$ occurs in all the ham messages.
    
- Once you're done with calculating an individual parameter, update the probability value in the two dictionaries you created initially.




To arrive to the dictionaries that contain ever $P(w_i|Spam)$ and $P(w_i|Ham)$, I will do the following:

First, we create two Series that gives us $N_{w_i|Spam}$ and $N_{w_i|Ham}$:

1. create two DataFrames that only contain either spam or non-spam messages.


2. out of those two DataFrames create two correspondent Series - `sms_spam_sum` and `sms_ham_sum`; analogously for both Series, the row index is the column index of the former DataFrame, and each value of the Series corresponds to the sum of the values in each column. For a more intuitive understanding, see below the transformations made in `sms_spam`.


3. the last step is to fill out two dictionaries, one for $P(w_i|Spam)$ and other for $P(w_i|Ham)$, based on the parameters parameters already calculated:
    - `n_spam` and `n_ham`.
    - `n_vocabulary`.
    - $N_{w_i|Spam}$ and $N_{w_i|Ham}$.
    - `alpha`.

In [41]:
# 1. the last colum 'sum_words_sms' is not included in the calculations
# of these Series, so I set `.iloc[:, 2:-1]`.
sms_spam = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='spam']

sms_ham = training_set_2.iloc[:, 2:-1].copy()[training_set_2.Label=='ham']

In [42]:
sms_spam.head(3)

Unnamed: 0,cookies,uncles,shaping,shoranur,frank,icmb3cktz8r7,150pm,winning,heater,6669,...,missin,max6,quiet,bishan,comedy,sterling,braved,token,into,weight
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# 2.
sms_spam_sum = sms_spam.sum().transpose()

sms_ham_sum = sms_ham.sum().transpose()

In [44]:
sms_spam_sum.head()

cookies     0
uncles      0
shaping     0
shoranur    0
frank       0
dtype: int64

In [45]:
# 3.

p_wi_given_spam_dict = {}

p_wi_given_ham_dict = {}

# P(w_i|Spam)
for i in range(0, sms_spam_sum.size):
    index = sms_spam_sum.index[i]
    dividend = sms_spam_sum[index] + alpha
    divisor = n_spam + (alpha*n_vocabulary)
    p_wi_given_spam_dict[index] =  dividend / divisor

    
# P(w_i|Ham)
for i in range(0, sms_ham_sum.size):
    index = sms_ham_sum.index[i]
    dividend = sms_ham_sum[index] + alpha
    divisor = n_ham + (alpha*n_vocabulary)
    p_wi_given_ham_dict[index] =  dividend / divisor

Checking first 20 items in `p_wi_given_spam_dict` and `p_wi_given_ham_dict`:

In [46]:
list(p_wi_given_spam_dict.items())[:10] 

[('cookies', 4.3036667240488894e-05),
 ('uncles', 4.3036667240488894e-05),
 ('shaping', 4.3036667240488894e-05),
 ('shoranur', 4.3036667240488894e-05),
 ('frank', 4.3036667240488894e-05),
 ('icmb3cktz8r7', 8.607333448097779e-05),
 ('150pm', 0.0003012566706834223),
 ('winning', 4.3036667240488894e-05),
 ('heater', 4.3036667240488894e-05),
 ('6669', 8.607333448097779e-05)]

In [47]:
list(p_wi_given_ham_dict.items())[:10] 

[('cookies', 4.621855212682371e-05),
 ('uncles', 7.703092021137284e-05),
 ('shaping', 3.081236808454914e-05),
 ('shoranur', 3.081236808454914e-05),
 ('frank', 3.081236808454914e-05),
 ('icmb3cktz8r7', 1.540618404227457e-05),
 ('150pm', 1.540618404227457e-05),
 ('winning', 3.081236808454914e-05),
 ('heater', 4.621855212682371e-05),
 ('6669', 1.540618404227457e-05)]

In [48]:
# Free RAM 4

del sms_spam
del sms_ham

## Slide 8
---
Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message ($w_1$, $w_2$, ..., $w_n$).


- Calculates 
.
  
  
- Compares the values of $P(Spam|w_1, w_2, ..., w_n)$ and $P(Ham|w_1, w_2, ..., w_n)$, and:
    - If $P(Ham|w_1, w_2, ..., w_n) > P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as ham.
    
    - If $P(Ham|w_1, w_2, ..., w_n) < P(Spam|w_1, w_2, ..., w_n)$, then the message is classified as spam.
    
    - If $P(Ham|w_1, w_2, ..., w_n) = P(Spam|w_1, w_2, ..., w_n)$, then the algorithm may request human help.

Below, we see a rough sketch of how the spam filter function might look like:

    import re

    def classify(message):

        message = re.sub('\W', ' ', message)
        message = message.lower()
        message = message.split()

        '''    
        This is where we calculate:

        p_spam_given_message = ?
        p_ham_given_message = ?
        '''    

        print('P(Spam|message):', p_spam_given_message)
        print('P(Ham|message):', p_ham_given_message)

        if p_ham_given_message > p_spam_given_message:
            print('Label: Ham')
        elif p_ham_given_message < p_spam_given_message:
            print('Label: Spam')
        else:
            print('Equal proabilities, have a human classify this!')



For the `classify()` function above, note that:


- The input variable `message` is assumed to be a string.


- We perform a bit of data cleaning on the string `message`:
    - We remove the punctuation using the `re.sub()` function.
    - We bring all letters to lower case using the `str.lower()` method.
    - We split the string at the space character and transform it into a Python list using the `str.split()` method.
    
    
- There's some placeholder code for calculating `p_spam_given_message` and `p_ham_given_message` — we'll write this code in the exercise below.


- We compare `p_spam_given_message` with p_ham_given_message and then print a classification label.

To write the code we need for calculating `p_spam_given_message` and `p_ham_given_message`, we need to use these two equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}



\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Note that some new messages will contain words that are not part of the vocabulary. Recall from the previous lesson that we simply ignore these words when we're calculating the probabilities.

Now we'll write the code for calculating `p_spam_given_message` and `p_ham_given_message`, and then we'll use the function to classify two new messages. On the next screen, we'll classify all the 1,114 messages in our test set.

#### Instructions
---
1. Copy the `classify()` function you see above and write the code needed for calculating `p_spam_given_message` and `p_ham_given_message`.
    - Initiate p_spam_given_message and p_ham_given_message with an initial value. We recommend initiating the variables as p_spam_given_message = p_spam and p_ham_given_message = p_ham (p_spam and p_ham are P(Spam) and P(Ham), and they were calculated on the previous steps).
    - Iterate over each word in `message` (the input of the `classify()` function), which should be a list of strings by the time you start this loop. For each word:
        - If the word is present in the dictionary containing the spam parameters, then update the value of `p_spam_given_message` by multiplying with the parameter value specific to that word. You'll need to code something similar to `p_spam_given_message *= parameters_spam[word]`.
        - If the word is present in the dictionary containing the ham parameters, then update the value of `p_ham_given_message` by multiplying with the parameter value specific to that word. You'll need to do something like `p_ham_given_message *= parameters_spam[word]`.
        - If the word is not present in any of the two dictionaries, then don't do anything. Recall that we ignore words that are not part of the vocabulary.



3. Use the `classify()` function to classify two new messages. You can use any messages you want, but we suggest that one message is obviously spam, and the other is obviously ham. For instance, you can use these two messages:
    - 'WINNER!! This is the secret code to unlock the money: C3421.'
    - "Sounds good, Tom, then see u there".

Now that $P(w_i|Spam)$ and $P(w_i|Ham)$ is calculated throughout the entire span of messages contained in the training set, it is possible to finally build the spam filter, by calculating and comparing $P(Spam|w_1, w_2, ..., w_n)$ with $P(Ham|w_1, w_2, ..., w_n)$.

In [49]:
def classify(message):
    """Takes in a string - a cellphone message (SMS), and returns the probability of Spam given the input message,
    the probability of non-spam (ham) given the input message and classifies whether
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('\W', ' ', message) # still a string
    message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]
        

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
        
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
        
    else:
        print('Equal proabilities, have a human classify this!')

Testing the filter:

In [51]:
# classify('WINNER!! This is the secret code to unlock the money: C3421.')

In [52]:
# classify('Sounds good, Tom, then see u there')

In [50]:
classify("Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50")

P(Spam|message): 2.075032030372528e-73
P(Ham|message): 1.693000959967756e-66
Label: Ham


## Slide 9
---
On the previous screen, we managed to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

First off, we'll change the `classify()` function that we wrote previously to return the labels instead of printing them. Below, note that we now have `return` statements instead of `print()` functions:

    def classify_test_set(message):

        message = re.sub('\W', ' ', message)
        message = message.lower()
        message = message.split()

        p_spam_given_message = p_spam
        p_ham_given_message = p_ham

        for word in message:
            if word in parameters_spam:
                p_spam_given_message *= parameters_spam[word]

            if word in parameters_ham:
                p_ham_given_message *= parameters_ham[word]

        if p_ham_given_message > p_spam_given_message:
            return 'ham'
        elif p_spam_given_message > p_ham_given_message:
            return 'spam'
        else:
            return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

    test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
    test_set.head()

Output:

|   |Label|SMS                                               |predicted|
|---|-----|--------------------------------------------------|---------|
|0  |ham  |Later i guess. I needa do mcat study too.         |ham      |
|1  |ham  |But i haf enuff space got like 4 mb...            |ham      |
|2  |spam |Had your mobile 10 mths? Update to latest Oran... |spam     |
|3  |ham  |All sounds good. Fingers . Makes it difficult ... |ham      |
|4  |ham  |All done, all handed in. Don't know if mega sh... |ham      |


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use **accuracy** as a metric:

\begin{equation}
\text{Accuracy} = \frac{\text{number of correctly classified messages}}{\text{total number of classified messages}}
\end{equation}


#### Instructions
---
1. Measure the accuracy of the spam filter.
- Initialize a variable named correct with a value of `0`.
- Initialize a variable named `total` with the number of messages in the test set.
- Iterate over the test set DataFrame (you can use the `DataFrame.iterrows() method`). For each row:
    - If the actual label is the same as the predicted label, then increment `correct` by `1`.
    - Use `correct` and `total` in combination with the above formula to calculate the accuracy of the spam filter.

2. What do you think about the accuracy value? Is it better or worse than you expected?

Instead of going for the suggested method in the instructions we'll add two new columns to `testing_set`: one, `Test`, displays the output returned by `classify_test_set()` for each row/message; the last, `Correct` is '1' if the filter returns the right classification, i.e. the same classification in the `Label` column (which previously classifies the message has 'spam' or 'ham'), or '0' otherwise.

In [53]:
# input_message = 'Sounds+good'


# for index, value in enumerate(non_words_list) and enumerate(non_words_list_w_spaces):
#     if non_words_list[index] in input_message:
#         expression = non_words_list[index]
#         expression = expression.encode('unicode_escape')
#         input_message = re.sub(expression, non_words_list_w_spaces[index], input_message)

# input_message 

In [51]:
def classify_test_set(message):
    """Takes in a string - a cellphone message (SMS), and returns a classification of whether 
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('[^A-Za-z0-9\s£€\$]', ' ', message) # still a string
    message = re.sub('£', ' £ ', message)
    message = re.sub('€', ' € ', message)
    message = re.sub('\$', ' $ ', message)
    message = re.sub('\s+', ' ', message)
    message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Requires human classification.'

In [52]:
testing_set['Test'] = testing_set['SMS'].apply(classify_test_set)


# Assigns True if condition is met and multplying by one converts True in '1' and False in '0'.
testing_set['Correct'] = (testing_set['Label'] == testing_set['Test'])*1

In [56]:
testing_set.head()

Unnamed: 0,Label,SMS,Test,Correct
4458,ham,Later i guess. I needa do mcat study too.,ham,1
4459,ham,But i haf enuff space got like 4 mb...,ham,1
4460,spam,Had your mobile 10 mths? Update to latest Oran...,spam,1
4461,ham,All sounds good. Fingers . Makes it difficult ...,ham,1
4462,ham,"All done, all handed in. Don't know if mega sh...",ham,1


Percentage of well classified messages when using the `classify_test_set()` function/filter:

In [53]:
test_accuracy = (testing_set['Correct'].sum() / testing_set.shape[0])*100

test_accuracy = test_accuracy.round(2)

print(f'When applied to the messages in the `testing_set` ({testing_set.shape[0]} entries) the test accuracy was aprox. {test_accuracy}%.')

When applied to the messages in the `training_set` (1114 entries) the test accuracy was aprox. 98.83%.


Given the value of the test accuracy of 98.74%, we can classify it as a success, since it exceeded the approval threshold of 80% by a substantial margin. 

## Slide 10 (end of project)
---
In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

If you want to keep working on this project, here's a few next steps you can take:


1. Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.


2. Make the filtering process more complex by making the algorithm sensitive to letter case.


3. Get the project portfolio-ready by using a few tips from our style guide for data science projects.


Congratulations, this is the end of the Conditional Probability course! We've come a long way and learned how to:

- Assign probabilities to events based on certain conditions by using conditional probability rules.


- Assign probabilities to events based on whether they are in relationship of statistical independence or not with other events.


- Assign probabilities to events based on prior knowledge by using Bayes' theorem.


- Create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

Curious to see what other students have done on this project? [Head over to our Community to check them out](https://community.dataquest.io/tags/c/social/share/49/433). While you are there, please remember to show some love and give your own feedback!

And of course, we welcome you to share your own project and show off your hard work. Head over to our Community to [share your finished Guided Project](https://community.dataquest.io/tags/c/social/share/49/433)!

## Extra Tasks

### 1. Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusion.

In [58]:
cond_incorrect = testing_set.Correct == 0

incorrect = testing_set[cond_incorrect].copy().reset_index(drop=True)

In [59]:
incorrect.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    13 non-null     object
 1   SMS      13 non-null     object
 2   Test     13 non-null     object
 3   Correct  13 non-null     int32 
dtypes: int32(1), object(3)
memory usage: 272.0+ bytes


In [60]:
 pd.options.display.max_colwidth = 500

incorrect[['Label', 'SMS']]

Unnamed: 0,Label,SMS
0,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
1,ham,Unlimited texts. Limited minutes.
2,ham,26th OF JULY
3,ham,Nokia phone is lovly..
4,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8"
5,ham,No calls..messages..missed calls
6,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us"
7,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50"
8,spam,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text"
9,spam,"0A$NETWORKS allow companies to bill for SMS, so they are responsible for their ""suppliers"", just as a shop has to give a guarantee on what they sell. B. G."


In [None]:
string = 'Oh my god! I ve found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50'



In [61]:
incorrect_split = incorrect.SMS.str.replace('\W', ' ', regex=True)

In [62]:
incorrect_split 

0                                                                                                                                                                                                                                                                                                          Not heard from U4 a while  Call me now am here all night with just my knickers on  Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4 net
1                                                                                                                                                                                                                                                                                                                                                                                                                                  Unlimited texts  Limited minutes 
2                                                                                             

In [63]:
find_currency_1 = training_set.SMS.copy().apply(lambda x: re.match('$', x))



In [64]:
containsd_gbp = training_set.SMS.str.contains('150p')

find_currency_2 = training_set.SMS[containsd_gbp]

find_currency_2.value_counts()

FREE for 1st week! No1 Nokia tone 4 ur mob every week just txt NOKIA to 8007 Get txting and tell ur mates www.getzed.co.uk POBox 36504 W45WQ norm150p/tone 16+       3
I don't know u and u don't know me. Send CHAT to 86688 now and let's find each other! Only 150p/Msg rcvd. HG/Suite342/2Lands/Row/W1J6HL LDN. 18 years or over.       3
YES! The only place in town to meet exciting adult singles is now in the UK. Txt CHAT to 86688 now! 150p/Msg.                                                        2
Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free entry 2 100 wkly draw txt MUSIC to 87066 TnCs www.Ldew.com1win150ppmx3age16               2
Congrats! Nokia 3650 video camera phone is your Call 09066382422 Calls cost 150ppm Ave call 3mins vary from mobiles 16+ Close 300603 post BCM4284 Ldn WC1N3XX        2
                                                                                                                                                                    .

In [65]:
find_currency_1[find_currency_1==True].head()

Series([], Name: SMS, dtype: object)

In [66]:
 re.sub('\W', ' ', '123 @ maria 12caes £ £150 loly$ loly. loly+pussy.net')    

'123   maria 12caes    150 loly  loly  loly pussy net'

In [67]:
pattern = '(?:!|,)'

re.sub(pattern, f' {pattern} ', 'Secret!! Money, goods.' )



'Secret (?:!|,)  (?:!|,)  Money (?:!|,)  goods.'

In [68]:
one_line = incorrect[3]#.str.replace('\W', ' ', regex=True)

KeyError: 3

In [None]:
def classify_test_set_v2(message):
    """Takes in a string - a cellphone message (SMS), and returns a classification of whether 
    the message is spam, not spam (ham), or if a human is required to classify the message.
    """

    message = re.sub('\W', ' ', message) # still a string
    message = message.lower() # still a string
    message = message.split() # now a list of strings

    
    # Calculating P(Spam|w_1, w_2, ..., w_n) with P(Ham|w_1, w_2, ..., w_n).
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    # Note: if `word` is not in the spam or in the non-spam DataFrames the loop does nothing by default.
    for word in message:
        
        if word in p_wi_given_spam_dict.keys():
            p_spam_given_message *= p_wi_given_spam_dict[word]
            
        if word in p_wi_given_ham_dict.keys():
            p_ham_given_message *= p_wi_given_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Requires human classification.'