In [1]:
# import libraries
from IPython.display import Image  # for displaying images in markdown cells
import pandas as pd  # Dataframe manipulation
import numpy as np  # Arrays manipulation

# Dataquest - Conditional Probabilities <br/> <br/> Project Title: Building A Spam Filter With Naive Bayes

## 1) Introduction and Exploring the Dataset

#### Key skills applied in project:
- Assign probabilities to events based on certain conditions by using conditional probability rules.
- Assign probabilities to events based on whether they are in relationship of statistical independence or not with other events.
- Assign probabilities to events based on prior knowledge by using Bayes' theorem.
- Create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

#### Background:

Provided by: [Dataquest.io](https://www.dataquest.io/)

We're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, the computer:

1) Learns how humans classify messages.

2) Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.

3) Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Note that due to the nature of spam messages, the dataset contains content that may be offensive to some users.

In [2]:
# read and load file, review for familiarity
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)


df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
rename_map = {0:'Label', 1:'SMS'}

df.rename(columns=rename_map, inplace=True)
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
# review number of rows and columns
df.shape

(5572, 2)

In [5]:
# review any null values (non-null values should match 5,572 rows)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
# review number and percentage of spam and ham
print(df['Label'].value_counts())
print('')
print(df['Label'].value_counts() / df['Label'].count() * 100)

ham     4825
spam     747
Name: Label, dtype: int64

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


## 2) Training And Test Set

Provided by: [Dataquest.io](https://www.dataquest.io/)

We read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

Before creating the spam filter, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A **training set**, which we'll use to "train" the computer how to classify messages.
- A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

The training set will have 4,458 messages (about 80% of the dataset).
The test set will have 1,114 messages (about 20% of the dataset).

All 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

For now, let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

In [24]:
# Start by randomizing the entire dataset by using the DataFrame.sample() method.


# frac=1 means entire dataset is randomised.
# random_state=1 is simply to specify a random seed, for reproducibility as well.
df_sample = df.sample(frac=1, random_state=1)


# Split the randomized dataset into a training and a test set.
# training set = 4,458 rows (80%), test set = 1,114 rows (20%)
training_set = df_sample.iloc[:4458, :]
test_set = df_sample.iloc[4458:, :]

training_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [25]:
training_set.head(2)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"


In [26]:
test_set.head(2)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...


In [28]:
# review number and percentage of spam and ham in each set (training set and test set)
print(training_set['Label'].value_counts())
print('')
print(training_set['Label'].value_counts() / training_set['Label'].count() * 100)
print('')
print(test_set['Label'].value_counts())
print('')
print(test_set['Label'].value_counts() / test_set['Label'].count() * 100)

ham     3858
spam     600
Name: Label, dtype: int64

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

ham     967
spam    147
Name: Label, dtype: int64

ham     86.804309
spam    13.195691
Name: Label, dtype: float64


#### Findings (Training And Test Set):

Upon splitting the dataset into training and test sets, the percentages of ham and spam messages in each set is quite similiar and close to the initial dataset (87:13 split).

## 3) Letter Case And Punctuation

Provided by: [Dataquest.io](https://www.dataquest.io/)

The next big step is to use the training set to teach the algorithm to classify new messages.

When a new message comes in, our Naive Bayes algorithm will make the classification based on the results it gets to these two equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations (accounting for additive smoothing as classifying categorical data):

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Where:

\begin{aligned}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Ham} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Ham} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{aligned}


#### Data cleaning
In order to implement the above algorithm, we would want to transform the existing data sets into new tables:
- multiple columns where each represent one unique word
- values of each such column contain the number of times each unique word occur
- one row still represents one unique message
- ignore both punctuation and differentiation between lower and upper cases

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [None]:
# Remove all the punctuation from the SMS column. Use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.

