## Spam Filter with Naive Bayes

![Image](https://ampjar.com/wp-content/uploads/2019/07/HERO-avoid-spam-filters@2x.png)

### Introduction

To classify messages as spam or non-spam using the naive bayes algorithm, we basically need follow these steps:

* Learns how humans classify messages
* Uses that human knowledge to estimate probabilities for new messages (spam and non-spam)
* Classifies a new message based on these probability values

In this project we'll be working with the data set provided by [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can download data set and readme download from this link and also get data set information.

There are 5,572 SMS messages that are already classified by humans. Let's read it and take a quick look on the first five rows.

In [1]:
import pandas as pd

classified_sms = pd.read_csv('SMSSpamCollection',
                            sep='\t',
                            header=None,
                            names=['Label', 'SMS'])

classified_sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's find now what percentage of the messages is spam and what isn't ("ham" means non-spam).

In [2]:
total_persentages = classified_sms['Label'].value_counts(normalize=True)

print('''There are {:.1%} spam messages and {:.1%} non-spam messages in the data set'''
      .format(total_persentages[1], total_persentages[0]))

There are 13.4% spam messages and 86.6% non-spam messages in the data set


### Building the filter

#### Test preparation

When creating software, it's a good rule to think over tests in advance. In this case we should split our data set into two parts:
* training set (about 80% of the dataset)
* test set (about 20% of the dataset)

We'll create the spam filter using training set only. Once we have it, we'll apply the algorithm on testing set and compare  results with the human classificatioan. Our goal here is accuracy greater than **80%**.

To ensure that spam and ham messages are spread properly throughout the datasets we'll randomize all the data.

In [3]:
#Randomize whole data set
classified_sms.sample(frac=1, random_state=1)

#Index to split with
split_on = round(len(classified_sms) * 0.8)
train_set = classified_sms.iloc[:split_on, :].copy()
test_set = classified_sms.iloc[split_on:, :].copy()

#Reset index
train_set.reset_index(inplace=True, drop=True)
test_set.reset_index(inplace=True, drop=True)

Let's check the spam and ham ratio in these new sets. It should be similar to what we have in the full data set.

In [4]:
train_persentages = train_set['Label'].value_counts(normalize=True)
test_persentages = test_set['Label'].value_counts(normalize=True)

print('''Full data set: {:.1%} spam messages and {:.1%} non-spam messages
Train set: {:.1%} spam messages and {:.1%} non-spam messages
Test set: {:.1%} spam messages and {:.1%} non-spam messages'''
      .format(total_persentages[1], total_persentages[0],
             train_persentages[1], train_persentages[0],
             test_persentages[1], test_persentages[0]))

Full data set: 13.4% spam messages and 86.6% non-spam messages
Train set: 13.5% spam messages and 86.5% non-spam messages
Test set: 13.0% spam messages and 87.0% non-spam messages


Everything looks fine.

#### Data cleaning

We want to classify messages by the words in them so we need to convert them into sets of words. It means:
* Remove any punctuation
* Reduce everything to the lowercase
* Split the message

In [5]:
train_set['SMS'] = train_set['SMS'].str.replace(
    r'\W', ' ',regex=True).str.lower().str.split()

train_set

Unnamed: 0,Label,SMS
0,ham,"[go, until, jurong, point, crazy, available, o..."
1,ham,"[ok, lar, joking, wif, u, oni]"
2,spam,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,..."
...,...,...
4453,ham,"[i, ve, told, you, everything, will, stop, jus..."
4454,ham,"[or, i, guess, lt, gt, min]"
4455,ham,"[i, m, home, ard, wat, time, will, u, reach]"
4456,ham,"[storming, msg, wen, u, lift, d, phne, u, say,..."


Now we should create a list of unique words - the **vocabulary**.

In [7]:
vocabulary = []

for message in train_set['SMS']:
    for word in message:
        vocabulary.append(word)
        
#Remove duplicates and return the list
vocabulary_set = set(vocabulary)
vocabulary = list(vocabulary_set)

len(vocabulary)

7813

There are 7813 words in the vocabulary!

Now 