# Building a Spam filter with Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. You can also download the dataset directly from this link. The data collection process is described in more details on this page, where you can also find some of the authors' papers.

In [1]:
import pandas as pd

In [2]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
print("Dataset has", sms.shape[0], "rows and", sms.shape[1], "columns.")

Dataset has 5572 rows and 2 columns.


In [5]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. 

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

For now, let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

In [6]:
random = sms.sample(frac=1, random_state=1)

In [7]:
random = random.reset_index(drop=True)

In [8]:
random.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
train = random.iloc[0:4458]
test = random.iloc[4459:]

In [10]:
print(train['Label'].value_counts(normalize=True) * 100)
print(test['Label'].value_counts(normalize=True) * 100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.792453
spam    13.207547
Name: Label, dtype: float64


To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [11]:
import re

train.loc[:,'SMS'] = train['SMS'].str.replace(r'\W', ' ')
train.loc[:,'SMS'] = train['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [12]:
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message). Recall from the previous mission that we call the set of unique words a vocabulary.

We'll eventually bring the training set to that format ourselves, but first, let's create a list with all of the unique words that occur in the messages of our training set.

In [13]:
train['SMS'] = train['SMS'].str.split()
vocabulary = []
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['SMS'] = train['SMS'].str.split()


In [14]:
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

Now we're going to use the vocabulary to make the data transformation we need:

In [15]:
word_count = {word: [0] * len(train['SMS']) for word in vocabulary}

for i, sms in enumerate(train['SMS']):
    for word in sms:
        word_count[word][i] += 1

Now that we have the dictionary we need, let's do the final transformations to our training set and then move forward with creating the spam filter.

In [16]:
count = pd.DataFrame(word_count)
count.head()

Unnamed: 0,dying,lmao,sheets,09063458130,bday,hows,bitching,8pm,drinkin,yah,...,potential,4goten,donyt,online,rite,nichols,mentionned,naseeb,idew,anyway
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
final = pd.concat([train, count], axis=1)

final.head()

Unnamed: 0,Label,SMS,dying,lmao,sheets,09063458130,bday,hows,bitching,8pm,...,potential,4goten,donyt,online,rite,nichols,mentionned,naseeb,idew,anyway
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.


In [18]:
spam_set = final[final['Label'] == 'spam']
ham_set = final[final['Label'] == 'ham']

In [19]:
p_spam = final[final['Label'] == 'spam']['Label'].count() / len(final['Label'])
p_spam

0.13458950201884254

In [20]:
p_ham = final[final['Label'] == 'ham']['Label'].count() / len(final['Label'])
p_ham

0.8654104979811574

In [21]:
n_spam = spam_set['SMS'].apply(len).sum()

In [22]:
n_spam

15190

In [23]:
n_ham = ham_set['SMS'].apply(len).sum()

In [24]:
n_ham

57237

In [25]:
n_vocabulary = len(vocabulary)

In [26]:
n_vocabulary

7783

In [27]:
alpha = 1

We can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

- P("lost"|Spam) and P("lost"|Ham)
- P("navigate"|Spam) and P("navigate"|Ham)
- P("sea"|Spam) and P("sea"|Ham)

In more technical language, the probability values that P(wi|Spam) and P(wi|Ham) will take are called parameters.

Let's now calculate all the parameters using the equations below:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [28]:
spam_dict = {}
ham_dict = {}

In [29]:
for i in vocabulary:
    spam_dict[i] = 0
    ham_dict[i] = 0

In [30]:
for word in vocabulary:
    prob_w_spam = (spam_set[word].sum() + alpha) / (n_spam + (alpha * n_vocabulary))
    spam_dict[word] = prob_w_spam

In [31]:
for word in vocabulary:
    prob_w_ham = (ham_set[word].sum() + alpha) / (n_ham + (alpha * n_vocabulary))
    ham_dict[word] = prob_w_ham

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn) with the equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


In [32]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
  
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
             p_spam_given_message *= spam_dict[word]
                
        if word in ham_dict:
             p_ham_given_message *= ham_dict[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Let's try with obvious examples:

In [33]:
classify('WINNER!! money, secret, code, winner.')

P(Spam|message): 3.957764308966044e-18
P(Ham|message): 2.6809569121070658e-22
Label: Spam


In [34]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

First off, we'll change the classify() function that we wrote previously to return the labels instead of printing them. Below, note that we now have return statements instead of print() functions:

In [35]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]

        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [36]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['predicted'] = test['SMS'].apply(classify_test_set)


Unnamed: 0,Label,SMS,predicted
4459,ham,But i haf enuff space got like 4 mb...,ham
4460,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4461,ham,All sounds good. Fingers . Makes it difficult ...,ham
4462,ham,"All done, all handed in. Don't know if mega sh...",ham
4463,ham,But my family not responding for anything. Now...,ham


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. 

In [37]:
correct = 0
total = len(test)

for i, j in test.iterrows():
    if j['Label'] == j['predicted']:
        correct += 1

Accuracy = correct / total
print(Accuracy)

0.9874213836477987
