# Building a Spam Filter with Naive Bayes

In this project, we want to classify messages as spam or non-spam messages. In order to reach this goal, the computer has to do the following steps for us:
1. Learn how humans classify messages
2. Use that human knowledge to estimate probabilities wether new messages are spam or non-spam
3. Classifies new message based on the probability values - if the probability for spam is greater, then the message should be classified as spam (and vice versa).

For this project we will use the multinominal Naive Bayes algorithm in combination with a dataset of 5,572 SMS messages that are already classified by humans. The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). <br><br>
First, let's start by reading in the dataset.

## Exploring the Dataset

In [1]:
# Reading in the csv file
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None,
                 names=['Label','SMS'])

In [2]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [4]:
spam_prob = sms['Label'].value_counts(normalize=True)
spam_prob

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In the dataset, 13.4% of all sms messages are labeled as spam while 86.6% of the messages are labeled 

In [5]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


## Training and Test Set
In order to validate the effectiveness of the spam filter we are about to create, we need to split the dataset into a training set and a test set. We will assign 80% of the messages to the training set to give our algorithm as much input as possible. The other 20% will be assigned to the test dataset. Splitting the dataset means that:
- 4,458 messeages will be assigned to the training set (80%)
- 1,114 messages will be assigned to the testing set (20%)

In [6]:
# Randomizing the dataset
sms_rand = sms.sample(frac=1, random_state=1)
# Splitting the dataset
train = sms_rand[:4458]
test = sms_rand[4458:]

In [7]:
len(train)

4458

In [8]:
len(test)

1114

In [9]:
train.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [10]:
test.head()

Unnamed: 0,Label,SMS
2131,ham,Later i guess. I needa do mcat study too.
3418,ham,But i haf enuff space got like 4 mb...
3424,spam,Had your mobile 10 mths? Update to latest Oran...
1538,ham,All sounds good. Fingers . Makes it difficult ...
5393,ham,"All done, all handed in. Don't know if mega sh..."


Calculating the percentage of spam in the test set and the training set

In [11]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [12]:
train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

We can see that the percentage of spam messages are fairly similar between both dataset. Hence, we can apply the naive bayes algorithm on the training set.

## Letter Case and Punctuation

Before we can continue, we have to clean up the data so that it is usable for the algorithm. We have to transform the sms column so that every word is assigned to a new column. Each row should display the number of times the word occurs in the sms. In addition, we have to transform all words to lower case so they are easily comparable. Furthermore, all punctuation should be removed.

In [13]:
sms['SMS'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: SMS, dtype: object

In [14]:
# Removing all the punctuation from the sms column
train['SMS'] = train['SMS'].str.replace('\W',' ')
train['SMS'] = train['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [15]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Creating the Vocabulary
Next we want to create a set of unique words as a vocabulary. 

In [16]:
import numpy as np
# Splitting each message into a set of words
train['SMS'] = train['SMS'].str.split()

vocabulary = []
# Appending all words to vocabulary
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
print(len(vocabulary))
# Transforming the list into a set to remove duplciates
vocabulary = set(vocabulary)

vocabulary = list(vocabulary)

72427


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [17]:
len(vocabulary)

7783

It appears that there are 7,783 unique words in all sms messages in our training set.

## The Final Training Set

Now we have to create a dataframe that displays the quantities every word occurs in a given message. This can be archieved with a dictionary that stores the word count for every sms in the dataset. In the following code we will use a previoulsy written code that creates the dictionary that we need for the training set.

In [18]:
# Initializing the dictionary
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

# Looping over all messages of the train dataset
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

Now that we have the dictionary we can create the new dataframe.

In [19]:
# Transforming the dictionary into a new dataframe
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

In [20]:
word_counts_per_sms.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Next, we will concacenate the Dataframe we just built with the training set which contains the label and sms columns.

In [21]:
train_clean = pd.concat([train, word_counts_per_sms], axis=1)

In [22]:
train_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[go, until, jurong, point, crazy, available, o...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,"[ok, lar, joking, wif, u, oni]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,"[u, dun, say, so, early, hor, u, c, already, t...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0


## Calculating Constants First

Now that the data set is clean we can start working on the on our spam filter the using Naive Bayes algorithm. First, let's calculate the important constants for our algorithm first:
- P(Spam)
- P(Ham)
- N Spam
- N Ham
- N Vocabulary
<br><br>(N refers to the total number of words)

More information on how the Naive Bayes is calculated can be found [here](https://www.geeksforgeeks.org/naive-bayes-classifiers/)

In [23]:
# Isolating spam messages and non spam messages
spam = train_clean[train_clean['Label']=='spam']
ham = train_clean[train_clean['Label']=='ham']

In [24]:
len(spam)

600

In [25]:
len(ham)

3858

In [26]:
p_spam = len(spam)/len(train_clean)
p_spam

0.11233851338700618

In [27]:
p_ham = len(ham)/len(train_clean)
p_ham

0.7223366410784497

In [28]:
n_words_per_spam = spam['SMS'].apply(len)
n_spam = n_words_per_spam.sum()
n_spam

15190

In [29]:
n_words_per_ham = ham['SMS'].apply(len)
n_ham = n_words_per_ham.sum()
n_ham

57237

In [30]:
n_vocabulary = len(vocabulary)
n_vocabulary

7783

In [31]:
# Initiating alpha for Laplace smoothing
alpha = 1

## Calculating Parameters

In order to calcuale the possibility that a message is a spam message, we have to calculate the possibility for every single word. If the message is for example "secret code here" we would have to calculate the following probabilities:
- p('secret'|Spam) and P('secret'|Ham)
- p('code'|Spam) and P('code'|Ham)
- p('here'|Spam) and P('here'|Ham)

There are 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both probabilities for spam and ham. 

In [32]:
# Initializing words
param_spam = {unique_word:0 for unique_word in vocabulary}
param_ham = {unique_word:0 for unique_word in vocabulary}

# Calculating parameters
for word in vocabulary:
    n_word_given_spam = spam[word].sum()
    # Using the bayes theorem
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    param_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum() 
    # Using the bayes theorem
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    param_ham[word] = p_word_given_ham

In [33]:
param_spam

{'gudnite': 4.3529360553693465e-05,
 'reltnship': 8.705872110738693e-05,
 'mfl': 4.3529360553693465e-05,
 'continue': 4.3529360553693465e-05,
 'scary': 4.3529360553693465e-05,
 'cares': 4.3529360553693465e-05,
 'sitll': 4.3529360553693465e-05,
 'figure': 8.705872110738693e-05,
 'atten': 4.3529360553693465e-05,
 'decking': 4.3529360553693465e-05,
 'bye': 8.705872110738693e-05,
 'beendropping': 4.3529360553693465e-05,
 'c': 0.0006094110477517085,
 'tirupur': 4.3529360553693465e-05,
 'cu': 4.3529360553693465e-05,
 'footprints': 4.3529360553693465e-05,
 'getstop': 4.3529360553693465e-05,
 'madurai': 4.3529360553693465e-05,
 'shipping': 4.3529360553693465e-05,
 'medicine': 4.3529360553693465e-05,
 'sol': 4.3529360553693465e-05,
 'lined': 4.3529360553693465e-05,
 'beers': 4.3529360553693465e-05,
 'gastroenteritis': 4.3529360553693465e-05,
 'support': 8.705872110738693e-05,
 '10ppm': 8.705872110738693e-05,
 'best1': 8.705872110738693e-05,
 'act': 4.3529360553693465e-05,
 'princes': 4.35293605

In [34]:
param_ham

{'gudnite': 6.151953245155337e-05,
 'reltnship': 1.537988311288834e-05,
 'mfl': 3.075976622577668e-05,
 'continue': 4.6139649338665025e-05,
 'scary': 6.151953245155337e-05,
 'cares': 4.6139649338665025e-05,
 'sitll': 3.075976622577668e-05,
 'figure': 0.00015379883112888343,
 'atten': 3.075976622577668e-05,
 'decking': 3.075976622577668e-05,
 'bye': 4.6139649338665025e-05,
 'beendropping': 3.075976622577668e-05,
 'c': 0.0010150722854506305,
 'tirupur': 6.151953245155337e-05,
 'cu': 1.537988311288834e-05,
 'footprints': 6.151953245155337e-05,
 'getstop': 3.075976622577668e-05,
 'madurai': 1.537988311288834e-05,
 'shipping': 3.075976622577668e-05,
 'medicine': 4.6139649338665025e-05,
 'sol': 9.227929867733005e-05,
 'lined': 3.075976622577668e-05,
 'beers': 3.075976622577668e-05,
 'gastroenteritis': 4.6139649338665025e-05,
 'support': 0.00012303906490310673,
 '10ppm': 1.537988311288834e-05,
 'best1': 4.6139649338665025e-05,
 'act': 3.075976622577668e-05,
 'princes': 1.537988311288834e-05,


We now have the probability of every word being spam or ham after using the bayes theorem.

## Classifying a New Message

Now that we calculated the parameters and constants we needed, we can start creating the spam filter. The spam filter will need to execute the following steps:
1. Taking in an input as a new messages (w1, w2, ..., wn) 
2. Calculating P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
3. Comparing the both calculated values from above <br>
3.1 if P(Spam|w1, w2, ..., wn) > P(Ham|w1, w2, ..., wn) the message should be classified as spam <br>
3.2 if P(Spam|w1, w2, ..., wn) < P(Ham|w1, w2, ..., wn) the message should be classified as ham <br>
3.3 if P(Spam|w1, w2, ..., wn) = P(Ham|w1, w2, ..., wn) the algorithm may request human help

Next, let's write a function that classifies the messages. The framework of the following code has been provided by Dataquest while we will implement the logical calculation by ourselves.

In [35]:
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
            
        if word in param_ham:
            p_ham_given_message *= param_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Testing the new function with two messages:

In [36]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 4.1578268320785084e-29
P(Ham|message): 1.284446514392249e-25
Label: Ham


In [37]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4449942502228983e-24
P(Ham|message): 4.287484325865619e-22
Label: Ham


The results are disappointing since they are inaccurate. The first message is obviously a spam message but is not getting labeled as one. 

In [38]:
train.head()

Unnamed: 0,Label,SMS
1078,ham,"[yep, by, the, pretty, sculpture]"
4028,ham,"[yes, princess, are, you, going, to, make, me,..."
958,ham,"[welp, apparently, he, retired]"
4642,ham,[havent]
4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [39]:
spam_examples = train[train['Label']=='spam']
spam_examples.head()

Unnamed: 0,Label,SMS
147,spam,"[freemsg, why, haven, t, you, replied, to, my,..."
4517,spam,"[congrats, 2, mobile, 3g, videophones, r, your..."
3316,spam,"[free, message, activate, your, 500, free, tex..."
1929,spam,"[call, from, 08702490080, tells, u, 2, call, 0..."
1745,spam,"[someone, has, conacted, our, dating, service,..."


In [40]:
classify('freemsg, why, haven, t, you, replied, to')

P(Spam|message): 9.069855630098013e-24
P(Ham|message): 1.4886102995316914e-21
Label: Ham


In [41]:
classify('congrats, 2, mobile, 3g, videophones')

P(Spam|message): 3.2444574013540275e-19
P(Ham|message): 1.6771026385462929e-18
Label: Ham


In [42]:
classify('free, message, activate, your, 500, free')

P(Spam|message): 1.160848638176028e-21
P(Ham|message): 2.8625610401999425e-19
Label: Ham


It seems that the classify function has a bias towards classifying results as non spam. We have to analyze our previous code for mistakes.

In [43]:
len(param_spam)

7783

In [44]:
len(param_ham)

7783

In [45]:
param_spam['secret']

4.3529360553693465e-05

In [46]:
param_ham['secret']

0.0001076591817902184

In [47]:
param_spam['winner']

0.0001305880816610804

In [48]:
param_ham['winner']

0.00016917871424177176

In [51]:
param_spam['free']

0.0007399991294127889

In [52]:
param_ham['free']

0.002476161181175023

There seems to be something off with the values of both dictionaries. We returned the value for three different key words that should have a higher probability to occur in a spam message than in a non spam message. To our surprise, all three words have a higher probability for non spam messages. Even after replacing the original code for initializing the dictionaries with dataquest's solution code did not change the results in the end. The cause of this problem is unclear up to this point.

## Measuring the Spam Filter's Accuracy

Even though the accuracy of our function seems to be inaccurate and biased towards classifying messages as non spam messages, we can still calculate the accuracy percentage of it.

First, we need a function that is similar to our previous function classify, but just returns either the string 'ham' or 'spam'.

In [58]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now we can use this function to create a new column in our test set.

In [59]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,Label,SMS,predicted
2131,ham,Later i guess. I needa do mcat study too.,ham
3418,ham,But i haf enuff space got like 4 mb...,ham
3424,spam,Had your mobile 10 mths? Update to latest Oran...,ham
1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
5393,ham,"All done, all handed in. Don't know if mega sh...",ham


The output shows us the bias towards ham. Let's calculate the percentage of correctly classified messages.

In [62]:
correct = 0
total = test.shape[0]
    
for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 965
Incorrect: 149
Accuracy: 0.8662477558348295


In [67]:
spam_test = test[test['Label']=='spam']
spam_test.shape[0]

147

The accuracy of our spam filter is better than expected with an accuracy of 86.6%. On the other hand, one reason for this high number is the high number of non spam messages in the test dataset. Since the classifying function has a bias towards non spam messages it is very likely to get them right, while almost all of the spam messages are being labeled wrong. Let's calculate the number of correctly classified messages that are spam.

In [68]:
spam_test.head()

Unnamed: 0,Label,SMS,predicted
3424,spam,Had your mobile 10 mths? Update to latest Oran...,ham
3895,spam,Dear Dave this is your final notice to collect...,ham
5285,spam,URGENT! You have won a 1 week FREE membership ...,ham
5147,spam,Get your garden ready for summer with a FREE s...,ham
824,spam,25p 4 alfie Moon's Children in need song on ur...,ham


In [70]:
correct = 0
total = spam_test.shape[0]
    
for row in spam_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 0
Incorrect: 147
Accuracy: 0.0


## Conclusion

Like expected, the accuracy of classifying spam messages correct is 0%. The reason for this are the high probability values in the ham dictionary. Even after rechecking the code for initializing the dictioanies and after replacing our code by the solution code from dataquest the mistakes in coding are unclear and the results don't change. The spam filter is not efficient or accurate and is classifying all messages as non spam. This project is not a success and should be reviewed again in the future.