# Building a Spam Filter with Naive Bayes

This project will seek to create a system to try and automatically detect SMS messages categorized as "Spam". We effectively need to teach the computer how to do so using a set of over five thousand SMS messages that have been classified by humans as spam already.

The dataset from The UCI Machine Learning Repository can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). We begin by reading in the dataset as a pandas dataframe.


In [1]:
import pandas as pd
SMS_spam = pd.read_csv('SMSSpamCollection', sep='\t', header = None, names = ['Label', 'SMS'])
print(SMS_spam.shape)
print(SMS_spam.head())
print(SMS_spam['Label'].value_counts(normalize=True)*100)

(5572, 2)
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
ham     86.593683
spam    13.406317
Name: Label, dtype: float64


We can see the general structure of the dataframe. It is relatively simple, with only two columns, one indicating whether or not it is spam, and another containing the string of the text. 

There's a heathy 5,572 rows. Around 13.4% of the rows are spam, while the remaining 86.59% are non-spam.

## Create Training Set and Test Set

We need to see if our model is effective, but the only data we have to test it against human classification is the data we are going to use to create/train it. So we will take 20% of the data out randomly, and use it to test the model after the fact (which will be trained with the remaining 80%).

In [2]:
SMS_full_sample = SMS_spam.sample(frac=1, random_state=1) 
#random-order of dataset

print('80% of the dataset is {} rows.'.format(len(SMS_full_sample)*0.8))


80% of the dataset is 4457.6 rows.


We randomized our dataset, and found 80% to be 4457.6, rounded to 4458. So the training set will take the first 4,458 rows of the randomized set (As it is randomized already, taking the first 80% of rows is not a biased selection). The remaining 1,114 rows will be the test set.

In [3]:
train_set = SMS_full_sample[:4458].reset_index()
test_set = SMS_full_sample[4458:].reset_index()
#reset_index clears the randomly unordered indexes from our new sets.

print(train_set['Label'].value_counts(normalize=True)*100)
print(test_set['Label'].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


We can see that the value counts for both the training set and test set are about the same, still 86 or 87% versus 13%. This resembes the full dataset, so we can proceed.

## Data Cleaning the Training Set

We first need to clean and standardize the format of the SMS texts so we can easily interpret them in our analysis. We start by removing the punctuation and covnerting all words to lower case.

In [4]:
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ').str.lower()

print(train_set.head())

   index Label                                                SMS
0   1078   ham                       yep  by the pretty sculpture
1   4028   ham      yes  princess  are you going to make me moan 
2    958   ham                         welp apparently he retired
3   4642   ham                                            havent 
4   4674   ham  i forgot 2 ask ü all smth   there s a card on ...


We can see inthe header there is a distinct lack of punctuation.

Next we need to create a vocabulary of all words found within the SMS data.

In [5]:
train_set['SMS'] = train_set['SMS'].str.split()
# Try not to run this more than once

In [6]:
vocabulary = []
for text in train_set['SMS']:
    for word in text:
        vocabulary.append(word)
        
vocabulary = set(vocabulary)
#Turns the list of words into a set, removing duplicates.

vocabulary = list(vocabulary)
#Turns the set of words back into a list, the desired format but still no duplicates.

len(vocabulary)

7783

We can see there are 7,783 unique words within our training set. We want to have lists this long so each value in the list acts as a word count for that unique word in an SMS message.

We create a blank version of this by making a dictionary, wher each word is the key and the values are 7,783 zeroes.

In [7]:
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) for unique_word in vocabulary}

for index, text in enumerate(train_set['SMS']):
    for word in text:
        word_counts_per_sms[word][index] += 1

word_counts_df = pd.DataFrame(word_counts_per_sms)

In [8]:
train_clean = pd.concat([train_set, word_counts_df], axis=1)
print(train_clean.head())

   index Label                                                SMS  0  00  000  \
0   1078   ham                  [yep, by, the, pretty, sculpture]  0   0    0   
1   4028   ham  [yes, princess, are, you, going, to, make, me,...  0   0    0   
2    958   ham                    [welp, apparently, he, retired]  0   0    0   
3   4642   ham                                           [havent]  0   0    0   
4   4674   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...  0   0    0   

   000pes  008704050406  0089  01223585334 ...  zindgi  zoe  zogtorius  zouk  \
0       0             0     0            0 ...       0    0          0     0   
1       0             0     0            0 ...       0    0          0     0   
2       0             0     0            0 ...       0    0          0     0   
3       0             0     0            0 ...       0    0          0     0   
4       0             0     0            0 ...       0    0          0     0   

   zyada  é  ú1  ü  〨ud  鈥  
0  

Now each row consists of the label, the words within it, and a word count of every single unique word in the dataset.

## Calculating Constant Variables

Some of the values we need to create our model vary from row to row, but some of them, such as the overall probability of an email being spam vs non-spam is constant. We can calculate the constants to start.

In [9]:
ct_spam = train_clean[train_clean['Label'] == 'spam']['Label'].count()
ct_ham = train_clean[train_clean['Label'] == 'ham']['Label'].count()
#count the number of rows that are spam and ham

p_spam = ct_spam/len(train_clean)
p_ham = ct_ham/len(train_clean)
# divide by total for probabilities

n_vocab = len(vocabulary)
 
n_words_per_spam_message = train_clean[train_clean['Label'] == 'spam']['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = train_clean[train_clean['Label'] == 'ham']['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

print(p_spam)
print(p_ham)
print(n_vocab)
print(n_spam)
print(n_ham)

alpha = 1
# For the Laplace smoothing

0.13458950201884254
0.8654104979811574
7783
15190
57237


## Calculating Parameters

For each of the 7,783 words in our vocabulary, we need to calcuate the probability: P(w|Spam) and P(w|Ham), where w is a variable for the word. 

We can create two dictionaries, one for 'Spam' and one for 'Ham' that will hold all of these parameters. We initialize these as blank, with each key being a unique vocab word and each value being zero.

In [11]:
train_spam = train_clean[train_clean['Label'] == 'spam']
train_ham = train_clean[train_clean['Label'] == 'ham']

parameters_spam ={word: 0 for word in vocabulary}
parameters_ham ={word: 0 for word in vocabulary}
#Initialized parameters, all zero values.

for word in vocabulary:
    n_wi_spam = train_spam[word].sum()
    n_wi_ham = train_ham[word].sum()
    
    parameters_spam[word] = (n_wi_spam + alpha)/(n_spam + (alpha * n_vocab))
    parameters_ham[word] = (n_wi_ham + alpha)/(n_ham + (alpha * n_vocab))

We then looped through each word in the dictionary and calculated its parameter for both 'Spam' and 'Ham' (non-spam).

## Classifying New Messages

Now that we have our parameters set up, we need a function to take in a string input and output whether it is more likely 'spam' or 'ham', based on our current probabilities. 

In [12]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

We have created a function, classify(), that inputs a string and outputs whether or not it is likely Spam or non-Spam. Let's test it with two new strings.

In [13]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')
classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Correct! For these two examples, the first seems to have been properly identified as spam and the second as non-spam. We can now apply this to our test set to see our success rate.

## Testing Accuracy on Test Set

We need to revise our classify function to returning the results rather than printing them, so we can calculate total statistics.

In [16]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'
    

Now that we have our function we can use apply() to put it over the entire test set in a new column.

In [17]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,Later i guess. I needa do mcat study too.,ham
1,3418,ham,But i haf enuff space got like 4 mb...,ham
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,5393,ham,"All done, all handed in. Don't know if mega sh...",ham


Now tht we have the new column we just have to count for the dataframe all the times in which our predicted value, spam or ham, matches the human-determined label.

In [25]:
correct = 0
total = len(test_set)

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
print(correct)
print(100*correct/total)

1100
98.74326750448833


We find our program to be 98.74% effective, correctly identifying all but 14 of the 1,114 test set SMS messages. This is a great success rate, and much higher than expected!

Let's quickly take a look at the 14 messages incorrectly identified.

In [26]:
wrong = test_set[test_set['Label'] != test_set['predicted']]
wrong.head(14)

Unnamed: 0,index,Label,SMS,predicted
114,3460,spam,Not heard from U4 a while. Call me now am here...,ham
135,1940,spam,More people are dogging in your area now. Call...,ham
152,3890,ham,Unlimited texts. Limited minutes.,spam
159,991,ham,26th OF JULY,spam
284,4862,ham,Nokia phone is lovly..,spam
293,2370,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,326,ham,No calls..messages..missed calls,spam
319,5046,ham,We have sent JD for Customer Service cum Accou...,spam
504,3864,spam,Oh my god! I've found your number again! I'm s...,ham
546,4676,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


## Conclusions

By looking at the 14 events our system incorrectly described as spam or non-spam, we can see some possible reasons why.

* It sometimes identifies ham messages where phones and calls are being described as spam, likely because this subject matter is common in spam.
* Messages with spelling errors also seem occasionally misidentified as spam when they are non-spam.
* Spam messages that try to come off as a friend speaking (i.e.: 'Oh my god! I've found your number again!...') and seem more human slide past the filter.

This project was a great look at how probabilities can be used in a practical way, as simple arithmatic has led to a wildly successful text filter being created. Some of the principles in this project are keys in machine learning.