# Building a Spam Filter with Naive Bayes Algorithm
In this project, we'll be using the Naive Bayes Algorithm to build a spam filter for SMS messages. The spam filter will work in three main steps.

1. The filter learns how humans classify a message as spam or non-spam
2. The filter will use the classification criteria to estimate probabilities a new message is or is not spam
3. The filter will then classify a message accordingly. If the probability is equal then the filter will ask for some human help for classification

We are going to teach our filter by training it off of 5,572 SMS text messages that have been previously classified. This data set comes from Tiago A. Almeida and José María Gómez Hidalgo and can be accessed via the UCI Machine Learning Repository.

In [1]:
import pandas as pd
sms = pd.read_csv("smsspamcollection/SMSSpamCollection", sep="\t", header=None, names=['Label', 'SMS'])
sms

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [2]:
vc = sms['Label'].value_counts()
ham = round(vc[0]/sms.shape[0]*100,2)
spam = round(vc[1]/sms.shape[0]*100,2)
print("Out of 5572 SMS messages {}% are real or 'ham' and {}% are spam.".format(ham, spam))

Out of 5572 SMS messages 86.59% are real or 'ham' and 13.41% are spam.


## Training and Testing

Before we develop our spam filter we need to devise a strategy for testing it's efficacy. We also need to make sure we give the spam filter enough data to be ready for the test. As a result, we will split our data into two categories: training and testing. The training set will be used to train the spam filter to classify spam and ham. The test set will determine our spam filter's efficacy. 

The categorization of the data will be split 80%/20%, training and testing respectively. There will be 4,458 training messages and 1,114 testing messages. The spam filter's goal is to classify new messages with an 80% or greater accuracy rate. We will be able to verify the accuracy rate because the 20%, the testing data, has already been classified as spam or ham.

To get this going, we need to randomize the dataset in order to make sure spam and ham messages are distributed randomly.

In [3]:
sample = sms.sample(frac=1, random_state=1)
sample

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
905,ham,"We're all getting worried over here, derek and..."
5192,ham,Oh oh... Den muz change plan liao... Go back h...
3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
235,spam,Text & meet someone sexy today. U can find a d...


In [4]:
eighty_percent = int((5572*.8))
training_set = sample[:eighty_percent+1].copy()
testing_set = sample[eighty_percent:-1].copy()
print(training_set.shape)
print(testing_set.shape)
print(training_set.shape[0] + testing_set.shape[0])
print(eighty_percent)

(4458, 2)
(1114, 2)
5572
4457


In [5]:
training_set.reset_index(inplace=True, drop=True)
testing_set.reset_index(inplace=True, drop=True)

In [6]:
training_v_counts = training_set['Label'].value_counts()
training_ham_count = training_v_counts[0]
training_spam_count = training_v_counts[1]
training_ham_percent = round(training_ham_count/training_set.shape[0]*100,2)
training_spam_percent = round(training_spam_count/training_set.shape[0]*100,2)
print("For the Training Set: out of {} rows there are {} counts of ham representing {}% and {} counts of spam representing {}%."
      .format(training_set.shape[0],training_ham_count, training_ham_percent, training_spam_count, training_spam_percent))

For the Training Set: out of 4458 rows there are 3858 counts of ham representing 86.54% and 600 counts of spam representing 13.46%.


In [7]:
testing_v_counts = testing_set['Label'].value_counts()
testing_ham_count = testing_v_counts[0]
testing_spam_count = testing_v_counts[1]
testing_ham_percent = round(testing_ham_count/testing_set.shape[0]*100,2)
testing_spam_percent = round(testing_spam_count/testing_set.shape[0]*100,2)
print("For the Testing Set: out of {} rows there are {} counts of ham representing {}% and {} counts of spam representing {}%."
      .format(testing_set.shape[0],testing_ham_count, testing_ham_percent, testing_spam_count, testing_spam_percent))

For the Testing Set: out of 1114 rows there are 967 counts of ham representing 86.8% and 147 counts of spam representing 13.2%.


In the above code the `SMS` dataset was randomized and then split into two distince datasets for training data and testing data. The data sets were tested to ensure they were split correctly with 80% of the original `SMS` data set going to the training set and the remaining 20% of orginal values allocated to the testing set. Additional inspection was performed to measure the percentages of ham and spam messages in the trainin sets. The original `SMS` data set had a ratio of 86%/14% ham to spam. The `training_set` and `testing_set` dataframes have similar ratio of 87%/13% ham to spam.

## Cleaning Our Data
With our training and test data sorted we need to take a look at how to extract the probabilities that a given message is spam or ham. In order to achieve this we'll need to begin cleaning the messages so only relevant words remain. In doing so we'll need to remove punctuation and remove capitalized words and letters. 

In [8]:
import re
def cleaner(series):
    return re.sub("\W", " ", series).lower()

sms['SMS'] = sms['SMS'].apply(cleaner)
sms

Unnamed: 0,Label,SMS
0,ham,go until jurong point crazy available only ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i don t think he goes to usf he lives aro...
...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...
5568,ham,will ü b going to esplanade fr home
5569,ham,pity was in mood for that so any other s...
5570,ham,the guy did some bitching but i acted like i d...


## Creating the Vocabulary and Training Set
Now that the SMS messages have been cleaned up to only contain word values and made lower-case, we can beging to compile the vocabulary for the dataset. As a refresher a vocabulary is a list of all the unique words in the dataset. Once we have created our vocabulary we'll be able to analyze each message against it to understand how often a particular word appears in a message. By comparing the frequency of particular words in a message against the frquency of words in other messages, we'll be able to predict if a given message is ham or spam. This is the fundamental logic of how we'll train the algorithm.

Let's get started by building our vocabulary.

In [9]:
training_set['SMS'] = training_set['SMS'].apply(cleaner)
testing_set['SMS'] = testing_set['SMS'].apply(cleaner)

In [10]:
vocab = []

training_set['word_list'] = training_set['SMS'].str.split()
for sms in training_set['word_list']:
    for word in sms:
        if word not in vocab:
            vocab.append(word)
len(vocab)

7783

Now with our vocabulary set we can build our dictionary to test for word frquencies in spam and ham messages. This is gunna be tricky. 

In [11]:
words_freq_sms = {unique_word: [0]*len(training_set['word_list']) for unique_word in vocab}
'''
The code above creates a dictionary where each key is a unique word from the vocabulary. The key or word,
has a value of '0' for every row in the dataset essentially a 0 for each message. We'll loop through the
dataset rows evaluating the words in each sms and then updating the particular word's count in the dictionary 
at the index it appears in the dataset row.
'''

for index,sms in enumerate(training_set['word_list']):
    for word in sms:
        words_freq_sms[word][index]+=1

df_words_per_sms = pd.DataFrame(words_freq_sms)
final_training_set = pd.concat([training_set, df_words_per_sms], axis=1)
final_training_set.head()

Unnamed: 0,Label,SMS,word_list,yep,by,the,pretty,sculpture,yes,princess,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Probability Constants
In the above code we developed two crucial components for our spam filter: the vocabulary and the word count per sms message data. In order to calculate the probabilities that a message is spam or ham, we must know the vocabulary and we must know the frequency of each word in each message. 

To calculate our probabilities of P(Spam) and P(Ham) we will also need to find NSpam, NHam, and NVocabulary. NSpam is equal to the number of words in all the spam messages. NHam is equal to the number of words in all non-spam messages. NVocabulary is equal to all unique words in both Spam and Ham messages.

In [12]:
count_spam = len(final_training_set[final_training_set['Label']=="spam"])
p_spam = count_spam/len(final_training_set)
p_ham = 1-p_spam
print(p_spam, p_ham)

0.13458950201884254 0.8654104979811574


In [13]:
final_training_set['total_words_per_sms'] = final_training_set.iloc[:,3:].apply(sum, axis=1)
final_training_set.head()

Unnamed: 0,Label,SMS,word_list,yep,by,the,pretty,sculpture,yes,princess,...,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre,total_words_per_sms
0,ham,yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,5
1,ham,yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,9
2,ham,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
3,ham,havent,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,26


In [14]:
nspam = final_training_set[final_training_set['Label']=="spam"]['total_words_per_sms'].sum()
nham = final_training_set[final_training_set['Label']=="ham"]['total_words_per_sms'].sum()
nvocab = len(vocab)
i = final_training_set['total_words_per_sms'].sum()
v = nspam + nham
aplha = 1
print(v,i)
final_training_set.head()

72427 72427


Unnamed: 0,Label,SMS,word_list,yep,by,the,pretty,sculpture,yes,princess,...,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre,total_words_per_sms
0,ham,yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,5
1,ham,yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,9
2,ham,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
3,ham,havent,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,26


Because the above values are going to be constants in our program it is wise that we categorize them as such. The code below will reassign the values to the constant convention.

In [15]:
N_SPAM = nspam
N_HAM = nham
N_VOCAB = nvocab
P_SPAM = p_spam
P_HAM = p_ham

## Calculating the Parameters
Although we have calculated our constants of the training set data, we know that P(word|Spam) and P(word|Ham) will vary depending on the word. Using the training set data we can calculate these Ps or Parameters and treat them as constants as well--provided we make no changes to the training dataset. The vocabulary contains 7783 unique words so getting the parameters calculated means we will have 15,566 probabilities to calculate.

In [16]:
def p_w_given_spam(wc_given_spam):
    return (wc_given_spam+1)/(N_SPAM+1*N_VOCAB)

def p_w_given_ham(wc_given_ham):
    return(wc_given_ham+1)/(N_HAM+1*N_VOCAB)

p_word_spam ={unique_word: 0 for unique_word in vocab}

spam_set = final_training_set[final_training_set['Label']=="spam"]

for word in vocab:
    wc = spam_set[word].sum()
    p_word_spam[word] = p_w_given_spam(wc)


In [17]:
p_word_ham = {unique_word: 0 for unique_word in vocab}

ham_set = final_training_set[final_training_set['Label']=="ham"]

for word in vocab:
    wc = ham_set[word].sum()
    p_word_ham[word] = p_w_given_ham(wc)

## Classifying a New Message
Now that we have our constants and the parameters of our training data, we can work on building the function that allows the filter to classify a given message as spam or ham.

In [18]:
def classify(message):
    #do some cleaning of the message, update to lower case and remove punctuation
    message = message.replace("\W"," ")
    message = message.lower()
    s_message = message.split()
    
    #calculate P(Spam|Message)
    p_given_word = 1
    for word in s_message:
        if word in p_word_spam:            
            p_given_word *= p_word_spam[word]
    p_spam_given_message = P_SPAM * p_given_word
    
    #calculate P(Ham|Message)
    p_g_w = 1
    for word in s_message:
        if word in p_word_ham:
            p_g_w *= p_word_ham[word]
    p_ham_given_message = P_HAM * p_g_w
    
    #Compare P(Spam|Message) to P(Ham|Message) and return a classification label
    if p_spam_given_message > p_ham_given_message:
        return "spam"
    elif p_spam_given_message < p_ham_given_message:
        return "ham"
    else:
        return "Undetermined. Need some human help, beep bop."

In [19]:
message_1 = "WINNER!! This is the secret code to unlock the money: C3421."
message_2 = "Sounds good, Tom, then see u there"

classify(message_1)

'spam'

In [20]:
classify(message_2)

'ham'

## Testing the Algorithm and Measuring the Accuracy
Above we built out a function that will take a given message, determine the probability of spam or ham and then label the message. Below we're going to run the algorithm on our `testing_set`. We'll then compare the human provided label to our machine learning algorithm and see if we reach our goal of over 80% accuracy.

In [21]:
testing_set['machine_classification'] = testing_set['SMS'].apply(classify)
testing_set.head()

Unnamed: 0,Label,SMS,machine_classification
0,ham,wherre s my boytoy,ham
1,ham,later i guess i needa do mcat study too,ham
2,ham,but i haf enuff space got like 4 mb,ham
3,spam,had your mobile 10 mths update to latest oran...,spam
4,ham,all sounds good fingers makes it difficult ...,ham


In [22]:
total_messages = len(testing_set)
correct = 0
for i,d  in testing_set.iterrows():
    if d['Label'] == d['machine_classification']:
        correct+=1

accuracy = correct/total_messages
accuracy

0.9874326750448833

# Conclusions
The Bayes algorithm produced from the training data resulted in a 98.7% accuracy rate in predicting whether an SMS message was Spam. This is a great start for our algorithm but it does require additional testing. This datset has the luxury of already being classified as Spam or Ham. To really validate our model, we need to test it on additional data. Further testing could include running the algorithm against an entire list of Spam messages or known Ham messages. With this, we can analyse the frequency of predicting false positives, messages that are Ham but classified as Spam and false negatives, messages that are Spam but classified as Ham.