# Building a Spam Filter with the Naive Bayes

We will be building a SMS message spam filter for this project. We will be using a premade dataset for this purpose. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. You can also download the dataset directly from this link. The data collection process is described in more details on this page, where you can also find some of the authors' papers.

## Exploring the Dataset

In [1]:
import pandas as pd

SMS_data = pd.read_csv('SMSSpamCollection', header=None, sep='\t', names=['Label', 'SMS'])

In [2]:
print(SMS_data.describe)

<bound method DataFrame.describe of      Label                                                SMS
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
5     spam  FreeMsg Hey there darling it's been 3 week's n...
6      ham  Even my brother is not like to speak with me. ...
7      ham  As per your request 'Melle Melle (Oru Minnamin...
8     spam  WINNER!! As a valued network customer you have...
9     spam  Had your mobile 11 months or more? U R entitle...
10     ham  I'm gonna be home soon and i don't want to tal...
11    spam  SIX chances to win CASH! From 100 to 20,000 po...
12    spam  URGENT! You have won a 1 week FREE membership ...
13     ham  I've been searching for the right words to tha...
14     ham                I HAVE A

In [3]:
SMS_data['Label'].value_counts()
# We find that 747 of the messages are spam

print(747/5572)
# Percentage of messages that are spam

0.13406317300789664


As we can see from a quick look at the dataset, we have 5572 specific text messages, 747 of which are spam. This is a percentage of roughly 13.4%. 

## Training and Test Set

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

    The training set will have 4,458 messages (about 80% of the dataset).
    The test set will have 1,114 messages (about 20% of the dataset).


For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [4]:
SMS_data_randomized = SMS_data.sample(frac=1, random_state=1)

In [5]:
SMS_data_training = SMS_data_randomized[0:4458]
# Training data will be roughly 80% of the data

SMS_data_testing = SMS_data_randomized[4458:]
# Testing data will be roughly 20$ of the data

SMS_data_training['Label'].value_counts()
print(600/4458)
# The percentage of spam in the sample is nearly the same of the entire dataset

SMS_data_testing['Label'].value_counts()
print(147/1114)

0.13458950201884254
0.1319569120287253


In [6]:
# Resetting the index labels
SMS_data_training.reset_index(inplace=True)


In [7]:
SMS_data_testing.reset_index(inplace=True)

After randomizing and splitting up the dataset, we can see our new randomized samples for training and testing both come out to roughly 13$ spam. 

## Letter Case and Punctuation

In order to more easily work with the data and make the necessary probability calculations, we'll want to create new columns based on the vocabulary of the SMS messages. Before we can do this though, we need to strip out puncutation, and conver all words to lower case. 

In [8]:
SMS_data_training.head(5)

Unnamed: 0,index,Label,SMS
0,1078,ham,"Yep, by the pretty sculpture"
1,4028,ham,"Yes, princess. Are you going to make me moan?"
2,958,ham,Welp apparently he retired
3,4642,ham,Havent.
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
SMS_data_training['SMS'] = SMS_data_training['SMS'].str.replace('\W', ' ')
SMS_data_training['SMS'] = SMS_data_training['SMS'].str.lower()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [10]:
SMS_data_training.head(5)

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


Comparing the `head` output for before and after our replace and lower fucntion use, we can see we have simplified the SMS messages sucssefully. 

## Creating the Vocabulary

Next, we need to get a list of all the words across the SMS messages. This is the next step before we can transform our data set to display the vocabulary as column headers with cells denoting their frequency, message to message. 

In [11]:
SMS_data_training['SMS'] = SMS_data_training['SMS'].str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [12]:
SMS_data_training.head(3)

Unnamed: 0,index,Label,SMS
0,1078,ham,"[yep, by, the, pretty, sculpture]"
1,4028,ham,"[yes, princess, are, you, going, to, make, me,..."
2,958,ham,"[welp, apparently, he, retired]"


In [13]:
# Initializing list and looping to add each SMS message word to it
vocabulary = []
for sms in SMS_data_training['SMS']:
    for word in sms:
        vocabulary.append(word)
        

In [14]:
vocabulary = set(vocabulary) # Remove duplicates

In [15]:
vocabulary = list(vocabulary)

In [16]:
len(vocabulary)

7783

It appears we have 7,783 unique words across all of the SMS messages in our training dataset.

## The Final Training Set

We will now create a dictionary for each message that contains the words and their frequencies for each message. 

In [17]:
word_counts_per_sms = {unique_word: [0] * len(SMS_data_training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(SMS_data_training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [18]:
# Turning our dicitonary into a DataFrame
word_counts = pd.DataFrame(word_counts_per_sms)

In [19]:
training_set_clean = pd.concat([SMS_data_training, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants

We will need to calculate some constants in order to complete our spam filter. These initial values are:

- P(Spam) and P(Ham)
- Nspam, Nham, and Nvocabulary

In [20]:
# Separating out the 'Spam' and 'Ham' messages
spam_messages = training_set_clean[training_set_clean["Label"] == 'spam']
ham_messages = training_set_clean[training_set_clean["Label"] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages)/len(training_set_clean)
p_ham = len(ham_messages)/len(training_set_clean)

# Nspam, Nham, Nvocabulary
n_spam = spam_messages['SMS'].apply(len).sum()
n_ham = ham_messages['SMS'].apply(len).sum()

n_vocabulary = len(vocabulary)
alpha = 1



# Calculating Parameters

Now we need to calculate the parameters, which in this instance would be probability of a word given spam, and the probability of a word given ham, for each word within our vocabulary. 

In [21]:
# Initialize dictionaries for 'spam' and 'ham'
spam_parameters = {unique_word: 0 for unique_word in vocabulary}
ham_parameters = {unique_word: 0 for unique_word in vocabulary}


In [22]:
for word in vocabulary:
    # Spam Parameter
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + (alpha * n_vocabulary))
    spam_parameters[word] = p_word_given_spam
    
    # Ham Parameter
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + (alpha * n_vocabulary))
    ham_parameters[word] = p_word_given_ham
    

## Classifying a New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter.

In [23]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

       
    # This is where we calculate:

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
            
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
    # ---   

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [24]:
# Test messages for our classify() function

spam_test = 'WINNER!! This is the secret code to unlock the money: C3421.'
ham_test = "Sounds good, Tom, then see u there"

classify(spam_test)
classify(ham_test)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Testing with one message that is obviously spam, and one that is obviously ham, we can see that our `classify()` function has worked successfully. 

## Measuring the Spam Filter's Accuracy

 We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm. 

Below we will construct a repurposed classify function, one that can evaluate the test set. 

In [29]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]

        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [30]:
SMS_data_testing['predicted'] = SMS_data_testing['SMS'].apply(classify_test_set)
SMS_data_testing.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,Later i guess. I needa do mcat study too.,ham
1,3418,ham,But i haf enuff space got like 4 mb...,ham
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,5393,ham,"All done, all handed in. Don't know if mega sh...",ham


Now that we have a new function to classify the test set, we can now see how accurate our classify function is.

In [32]:
correct = 0
total = 1114
    
for row in SMS_data_testing.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The accuracy of the classify function is 98%, which is very good for classifying 1114 previously unseen messages. 