Building a Spam Filter with Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages. So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans

GOAL OF THE PROJECT

our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam)

1. Exploring the Dataset

In [None]:
import pandas  as pd
smsspam = pd.read_csv('SMSSpamCollection',sep='\t', header=None,names=['Label', 'SMS'])


In [None]:
#Lets find how many rows and columns
smsspam.info()

In [None]:
#Lets find what percentage of the messages is spam and what percentage is ham
percentages = smsspam['Label'].value_counts()
spam_nonspam = percentages / len(smsspam['Label']) * 100
print(spam_nonspam)

The dataset has 5572 rows and two columns
86% of the smses are not spam
13% of the smses are spam

2· Training and Test Set

To test the spam filter, we're first going to split our dataset into two categories:

A training set, which we'll use to "train" the computer how to classify messages.

A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

The training set will have 4,458 messages (about 80% of the dataset).
The test set will have 1,114 messages (about 20% of the dataset)

We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset

In [None]:
#Lets randomize the entire dataset
rand_dataset = smsspam.sample(random_state = 1, frac =1)

In [None]:
#Lets split the randimized dataset into training and testing sets
training_set = rand_dataset.sample(frac=.8)


In [None]:
#Test set
test_set = rand_dataset.drop(training_set.index)


In [None]:
#Lets reset the indexes 
training_set.reset_index()


In [None]:
test_set.reset_index()

Now Lets check the percentages of both the spam and non spam in both the training and test set

In [None]:
test_set_perc = test_set['Label'].value_counts()
percentage = test_set_perc/ len(test_set['Label']) * 100
print(percentage)

In [None]:
train_set_perc = training_set['Label'].value_counts()
percentage = train_set_perc/ len(training_set['Label']) * 100
print(percentage)

The percentages of spam and ham in both the training and test set are different from what we have in the full dataset

3. Letter Case and Punctuation

To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Remove all the punctuation from the SMS column

In [40]:
# Get rid of all symbols outside of Aa-Z and 0-9
training_set['SMS'] = training_set['SMS'].str.replace('\W',' ').str.lower()

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

4. Creating the Vocabulary

we removed the punctuation and changed all letters to lowercase. Recall that our end goal with this data cleaning process is to bring our training set to the format shown below.

In [None]:
from IPython.display import Image

# Replace the file path with the actual path to your image
Image(url="https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png")

With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message)

Lets Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string

In [39]:
training_set['SMS'] = training_set['SMS'].str.split()
# Initiate a vocabulary list
     
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

In [38]:
#tranform the vocabulary list into a set using the set() function to remove duplicates
vocabulary = list(set(vocabulary))

5· The Final Training Set

we managed to create the vocabulary for our messages in the training set. Now we're going to use the vocabulary to make the data transformation we need

Eventually, we're going to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need

In [14]:
# Create a dictionary of how many unique words are in an SMS message
# using vocabulary list
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1


In [15]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,...,zoe,zoom,zouk,zyada,èn,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,0121,01223585236,...,zoe,zoom,zouk,zyada,èn,é,ú1,ü,〨ud,鈥
0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,"[ok, lar, joking, wif, u, oni]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,spam,"[free, entry, in, 2, a, wkly, comp, to, win, f...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
training_set_clean['Label'].value_counts()

ham     3856
spam     602
Name: Label, dtype: int64

In [None]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

6. Calculating Constants First

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.

In [34]:

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)


In [35]:
print(p_spam)
print(p_ham)

0.1126497005988024
0.7215568862275449


In [36]:
# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

#N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocab
n_vocab = len(vocabulary)

# alpha
alpha = 1

print(n_spam)
print(n_ham)
print(n_vocab)


15313
57155
7752


7  Calculating Parameters

In [20]:
#Initialize 2 dictionaries
spam_dict = {unique_word:0 for unique_word in vocabulary}
ham_dict = {unique_word:0 for unique_word in vocabulary}

1. Isolate the spam and the ham messages in the training set into two different DataFrames.
2. The Label column will help you isolate the messages.

In [21]:
spam_training_set = training_set_clean[training_set_clean['Label']  == 'spam']
ham_training_set = training_set_clean[training_set_clean['Label'] == 'ham']

In [22]:
for word in vocabulary:
    # Caculate number of times word occurs in relevant set and 
    # calculate the probability of word occuring in relevant set
    # then assign value to relevant dictionary
    
    n_word_given_spam = spam_training_set[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocab)
    spam_dict[word] = p_word_given_spam
     
    n_word_given_ham = ham_training_set[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocab)
    ham_dict[word] = p_word_given_ham

In [23]:
print(spam_dict)



8· Classifying A New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter

In [24]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

        

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [25]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.4265238295315167e-29
P(Ham|message): 1.7243013792619865e-25
Label: Ham


In [26]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.625539642489444e-24
P(Ham|message): 4.630320989711132e-22
Label: Ham


9. Measuring the Spam Filter's Accuracy

we managed to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm

In [27]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]

        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [28]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
4028,ham,"Yes, princess. Are you going to make me moan?",ham
5461,ham,Ok i thk i got it. Then u wan me 2 come now or...,ham
1603,ham,Ok pa. Nothing problem:-),ham
4259,ham,I am late. I will be there at,ham
5392,ham,Ooooooh I forgot to tell u I can get on yovill...,ham


In [29]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 968
Incorrect: 146
Accuracy: 0.8689407540394973


THe Accuracy is better than expected