# Building a Spam Filter with Naive Bayes

In this project, we will use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages to build a spam filter. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo and can be downloaded from [the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

Our goal in this project is to build a spam filter that classifies new messages with an accuracy of 80% or more.

# Exploring the Dataset

In [1]:
# Read in the SMSSpamCollection file
import pandas as pd
data = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS'])
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# Number of rows and columns
print('The number of rows is:', data.shape[0])
print('The number of columns is:', data.shape[1])
print('')

# Percentage of ham and spam
p_ham = (data['Label'].value_counts(normalize = True)*100)[0]

print('The percentage of ham messages is:', p_ham)
print('The percentage of spam messages is:', 100 - p_ham)

The number of rows is: 5572
The number of columns is: 2

The percentage of ham messages is: 86.59368269921033
The percentage of spam messages is: 13.406317300789667


# Training and Test Set

We'll split the dataset into a training set and a test set. I'll keep 80% of the dataset for training and 20% for testing.

In [3]:
# randomize the entire dataset
randomized_dataset = data.sample(frac = 1, random_state = 1)
randomized_dataset.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [4]:
# number of rows in training set and test set
training_set_rows = round(randomized_dataset.shape[0] * 0.8,0)
test_set_rows = randomized_dataset.shape[0] - training_set_rows

print('The number of rows in the training set is:', training_set_rows)
print('The number of rows in the test set is:', test_set_rows)

The number of rows in the training set is: 4458.0
The number of rows in the test set is: 1114.0


In [5]:
# split the randomized dataset into training set and test set
training_set = randomized_dataset.iloc[:4458,:].reset_index(drop = True)
test_set = randomized_dataset.iloc[4458:,:].reset_index(drop = True)

In [6]:
# Percentage of spam and ham in training set
training_set['Label'].value_counts(normalize = True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [7]:
# Percentage of spam and ham in test set
test_set['Label'].value_counts(normalize = True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We can see that the training set and the test set represent the original dataset.

# Letter Case and Punctuation

We'll remove punctuation and convert all words in the SMS column in the training set to lowercase.

In [8]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


# Creating vocabulary

We'll create a list of all unique words in the training set.

In [9]:
# split words in each row
training_set['SMS'] = training_set['SMS'].str.split()
vocabulary = []

In [10]:
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))

# The Final Training Set

We'll transform the training set into a better format.

In [11]:
# create a dictionary with each key is an unique word in the vocabulary and 
# each value is a list of the length of the training set
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

# loop over the message and incremenent the key's values in the above dictionary by 1
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [12]:
# create a dataframe
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [13]:
# concatenate the training set with the data frame above to get the desire data frame
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Calculating Constants First

We'll calculate:
- the number of words in all the spam messages and non-spam messages
- the number of vocabulary
- probability of a spam message and probability of a non-spam message

In [14]:
# probability of a spam message and a non-spam message
p_spam = training_set_clean['Label'].value_counts(normalize = True)[1]
p_ham = 1 - p_spam

In [15]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [16]:
# filter spam messages in the training set
spam_training_set = training_set[training_set['Label'] == 'spam']
spam_training_set.head()

Unnamed: 0,Label,SMS
16,spam,"[freemsg, why, haven, t, you, replied, to, my,..."
18,spam,"[congrats, 2, mobile, 3g, videophones, r, your..."
56,spam,"[free, message, activate, your, 500, free, tex..."
60,spam,"[call, from, 08702490080, tells, u, 2, call, 0..."
61,spam,"[someone, has, conacted, our, dating, service,..."


In [17]:
# calculate the number words in spam messages
num_spam = 0
for sms in spam_training_set['SMS']:
    for word in sms:
        num_spam += 1

In [18]:
# filter ham messages in the training set
ham_training_set = training_set[training_set['Label'] == 'ham']
ham_training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [19]:
# calculate the number of words in ham messages
num_ham = 0
for sms in ham_training_set['SMS']:
    for word in sms:
        num_ham += 1

In [20]:
# calculate the number of vocabulary
num_vocabulary = len(vocabulary)

In [21]:
# set alpha value to 1
alpha = 1

# Calculating Parameters

For each word in the vocabulary list, we'll calculate:
- the probability of that word given the message is spam
- the probability of that word given the message is non-spam

In [22]:
# initialize two dictionaries
p_word_given_spam = {unique_word: 0 for unique_word in vocabulary}
p_word_given_ham = {unique_word: 0 for unique_word in vocabulary}

In [23]:
# calculate the probability of each word in the vocabulary given spam or non-spam message
for word in vocabulary:
    num = 0
    for sms in spam_training_set['SMS']:
        for w in sms:
            if word == w:
                num += 1
    p_word_given_spam[word] = (num+alpha)/(num_spam + alpha*num_vocabulary)
    
    num_1 = 0
    for sms_1 in ham_training_set['SMS']:
        for w_1 in sms_1:
            if word == w_1:
                num_1 += 1
    p_word_given_ham[word] = (num_1+alpha)/(num_ham + alpha*num_vocabulary)


# Classifying A New Message

In [24]:
import re 

# create a function to classify new messages

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in p_word_given_spam:
            p_spam_given_message *= p_word_given_spam[word]
        if word in p_word_given_ham:
            p_ham_given_message *= p_word_given_ham[word]
    
    if p_spam_given_message > p_ham_given_message:
        return 'spam'
    elif p_spam_given_message < p_ham_given_message:
        return 'ham'
    else:
        return 'We might need human help'

In [25]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [26]:
classify("Sounds good, Tom, then see u there")

'ham'

# Measuring the Spam Filter's Accuracy

We'll use accuracy as a metric to measure the spam filter's accuracy.

In [27]:
# create a column describe the predictions of our program
test_set['predicted'] = test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [29]:
# create a column that show the result of the comparison between the actual label and the predicted label
test_set['comparison'] = test_set['Label'] == test_set['predicted']
test_set.head()

Unnamed: 0,Label,SMS,predicted,comparison
0,ham,Later i guess. I needa do mcat study too.,ham,True
1,ham,But i haf enuff space got like 4 mb...,ham,True
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,True
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,True
4,ham,"All done, all handed in. Don't know if mega sh...",ham,True


In [30]:
test_set['comparison'].value_counts(normalize = True)*100

True     98.743268
False     1.256732
Name: comparison, dtype: float64

In [32]:
# calculate how many messages the program calculate incorrectly
correct = int(test_set.shape[0] * 0.9874)
incorrect = test_set.shape[0] - correct

print('The number of messages that the program correctly predict is:', correct)
print('The number of messages that the program incorrectly predict is:', incorrect)

The number of messages that the program correctly predict is: 1099
The number of messages that the program incorrectly predict is: 15


# Conclusion

Although our goal is to build a program with an accuracy greater than 80%, our spam filter's accuracy is almost 99%. If we use the program to classify 100 new messages, it will predict incorrectly only one time.