Our first task is to "teach" the computer how to classify messages.
To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

Open the SMSSpamCollection file using the read_csv() function from the pandas package.
The data points are tab separated, so we'll need to use the sep='\t' parameter for our read_csv() function.
The dataset doesn't have a header row, which means we need to use the header=None parameter, otherwise the first row will be wrongly used as the header row.
Use the names=['Label', 'SMS'] parameter to name the columns as Label and SMS.

In [240]:
import pandas as pd

messages = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS'])

#print(messages.info())
#print(messages.shape)
print(messages['Label'].value_counts(normalize = True))
messages.head()

ham     0.865937
spam    0.134063
Name: Label, dtype: float64


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Once the spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter,  first split the dataset into two categories:
A training set, which we'll use to "train" the computer how to classify messages.
A test set, which we'll use to test how good the spam filter is with classifying new messages.
We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

The training set will have 4,458 messages (about 80% of the dataset).
The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.
The goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To  create a training and a test set, We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset

In [241]:
#randomize the entire dataset using dataframe.sample()
#Use the frac=1 parameter to randomize the entire dataset.
#Use the random_state=1 parameter to make sure your results are reproducible.

random_message = messages.sample(frac=1, random_state = 1)

# Calculate index for split
training_test_index = round(len(random_message) * 0.8)

# Training/Test split
#Reset the index labels for both data sets — the index labels remained unordered after randomization. 
#You can use the DataFrame.reset_index() method.
training_set = random_message[:training_test_index].reset_index(drop=True)
test_set = random_message[training_test_index:].reset_index(drop=True)

#Find the percentage of spam and ham in both the training and the test set.
training_set['Label'].value_counts(normalize = True)


ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [242]:
test_set['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Data Cleaning
The SMS column doesn't exist anymore.
Instead, the SMS column is replaced by a series of new columns, where each column represents a unique word from the vocabulary.
Each row describes a single message. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it has the values spam, 2, 2, 1, 1, 0, 0, 0, 0, 0. These values tell us that:
The message is spam.
The word "secret" occurs two times inside the message.
The word "prize" occurs two times inside the message.
The word "claim" occurs one time inside the message.
The word "now" occurs one time inside the message.
The words "coming", "to", "my", "party", and "winner" occur zero times inside the message.
All words in the vocabulary are in lower case, so "SECRET" and "secret" come to be considered to be the same word.
Punctuation is not taken into account anymore (for instance, we can't look at the table and conclude that the first message initially had three exclamation marks).

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [243]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')  # remove all non word character
training_set['SMS'] = training_set['SMS'].str.lower()

let's create a list with all of the unique words that occur in the messages of our training set.

In [244]:
training_set['SMS'] = training_set['SMS'].str.split()
vocabulary = []
for row in training_set['SMS']:
    for word in row:
        vocabulary.append(word)
vocabulary = set(vocabulary)
vocabulary = list(vocabulary) #This is the list of unique words
print(len(vocabulary))

#Now create a frequency table in a dictinoary for the unique words
word_count_per_sms = { unique_word : [0] * len(training_set['SMS']) for unique_word in vocabulary}
for index,sms in enumerate(training_set['SMS']):
        for word in sms:
            word_count_per_sms[word][index] += 1

# i = 0
# for key, value in word_count_per_sms.items():
#     print(key,value)
#     print('\n')
#     i += 1
#     if i == 20:
#         break

#convert dictionary to a dtaframe
word_count_per_sms = pd.DataFrame(word_count_per_sms)
# word_count_per_sms.info()
# word_count_per_sms.head()

#join the original dataframe (the one with the label and sms columns) with this new word_count dataframe
final_training_set = pd.concat([training_set,word_count_per_sms], axis = 1)
#final_training_set.head()

7783


As a start, let's first calculate:

P(Spam) - probability of spam
P(Ham) - probability of non spam
NSpam - Total no of words in all the spam messages
NHam - Total no of words in all the non spam messages
NVocabulary - Total no of words in the vocabulary

In [245]:
#the final training set data will be divided into two different data sets, spam and ham
training_set_spam = final_training_set[final_training_set['Label'] == 'spam'].reset_index(drop = True)
training_set_ham = final_training_set[final_training_set['Label'] == 'ham'].reset_index(drop=True)

#calculate P(spam) & P(non_spam)
p_spam, p_ham = final_training_set['Label'].value_counts(normalize=True)
#print(p_spam, p_ham)

#calculate n_spam
total_word_count_per_spam_message = training_set_spam['SMS'].apply(len)
n_spam = total_word_count_per_spam_message.sum()
print(n_spam)

#calculate n_ham
total_word_count_per_ham_message = training_set_ham['SMS'].apply(len)
n_ham = total_word_count_per_ham_message.sum()
print(n_ham)

# N_Vocabulary
n_vocabulary = len(vocabulary)
print(n_vocabulary)

# Laplace smoothing
alpha = 1

15190
57237
7783


Now that we have the constant terms calculated above, we can move on with calculating the parameters P(w/spam) and P(w/ham)
Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:
P(w/spam) = (N(w/spam) + alpha)/(n_spam + alpha*n_vocab))
P(w/ham) = (N(w/ham) + alpha)/(n_ham + alpha*n_vocab))

we can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

P("lost"|Spam) and P("lost"|Ham)
P("navigate"|Spam) and P("navigate"|Ham)
P("sea"|Spam) and P("sea"|Ham)

We have 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both P(wi|Spam) and P(wi|Ham).
 
 


In [246]:
#now lets calculate P(w/spam) & P(w/ham) for the unique words in the vocabulary
#we will have two dictionaries, one for spam probalities and the other for ham

spam_prob_dict ={}
for word in vocabulary:
    n_word_spam = training_set_spam[word].sum()
    spam_prob_dict[word] = (n_word_spam + alpha)/(n_spam + (alpha * n_vocabulary))
    
ham_prob_dict = {}
for word in vocabulary:
    n_word_ham = training_set_ham[word].sum()
    ham_prob_dict[word] = (n_word_ham + alpha)/(n_ham + (alpha * n_vocabulary))
    
# i = 0
# for key, value in spam_prob_dict.items():
#     print(key,value)
#     print('\n')
#     i += 1
#     if i == 20:
#         break


Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:
-- Takes a new message (w1, w2, ..., wn) as input. The message would be a string
-- Cleans the message (removes all non word characters and also changes all the words to lower case using re.sub() and str.lower())
-- splits all the words in the message into a string using str.split()
-- Any new words that are not in the vocabulary of unique words will be ignored and no probability will calculated for them.
-- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
-- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
----If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
----If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
----If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


In [247]:
import re
import numpy as np

def classify(message):
    msg = re.sub('\W', ' ', message)
    msg = msg.lower()
    msg_list = msg.split()
    p_spam_list = []  #list to store the probability of all the words for spam in the message 
    p_ham_list = []   #list to store the probability of all the words for ham in the message
    for word in msg_list:
        if word in vocabulary:  #confirm the word is in our vocab
            p_spam_list.append(spam_prob_dict[word]) #put the probability for the word in the spam list 
            p_ham_list.append(ham_prob_dict[word])  #put the probability for the word in the ham list
    #find p(spam given message
    p_spam_message = p_spam * np.prod(p_spam_list)
    p_ham_message = p_ham * np.prod(p_ham_list)
    classification = None
    if p_spam_message == p_ham_message:
        classification = 'Same probability, needs human to classify'
    elif p_spam_message > p_ham_message:
        classification = 'spam'
    else:
        classification = 'ham'
    return classification


#print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
#print(classify("Sounds good, Tom, then see u there"))

#now we will use the test data set to find the accuracy of the spam filter
test_set['predicted'] = test_set['SMS'].apply(classify)

#print(test_set.head())

#now we will create another column for the accuracy of the prediction (i.e whether true or false)

test_set['Accuracy'] = test_set['Label'] == test_set['predicted']

#now we will check the accuracy
#print(test_set['Accuracy'].value_counts(normalize = True) * 100)

print (test_set[test_set['Accuracy'] == False])
    
        

     Label                                                SMS  \
9      ham                             I liked the new mobile   
18     ham           and  picking them up from various points   
56     ham                       what is your account number?   
114   spam  Not heard from U4 a while. Call me now am here...   
125    ham                           Your brother is a genius   
152    ham                  Unlimited texts. Limited minutes.   
159    ham                                       26th OF JULY   
176    ham                            Your dad is back in ph?   
182    ham                         Surely result will offer:)   
186    ham     They r giving a second chance to rahul dengra.   
195    ham  Hi.:)technical support.providing assistance to...   
247    ham                          Which channel:-):-):):-).   
249    ham                                            Okie...   
251    ham                        Hahaha..use your brain dear   
271    ham               