# Building a Spam Filter using the Naive Bayes' Algorithm

Using a dataset that contains about 5000 spam messages, we're going to build a spam filter using our knowledge on probabilites and the multinomial naive bayes' algorithm. The dataset can be downloaded directly from [here](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection), but you could also download it from this [website](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')

In [2]:
sms = pd.read_csv('/Users/Tejas/csv_files/SMSSpamCollection', 
                  sep='\t', 
                  header=None, 
                  names=['Label', 'SMS'])
print(sms.shape)
sms.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In our dataset's "label" column, 'ham' means the message is a non-spam message and 'spam' means the message is a spam message.

In [3]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

About 87% of our messages are not spam and 13% of our messages are spam.

In [4]:
sms_random = sms.sample(frac=1, random_state=1)
sms_random

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
905,ham,"We're all getting worried over here, derek and..."
5192,ham,Oh oh... Den muz change plan liao... Go back h...
3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
235,spam,Text & meet someone sexy today. U can find a d...


Now we're going to Split the data into a training and test set to see if our algorithm works on some new data that we provid it, learining from as much data as possible. We are going to split the data as follows:

In [5]:
# 80% for the training data
print(round(((80 / 100) * 5572)))

#20% for the test data
print(round(((20 / 100) * 5572)))

4458
1114


In [6]:
train = sms_random.iloc[:4458, ]
test = sms_random.iloc[4458:, ]

In [7]:
print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [8]:
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

In [9]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [10]:
test.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


Now we find the percentage of 'ham' and 'spam' in both the training and test set.

In [11]:
train.Label.value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [12]:
test.Label.value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We find that the percentages are more or less similar to what we have in the full dataset. With a bit of data cleaning, we're going to make the computer classify messages as spam or not spam by looking at the individual words in each SMS.

In [13]:
# we can use this regular expression to detect any character
# that's not from a-z, A-Z or 0-9. 
re.sub('\W', ' ', 'Secret!! Money, goods.' )

'Secret   Money  goods '

We remove all the punctuation in the SMS column:-

In [14]:
train['SMS'] = train['SMS'].str.replace('\W', ' ')
train['SMS'] = train['SMS'].str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


We then split the words into a list using the str.split method.

In [15]:
train['SMS'] = train['SMS'].str.split()
train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


Using a nested loop, we create a vocabulary that contains every unique word in this column.

In [16]:
vocabulary = []
for word_lists in train.SMS:
    for word in word_lists:
        vocabulary.append(word)

In [17]:
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)
vocabulary

['taunton',
 'teaches',
 '1pm',
 'grow',
 'salon',
 'filling',
 'lips',
 'nat27081980',
 'waz',
 'lubly',
 'sucks',
 '09064017305',
 'sufficient',
 'realised',
 'stuffed',
 'thk',
 'exposed',
 '09701213186',
 'terry',
 'reslove',
 'jen',
 'shijas',
 'greet',
 '08002986030',
 'drug',
 'mis',
 'contract',
 'gimme',
 'vomiting',
 'gibbs',
 'wings',
 'pixels',
 'this',
 'confused',
 'till',
 'ringtones',
 'txting',
 'msging',
 'thia',
 'dog',
 'tessy',
 'meanwhile',
 '7pm',
 'guesses',
 'young',
 'begun',
 'toughest',
 'oli',
 'wedlunch',
 'exams',
 'hence',
 'helloooo',
 'monthlysubscription',
 'yck',
 'sundayish',
 'dooms',
 'cc',
 'lays',
 'ques',
 'cake',
 'singapore',
 'buffet',
 'coach',
 'unknown',
 'answer',
 'termsapply',
 'calls',
 'rofl',
 'gumby',
 '82277',
 'real1',
 'supposed',
 'peach',
 'egg',
 'thinks',
 'fulfil',
 'morning',
 'chosen',
 '09065069154',
 '08715203652',
 'pretty',
 'trusting',
 'box403',
 'pray',
 'thkin',
 '20',
 'bettersn',
 'conveying',
 'oh',
 '3650',
 '

In [18]:
len(vocabulary)

7783

To create a dataframe that contains each unique word in the column and the number of times we see that word for each spam and non spam message in the data, we first create a dictionary that counts the number of words in the data and then transform that dictionary into a dataframe and join that dataframe with the training set so that we have our Label and SMS column for reference.

In [19]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1


In [20]:
word_counts = pd.DataFrame(word_counts_per_sms)
train_clean = pd.concat([train, word_counts], axis=1)
train_clean.head()

Unnamed: 0,Label,SMS,taunton,teaches,1pm,grow,salon,filling,lips,nat27081980,...,dnt,cheap,63miles,hotmail,plum,purity,bid,disastrous,tom,oja
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
word_counts.head()

Unnamed: 0,taunton,teaches,1pm,grow,salon,filling,lips,nat27081980,waz,lubly,...,dnt,cheap,63miles,hotmail,plum,purity,bid,disastrous,tom,oja
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we're going to calculate all the necessary probability values and number of instances observed to insert into our Naive Bayes algorithm.  

In [22]:
spam_msgs = train_clean[train_clean['Label'] == 'spam']
ham_msgs = train_clean[train_clean['Label'] == 'ham']

In [23]:
p_spam = len(spam_msgs) / len(train_clean)
p_spam
p_ham = len(ham_msgs) / len(train_clean)
p_ham

n_words_per_spam_message = spam_msgs['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_msgs['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

In [24]:
n_vocabulary = len(vocabulary) 
alpha = 1 #additive smoothing

We initialize two dictionaries that are going to be used to count the number of times a word occurs in both spam and ham messages.

In [25]:
word_dict_spam = {unique_word:0 for unique_word in vocabulary}
word_dict_ham = {unique_word:0 for unique_word in vocabulary}

Then we calculate the probability of getting that messgae in a spam and ham.

In [26]:
for word in vocabulary:
    n_word_given_spam = spam_msgs[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    
    n_word_given_ham = ham_msgs[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)
    
    word_dict_spam[word] = p_word_given_spam
    word_dict_ham[word] = p_word_given_ham

We then create a function to compare the probabilities of getting a spam given we got some message and getting a ham given we got some message and then decide whether the message is a spam or not a spam based on the probabilities found. 

In [27]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in word_dict_spam:
            p_spam_given_message *= word_dict_spam[word]
        if word in word_dict_ham:
            p_ham_given_message *= word_dict_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [28]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [29]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Now we use this function on our test set. We change the print statements to return statements as we need to apply this function to our test set in order for it to work.

In [30]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in word_dict_spam:
            p_spam_given_message *= word_dict_spam[word]

        if word in word_dict_ham:
            p_ham_given_message *= word_dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [31]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


We can measure the accuracy (the metric we use to calculate the amount of correct predictions the function made on our test set) to see if our function performed better or worse than what we expected.

In [32]:
correct = 0
total = len(test)
for msgs in test.iterrows():
    msgs = msgs[1]
    if msgs.Label == msgs.predicted:
        correct += 1
        
accuracy = correct / total
accuracy
        

0.9874326750448833

In [33]:
accuracy * 100

98.74326750448833

Looks like we got an accuracy of 98.74%. This is good, and shows us that our spam filter can be used to check other messages to see whether they are spam or not. This concludes the project, thank you for reading.