# Spam filter for SMS messages

In this project, I will use a dataset of already classified SMS messages from the UCI machine learning repository to construct a spam filter that can detect if a given SMS message is likely to be spam.

In [1]:
#Import the dataset and name the columns as 'Label' and 'SMS'

import pandas as pd
dataset = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [45]:
dataset.head(10)

Unnamed: 0,Label,SMS
1078,ham,"[yep, by, the, pretty, sculpture]"
4028,ham,"[yes, princess, are, you, going, to, make, me, moan]"
958,ham,"[welp, apparently, he, retired]"
4642,ham,[havent]
4674,ham,"[I, forgot, 2, ask, ü, all, smth, there, s, a, card, on, da, present, lei, how, Ü, all, want, 2, write, smth, or, sign, on, it]"
5461,ham,"[ok, i, thk, i, got, it, then, u, wan, me, 2, come, now, or, wat]"
4210,ham,"[I, want, kfc, its, tuesday, only, buy, 2, meals, ONLY, 2, no, gravy, only, 2, mark, 2, !]"
4216,ham,"[no, dear, i, was, sleeping, P]"
1603,ham,"[ok, pa, nothing, problem]"
1504,ham,"[ill, be, there, on, lt, gt, ok]"


All of the messages have been labelled by a human as either 'spam' or 'ham' (non-spam). The label column shows that label for each SMS.

In [3]:
dataset['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [4]:
# Randomise the dataset. Find the value equivalent to 80% of the dataset. 80% of the classified messages will be used
# to train the algorithm. The other 20% will be used to test it.
dataset = dataset.sample(frac=1, random_state=1)
cutoff = round(0.8*(len(dataset.index)))
print(cutoff)

4458


In [5]:
# Divide the dataset into a training dataset and a test dataset.
training = dataset.iloc[0 : 4458]
test = dataset.iloc[4458 :]

In [6]:
# Reset the index for the newly created dataframes
training = training.reset_index(drop=True)
test = test.reset_index(drop=True)
print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [7]:
training['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [8]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

The percentage of spam vs non_spam messages in the two samples is very close to the percentages in the dataset. This
shows that the two datasets are representative of the original dataset.

In [9]:
#Clean the SMS data so it's easier to analyse
import re
training['SMS'] = training['SMS'].str.replace('\W',' ').str.lower()


In [10]:
training.head(10)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


In [11]:
training['SMS'] = training['SMS'].str.split()
training['SMS'].head(20)

0                     [yep, by, the, pretty, sculpture]
1     [yes, princess, are, you, going, to, make, me,...
2                       [welp, apparently, he, retired]
3                                              [havent]
4     [i, forgot, 2, ask, ü, all, smth, there, s, a,...
5     [ok, i, thk, i, got, it, then, u, wan, me, 2, ...
6     [i, want, kfc, its, tuesday, only, buy, 2, mea...
7                       [no, dear, i, was, sleeping, p]
8                            [ok, pa, nothing, problem]
9                      [ill, be, there, on, lt, gt, ok]
10    [my, uncles, in, atlanta, wish, you, guys, a, ...
11                                          [my, phone]
12                   [ok, which, your, another, number]
13    [the, greatest, test, of, courage, on, earth, ...
14    [dai, what, this, da, can, i, send, my, resume...
15                [i, am, late, i, will, be, there, at]
16    [freemsg, why, haven, t, you, replied, to, my,...
17           [k, text, me, when, you, re, on, th

### Detecting spam using a multinomial Naive Bayes algorithm

I will use the following Naive Bayes' algorithm to determine whether a message is more likely to be spam or ham (non-spam). It will take as its input the words in a message (denoted below as $w_{1}$, $w_{2}$ etc.), and output the likelihood of the message being spam or ham (non-spam). 

P(Spam|$w_{1}$, $w_{2}$...) <font size="7"> ∝ </font> P(Spam) * P($w_{1}$|Spam) * P($w_{2}$|Spam) ...

P(Ham|$w_{1}$, $w_{2}$...) <font size="7"> ∝ </font>  P(Ham) * P($w_{1}$|Ham) * P($w_{2}$|Ham) ...

In the above equation, to determine the probability of each word appearing in a message given that the message is spam or given that the message is ham, we will use a Laplace smoothing technique (this is to prevent multiplication by 0 for words that may not appear in either spam or ham). The can be represented as follows:

P($w_{i}$|Spam) = (N$_{w_{i}}$|Spam) + α) / (N$_{Spam}$ + (α * N$_{Vocabulary}$))

P($w_{i}$|Ham) = (N$_{w_{i}}$|Ham) + α) / (N$_{Ham}$ + (α * N$_{Vocabulary}$))

We will use 1 as our α variable. The algorithm will output either 'Ham' or 'Spam' depending on which probability is higher.

In [12]:
#First we calculate the prior probabilities that a message is spam or ham (i.e. P(Spam) and P(Ham))

p_spam = training['Label'].value_counts(normalize=True)[1]
p_ham = training['Label'].value_counts(normalize=True)[0]

In [13]:
#Then we create list of all words in every SMS

vocabulary = []
for l in training['SMS']:
    for w in l:
        vocabulary.append(w)
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)
len(vocabulary)       

7783

In [14]:
#Next we extend the dataframe so each word in the vocabulary has its own column. Then the number of times each word 
#appears in each SMS can be entered into the dataframe for analysis.

word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
word_counts = pd.DataFrame(word_counts_per_sms)
updated_training = pd.concat([training, word_counts], axis=1)


In [15]:
updated_training.head(5)

Unnamed: 0,Label,SMS,karo,planning,less,piss,bw,bottle,oredi,walsall,...,scold,footprints,hadn,4msgs,jurong,clean,wld,mins,wen,pride
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#Next we find the number of words (total words, not unique words) in all spam messages

spam = updated_training[updated_training['Label'] == 'spam']
n_spam = spam['SMS'].apply(len).sum()
print(n_spam)

15190


In [17]:
#And the same for the ham messages

ham = updated_training[updated_training['Label'] == 'ham']
n_ham = ham['SMS'].apply(len).sum()
print(n_ham)

57237


In [18]:
#We will use 1 as the alpha variable in the Laplace smoothing equation

alpha = 1

Now we apply the LaPlace smoothing technique to find the probability of a message containing each of the words in
the vocabulary given that the message is either spam or ham. We will save the probabilities of each word appearing in
a spam message or a ham message in 2 dictionaries, 1 for spam and 1 for ham

In [19]:
ham_dict = {}
spam_dict = {}
for w in vocabulary:
    ham_dict[w] = 0
    spam_dict[w] = 0

In [20]:
#Below we take each word from the vocabulary, count how many times it appears in both spam and ham messages and apply
#LaPlace smoothing technique to determine the likelihood of the word appearing given the message is spam and the 
#likelihood of it appearing given the message is ham.

n_vocabulary = len(vocabulary)
for w in vocabulary:
    nw_spam = spam['SMS'].apply(lambda x: x.count(w)).sum()
    nw_ham = ham['SMS'].apply(lambda x: x.count(w)).sum()
    spam_dict[w] = (nw_spam + alpha)/(n_spam + alpha*n_vocabulary)
    ham_dict[w] = (nw_ham + alpha)/(n_ham + alpha*n_vocabulary)



Now that we have the variables we need we can construct the spam filter. We will create a function that takes as it's input a message, cleans the data in the message and applies the multinomial Naive Bayes' algorithm using the prior probailites of a message being spam or ham combined with the probabilities we have saved for each word in the vocabulary. The message will output either 'Spam' or 'Ham' depending on which probability is higher, it will also output the probabilites as determined by the algorithm.

In [21]:
import re

def classify_message(message):
    
    message = message.replace('\W',' ').lower().split()
    
    
    x = p_spam
    y = p_ham

    
    for w in message:
        if w in spam_dict:
            x *= spam_dict[w]
        if w in ham_dict:
            y *= ham_dict[w]
    
    p_spam_given_message = x
    p_ham_given_message = y
    
    
    print('P(Spam):', p_spam_given_message)
    print('P(Ham):', p_ham_given_message)
        
            
    
    if p_ham_given_message > p_spam_given_message:
        return 'Message is ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'Message is spam'
    else:
        return 'Message needs human classification'

    

In [22]:
#Now we can test the algorithm on a Ham message.
classify_message('Sounds good, Tom, then see u there')

P(Spam): 5.359472501724851e-18
P(Ham): 2.8089018273976984e-14


'Message is ham'

In [23]:
#And the same for a Spam message
classify_message('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam): 1.0164097981708963e-18
P(Ham): 1.8195638182330266e-19


'Message is spam'

The algorithm works. We now need to alter it slightly so that it outputs only the label ('spam' or 'ham'), this way we can test its accuracy by comparing its output to the message's label in the test dataset. 

In [24]:
def classify_test(message):
    
    message = message.replace('\W',' ').lower().split()
    
    x = p_spam
    y = p_ham

    
    for w in message:
        if w in spam_dict:
            x *= spam_dict[w]
        if w in ham_dict:
            y *= ham_dict[w]
    
    p_spam_given_message = x
    p_ham_given_message = y
           
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Needs human classification'

In [25]:
test['filter_result'] = test['SMS'].apply(classify_test)

In [26]:
match = test['Label'] == test['filter_result']
match.value_counts(normalize=True)

True     0.982944
False    0.017056
dtype: float64

The algorithm is highly accurate (>98%) on the test data. This is a great success.
To see if we could make it even more accurate, we can examine instances where it failed.

In [27]:
test[test['Label'] != test['filter_result']].shape

(19, 3)

In [28]:
pd.set_option("max_colwidth", 1)
test[test['Label'] != test['filter_result']].head(19)

Unnamed: 0,Label,SMS,filter_result
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
287,spam,Cashbin.co.uk (Get lots of cash this weekend!) www.cashbin.co.uk Dear Welcome to the weekend We have got our biggest and best EVER cash give away!! These..,ham
319,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",spam
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123,ham
466,spam,You won't believe it but it's true. It's Incredible Txts! Reply G now to learn truly amazing things that will blow your mind. From O2FWD only 18p/txt,ham
492,ham,"Madam,regret disturbance.might receive a reference check from DLF Premarica.kindly be informed.Rgds,Rakhesh,Kerala.",spam
504,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",ham


A few things can be deduced from this list of messages that the algorithm failed to correctly classify. Many spam messages that were incorrectly labelled ham contained many exclamation marks and words in capitals, but capitals and exclamation marks were ignored by our algorithm. We can improve the algorithm by making it sensitive to words entirely in capitals and also sensitive to exclamation marks.

In [29]:
#Write a function that turns words to lowercase unless they are entirely in capitals.

def keep_uppercase(message):
    clean_message = []
    for w in message:
        if w.isupper() == False:
            w = w.lower()
            clean_message.append(w)
        else:
            clean_message.append(w)
    return clean_message

In [30]:
#I will make a new training dataset which preserves words in all capital letters and also exclamation marks.

training2 = dataset.iloc[0 : 4458]
training2['SMS'] = training2['SMS'].str.replace('[^\w!]',' ')
training2['SMS'] = training2['SMS'].str.replace('!', ' ! ').str.split()
training2['SMS'] = training2['SMS'].apply(keep_uppercase)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training2['SMS'] = training2['SMS'].str.replace('[^\w!]',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training2['SMS'] = training2['SMS'].str.replace('!', ' ! ').str.split()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training2['SMS'] = training2['SMS'].apply(keep_uppercase)


In [31]:
training2.reset_index(inplace=True, drop=True)
training2.head(10)

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me, moan]"
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[I, forgot, 2, ask, ü, all, smth, there, s, a, card, on, da, present, lei, how, Ü, all, want, 2, write, smth, or, sign, on, it]"
5,ham,"[ok, i, thk, i, got, it, then, u, wan, me, 2, come, now, or, wat]"
6,ham,"[I, want, kfc, its, tuesday, only, buy, 2, meals, ONLY, 2, no, gravy, only, 2, mark, 2, !]"
7,ham,"[no, dear, i, was, sleeping, P]"
8,ham,"[ok, pa, nothing, problem]"
9,ham,"[ill, be, there, on, lt, gt, ok]"


In [32]:
#I will rename the SMS column because SMS may be a word in new vocabulary, and hence have its own column name.

training2 = training2.rename(columns={'SMS': 'S.M.S.'})

In [33]:
#I will repeat the steps from making the first filter. Here I will make a list of all the words in the new vocabulary.

vocabulary2 = []
for l in training2['S.M.S.']:
    for w in l:
        vocabulary2.append(w)
vocabulary2 = set(vocabulary2)
vocabulary2 = list(vocabulary2)
len(vocabulary2)

8509

In [34]:
#Here I will set some new variables for the LaPlace smoothing equation.

n_vocabulary2 = len(vocabulary2)
alpha = 1

In [35]:
#Create a new training dataframe with the word counts from the new vocabulary

new_word_counts_per_sms = {unique_word: [0] * len(training2['S.M.S.']) for unique_word in vocabulary2}
for index, sms in enumerate(training2['S.M.S.']):
    for word in sms:
        new_word_counts_per_sms[word][index] += 1
new_word_counts = pd.DataFrame(new_word_counts_per_sms)
updated_training2 = pd.concat([training2, new_word_counts], axis=1)

In [36]:
#Find number of words from new vocabulary in all of the spam messages.

spam2 = updated_training2[updated_training2['Label'] == 'spam']
n_spam2 = spam2['S.M.S.'].apply(len).sum()
print(n_spam2)

15631


In [37]:
#Find number of words from new vocabulary in all of the ham messages.

ham2 = updated_training2[updated_training2['Label'] == 'ham']
n_ham2 = ham2['S.M.S.'].apply(len).sum()
print(n_ham2)

57944


In [38]:
#Create 2 new dictionaries (1 for spam and 1 for ham), to contain the probabilites of each word from new vocabulary
#appearing in either a spam or ham message.

ham_dict2 = {}
spam_dict2 = {}
for w in vocabulary2:
    ham_dict2[w] = 0
    spam_dict2[w] = 0   

In [39]:
#Calculate the values for the dictionaries

n_vocabulary2 = len(vocabulary2)
for w in vocabulary2:
    nw_spam2 = spam2['S.M.S.'].apply(lambda x: x.count(w)).sum()
    nw_ham2 = ham2['S.M.S.'].apply(lambda x: x.count(w)).sum()
    spam_dict2[w] = (nw_spam2 + alpha)/(n_spam2 + alpha*n_vocabulary2)
    ham_dict2[w] = (nw_ham2 + alpha)/(n_ham2 + alpha*n_vocabulary2)

In [40]:
#Create new filter using new variables.

def classify_test2(message):
    
    message = message.replace('[^\w!]',' ')
    message = message.replace('!', ' ! ').split()
    for w in message:
        if w.isupper() == False:
            w = w.lower()
    
    
    x = p_spam
    y = p_ham

    
    for w in message:
        if w in spam_dict2:
            x *= spam_dict2[w]
        if w in ham_dict2:
            y *= ham_dict2[w]
    
    p_spam_given_message = x
    p_ham_given_message = y
           
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Needs human classification'

In [41]:
#Apply new filter to the test dataframe

test['filter_result2'] = test['SMS'].apply(classify_test2)

In [42]:
#Determine success rate of filter

match = test['Label'] == test['filter_result2']
match.value_counts(normalize=True)

True     0.979354
False    0.020646
dtype: float64

The updated algorithm performed slightly worse than the first one. Though it as still highly effective, with a success rate of %97.9 (compared to %98.3 for the first filter).

In [43]:
#Find out how many messages were incorretly classified

test[test['Label'] != test['filter_result2']].shape

(23, 4)

In [44]:
#Display incorrectly classified messages

pd.set_option("max_colwidth", 1)
test[test['Label'] != test['filter_result2']].head(23)

Unnamed: 0,Label,SMS,filter_result,filter_result2
29,ham,THING R GOOD THANX GOT EXAMS IN MARCH IVE DONE NO REVISION? IS FRAN STILL WITH BOYF? IVE GOTTA INTERVIW 4 EXETER BIT WORRIED!x,ham,spam
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham,ham
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,spam,ham
207,ham,Miss call miss call khelate kintu opponenter miss call dhorte lage. Thats d rule. One with great phone receiving quality wins.,ham,spam
244,ham,ARE YOU IN TOWN? THIS IS V. IMPORTANT,ham,spam
287,spam,Cashbin.co.uk (Get lots of cash this weekend!) www.cashbin.co.uk Dear Welcome to the weekend We have got our biggest and best EVER cash give away!! These..,ham,ham
323,ham,CHEERS U TEX MECAUSE U WEREBORED! YEAH OKDEN HUNNY R UIN WK SAT?SOUNDS LIKEYOUR HAVIN GR8FUN J! KEEP UPDAT COUNTINLOTS OF LOVEME XXXXX.,ham,spam
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123,ham,ham
443,spam,"Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!",spam,ham
466,spam,You won't believe it but it's true. It's Incredible Txts! Reply G now to learn truly amazing things that will blow your mind. From O2FWD only 18p/txt,ham,ham


It appears that making the algorithm sensitive to case may have caused it to inadvertantly flag some ham messages with lots of capital letters as spam. 

Overall, it appears the first spam filter performed slightly better.