# Building a Spam Filter with Naive Bayes

In this project, we are going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to classifies SMS messges as accurately as possible.

We will use a dataset from [the UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) to train to train the algorithm. The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition)

In [10]:
import pandas as pd

sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
print(sms.shape)
sms.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
sms.Label.value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

# Traning and test set

We are going to split the dataset into 2 categories training set (80% of the original dataset) & test set (20%).

In [12]:
data_randomized = sms.sample(frac=1, random_state=1)

split_index = round(len(data_randomized) * 0.8)

# Training/Test split
training = data_randomized[:split_index].reset_index(drop=True)
test = data_randomized[split_index:].reset_index(drop=True)

print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [13]:
training.Label.value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [14]:
test.Label.value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

We are see here the percentage of the spam and non-spam messages in training and test dataset are roughly the same as the original dataset

# Letter Case and Punctuation

We will remove all the non words from the message and make it lowercase

In [15]:
training.SMS = training.SMS.str.replace('\W', ' ').str.lower()
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


# Creating the Vocabulary

We will transform the dataframe as described below to count how manys times the word have appeared in the SMS message.

![](cleaning.PNG)

In [16]:
training.SMS = training.SMS.str.split()
training.SMS.head()
       
unique_words = list(set([ word for sms in training.SMS for word in sms]))
print(len(unique_words))

7783


In [17]:
word_counts_per_sms = {word: [0] * len(training.SMS) for word in unique_words}

for index, sms in enumerate(training.SMS):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [18]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [19]:
training_clean = pd.concat([training, word_counts], axis=1)
training_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


#  Calculating Constants First

Since we are done with data cleaning, let us move on to training the algorithm.
Below are the formulas we use to calculate whether the message is spam or not.

![](formula.PNG)

![](formula_detail.PNG)

In [20]:
spam_messages = training_clean[training_clean.Label == 'spam']
ham_messages = training_clean[training_clean.Label == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_clean)
p_ham = len(ham_messages) / len(training_clean)

# N_Spam
n_words_per_spam_message = spam_messages.SMS.apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages.SMS.apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(unique_words)

# Laplace smoothing
alpha = 1

# Calculating Parameters

In [21]:
parameters_spam = {unique_word:0 for unique_word in unique_words}
parameters_ham = {unique_word:0 for unique_word in unique_words}

for word in unique_words:
    n_word_given_spam = spam_messages[word].sum()  
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

# Classifying A New Message

We have calculated all the constants and parameter needed to create the spam filter. The filter will take a new message and computes the probabilities of message being spam and ham.

In [22]:
import re

def classify(message):   
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

In [28]:
# Test it...
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))

print(classify("Sounds good, Tom, then see u there"))

spam
ham


# Test accuracy

We will now try our filter and test its accuracy by using the test dataset.

In [29]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out,spam
3,ham,All sounds good. Fingers . Makes it difficult to type,ham
4,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!",ham


In [26]:
correct = 0

for r in test.iterrows():
    if r[1].Label == r[1].predicted:
        correct += 1

print('Accuracy: {:.3f}%'.format(correct / len(test)))

print('Incorrectly predicted messages...')
pd.set_option('max_colwidth', 999999)
test[test.Label != test.predicted]

Accuracy: 0.987%
Incorrectly predicted messages...


Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8","Equal proabilities, have a human classify this!"
302,ham,No calls..messages..missed calls,spam
319,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",spam
504,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text",ham


# Conclusion

The filter yielded 0.987% accuracy.
The incorrectly predicted messages by the filter, some of them are diffcult for even humans to guess it correctly. The next step will be to cover those edge cases and try to figure out how to classify correctly.