# Building a Spam Filter with Naive Bayes

The idea of this project is to build a spam filter using the Naive Bayes Algorithm. To do that, we'll use a dataset of 5572 SMS that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).



In [1]:
import pandas as pd
pd.options.display.max_colwidth=1000

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None,
                 names=['Label', 'SMS'])

df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [3]:
print('Number of rows:', df.shape[0])
print('Number of columns:', df.shape[1])

Number of rows: 5572
Number of columns: 2


In [4]:
ham_perc = round((df[df['Label'] == 'ham'].count()[0] / 5572) * 100, 0)
spam_perc = round((df[df['Label'] == 'spam'].count()[0] / 5572) * 100, 0)

print('% of ham messages: ', ham_perc)
print('% of spam messages:', spam_perc)

% of ham messages:  87.0
% of spam messages: 13.0


For the 5572 SMS in the dataframe 87% of them are non-spam (ham), only 13% are spam.

To make the testing process more realistic i'm keeping 80% of the dataset for training and 20% for testing. Wich means that:

* The training set will have 4458 messages
* The test set will have 1114 messages

In [5]:
# Randomize the dataset
data_randomized = df.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
train = data_randomized[:training_test_index].reset_index(drop=True)
test = data_randomized[training_test_index:].reset_index(drop=True)

print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


After sampling both training and test sets, i found the percentage of spam and ham wich did not change as it was expected.

Now we need to do some cleaning, to be able to calculate the probabilities needed to apply the algorithm. To do that we need to remove the punctuation and bring all the worlds to lower case

## Cleaning letter case and punctuation

In [6]:
#Before cleaning
train.head()

#PD: Remember to use private features on your texting app when
#you're going to sent dirty SMS. Yes, i'm talking about index 4028

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on da present lei... How? Ü all want 2 write smth or sign on it?


In [7]:
#After cleaning
train['SMS'] = train['SMS'].str.replace('\W', ' ')
train['SMS'] = train['SMS'].str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on da present lei how ü all want 2 write smth or sign on it


After cleaning we need to find the unique words in the SMS column in the training set

## Creating vocabulary


In [8]:
train['SMS'] = train['SMS'].str.split()
vocabulary = []
for row in train['SMS']:
    for word in row:
        vocabulary.append(word)
            
vocabulary = list(set(vocabulary))

len(vocabulary)

7783

There are 7783 unique words in the train set
Now we need to create a dictionary to count the number of words in both spam and ham messages

In [9]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [10]:
train_clean = pd.concat([train, word_counts], axis=1)
train_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me, moan]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a, card, on, da, present, lei, how, ü, all, want, 2, write, smth, or, sign, on, it]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating constants first

We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:
$$ P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) $$$$ P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham) $$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:
$$ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} $$$$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} $$

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:

    P(Spam) and P(Ham)
    NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set $\alpha = 1$.

In [11]:
spam_sms = train_clean[train_clean['Label']=='spam']
ham_sms = train_clean[train_clean['Label']=='ham']

p_spam = len(spam_sms)/len(train_clean)
p_ham = len(ham_sms)/len(train_clean)

print('p(spam):', p_spam)
print('p(ham):', p_ham)

p(spam): 0.13458950201884254
p(ham): 0.8654104979811574


In [12]:
# N_Spam
n_words_per_spam_message = spam_sms['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_sms['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

print('Nsapm:', n_spam)
print('Nham:', n_ham)
print('Nvocabulary:', n_vocabulary)

Nsapm: 15190
Nham: 57237
Nvocabulary: 7783


## Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:
$$ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} $$$$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} $$

In [13]:
#Parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}


for word in vocabulary:
    n_word_spam = spam_sms[word].sum()
    p_word_spam = (n_word_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_spam
    
    n_word_ham = ham_sms[word].sum()
    p_word_ham = (n_word_ham + alpha) / (n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_ham    


## Classifying A New Message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

* Takes in as input a new message (w1, w2, ..., wn).
* Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
* Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
* If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
* If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
* If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.



In [14]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
   
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]     

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [16]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [17]:
def classify_test(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
   
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]     
            
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [18]:
test['predicted'] = test['SMS'].apply(classify_test)
test

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out,spam
3,ham,All sounds good. Fingers . Makes it difficult to type,ham
4,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!",ham
5,ham,But my family not responding for anything. Now am in room not went to home for diwali but no one called me and why not coming. It makes me feel like died.,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to take me shopping today. :(,ham
8,ham,Genius what's up. How your brother. Pls send his number to my skype.,ham
9,ham,I liked the new mobile,ham


In [19]:
print('Test labels:')
print(test['Label'].value_counts())
print('-' * 15)
print('Classified labels:')
print(test['predicted'].value_counts())

Test labels:
ham     967
spam    147
Name: Label, dtype: int64
---------------
Classified labels:
ham                           969
spam                          144
needs human classification      1
Name: predicted, dtype: int64


We see that there isn't a difference between the labels and predicted counts. With the slightly difference that one SMS in the test set was classified as needs human classification. That means that our spam filter works fine.

Apparently there are two spam sms that were classified as ham

Let's see the SMS that needs human classification

In [20]:
test[test['predicted']=='needs human classification']

Unnamed: 0,Label,SMS,predicted
293,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8",needs human classification


According to the label the sms isn't spam. The filter wasn't able to recognize it as spam or ham because some words are misspelled, but after reading it, looks like someone got free sms and decided to start joking around

Now let's see those sms that were spam and were classified as ham

In [21]:
test[(test['Label'] == 'spam') & (test['predicted'] == 'ham')]

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,ham
504,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text",ham
741,spam,"0A$NETWORKS allow companies to bill for SMS, so they are responsible for their ""suppliers"", just as a shop has to give a guarantee on what they sell. B. G.",ham
876,spam,RCT' THNQ Adrian for U text. Rgds Vatian,ham
885,spam,2/2 146tf150p,ham
953,spam,Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r,ham


At first it looks like it was only 2 sms but in reality it was 8 sms that clearly were spam and were classified as ham. However these 8 sms represent 0.72% of the sms.

Now, let's see if the algortihm classifies ham sms as spam

In [22]:
test[(test['Label'] == 'ham') & (test['predicted'] == 'spam')]

Unnamed: 0,Label,SMS,predicted
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
302,ham,No calls..messages..missed calls,spam
319,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",spam


I found that 5 sms were misclassified, but again these sms only represent 0.45% of the sms, and some of them look like actually are spam

Now, let's measure the accuracy of the spam filter

In [24]:
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The algorithm has a 98.74% of accuracy, wich is really good. As we saw 14 sms were misclassified but they represent a little bit more than 1%, and actually some ham sms were spam sms.