# Building a Spam Filter with Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

Our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

In [2]:
sms = pd.read_csv('SMSSpamCollection', header = None,sep='\t',names = ['Label', 'SMS'])
print(sms.shape[0])
sms.head()


5572


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
ham_pct = round(sms['Label'].value_counts(normalize = True)[0]*100,2)
print(ham_pct,"% of the SMS are spam!")

86.59 % of the SMS are spam!


### Training and Test sets
Before creating our algorythm let's trhink about the testing procedure!
We will create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

Then we will take a training set that corresponds to 80% of the original dataset and a test set that corresponds to 20%.

In [4]:
sms = sms.sample(frac = 1, random_state =1)

In [5]:
sms_training = sms[:4458].reset_index(drop = True)
sms_test = sms[4458:].reset_index(drop = True)

print(sms_training.shape)
print(sms_test.shape)


(4458, 2)
(1114, 2)


Let's find the percentage of spam and ham in both the training and the test set.

In [6]:
ham_pct_training = round(sms_training['Label'].value_counts(normalize = True)[0]*100,2)
print(ham_pct_training,"% of the SMS are spam in the training dataset.")
ham_pct_test = round(sms_test['Label'].value_counts(normalize = True)[0]*100,2)
print(ham_pct_test, "%of the SMS are spam in the test dataframe.")

86.54 % of the SMS are spam in the training dataset.
86.8 %of the SMS are spam in the test dataframe.


We are pretty close from the mean of the entire dataset.

### Data Cleaning

Now in order to use our Naive Bayes algorythm we need to clean the data a bit, especially the SMS column. We will divide it into numerous columns (1 columns for 1 word, in order to count the occurence of each word in all the sms).

First let's erase the punctuation and all the upper-case.

In [7]:
sms_training['SMS'] = sms_training["SMS"].str.replace('\W',' ')
sms_training['SMS'] = sms_training["SMS"].str.lower()

In [8]:
sms_training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


now let's create a list called *Vocabulary* where we will find all the unique words that appears in all our SMS.

In [9]:
sms_training['SMS'] = sms_training['SMS'].str.split()   

In [10]:
vocabulary = []

for sms in sms_training['SMS']:
    for word in sms:
        vocabulary.append(word)
            
vocabulary = list(set(vocabulary))
len(vocabulary)

7783

### The Final Training Set

Now we're going to use the vocabulary to make the data transformation we need. We're going to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need.

In [11]:
word_counts = {unique_word: [0] * len(sms_training['SMS']) for unique_word in vocabulary}

for idx, sms in enumerate(sms_training['SMS']):
    for word in sms:
        word_counts[word][idx] += 1

In [12]:
word_counts_df = pd.DataFrame(word_counts)
word_counts_df.head(1)

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
sms_training_clean = pd.concat([sms_training, word_counts_df], axis = 1)
sms_training_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


#### Creating the Spam Filter

The Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

![Image](https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display)
![Image](https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display)

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

![Image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)
![Image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:

- P(Spam) and P(Ham)
- NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set![Image](https://render.githubusercontent.com/render/math?math=%5Calpha%20%3D%201&mode=inline)

Let's start with : P(Spam) and P(Ham)

In [14]:
pham = len(sms_training_clean[sms_training_clean["Label"] == 'ham'])/len(sms_training_clean)
pspam = len(sms_training_clean[sms_training_clean["Label"] == 'spam'])/len(sms_training_clean)

Now let's compute: NSpam, NHam and NVocabulary and initiate a variable named alpha with a value of 1.

In [15]:
training_sum = sms_training_clean.groupby("Label").sum()
training_sum.head()

Unnamed: 0_level_0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,0,0,0,1,0,0,0,0,0,0,...,2,1,1,0,1,4,0,128,1,1
spam,3,9,25,0,1,1,2,7,3,1,...,0,0,0,1,0,0,1,0,0,0


In [16]:
count = training_sum.sum(axis = 1)
count

Label
ham     57237
spam    15190
dtype: int64

In [17]:
NSpam = count[1]
NHam = count[0]
NVocabulary = training_sum.shape[1]
alpha = 1    

### Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

![Image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display) 

![Image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CHam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CHam%7D%20%2B%20%5Calpha%7D%7BN_%7BHam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

In [18]:
spam_dic = {unique_word: 0  for unique_word in vocabulary}
ham_dic = {unique_word: 0  for unique_word in vocabulary}


In [19]:
for word in vocabulary:
    PWHam = ((training_sum.loc['ham',word] + alpha)/(NHam + (alpha*NVocabulary)))
    ham_dic[word] = PWHam

    PWSpam = ((training_sum.loc['spam',word] + alpha)/(NSpam + (alpha*NVocabulary)))
    spam_dic[word] = PWSpam

We have calculated all our parameters, let's move on to the next step!

### Classifying A New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [20]:
def classify(message):
    message = re.sub('\W',' ',message)
    message = message.lower().split()    
    
    PSpamMessage = pspam 
    PHamMessage = pham
    
    for word in message:
        if word in spam_dic:
            PSpamMessage  *= spam_dic[word]
            
        if word in ham_dic:
            PHamMessage *= ham_dic[word]
      
        
    print('P(Spam|message):', PSpamMessage)
    print('P(Ham|message):', PHamMessage)

    if PHamMessage > PSpamMessage:
        print('Label: Ham')
    elif PHamMessage < PSpamMessage:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
     
    
     
    

In [21]:
classify("WINNER!! This is the secret code to unlock the money: C3421.")

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [22]:
classify("Sounds good, Tom, then see u there")


P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


With those two obvious SMS our function is doing pretty good! Now let's try with our testing set and compare with the real results!

### Measuring the Spam Filter's Accuracy

We need to modify our function so that it returns only the label.

We are going to use the testing set by comparing the prediction we are making to the labels we already have!

In [25]:
def classify_test(message):
    message = re.sub('\W',' ',message)
    message = message.lower().split()    
    
    PSpamMessage = pspam 
    PHamMessage = pham
    
    for word in message:
        if word in spam_dic:
            PSpamMessage  *= spam_dic[word]
            
        if word in ham_dic:
            PHamMessage *= ham_dic[word]

    if PHamMessage > PSpamMessage:
        return 'ham'
    elif PHamMessage < PSpamMessage:
        return 'spam'
    else:
        return('needs human classification')

In [26]:
sms_test["prediction"] = sms_test["SMS"].apply(classify_test)
sms_test.head()

Unnamed: 0,Label,SMS,prediction
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


Now let's compare our column 'Label' and 'prediciton' and calculate the accuracy.

In [40]:
correct = 0

for idx,x in sms_test.iterrows():
    if sms_test.loc[idx,"Label"] == sms_test.loc[idx,"prediction"]:
        correct += 1
        
accuracy = correct/len(sms_test)
accuracy = accuracy*100

print("Our Spam Filter is accurate at {}%".format(round(accuracy,2)))
print(correct)
print(len(sms_test)-correct)

Our Spam Filter is accurate at 98.74%
1100
14


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

### Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

We can take a quick look at the 14 SMS that were not labelled correctly and try to see if there are some patterns

In [42]:
for idx,x in sms_test.iterrows():
    if sms_test.loc[idx,"Label"] != sms_test.loc[idx,"prediction"]:
        print(x['SMS'])
        print('----------------')

Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
----------------
More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
----------------
Unlimited texts. Limited minutes.
----------------
26th OF JULY
----------------
Nokia phone is lovly..
----------------
A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8
----------------
No calls..messages..missed calls
----------------
We have sent JD f

Even if there is not one clear pattern between all those messages we can see that some of them :

- are not written in proper English but in 'SMS', which create new unique words that affect the label
- have some phone number or figures, that may look likes the false code given in the spam SMS

Those are ideas that we could work on to enhance our Spam Filter in the future!

