# Building a Spam Filter with Naive Bayes

We're going to study the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

First, we need to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
sms = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])

In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms.shape

(5572, 2)

In [5]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

About 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam.

## Training and test datasets

We will divide our data set into:
- training set: 80%
- test set: 20%

Objective: ** test accuracy of 80%**

In [6]:
sms = sms.sample(frac=1,random_state=1)

In [7]:
sms.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [8]:
training_set_size = int(sms.shape[0]*0.8)
test_set_size = sms.shape[0] - training_set_size

In [9]:
training = sms.iloc[:training_set_size].reset_index(drop=True)

In [10]:
test = sms.iloc[training_set_size:].reset_index(drop=True)

In [11]:
training['Label'].value_counts(normalize=True)*100

ham     86.53803
spam    13.46197
Name: Label, dtype: float64

In [12]:
test['Label'].value_counts(normalize=True)*100

ham     86.816143
spam    13.183857
Name: Label, dtype: float64

Percentages in both `training` and `test` are similiar to the full dataset.

## Lower case and cleaning punctuation

Punctuation is not taken into account, so we'll get rid of every punctuation sign. We'll also make sure every letter appears in lower case.

In [13]:
training['SMS'] = training['SMS'].str.replace('\W',' ').str.lower()

In [14]:
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the vocabulary

Let's create a list with all of the unique words that occur in the messages of our training set.

In [15]:
training['SMS'] = training['SMS'].str.split()

In [16]:
vocabulary = []
for sms in training['SMS']:
    vocabulary.extend(sms)
vocabulary_set = set(vocabulary)
vocabulary = list(vocabulary_set)

In [17]:
vocabulary

['goodfriend',
 '81010',
 'infections',
 'firsg',
 'silly',
 'thanksgiving',
 'fat',
 'doing',
 'checking',
 'sonyericsson',
 '75max',
 '31p',
 'sleeps',
 'callcost150ppmmobilesvary',
 '24th',
 'broth',
 'prsn',
 'erutupalam',
 'switch',
 'w8in',
 'flowing',
 'enuff',
 'grief',
 'accenture',
 'alaipayuthe',
 'matured',
 'lock',
 'groovying',
 'thinkthis',
 'lifpartnr',
 'plan',
 'atm',
 'roommate',
 'cake',
 'however',
 'symbol',
 'yalru',
 'bulbs',
 'billy',
 'take',
 'east',
 '08001950382',
 'earth',
 'these',
 'coping',
 'meal',
 '2go',
 'royal',
 'now',
 'presnts',
 'eventually',
 'gek1510',
 'f',
 'reformat',
 'jeevithathile',
 'discreet',
 'rencontre',
 'politicians',
 '3mins',
 '08704050406',
 'considering',
 'replacement',
 'tm',
 'fuelled',
 'elama',
 'customers',
 '3qxj9',
 'shexy',
 'landline',
 'route',
 'pretend',
 'her',
 'faber',
 'spice',
 'suman',
 'god',
 'especially',
 'kath',
 'othrs',
 'this',
 'bluetoothhdset',
 'timi',
 '89555',
 '20',
 '69866',
 'no',
 'network'

## The final training set

In [18]:
word_counts_per_sms = {unique_word:[0] * len(training['SMS']) for unique_word in vocabulary}

In [19]:
for index,sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [20]:
word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)

In [21]:
training = pd.concat([training,word_counts_per_sms_df],axis=1)

In [22]:
training.head()

Unnamed: 0,Label,SMS,goodfriend,81010,infections,firsg,silly,thanksgiving,fat,doing,...,15,tonght,treasure,phones,scarcasim,dt,reverse,sf,draws,august
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating probabilities

### Calculating P(spam) and P(ham)

In [23]:
p_spam = sum(training['Label'] == 'spam') / len(training['Label'])
p_spam

0.13461969934933812

In [24]:
p_ham = sum(training['Label'] == 'ham') / len(training['Label'])
p_ham

0.8653803006506618

### Calculating N<sub>spam</sub>, N<sub>ham</sub> and N<sub>Vocabulary</sub>

In [25]:
n_spam = training[training['Label'] == 'spam']['SMS'].apply(len).sum()
n_spam

15190

In [26]:
n_ham = training[training['Label'] == 'ham']['SMS'].apply(len).sum()
n_ham

57233

In [27]:
n_vocabulary = len(vocabulary)
n_vocabulary

7782

In [28]:
alpha = 1
alpha

1

### Calculating Parameters

Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. We'll need one dictionary to store the parameters for P(w<sub>i</sub>|Spam), and the other for P(w<sub>i</sub>|Ham).

In [35]:
p_spam_dict = {}
p_ham_dict = {}
for word in vocabulary:
    p_spam_dict[word] = 0
    p_ham_dict[word] = 0

Isolate the spam and the ham messages in the training set into two different DataFrames.

In [36]:
training_spam = training[training['Label'] == 'spam']
training_ham = training[training['Label'] == 'ham']

Update the probability value in the two dictionaries created initially.

In [37]:
for word in vocabulary:
    p_spam_dict[word] = (training_spam[word].sum() + alpha) / (n_spam + alpha * n_vocabulary)
    p_ham_dict[word] = (training_ham[word].sum() + alpha) / (n_ham + alpha * n_vocabulary)

## Classifying a new message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
- Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)
- Compares the values of P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), and:
  - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.
  - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) < P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as spam.
  - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) = P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the algorithm may request human help.


In [56]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_spam_dict:
            p_spam_given_message *= p_spam_dict[word]
        if word in p_ham_dict:
            p_ham_given_message *= p_ham_dict[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [51]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3489598779101096e-25
P(Ham|message): 1.9380782419077522e-27
Label: Spam


In [52]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4385273359614485e-25
P(Ham|message): 3.6893872875947e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

We'll now try to determine how well the spam filter does on our test set of 1,115 messages.

In [57]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_spam_dict:
            p_spam_given_message *= p_spam_dict[word]
        if word in p_ham_dict:
            p_ham_given_message *= p_ham_dict[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [60]:
test['prediction'] = test['SMS'].apply(classify_test_set)

P(Spam|message): 6.396386189153132e-12
P(Ham|message): 1.018543627476109e-08
P(Spam|message): 3.4849503301295204e-26
P(Ham|message): 4.255386848238802e-19
P(Spam|message): 3.1157998219097493e-34
P(Ham|message): 9.67576804672042e-29
P(Spam|message): 7.5597805688254e-83
P(Ham|message): 4.3367944617713974e-98
P(Spam|message): 3.6109319127843506e-34
P(Ham|message): 1.4824697216246626e-28
P(Spam|message): 2.767423539276339e-68
P(Ham|message): 6.591042788476465e-58
P(Spam|message): 3.0086941599352304e-110
P(Ham|message): 1.3979176045604854e-88
P(Spam|message): 6.63260828466644e-08
P(Ham|message): 1.537004757877139e-05
P(Spam|message): 1.6764133041876452e-44
P(Ham|message): 9.832508931994963e-39
P(Spam|message): 1.2948617201929211e-42
P(Ham|message): 5.6819383859117084e-36
P(Spam|message): 1.0302902892183238e-15
P(Ham|message): 6.46037712404047e-15
P(Spam|message): 6.700027976514768e-16
P(Ham|message): 1.834615275513532e-12
P(Spam|message): 1.0063379014504253e-41
P(Ham|message): 3.98290716705

P(Spam|message): 5.717708310413543e-73
P(Ham|message): 3.047346245588996e-90
P(Spam|message): 5.186645408495237e-39
P(Ham|message): 1.7982941035117792e-30
P(Spam|message): 3.365830274298289e-77
P(Ham|message): 1.662665299881136e-82
P(Spam|message): 2.035108530651968e-42
P(Ham|message): 3.525385542370836e-32
P(Spam|message): 5.434594637415205e-91
P(Ham|message): 9.596813070137602e-76
P(Spam|message): 4.0350820708170283e-54
P(Ham|message): 7.647313814258023e-44
P(Spam|message): 2.2160582890449115e-33
P(Ham|message): 9.498870321637662e-29
P(Spam|message): 3.0387067832827704e-46
P(Ham|message): 6.353193188230863e-41
P(Spam|message): 1.175640833645267e-34
P(Ham|message): 1.6307915821180167e-28
P(Spam|message): 1.4336740267427656e-16
P(Ham|message): 3.209159869264274e-14
P(Spam|message): 3.343264127979061e-17
P(Ham|message): 9.50202332574957e-17
P(Spam|message): 1.328637163390977e-32
P(Ham|message): 1.3547129053555164e-23
P(Spam|message): 3.5493814894789726e-28
P(Ham|message): 6.400217831150

P(Spam|message): 1.7908014040642757e-21
P(Ham|message): 1.2870169307934302e-16
P(Spam|message): 1.0436759666351853e-79
P(Ham|message): 1.7706335466239798e-64
P(Spam|message): 8.417334179227624e-22
P(Ham|message): 1.7185355732422595e-18
P(Spam|message): 1.8371830367176604e-34
P(Ham|message): 1.4849237829957148e-27
P(Spam|message): 6.876119265043098e-20
P(Ham|message): 9.908153450698732e-16
P(Spam|message): 3.368588050927246e-85
P(Ham|message): 3.631654082848021e-111
P(Spam|message): 2.478327895485806e-29
P(Ham|message): 1.9390981810347198e-24
P(Spam|message): 1.2438473396348703e-68
P(Ham|message): 1.3028212437713896e-69
P(Spam|message): 1.2545486447556496e-25
P(Ham|message): 3.2741472254946495e-20
P(Spam|message): 2.5565148436272113e-85
P(Ham|message): 1.8857446853634218e-76
P(Spam|message): 2.0002316950508402e-198
P(Ham|message): 1.2358774637117776e-166
P(Spam|message): 2.640277242802707e-73
P(Ham|message): 1.6277468442659727e-66
P(Spam|message): 1.9166134718182438e-20
P(Ham|message): 

P(Ham|message): 9.140957615294085e-132
P(Spam|message): 2.6373243287593446e-30
P(Ham|message): 6.605628354411669e-22
P(Spam|message): 7.10602275587702e-105
P(Ham|message): 1.3408362243487157e-89
P(Spam|message): 5.0942584532080375e-111
P(Ham|message): 5.1709665319343305e-84
P(Spam|message): 5.229268858845161e-18
P(Ham|message): 3.062895029720029e-15
P(Spam|message): 9.667950683500839e-87
P(Ham|message): 1.3080875122213655e-70
P(Spam|message): 4.3115115626456116e-57
P(Ham|message): 1.56611978986489e-51
P(Spam|message): 8.711911361035248e-28
P(Ham|message): 1.859600958731596e-23
P(Spam|message): 3.734771175323297e-18
P(Ham|message): 9.83256084997643e-13
P(Spam|message): 4.9484597789741986e-70
P(Ham|message): 3.8849868374418156e-58
P(Spam|message): 4.336705416897287e-09
P(Ham|message): 3.3780324348948116e-07
P(Spam|message): 2.273229999938772e-34
P(Ham|message): 9.558073437812572e-28
P(Spam|message): 1.0474216706438766e-22
P(Ham|message): 1.1408683971139188e-16
P(Spam|message): 1.64960826

In [61]:
test

Unnamed: 0,Label,SMS,prediction
0,ham,Wherre's my boytoy ? :-(,ham
1,ham,Later i guess. I needa do mcat study too.,ham
2,ham,But i haf enuff space got like 4 mb...,ham
3,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4,ham,All sounds good. Fingers . Makes it difficult ...,ham
5,ham,"All done, all handed in. Don't know if mega sh...",ham
6,ham,But my family not responding for anything. Now...,ham
7,ham,U too...,ham
8,ham,Boo what time u get out? U were supposed to ta...,ham
9,ham,Genius what's up. How your brother. Pls send h...,ham


In [83]:
correct = 0
total = test.shape[0]

for _ ,row in test.iterrows():
    if row['Label'] == row['prediction']:
        correct +=1

In [86]:
accuracy = correct / total
accuracy

0.9874439461883409

We've reached the final stage of the project with the filter reaching an accuracy level of 98.7%

In [89]:
for sms in test[test['Label'] != test['prediction']]['SMS']:
    print(sms)

Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
Unlimited texts. Limited minutes.
26th OF JULY
Nokia phone is lovly..
A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8
No calls..messages..missed calls
We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us
Oh my god! I've found your number agai

In [90]:
test[test['Label'] != test['prediction']]

Unnamed: 0,Label,SMS,prediction
115,spam,Not heard from U4 a while. Call me now am here...,ham
136,spam,More people are dogging in your area now. Call...,ham
153,ham,Unlimited texts. Limited minutes.,spam
160,ham,26th OF JULY,spam
285,ham,Nokia phone is lovly..,spam
294,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
303,ham,No calls..messages..missed calls,spam
320,ham,We have sent JD for Customer Service cum Accou...,spam
505,spam,Oh my god! I've found your number again! I'm s...,ham
547,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham
