# Building a Spam Filter with Naive Bayes

To classify messages as spam or non-spam, the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values and if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So the first task is to "teach" the computer how to classify messages. To do that, I'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

## Exploring the Dataset

In [1]:
import pandas as pd
spam_database=pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
print(spam_database.head())

  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


__Ham__ means non-spam. 

In [3]:
print(spam_database.shape)

(5572, 2)


In [4]:
spam_sms=spam_database[spam_database["Label"]=="spam"]
print(spam_sms.shape[0])

747


In [5]:
ham_sms=spam_database[spam_database["Label"]=="ham"]
print(ham_sms.shape[0])

4825


In [6]:
n_sms_tot=5572
n_spam=747
n_ham=4825
p_spam=n_spam/n_sms_tot
p_ham=n_ham/n_sms_tot
print(p_spam*100)
print(p_ham*100)

13.406317300789663
86.59368269921033


## Training and Test Set

About 87% of the messages are ham (non-spam), and the remaining 13% are spam.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, I'm first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

It's better to keep 80% of the dataset for training, and 20% for testing.

In [7]:
rand_spam_database=spam_database.sample(frac=1, random_state=1)
print(rand_spam_database.head(10))

     Label                                                SMS
1078   ham                       Yep, by the pretty sculpture
4028   ham      Yes, princess. Are you going to make me moan?
958    ham                         Welp apparently he retired
4642   ham                                            Havent.
4674   ham  I forgot 2 ask ü all smth.. There's a card on ...
5461   ham  Ok i thk i got it. Then u wan me 2 come now or...
4210   ham  I want kfc its Tuesday. Only buy 2 meals ONLY ...
4216   ham                         No dear i was sleeping :-P
1603   ham                          Ok pa. Nothing problem:-)
1504   ham                    Ill be there on  &lt;#&gt;  ok.


In [8]:
training=rand_spam_database.sample(frac=0.8, random_state=1)
test=rand_spam_database.sample(frac=0.2, random_state=1)
training_ham=training[training["Label"]=="ham"].shape[0]
training_spam=training[training["Label"]=="spam"].shape[0]
test_ham=test[test["Label"]=="ham"].shape[0]
test_spam=test[test["Label"]=="spam"].shape[0]
p_training_ham=training_ham/(training_ham+training_spam)
p_training_spam=1-p_training_ham
p_test_ham=test_ham/(test_ham+test_spam)
p_test_spam=1-p_test_ham
print(100*p_training_ham, " ", 100*p_training_spam)
print(100*p_test_ham, " ", 100*p_test_spam)

86.67563930013459   13.324360699865412
86.17594254937163   13.824057450628368


The samples are representative of the whole dataset.

# Data Cleaning

## Letter Case and Punctuation

I'll begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [9]:
training["SMS"]=training["SMS"].str.replace(r"\W", " ").str.lower()
print(training["SMS"].head(10))

3404         good night my dear   sleepwell amp take care
4781    sen told that he is going to join his uncle fi...
484     thank you baby  i cant wait to taste the real ...
502                                  when can ü come out 
3898                 no  thank you  you ve been wonderful
96                      watching telugu movie  wat abt u 
2177                      get ready to moan and scream   
2841    babe     i miiiiiiissssssssss you   i need you...
993     up to ü    ü wan come then come lor    but i d...
3590    i ve sent my wife your text  after we buy them...
Name: SMS, dtype: object


## Creating the Vocabulary

I'll now move to creating the vocabulary, which in this context means a list with all the unique words in our training set.

In [10]:
vocabulary=[]
training_2=training.copy()
training_2["SMS"]=training_2["SMS"].str.split()
training_2.reset_index(inplace=True)
for sms in training_2["SMS"]:
    for word in sms:
        vocabulary.append(word)
set_of_words=set(vocabulary)
list_of_words=list(set_of_words)

## The Final Training Set

I'm now going to use the just created vocabulary to make the data transformation we want.

In [11]:
word_counts_per_sms = {unique_word: [0] * len(training_2['SMS']) for unique_word in list_of_words}

for index, sms in enumerate(training_2['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [12]:
wcps_df = pd.DataFrame(word_counts_per_sms)
print(wcps_df.shape)

(4458, 7712)


In [13]:
training_3=pd.concat([training_2, wcps_df], axis=1)
print(training_3.shape)

(4458, 7715)


In [14]:
print(wcps_df.columns)

Index(['0', '00', '000', '000pes', '008704050406', '0089', '0121',
       '01223585236', '01223585334', '0125698789',
       ...
       'zoe', 'zogtorius', 'zoom', 'zouk', 'èn', 'é', 'ú1', 'ü', '〨ud', '鈥'],
      dtype='object', length=7712)


In [15]:
print(training_3.iloc[:,:5].tail())

      index Label                                                SMS  0  00
4453    657   ham  [sun, cant, come, to, earth, but, send, luv, a...  0   0
4454   4753   ham  [well, boy, am, i, glad, g, wasted, all, night...  0   0
4455   1442   ham                       [ya, going, for, restaurant]  0   0
4456   2105   ham  [anyway, seriously, hit, me, up, when, you, re...  0   0
4457   4585   ham  [noooooooo, please, last, thing, i, need, is, ...  0   0


In [16]:
print(training_3[training_3["Label"].isnull()].shape)

(0, 7715)


In [17]:
print(training_3[training_3["Label"].isnull()].head(3))

Empty DataFrame
Columns: [index, Label, SMS, 0, 00, 000, 000pes, 008704050406, 0089, 0121, 01223585236, 01223585334, 0125698789, 02, 0207, 02073162414, 02085076972, 021, 03, 04, 05, 050703, 0578, 06, 07, 07008009200, 07046744435, 07090201529, 07090298926, 07099833605, 07123456789, 0721072, 07732584351, 07734396839, 07742676969, 07753741225, 0776xxxxxxx, 07781482378, 07786200117, 077xxx, 07801543489, 07808, 07808247860, 07808726822, 07821230901, 07880867867, 0789xxxxxxx, 07946746291, 0796xxxxxx, 07973788240, 07xxxxxxxxx, 08, 0800, 08000407165, 08000776320, 08000839402, 08000930705, 08000938767, 08001950382, 08002888812, 08002986030, 08002986906, 08002988890, 08006344447, 0808, 08081263000, 08081560665, 0825, 083, 0844, 08448350055, 0845, 08450542832, 08452810073, 08452810075over18, 0870, 08700435505150p, 08700621170150p, 08701213186, 08701417012, 08701417012150p, 0870141701216, 087016248, 087018728737, 0870241182716, 08702490080, 08702840625, 08704050406, 08704439680, 08704439680ts, 087

In [18]:
print(training_2.shape)

(4458, 3)


## Calculating Constants First

Now that I'm done with data cleaning and have a training set to work with, I can begin creating the spam filter.

In [19]:
alpha=1
N_spam=0
N_ham=0
training_2_spam=training_2[training_2["Label"]=="spam"]
training_2_ham=training_2[training_2["Label"]=="ham"]
for li in training_2_spam["SMS"]:
    N_spam+=len(li)
for li in training_2_ham["SMS"]:
    N_ham+=len(li)
print(N_spam)
print(N_ham)

15142
57140


In [20]:
N_vocabulary=len(list_of_words)
print(N_vocabulary)

7712


## Calculating Parameters

The probability values that P(wi|Spam) and P(wi|Ham) will take are called parameters.
The fact that I calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms).

In [21]:
dict_spam={}
for word in list_of_words:
    dict_spam[word]=0
dict_ham={}
for word in list_of_words:
    dict_ham[word]=0

In [22]:
training_3_spam=training_3[training_3["Label"]=="spam"]
training_3_ham=training_3[training_3["Label"]=="ham"]

In [23]:
for word in list_of_words:
    N_word_spam=training_3_spam[word].sum()
    parameter_spam=(N_word_spam+alpha)/(N_spam+alpha*N_vocabulary)
    dict_spam[word]=parameter_spam
    N_word_ham=training_3_ham[word].sum()
    parameter_ham=(N_word_ham+alpha)/(N_ham+alpha*N_vocabulary)
    dict_ham[word]=parameter_ham

In [24]:
p_spam = len(training_3_spam) / len(training_3)
p_ham = len(training_3_ham) / len(training_3)

In [25]:
print(p_spam)
print(p_ham)

0.13324360699865412
0.8667563930013459


## Classifying a New Message

Now that we have all our parameters calculated, I can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [26]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    ''' 
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in dict_spam:
            p_spam_given_message*=dict_spam[word]
        if word in dict_ham:
            p_ham_given_message*=dict_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')


In [27]:
example_spam='WINNER!! This is the secret code to unlock the money: C3421.'
example_ham="Sounds good, Tom, then see u there"
classify(example_spam)
classify(example_ham)

P(Spam|message): 1.273700039190484e-25
P(Ham|message): 2.6479653658243408e-27
Label: Spam
P(Spam|message): 1.0782622513413257e-25
P(Ham|message): 4.248744927807854e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

I'll now try to determine how well the spam filter does on The test set of 1114 messages.

In [28]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    ''' 
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in dict_spam:
            p_spam_given_message*=dict_spam[word]
        if word in dict_ham:
            p_ham_given_message*=dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [29]:
correct=0
test["Predicted"]=test["SMS"].apply(classify_test_set)
print(test.head())

     Label                                                SMS Predicted
3404   ham       Good night my dear.. Sleepwell&amp;Take care       ham
4781   ham  Sen told that he is going to join his uncle fi...       ham
484    ham  Thank you baby! I cant wait to taste the real ...       ham
502    ham                               When can ü come out?       ham
3898   ham               No. Thank you. You've been wonderful       ham


In [30]:
total=len(test)
for index, row in test.iterrows():
    if row["Label"]==row["Predicted"]:
        correct+=1
accuracy=correct/total
print(accuracy)

0.992818671454219


The accuracy is 99.3%. It's extraordinarily high!

## When the Spam Filter was wrong?

Let's read the incorrectly classified SMS.

In [31]:
test_incorrect=test[test["Label"]!=test["Predicted"]]
print(test_incorrect["SMS"])

4703                                           Anytime...
5       FreeMsg Hey there darling it's been 3 week's n...
1863    The last thing i ever wanted to do was hurt yo...
1988                     No calls..messages..missed calls
2663    Hello darling how are you today? I would love ...
3460    Not heard from U4 a while. Call me now am here...
4213    Missed call alert. These numbers called but le...
1875    Would you like to see my XXX pics they are so ...
Name: SMS, dtype: object


In [32]:
print(test_incorrect)

     Label                                                SMS  \
4703   ham                                         Anytime...   
5     spam  FreeMsg Hey there darling it's been 3 week's n...   
1863   ham  The last thing i ever wanted to do was hurt yo...   
1988   ham                   No calls..messages..missed calls   
2663  spam  Hello darling how are you today? I would love ...   
3460  spam  Not heard from U4 a while. Call me now am here...   
4213  spam  Missed call alert. These numbers called but le...   
1875  spam  Would you like to see my XXX pics they are so ...   

                       Predicted  
4703                        spam  
5                            ham  
1863  needs human classification  
1988                        spam  
2663                         ham  
3460                         ham  
4213                         ham  
1875                         ham  


In [33]:
for index, row in test_incorrect.iterrows():
    print(row["SMS"], " ", "Label:", row["Label"])
    print("<---------------------------->")

Anytime...   Label: ham
<---------------------------->
FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv   Label: spam
<---------------------------->
The last thing i ever wanted to do was hurt you. And i didn't think it would have. You'd laugh, be embarassed, delete the tag and keep going. But as far as i knew, it wasn't even up. The fact that you even felt like i would do it to hurt you shows you really don't know me at all. It was messy wednesday, but it wasn't bad. The problem i have with it is you HAVE the time to clean it, but you choose not to. You skype, you take pictures, you sleep, you want to go out. I don't mind a few things here and there, but when you don't make the bed, when you throw laundry on top of it, when i can't have a friend in the house because i'm embarassed that there's underwear and bras strewn on the bed, pillows on the floor, that's something else. You used to 