# Guided Project: [Building a Spam Filter with Naive Bayes](https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes)

---

## Background

Have you ever felt disturbed by spam messages? I'm sure everybody has. Even those who sent the spam messages have. Fortunately there's this thing called spam filter that helps us avoid the hassle of repeatedly cleaning our inbox. The goal of this exercise is to create a basic messaging spam filter using Naive Bayes theorem. Isn't this exciting? It sure is.

We'll achieve this goal using a dataset put together by Tiago A. Almeida and José María Gómez Hidalgo that has 5,572 SMS messages made available in [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).  The messages have been previously classified into spam/ham (not spam) category and can be downloaded from [here](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). This allows us to "train" the algorithm to classify the messages and later on label whether a certain message is a spam/ham using the following principal:


If the $ P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) $ is bigger than $ P(Spam^C) \cdot \prod_{i=1}^{n}P(w_i|Spam^C) $ then it's a $\text{Spam message} $.

Otherwise, if the $ P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) $ is smaller than the $ P(Spam^C) \cdot \prod_{i=1}^{n}P(w_i|Spam^C) $ then it's not a $ \text{Spam message} $

Now that we've covered the basics, let's start working~

## Content
> - [Background](#Background)
> - [Loading the dataset](#Separating-the-dataset)
> - [Separating the dataset](#Separating-the-dataset)
> - [Cleaning the dataset](#Cleaning-the-dataset)
> - [Creating word count dataset](#Creating-word-count-dataset)
> - [Calculating constants](#Calculating-constants)
> - [Calculating parameters](#Calculating-parameters)
> - [Creating spam filter function](#Creating-spam-filter-function)
> - [Measuring filter accuracy](#Measuring-filter-accuracy)
> - [Conclusion](#Conclusion)

---

## Loading the dataset

First we will import the packages that we'll be using and load the dataset. We'll also take a quick look at the dataset to familiarize ourselves with the data.

In [1]:
# Importing packages for data management 
import pandas as pd    # Importing pandas
import numpy as np     # Importing numpy
import datetime as dt  # Importing datetime
import re              # Importing regular expression
import warnings        # To suppress warning alert
warnings.filterwarnings('ignore')
#Change setting to avoid dataframe from truncating
pd.options.display.max_rows = 500
pd.options.display.width = 500
pd.options.display.max_colwidth = 500
pd.options.display.max_columns = 500
# Displaying all output from code cell. Default value = 'last_expr'.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# loading the dataset 
sms = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['Label','spam_sms'])

In [3]:
# Shape and proportion of the dataset 
print("\nSMSSpamCollection dataset has {} rows and {} columns".format(sms.shape[0],sms.shape[1]))
sms['Label'].value_counts(normalize=True)*100


SMSSpamCollection dataset has 5572 rows and 2 columns


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [4]:
# Quick look at the dataset 
sms.head(10)

Unnamed: 0,Label,spam_sms
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030


[Back to top](#Background)

---

## Separating the dataset 

To create the spam filter we will need to split the dataset into 2, one for training purpose and the other one for the spam filter testing. The dataset will be split into 80:20 proportion, the larger one being the training dataset that will "teach" our algorithm to recognize spam messages.

Here's where the pre-labelled message will come into play. Basically we'll have the algorithm calculate and store probability for every single words in the dataset, letting it know which words are more probable to be used in spam and which words are not. 

Since we have this 5,572 rows of messages, we'll split them into two sets, 80% for training and 20% for testing. 

In [5]:
# Randomizing the dataset 
random_set = sms.sample(frac=1,random_state=1)

# Calculating the index for split
random_index = round(len(random_set)*0.8)

# Splitting the dataset
training = random_set[:random_index].reset_index(drop=True)
test = random_set[random_index:].reset_index(drop=True)

# Checking shape and label proportion for training dataset
print("\nTraining dataset has {} rows.".format(training.shape[0]))
training['Label'].value_counts(normalize=True)*100

# Checking shape and label proportion for test dataset
print("Test dataset has {} rows.".format(test.shape[0]))
test['Label'].value_counts(normalize=True)*100


Training dataset has 4458 rows.


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

Test dataset has 1114 rows.


ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We've split them nicely and it looks like the proportion between the two is similar, which is good.

---

## Cleaning the dataset

Up next, tidying the dataset. We'll clean the words by removing any punctuation and have all the words in lower case. 

In [6]:
# Changing to lower case and replacing all punctuation 
def clean_punctuation(x):
    result = re.sub('\W', ' ', x)
    result = result.lower()
    return result

training['spam_sms'] = training['spam_sms'].apply(clean_punctuation)

---

## Creating word count dataset

To count the words, we need to split the words in the messages. By performing a split we'd be creating series containing lists that we can iterate to obtain the vocabulary frequency.

In [7]:
# Transforming messages in SMS column into list 
training['spam_sms'] = training['spam_sms'].str.split()

In [8]:
# Creating vocabulary from the SMS column
vocabulary = []

for row in training['spam_sms']:
    for n in row:
        if n not in vocabulary:
            vocabulary.append(n)
        else:
            continue

The for loop function in the above should prevent any duplicate in the vocabulary list.
Quick check to confirm that there's no duplicate:

In [9]:
len(vocabulary) == len(set(vocabulary))
len(vocabulary)

True

7783

The check returns ```True```, meaning there's no duplicated words and the vocabulary has 7,873 unique words in it.

Now we can move on to the next step, which is counting the number of times a certain word is used and storing them into a dataframe. This dataframe will later be concatenated with training dataframe to complete the process.

In [10]:
# Creating word count per sms dictionary 
word_counts_per_sms = {unique_word: [0] * len(training['spam_sms']) for unique_word in vocabulary}

# For loop function to fill the dictionary 
for index, sms in enumerate(training['spam_sms']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [11]:
# Creating word count dataframe and concatenating it with training dataframe 
word_count_data = pd.DataFrame(word_counts_per_sms)
training_set = pd.concat([training,word_count_data],axis=1)

In [12]:
# Quick look at the dataset 
training_set.sample(3, random_state=0)

Unnamed: 0,Label,spam_sms,yep,by,the,pretty,sculpture,yes,princess,are,you,going,to,make,me,moan,welp,apparently,he,retired,havent,i,forgot,2,ask,ü,all,smth,there,s,a,card,on,da,present,lei,how,want,write,or,sign,it,ok,thk,got,then,u,wan,come,now,wat,kfc,its,tuesday,only,buy,meals,no,gravy,mark,dear,was,sleeping,p,pa,nothing,problem,ill,be,lt,gt,my,uncles,in,atlanta,wish,guys,great,semester,phone,which,your,another,number,greatest,test,of,courage,earth,is,bear,defeat,without,losing,heart,gn,tc,dai,what,this,can,send,resume,id,am,late,will,at,freemsg,why,haven,t,replied,text,m,randy,sexy,female,and,live,local,luv,hear,from,netcollex,ltd,08700621170150p,per,msg,reply,stop,end,k,when,re,way,congrats,mobile,3g,videophones,r,yours,call,09061744553,videochat,wid,ur,mates,play,java,games,dload,polyh,music,noline,rentl,bx420,ip4,5we,150pm,please,leave,topic,sorry,for,telling,that,ooooooh,tell,get,yoville,hi,yijue,meet,11,tmr,show,world,about,europe,well,must,pain,catch,know,mean,texting,bill,3,33,65,so,thats,not,bad,yeah,where,class,pin,fighting,with,easy,either,win,lose,bt,fightng,some1,who,close,dificult,if,still,dude,up,teresa,hope,have,been,okay,didnt,these,people,called,them,they,had,received,package,since,dec,just,thot,ld,like,do,fantastic,year,best,reading,plus,really,bam,first,aid,usmle,work,done,time,coming,...,astronomer,starer,election,recount,hitler,eleven,twelve,perpetual,dd,onam,sirji,tata,aig,recieve,channel,teletext,kay,gauti,sehwag,odi,lodging,uworld,qbank,assessment,mittelschmertz,paracetamol,ikea,spelled,caps,among,mess,bullshit,hanks,lotsly,09058091870,m26,3uz,weirdo,woo,hoo,hppnss,sorrow,goodfriend,stubborn,sucker,hospitals,suckers,fans,0870141701216,120p,bbq,6ish,09058094455,jurong,amore,09096102316,impatient,belligerent,splash,dessert,warner,83118,colin,farrell,swat,kiosk,mre,achan,amma,taught,becaus,verifying,prabu,meg,sections,clearer,1million,ppt150x3,box403,w1t1jy,0870737910216yrs,tiring,concentrating,throwing,performed,plumbers,wrench,borrow,arestaurant,squid,dosomething,wall,08712402779,ibm,hp,yalru,lyfu,astne,innu,mundhe,lyf,ali,halla,ke,bilo,edhae,ovr,vargu,meow,velachery,natalie,165,natalie2k9,nordstrom,qi,suddenly,holy,weren,95,pax,deposit,sq825,arrival,interest,figuring,favorite,oyster,sashimi,rumbling,outbid,simonwatson5120,shinco,plyr,smsrewards,notifications,ndship,needle,4few,conected,checkup,aka,pap,smear,attention,spreadsheet,determine,entire,mirror,apeshit,swimming,jacuzzi,dismissial,formally,accidant,tookplace,ghodbandar,slovely,excited,7cfca1a,07090201529,helens,princes,garments,elections,restock,scrounge,rhythm,establish,attraction,named,sorrows,craziest,proove,praises,curry,makiing,sambar,07815296484,41782,lingo,69969,bcmsfwc1n3xx,free2day,george,89080,0870241182716,sometme,andrews,costing,offering,09066368470,shb,chapel,frontierville,progress,non,cantdo,anythingtomorrow,myparents,aretaking,outfor,katexxx,08700469649,box420,virgins,4fil,sexual,theirs,69911,leanne,cutting,ooh,4got,moseley,weds,bruv,09053750005,310303,08718725756,140ppm,dan,reminded,cme,hos,occupied,armenia,swann,abbey,09066660100,2309,harry,potter,phoenix,readers,inconsiderate,nag,recession,hence,genes,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
2715,ham,"[wait, i, will, come, out, lt, gt, min]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
775,ham,"[there, generally, isn, t, one, it, s, an, uncountable, noun, u, in, the, dictionary, pieces, of, research]",0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
117,ham,"[of, course, don, t, tease, me, you, know, i, simply, must, see, grins, do, keep, me, posted, my, prey, loving, smile, devouring, kiss]",0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


This looks great. By having this we can obtain the frequency of a word on a certain label and subsequently calculate the probability of that particular word given spam or ham.

[Back to top](#Background)

---

## Calculating constants

Calculating $P(Spam)$, $P(Ham)$, $N_{Spam}$, $N_{Ham}$ and $N_{Vocabulary}$

In [13]:
# Calculating the constants 
p_ham=training_set['Label'].value_counts(normalize=True)[0]
p_spam=training_set['Label'].value_counts(normalize=True)[1]

# Function to count n of words
def word_counter(x):
    '''Function takes in a series 
    of strings that will append
    it to a list and return the 
    len to complete the word count
    '''
    result=[]
    for row in x:
        for n in row:
            word = str(n)
            result.append(word)
    return len(result)

# Number of words in vocabulary, ham and spam
n_vocab = len(vocabulary)
n_ham = word_counter(training_set.query("Label=='ham'")['spam_sms'])
n_spam = word_counter(training_set.query("Label=='spam'")['spam_sms'])

# Laplace smoothing
alpha = 1

In [14]:
# Displaying the constants 
pd.DataFrame([p_ham,p_spam],["$P(Ham)$","$P(Spam)$"]).rename(columns={0:"Probability"})
pd.DataFrame([n_vocab,n_ham,n_spam],["$N_{Vocabulary}$","$N_{Ham}$","$N_{Spam}$"]).rename(columns={0:"Number of words"})

Unnamed: 0,Probability
$P(Ham)$,0.86541
$P(Spam)$,0.13459


Unnamed: 0,Number of words
$N_{Vocabulary}$,7783
$N_{Ham}$,57237
$N_{Spam}$,15190


[Back to top](#Background)

---

## Calculating parameters 

$ P(w_{i} | Spam)$ and $ P(w_{i} | Ham)$ that will be used to classify the incoming message are called _parameters_.
They will be calculated using the following equations:

$$ P(w_{i} | Spam) = \frac{N_{w_{i}|Spam}+ \alpha}{N_{Spam} + N_{Vocabulary}} $$


$$ P(w_{i} | Ham) = \frac{N_{w_{i}|Ham}+ \alpha}{N_{Ham} + N_{Vocabulary}} $$

In [15]:
# Separating spam/ham messages 
spam_msg = training_set[training_set['Label']=='spam']
ham_msg = training_set[training_set['Label']=='ham']

# Creating 2 dictionaries for parameters 
parameters_spam = {word: 0 for word in vocabulary}
parameters_ham = {word: 0 for word in vocabulary}

In [16]:
# Calculating parameters: 

for word in vocabulary:
    n_word_given_spam = spam_msg[word].sum()   
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocab)
    parameters_spam[word] = p_word_given_spam 
    
    n_word_given_ham = ham_msg[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocab)
    parameters_ham[word] = p_word_given_ham 

[Back to top](#Background)

---

## Creating spam filter function

The spam filter works by comparing the $ P(Spam | Message)$ to $ P(Ham | Message)$ and assigning spam/ham label depending on which one has the larger proportion. To do this we'll use the following function:

In [17]:
# Spam filter function
import re

def prob(message, dictionary, init_value):
    p = init_value
    for n in message:
        if n in dictionary:
            p *= dictionary[n]
        else:
            continue
    return p

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = prob(message, parameters_spam, p_spam)
    p_ham_given_message = prob(message, parameters_ham, p_ham)

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Testing the ```classify``` function using a message that's clearly a spam:

In [18]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


and using a message that's a ham:

In [19]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


This shows that the function is working properly and we can apply it to the test set that we haven't touched yet. We'll do that while measuring the filter accuracy.

[Back to top](#Background)

--- 

## Measuring filter accuracy

To do this, we will compare the set that was not included in the training set, which is the test set, and compare the label given by the filter with the label that comes with the dataset, which is done by human.
We'll adjust the ```classify``` function in the above to return a value instead of printing them so we can apply it to create a new column.

In [49]:
# Function to test the filter function
def classify_test_set(message):
    message = re.sub('[\W]',' ', message)
    message = message.lower()
    message = re.split("[\s]",message)
    p_spam_given_message = prob(message, parameters_spam, p_spam)
    p_ham_given_message = prob(message, parameters_ham, p_ham)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

In [50]:
# Applying the function 
test['predicted'] = test['spam_sms'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,spam_sms,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out,spam
3,ham,All sounds good. Fingers . Makes it difficult to type,ham
4,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!",ham



$$
Accuracy = \frac{\text{number of correctly classified messages}}{\text{total number of classifed messages}}
$$

The applied function works as expected. Next we'll do the calculation in the above on the dataset by using this for loop function below that will print the numbers:

In [51]:
# Measuring accuracy value 
correct = 0
total = len(test['Label'])
for val_1, val_2 in zip(test['Label'],test['predicted']):
    if val_1 == val_2:
        correct += 1
    else:
        continue
print("Accuracy value of the spam filter is {}%.".format(correct/total * 100))
print("Number of correct label: {}".format(correct))
print("Number of incorrect label: {}".format(total-correct))

Accuracy value of the spam filter is 98.74326750448833%.
Number of correct label: 1100
Number of incorrect label: 14


---

## Conclusion

In conclusion, it's a really worthwhile project that helped me understand the application of Bayes theorem and see how a simple spam filter that uses statistics works. It would be interesting to see how we might be able to increase the accuracy further by modifying the algorithm. 

[Back to top](#Background)

---