# Filtering Spam messages using Naive Bayes algorithm
This project uses the Naive Bayes algorithm to determine the probability of a certain message being considered a spam message or not (ham). This filter will be trained on a dataset of messages and classify them as one of the two options.

In [1]:
## Imports
import pandas as pd
import numpy as np 

In [2]:
## Reading in dataset
spam_collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
spam_collection.shape

(5572, 2)

In [4]:
spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
spam_collection.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [6]:
spam_collection['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We can see that the dataset has over 5000 unique messages. From those messages, 86.59% of them are non spam while 13.40% of them are spam messages.

## Training and Test Set

In [7]:
## Randomizing set
samples = spam_collection.sample(frac=1,random_state=1)

In [8]:
## Training set will consist of 80% of the data while test set will acount for 20%
spam_train = samples[:round(samples.shape[0]*0.8)].reset_index(drop=True)
spam_train.shape

(4458, 2)

In [9]:
spam_test = samples[round(samples.shape[0]*0.8):].reset_index(drop=True)
spam_test.shape

(1114, 2)

In [10]:
## Find percentage of spam and ham in both datasets
spam_test['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [11]:
spam_train['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

The ratio of non-spam and spam messages are consistent throughout the two sets.

## Some data cleaning
To be able to accurately define a spam message, we should clean the data that every letter is in lowercase and there are no punctuations such as "!". Let's do this to both sets.

In [12]:
## BEFORE (TEST)
spam_test.head(3)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...


In [13]:
spam_test['SMS'] = spam_test['SMS'].str.replace('\W',' ').str.lower()

  spam_test['SMS'] = spam_test['SMS'].str.replace('\W',' ').str.lower()


In [14]:
## AFTER (TEST)
spam_test.head(3)

Unnamed: 0,Label,SMS
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest oran...


In [15]:
## BEFORE (TRAIN)
spam_train.head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [16]:
spam_train['SMS'] = spam_train['SMS'].str.replace('\W',' ').str.lower()

  spam_train['SMS'] = spam_train['SMS'].str.replace('\W',' ').str.lower()


In [17]:
## AFTER (TRAIN)
spam_train.head(3)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired


## Vocabulary
We need to account for every unique word present in the SMS column and classify which ones show up the most in spam messages. Firstly let's create a list of the unique words.

In [18]:
## Create lists of each SMS message
spam_train['SMS'] = spam_train['SMS'].str.split()

In [19]:
spam_train.head(2)

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."


In [20]:
## Nested loop to iterate over each word of each list in the sms column
vocabulary = []
for mes in spam_train['SMS']:
    for word in mes:
        ## Add word to list
        vocabulary.append(word)
        ## turn list into set so duplicates are ignored, then back into list
        vocabulary = set(vocabulary)
        vocabulary = list(vocabulary)
        
print(vocabulary)



In [21]:
## Create dictionary to store occurrence of word
word_counts_per_sms = {unique_word: [0] * len(spam_train['SMS']) for unique_word in vocabulary}

## Use loop to increment how much each word appears in a message
for index, sms in enumerate(spam_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
## Create new dataframe to store how many times each word occurs in a given sms message
words = pd.DataFrame(word_counts_per_sms)

words.head(10)

Unnamed: 0,bec,bottom,soil,wif,george,via,charity,starts,refilled,cleared,...,78,comfort,full,equally,2moro,requires,aeronautics,shy,whose,explain
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
## Concatenate the label so we can see if the message is spam or not
train_word = pd.concat([spam_train,words],axis=1)

train_word.head(10)

Unnamed: 0,Label,SMS,bec,bottom,soil,wif,george,via,charity,starts,...,78,comfort,full,equally,2moro,requires,aeronautics,shy,whose,explain
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,ham,"[ok, i, thk, i, got, it, then, u, wan, me, 2, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ham,"[i, want, kfc, its, tuesday, only, buy, 2, mea...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ham,"[no, dear, i, was, sleeping, p]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,ham,"[ok, pa, nothing, problem]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,ham,"[ill, be, there, on, lt, gt, ok]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
train_word.shape

(4458, 7785)

The abnormally large amount of columns is due to the amount of unique words that appeared in the dataset.

## Calculating constants

![Example Image](NB%20formula.png)

To classify which messages are spam we have to use this Naive Bayes formula where P(Wi | Spam) is the probability of a unique word appearing given the condition of it being a spam message.

![Example Image](NB%202.png)

There are now the formulas to find the condtional probabilities for spam and ham. Let's first find:

 - P(spam): The probability of a spam message
 - P(ham): the probability of a non-spam message
 - Nspam/ Nham: The total number of words in spam/ham messages
 - Nvocabulary: The total number of words in all messages
 
 These are the constant variables which are not going to change regardless of message.

In [24]:
## Find Probability of spam messages
p_spam = ((train_word[train_word['Label']=='spam']).shape[0]) / train_word.shape[0]
print('Probability of getting a spam message is: ' + str(p_spam))

## Find Probability of non-spam messages (ham)
p_ham = ((train_word[train_word['Label']=='ham']).shape[0]) / train_word.shape[0]
print('Probability of getting a non-spam message is: ' + str(p_ham))

Probability of getting a spam message is: 0.13458950201884254
Probability of getting a non-spam message is: 0.8654104979811574


In [25]:
## Find Nspam
s = train_word[train_word['Label']=='spam'].iloc[:,2:]
Nspam = s.sum()
Nspam = Nspam.sum()
print(Nspam)

15190


In [26]:
## Find Nham
h = train_word[train_word['Label']=='ham'].iloc[:,2:]
Nham = h.sum()
Nham = Nham.sum()
print(Nham)

57237


In [27]:
## Find Nvocabulary
v = train_word[(train_word['Label']=='ham')|(train_word['Label']=='spam')].iloc[:,2:]
Nvoc = v.sum()
Nvoc = Nvoc.sum()
print(Nvoc)

## Check to see if Nvoc is the sum of Nham and Nspam
print(Nvoc==(Nspam+Nham))

72427
True


Right now we found the total number of words in spam, non-spam, and the entire dataset to use for our filter. Lastly, let's intialize the alpha variable used in the equation to 1 for Laplace smoothing.

In [28]:
alpha = 1

## Calculating Parameters
Since we found out our constants, now lets find the parameters we need. The two parameters we need to focus on are P(Wi|spam) and P(Wi|ham). These probabilities are calculated for each time a word is in spam or ham. Two dictionaries are made. spam_dict stores each word and the conditional probability of each word given spam. Same thing happens in ham_dict.

In [29]:
## Initialize two dictionaries, one for spam and one for ham
spam_dict = {}
ham_dict = {}

for word in s.columns:
    spam_dict[word] = 0
    
for word in h.columns:
    ham_dict[word] = 0

In [30]:
## Find conditional probabilities of each word given spam or ham
for word in vocabulary:
    n_wi_given_spam = s[word].sum()
    p_wi_spam = (n_wi_given_spam + alpha) / (Nspam + (alpha*Nvoc))
    
    n_wi_given_ham = h[word].sum()
    p_wi_ham = (n_wi_given_ham + alpha) / (Nham + (alpha*Nvoc))
    
    ## Add each probability to the corresponding word in the dictionaries
    spam_dict[word] = p_wi_spam
    ham_dict[word] = p_wi_ham

In [31]:
spam_dict

{'bec': 1.141331020235799e-05,
 'bottom': 1.141331020235799e-05,
 'soil': 1.141331020235799e-05,
 'wif': 1.141331020235799e-05,
 'george': 2.282662040471598e-05,
 'via': 3.423993060707397e-05,
 'charity': 7.989317141650592e-05,
 'starts': 5.706655101178995e-05,
 'refilled': 1.141331020235799e-05,
 'cleared': 1.141331020235799e-05,
 'bless': 1.141331020235799e-05,
 'btw': 1.141331020235799e-05,
 'talking': 1.141331020235799e-05,
 '6wu': 3.423993060707397e-05,
 'puzzeles': 1.141331020235799e-05,
 '81303': 4.565324080943196e-05,
 'sachin': 1.141331020235799e-05,
 '08001950382': 3.423993060707397e-05,
 'reception': 1.141331020235799e-05,
 'games': 0.00017119965303536984,
 'entitled': 5.706655101178995e-05,
 '150': 0.00015978634283301185,
 'actor': 1.141331020235799e-05,
 'may': 6.847986121414794e-05,
 'dirtiest': 2.282662040471598e-05,
 'applyed': 1.141331020235799e-05,
 'aiyo': 1.141331020235799e-05,
 'screen': 1.141331020235799e-05,
 'apo': 1.141331020235799e-05,
 '4txt': 3.4239930607073

In [32]:
ham_dict

{'bec': 2.3136722606120434e-05,
 'bottom': 1.5424481737413622e-05,
 'soil': 1.5424481737413622e-05,
 'wif': 0.0002005182625863771,
 'george': 7.712240868706811e-06,
 'via': 6.94101678183613e-05,
 'charity': 7.712240868706811e-06,
 'starts': 3.0848963474827244e-05,
 'refilled': 2.3136722606120434e-05,
 'cleared': 3.0848963474827244e-05,
 'bless': 2.3136722606120434e-05,
 'btw': 3.0848963474827244e-05,
 'talking': 5.398568608094768e-05,
 '6wu': 7.712240868706811e-06,
 'puzzeles': 1.5424481737413622e-05,
 '81303': 7.712240868706811e-06,
 'sachin': 2.3136722606120434e-05,
 '08001950382': 7.712240868706811e-06,
 'reception': 1.5424481737413622e-05,
 'games': 7.712240868706811e-06,
 'entitled': 7.712240868706811e-06,
 '150': 7.712240868706811e-06,
 'actor': 2.3136722606120434e-05,
 'may': 0.0002622161895360316,
 'dirtiest': 7.712240868706811e-06,
 'applyed': 1.5424481737413622e-05,
 'aiyo': 2.3136722606120434e-05,
 'screen': 3.0848963474827244e-05,
 'apo': 2.3136722606120434e-05,
 '4txt': 7.

As we can see, we have updated each probability of a word given either spam or ham.

## Classification filter
Now, we are able to write a function which takes in a message, and can tell us whether it is spam or not. Note that these two equations are going to be used:

![Example Image](NB%20formula.png)

We have already found P(Wi|spam/ham) so now to classify if it is a spam message or not we find both P(Spam|w1,w2,..,wn) and P(Ham|w1,w2,..,wn).

- If P(Spam|w1,w2,..,wn) > P(Ham|w1,w2,..,wn) then we classify the message as Spam
- If P(Spam|w1,w2,..,wn) < P(Ham|w1,w2,..,wn) then we classify the message as Not-spam
- If P(Spam|w1,w2,..,wn) = P(Ham|w1,w2,..,wn) then we state that human interaction needs to be involved for determination

In [33]:
import re

def spam_or_not(message):
    
    ## Clean up initial message like we did before 
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    ## Initialize the probabilities of the conditional variables with the probability of a spam or ham message (equation above)
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
   
    ## Loop over each word in the message list
    for word in message:
        ## Check if the word is in the dictionaries
        if word in spam_dict:
            ## Update the probability with the conditional probability of word in either spam/ ham dictionary
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:     
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal'

In [34]:
## Testing function
spam_or_not('had your mobile 10 mths  update to latest orange camera video phones for free  save  s with free texts weekend calls  text yes for a callback orno to opt out')

'spam'

In [35]:
spam_or_not("Sounds good, Tom, then see u there")

'ham'

## Measuring the Filter's accuracy
It is time to use the test data set to see and compare the outcomes (note that we already know if they are spam or ham now it is up to the algorithm to determine it).

In [36]:
spam_test['Classification'] = spam_test['SMS'].apply(spam_or_not)
spam_test.head()

Unnamed: 0,Label,SMS,Classification
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [37]:
## We will see how many rows matched in the 'Label' and 'Classification' columns and measure accuracy by dividing correct by total
correct = 0
total = spam_test.shape[0]
wrong = []
for row in spam_test.iterrows():
    row = row[1]
    if row['Classification'] == row['Label']:
        correct += 1
    else:
        wrong.append(row['SMS'])
        
accuracy = correct / total

In [38]:
print('Correct messages', correct)
print('Incorrect messages', total - correct)
print('Accuracy of Function:', round(accuracy*100,2),'%')

Correct messages 1076
Incorrect messages 38
Accuracy of Function: 96.59 %


The filter works really well, with an accuracy of 96.59% it can classify 1076 out of the 1114 test data whether it is a spam message or not.

In [39]:
## Check wrongly classfied messages
wrong_df = spam_test[spam_test['SMS'].isin(wrong)]
wrong_df

Unnamed: 0,Label,SMS,Classification
51,spam,freemsg hey i m buffy 25 and love to satis...,ham
89,spam,goldviking 29 m is inviting you to be his fr...,ham
114,spam,not heard from u4 a while call me now am here...,ham
135,spam,more people are dogging in your area now call...,ham
141,spam,dear voucher holder 2 claim your 1st class air...,ham
152,ham,unlimited texts limited minutes,spam
169,spam,hottest pics straight to your phone see me g...,ham
180,spam,win the newest harry potter and the order of ...,ham
263,spam,themob yo yo yo here comes a new selection of ...,ham
284,ham,nokia phone is lovly,spam


## Conclusion
In conclusion, we have created a strong filter to classify whether a message is spam or not. Some future improvements may be to implement punctuation characters and lower/ uppercase letters and see how those would help us in our classification.