## Spam filter using Naive Bayes Algorithm

<p style="font-size:17px;text-align:justify">In this work I will build a multinominal Naive Bayes algorithm in order to sort the messages from <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection">SMSSpamCollection</a> dataset based on wether the message is a SPAM or not.</p>

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_colwidth', 75)

In [2]:
#reading in the dataset, naming the columns 'label' and 'SMS'.
sms = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['Label', 'SMS'])
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Tex...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [3]:
sms.shape

(5572, 2)

In [4]:
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

<p style="text-align:justify;font-size:15px">There are 5572 messages in the dataset, out of which 13.4% are SPAM</p>
<hr style="border-width:1px;border-color:black">

<p style="text-align:justify;font-size:15px">Software like this needs evaluation, to see how accurate it is. To escape from any form of biased testing, I'm going to define the test before creating the algorithm.
<br>
The dataset will be divided into two parts in ratio of 80:20. The bigger chunk will be used to "teach" the algorithm, the smaller one will be the test. The final algorithm should have the accuracy of 80%.</p>

In [5]:
#creating a randomized dataset to ensure the two chunks for feeding and testing are representative.
sms_randomized = sms.sample(frac=1, random_state=1)

In [6]:
per_80 = round(len(sms_randomized) * 0.8)
sms_feed = sms_randomized.iloc[0:per_80,:].reset_index(drop=True)
sms_test = sms_randomized.iloc[per_80:,:].reset_index(drop=True)

In [7]:
sms_feed['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [8]:
sms_test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

<p style="text-align:justify;font-size:15px">Now there are two datasets, both with roughly the same ham/spam ratio as the original.<p>
<hr style="border-width:1.5px;border-color:black">

<p style="text-align:justify;font-size:15px">To make the calculations easier, the dataset needs to be cleaned in a proper way. The approach that is taken here consists of creating a separate column for each word in a list of all the unique words, and each row will have the count of this word in the given message.</p>

In [9]:
#deleting all punctuation and converting the strings to lowercase.
sms_feed['SMS'] = sms_feed['SMS'].str.replace(pat='\W', repl=" ").str.lower()


In [10]:
#spliting each message to a list of words.
sms_feed['SMS'] = sms_feed['SMS'].str.split()
#creating a vocabulary with each unique word.
vocabulary = []
for i in sms_feed['SMS']:
    for j in i:
        vocabulary.append(j)
vocabulary = list(set(vocabulary))


In [11]:
#creating a dictionary with a word count per sms
word_counts_per_sms = {unique_word: [0] * len(sms_feed['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(sms_feed['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

#converting the dictionary to a pandas DataFrame.
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.shape

(4458, 7783)

In [12]:
#concatenating the new dataframe with the training one, so the labels and messages are present.
sms_feed_clean = pd.concat([sms_feed, word_counts], axis=1)
sms_feed_clean

Unnamed: 0,Label,SMS,sinco,foundurself,useless,enemy,lock,7ish,car,tv,...,table,worry,89123,freek,toll,pictures,yup,oil,package,original
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me, moan]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a, card, on, da, present, l...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,ham,"[sorry, i, ll, call, later, in, meeting, any, thing, related, to, trade...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,"[babe, i, fucking, love, you, too, you, know, fuck, it, was, so, good, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,"[u, ve, been, selected, to, stay, in, 1, of, 250, top, british, hotels,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,"[hello, my, boytoy, geeee, i, miss, you, already, and, i, just, woke, u...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<hr style="border-width:1.5px;border-color:black">
<p style="text-align:justify;font-size:15px">Now that the dataset is clean, it is time to start creating the spam filter.
<br>
First of all, the probabilities of receiving a spam and non-spam messages need to be calculated. Also the counts of words in all of the spam and non-spam messages, along with vocabulary count.
<br>
In this method, when calculating P(Wi|Spam) and P(Wi|Ham), the equations need an additional, smoothing parameter - alpha. In this case it will equal to 1.</p>

In [13]:
#creating constants that will be used in calculating final probabilities
p_spam = len(sms_feed_clean[sms_feed_clean['Label']=='spam']) / sms_feed_clean.shape[0]
p_ham = 1 - p_spam
n_spam = 0
for i in sms_feed_clean.loc[sms_feed_clean['Label']=='spam', 'SMS']:
    n_spam += len(i)

n_ham = 0
for i in sms_feed_clean.loc[sms_feed_clean['Label']=='ham', 'SMS']:
    n_ham += len(i)

n_vocabulary = len(vocabulary)
alpha = 1

In [14]:
# creating dictionaries with probabilities for each word in both spam and ham messages.
p_wi_given_spam = {}
p_wi_given_non_spam = {}
for i in vocabulary:
    p_wi_given_spam[i] = 0
    p_wi_given_non_spam[i] = 0

spam_training = sms_feed_clean[sms_feed_clean['Label']=='spam']
ham_training = sms_feed_clean[sms_feed_clean['Label']=='ham']

for word in vocabulary:
    p_wi_given_spam[word] = (spam_training[word].sum() + alpha)/(n_spam + alpha * n_vocabulary)
    p_wi_given_non_spam[word] = (ham_training[word].sum() + alpha)/(n_ham + alpha * n_vocabulary)


In [15]:
#defining function that calculates probabilities for given message in spam and ham messages and assigns the message to either one of the labels.
import re

def classify(message):

    #cleaning the message, so it is ready for use
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    #calculating probabilities for message being a spam or non-spam.
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for i in message:
        if i in p_wi_given_spam:
            p_spam_given_message *= p_wi_given_spam[i]
        if i in p_wi_given_non_spam:
            p_ham_given_message *= p_wi_given_non_spam[i]
            
    #comparing the probabilities and returning proper labels.
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [16]:
#testing with message that is obviously a spam.
classify('WINNER!! This is the secret code to unlock the money: C3421')

'spam'

In [17]:
#testing with message that is obviously not a spam.
classify('Sounds good, Tom, then see u there')

'ham'

In [18]:
sms_test['predicted'] = sms_test['SMS'].apply(classify)
sms_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones fo...,spam
3,ham,All sounds good. Fingers . Makes it difficult to type,ham
4,ham,"All done, all handed in. Don't know if mega shop in asda counts as cele...",ham


In [19]:
correct = 0
total = sms_test.shape[0]
for i in sms_test.iterrows():
    if i[1]['Label'] == i[1]['predicted']:
        correct += 1

accuracy = correct / total
accuracy

0.9874326750448833

#### As can be seen above, accuracy of the model is almost 99%.