# Building Spam Filter

In this project, we are going to build a spam filter using Supervised, multinomial Naive Bayes Algorithm. Our goal is to design a program that can classify spam and non-spam messages that have accuracy of more than 90%. 

We shall use the dataset that contained 5572 classified messages. You can download from <a hruf='https://archive.ics.uci.edu/ml/datasets/sms+spam+collection'>here</a>

<img src="Text Classifier.png">

## Exploring the dataset

We first shall import our libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image 
%matplotlib inline

Reading our dataset

In [3]:
df = pd.read_csv('SMSSpamCollection',sep='\t',header=None)
df.columns =['Label', 'SMS']

print('Our dataset has',df.shape[0], 'rows and', df.shape[1],'columns')
print('\n')

print('percentages of spam and ham messages')
print(df['Label'].value_counts(normalize=True))

print('\n')
print('first 5 rows of the dataset')
df.head()

Our dataset has 5572 rows and 2 columns


percentages of spam and ham messages
ham     0.865937
spam    0.134063
Name: Label, dtype: float64


first 5 rows of the dataset


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Seperating our dataset in 2, training set and test set both respectively containing 80% and 20% of the original dataset.

In [4]:
data_random = df.sample(frac=1,random_state=1)
split_index = round(len(data_random)*0.8)

training_set = data_random[:split_index].reset_index(drop=True)
test_set = data_random[split_index:].reset_index(drop=True)

Checking percentages of spam and ham messages in both datasets.

In [5]:
print('training_set')
print(training_set['Label'].value_counts(normalize=True))
print('\n')
print('test_set')
print(test_set['Label'].value_counts(normalize=True))

training_set
ham     0.86541
spam    0.13459
Name: Label, dtype: float64


test_set
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


## Data Cleaning

First step to train our algorithm is to clean our training set, our first step is to getting individual words in messages into lists.

In [6]:
pattern = '\W+'
training_set['SMS'] = training_set['SMS'].str.replace(pattern, ' ').str.lower()
training_set['SMS'] = training_set['SMS'].str.split()
training_set

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."
...,...,...
4453,ham,"[sorry, i, ll, call, later, in, meeting, any, ..."
4454,ham,"[babe, i, fucking, love, you, too, you, know, ..."
4455,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2..."
4456,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea..."


Creating a list that contains unique words from all of the messages.

In [7]:
vocabulary = []
for row in training_set['SMS']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

Creating a dictionary that count number of occurence of each words in each messages. Then create a dataframe from this dictionary

In [8]:
word_counts_per_sms = {unique_word: [0]*len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
wcps = pd.DataFrame(word_counts_per_sms)
wcps.head()

Unnamed: 0,arts,easy,soup,burger,l8rs,lot,roads,juswoke,drastic,goto,...,lst,ctxt,meg,arun,stomach,maretare,5wb,ennal,ki,7
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Merging the new dataframe to the training set dataframe

In [9]:
training_set_clean = pd.concat([training_set,wcps],axis=1)

Seperating dataset into 2 individual datasets, one contains spam messages while the other contains non-spam messages.

In [10]:
spam_sms = training_set_clean[training_set_clean['Label']=='spam']
ham_sms = training_set_clean[training_set_clean['Label']=='ham']

## Calculating Parameters

<img src='Formula.png'>

Lets first calculate parameters for the algorithms:
- p_spam: probability of spam messages
- p_ham: probability of ham messages
- n_spam: total number of words in spam messages
- n_ham: total number of words in ham messages
- n_vocabulary: total_number of words in all messages
- alpha: 1 

In [36]:
p_spam = len(spam_sms) / len(training_set_clean)
p_ham = len(ham_sms) / len(training_set_clean)

n_spam = spam_sms['SMS'].apply(len).sum()
n_ham = ham_sms['SMS'].apply(len).sum()
n_vocabulary = len(vocabulary)
alpha = 1

Generating probabilities of each word occurences in both ham and spam messages.  P(wi|Spam) and P(wi|Ham) 

In [37]:
spam_dict = {}
ham_dict = {}
for word in vocabulary:
    spam_dict[word] = 0
    ham_dict[word] = 0

for word in vocabulary:
    n_word_in_spam = spam_sms[word].sum()
    p_word_in_spam = (n_word_in_spam + alpha) / (n_spam + alpha * n_vocabulary)
    spam_dict[word] = p_word_in_spam
    
    n_word_in_ham = ham_sms[word].sum()
    p_word_in_ham = (n_word_in_ham*alpha) / (n_ham + alpha*n_vocabulary)
    ham_dict[word] = p_word_in_ham

## Classifying Messages

<img src='Formula.png'>

We shall write a function that can classify messages to spam and ham category. Here is a summary of each stages of our function.

- Take in a message
- Get each word of the message into a list of words
- Calculating P(wi|Spam) and P(wi|Ham) 
- Comparing P(Spam| w1,w2,w3...) and P(Ham| w1,w2,w3...)


- If P(Spam| w1,w2,w3...) > P(Ham| w1,w2,w3...) then return 'Spam'
- If P(Spam| w1,w2,w3...) < P(Ham| w1,w2,w3...) then return 'Ham'
- If P(Spam| w1,w2,w3...) = P(Ham| w1,w2,w3...) then return 'needs human classification'

In [39]:
import re
def classify(message):
    message = re.sub('\W',' ',message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= spam_dict[word]
            p_ham_given_message *= ham_dict[word]
            
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

## Spam Filter's Accuracy

It is never a bad idea to test our function. We shall test it on our test_set data.

In [42]:
test_set['predicted'] = test_set['SMS'].apply(classify)
test_set.head(10)

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",spam
5,ham,But my family not responding for anything. Now...,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to ta...,ham
8,ham,Genius what's up. How your brother. Pls send h...,ham
9,ham,I liked the new mobile,ham


For our first 10 messages, we got 9 corrects out of 10. Which satisfied our goal for this project.
However, it is better to have a better view to the accuracy of our test.

In [52]:
total_correct = (test_set['predicted'] == test_set['Label']).sum()
total = len(test_set)
percentage_correct = total_correct / total
print('Our Spam Filter\'s accuracy is:', percentage_correct)
print('Total number of corrects is:',total_correct)
print('Total number of incorrects is:',total-total_correct)
print('Number of messages tested is:', total)

Our Spam Filter's accuracy is: 0.9488330341113106
Total number of corrects is: 1057
Total number of incorrects is: 57
Number of messages tested is: 1114


## Result

Our result suggested that the accuracy of our filter had an accuracy of 95% on the test set we used. This has exceeded our initial goal, which is to get 90% accuracy.