- name : Bonaventure Osuide
- course : Msc Applied A.I and Data Science
- school : Solent University
- project : A spam filter using the Naive Bayes Theorem


# INTRODUCTION


In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [1]:
import pandas as pd

In [2]:
sms_spam = pd.read_csv('C://Users//MY PC//Desktop//DATASETS//smsspamcollection//SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
sms_spam

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


# Exploring The Dataset

In [4]:
sms_spam.shape

(5572, 2)

There are 5572 messages we are going to classify as either spam or ham(non-spam)

In [5]:
sms_spam[sms_spam.Label=='ham'].shape

(4825, 2)

In [6]:
sms_spam[sms_spam.Label=='spam'].shape

(747, 2)

In [7]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

From the above analysis,  we can conclude that there are approximately 87% ham messages and 13% spam messages, which makes sense because in reality most messages are ham rather than spam.

# Train and Test Data

To test the spam filter, we're first going to split our dataset into two categories:

A training set, which we'll use to "train" the computer how to classify messages.
A test set, which we'll use to test how good the spam filter is with classifying new messages. We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that the training set will have 4,458 messages (about 80% of the dataset). The test set will have 1,114 messages (about 20% of the dataset).

Scikit learn's train_test_split is a good and fast way to split our dataset and also we will use shuffle our dataframe to encured the spam and ham messages are shuffled all through the dataframe

In [8]:
from sklearn.model_selection import train_test_split
# Split and shuffle the dataset
train, test = train_test_split(sms_spam, test_size = 0.2, random_state=0)

In [9]:
train.shape

(4457, 2)

In [10]:
test.shape

(1115, 2)

From the above, it is confirmed that the dataframe was split into 80% train and 20% test.

In [11]:
train.head()

Unnamed: 0,Label,SMS
1114,ham,"No I'm good for the movie, is it ok if I leave..."
3589,ham,If you were/are free i can give. Otherwise nal...
3095,ham,Have you emigrated or something? Ok maybe 5.30...
1012,ham,"I just got home babe, are you still awake ?"
3320,ham,Kay... Since we are out already


In [12]:
test.head()

Unnamed: 0,Label,SMS
4456,ham,"Storming msg: Wen u lift d phne, u say ""HELLO""..."
690,spam,<Forwarded from 448712404000>Please CALL 08712...
944,ham,And also I've sorta blown him off a couple tim...
3768,ham,"Sir Goodmorning, Once free call me."
1189,ham,All will come alive.better correct any good lo...


In [13]:
# rest the index and drop the index column 
train = train.reset_index().drop('index', axis=1)
test = test.reset_index().drop('index', axis=1)

In [14]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"No I'm good for the movie, is it ok if I leave..."
1,ham,If you were/are free i can give. Otherwise nal...
2,ham,Have you emigrated or something? Ok maybe 5.30...
3,ham,"I just got home babe, are you still awake ?"
4,ham,Kay... Since we are out already


In [15]:
test.head()

Unnamed: 0,Label,SMS
0,ham,"Storming msg: Wen u lift d phne, u say ""HELLO""..."
1,spam,<Forwarded from 448712404000>Please CALL 08712...
2,ham,And also I've sorta blown him off a couple tim...
3,ham,"Sir Goodmorning, Once free call me."
4,ham,All will come alive.better correct any good lo...


In [16]:
train.Label.value_counts(normalize=True)

ham     0.868297
spam    0.131703
Name: Label, dtype: float64

In [17]:
test.Label.value_counts(normalize=True)

ham     0.856502
spam    0.143498
Name: Label, dtype: float64

We can see the percentages are close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

# Data Cleaning

In [18]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"No I'm good for the movie, is it ok if I leave..."
1,ham,If you were/are free i can give. Otherwise nal...
2,ham,Have you emigrated or something? Ok maybe 5.30...
3,ham,"I just got home babe, are you still awake ?"
4,ham,Kay... Since we are out already


We are going to remove all the punctuation from the SMS column and remove any character that is not from a-z, A-Z or 0-9.

In [19]:
train.SMS = train.SMS.str.replace('\W',' ').str.lower()

In [20]:
train['SMS'] = train.SMS.str.split()

In [21]:
vocabulary = []
for sms in train.SMS:
    for word in sms:
        vocabulary.append(word)
# convert vocabulary to set to remove duplicate words and back to list 
vocabulary = list(set(vocabulary))
print(len(vocabulary))


7833


It looks like there are 7,833 unique words in all our train data sets

# The final Training Set

In [22]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        

In [25]:
# convert the dictionary to a dataframe
word_counts = pd.DataFrame(word_counts_per_sms)

In [30]:
word_counts.head()

Unnamed: 0,liver,six,mix,gloucesterroad,t,ws,grave,skip,log,jos,...,gifted,issue,wad,lccltd,feelin,dnot,watch,points,way,joking
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
# concatenate the new dataframe with the train dataframe
clean_train = pd.concat([train, word_counts], axis=1)

In [34]:
clean_train.head()

Unnamed: 0,Label,SMS,liver,six,mix,gloucesterroad,t,ws,grave,skip,...,gifted,issue,wad,lccltd,feelin,dnot,watch,points,way,joking
0,ham,"[no, i, m, good, for, the, movie, is, it, ok, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[if, you, were, are, free, i, can, give, other...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[have, you, emigrated, or, something, ok, mayb...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[i, just, got, home, babe, are, you, still, aw...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[kay, since, we, are, out, already]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Building the Spam Filter

We need to calculate the probability of getting a Spam message and Ham message.

In [42]:
# separate the dataframe into spam and ham
spam = clean_train[clean_train.Label=='spam']
ham = clean_train[clean_train.Label=='ham']

In [39]:
alpha = 1

p_spam = len(spam) / len(clean_train)
p_ham = len(ham) / len(clean_train)

In [53]:
spam_words = []
for words in spam.SMS:
    for word in words:
        spam_words.append(word)
# Number of spam words
Nspam = len(spam_words)
# We could also use the comment below
# n_words_per_spam_message = spam['SMS'].apply(len)
# n_spam = n_words_per_spam_message.sum()
ham_words = []
for words in ham.SMS:
    for word in words:
        ham_words.append(word)
# Number of ham words
Nham = len(ham_words)

# Number of words in Vocabulary
Nvocabulary = len(vocabulary)

In [60]:
print(Nspam, Nham, Nvocabulary)

14970 56846 7833


Now we have calculated the constant terms, we can move on with calculating the parameters for probability of word given spam; P(wi|Spam) and probability of word given ham; P(wi|Ham)

# Calculating the Parameters

In [66]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

#calculating the parameters
for word in vocabulary:
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (Nspam + alpha * Nvocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (Nham + alpha * Nvocabulary)
    parameters_ham[word] = p_word_given_ham

# Classifying a New Message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [67]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [68]:
classify('Sounds good Tom, see you there!')

P(Spam|message): 1.7828141226810212e-21
P(Ham|message): 2.2082119422264364e-18
Label: Ham


In [69]:
classify('My name is Bonaventure')

P(Spam|message): 1.443985584753648e-10
P(Ham|message): 3.094228730984524e-08
Label: Ham


Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

# Measuring the Spam Filter's Accuracy

The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.

We'll start by writing a function that returns classification labels instead of printing them.

In [70]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [71]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,"Storming msg: Wen u lift d phne, u say ""HELLO""...",ham
1,spam,<Forwarded from 448712404000>Please CALL 08712...,spam
2,ham,And also I've sorta blown him off a couple tim...,ham
3,ham,"Sir Goodmorning, Once free call me.",ham
4,ham,All will come alive.better correct any good lo...,ham


Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does.

In [73]:
correct = 0
total = test.shape[0]
    
for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1099
Incorrect: 16
Accuracy: 0.9856502242152466


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

# Conclusion

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- I will look forward to analyzing the 16 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Try to make the filtering process more complex by making the algorithm sensitive to letter case