# Email Spam Filter with Multinomial Naive Bayes
In this project, I am going to build an email spam filter with naive bayes algorithms. My goal is to write a program that classifies new emails with an accuracy greater than 90% — so I expect that more than 90% of the new messages will be classified correctly as spam or non-spam.
To train the algorithm, we'll use a dataset of 5728 emails that are already classified by humans, the dataset was gotten from kaggle

In [0]:
import pandas as pd

# Data Exploration

In [0]:
from google.colab import files

def f(fname):
    x = files.upload()
    return x[fname]
f('emails.csv')

In [0]:
emails = pd.read_csv('emails.csv')
emails.head()

In [0]:
len(emails)

5728

The dataframe contains just 2 columns, the <b>text</b> column containing the email subject and the <b>spam</b> column containing 1s and 0s where 1 means the message is spam while 0 means the message is non_spam.

In [0]:
emails['spam'].value_counts(normalize=True) * 100

0    76.117318
1    23.882682
Name: spam, dtype: float64

Above, about 77% of all the messages are not spam and the remaining 23% are spam. This sample looks represntative, since in practice most messages that people receive are non_spam.

# Data Cleaning
To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Remove Punctuation and special characters

We'll begin with removing all the punctuation.
The text column contains a text that isnt part of the main message, the word 'Subject'. I'm going to remove it from the messages so that the algorithm will have only the necessary data.

In [0]:
emails['text'].head(3)

0    Subject: naturally irresistible your corporate...
1    Subject: the stock trading gunslinger  fanny i...
2    Subject: unbelievable new homes made easy  im ...
Name: text, dtype: object

In [0]:
emails['text'] = emails['text'].str.replace('Subject:', '').str.strip()

In [0]:
emails['text'] = emails['text'].str.replace('.', '').str.replace('*', '').str.replace(',', '').str.replace(':', '').str.replace('-', '').str.strip()
emails['text']

0       naturally irresistible your corporate identity...
1       the stock trading gunslinger  fanny is merrill...
2       unbelievable new homes made easy  im wanting t...
3       4 color printing special  request additional i...
4       do not have money  get software cds from here ...
                              ...                        
5723    re  research and development charges to gpg  h...
5724    re  receipts from visit  jim   thanks again fo...
5725    re  enron case study update  wow ! all on the ...
5726    re  interest  david   please  call shirley cre...
5727    news  aurora 5  2 update  aurora version 5  2 ...
Name: text, Length: 5728, dtype: object

The messages that has been classified as spam are labelled 1, whereas the non-spam messages have been labelled as 0. about 77% of the dataset are non_spam messages while only 23% are spam messages.

The next step is to creating our spam filter, the first thing to do is to split the dataset into the training and test sets;

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

I would be splitting the dataset in a ratio 80:20 for the training set and test set repectively to train the algorithm with as much data as possible but with still enough data to test.

In [0]:
# Randomize the dataset
emails_randomized = emails.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(emails_randomized) * 0.8)

# Training/Test split
training_set = emails_randomized[:training_test_index].reset_index(drop=True)
test_set = emails_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4582, 2)
(1146, 2)


In [0]:
#check the percentage of the training set just to confirm
training_set['spam'].value_counts(normalize=True)

0    0.75993
1    0.24007
Name: spam, dtype: float64

In [0]:
test_set['spam'].value_counts(normalize=True)

0    0.766143
1    0.233857
Name: spam, dtype: float64

# Creating the Vocabulary

Let's now move to creating the vocabulary, which in this context means a list with all the unique words in our training set.

In [0]:
training_set['text'] = training_set['text'].str.split()

vocabulary = []
for text in training_set['text']:
    for word in text:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [0]:
len(vocabulary)

34188

It looks like there are 34,188 unique words in all the messages of the training set.

Now going to use the vocabulary just created to make the data transformation we want.

In [0]:
import numpy as np

len_training_set = len(training_set['text'])

word_counts_per_text = {unique_word: np.zeros (len_training_set, dtype=np.int32) for unique_word in vocabulary}

for index, text in enumerate(training_set['text']):
    for word in text:
        word_counts_per_text[word][index] += 1

In [0]:
word_counts = pd.DataFrame(word_counts_per_text)
word_counts.head()

Unnamed: 0,calpx,pete,nominating,ability,marshall,lumber,referee,adds,upper,label,therfore,georganne,refuses,screechy,012501,continues,slowed,specifiers,mathews,trademark,does,redoubtable,karine,puhca,amiry,oom,356,grumulaitis,feasiblity,496,ukrpi,bette,insiders,lengthens,boland,appropriately,502373,308,zaprzeczenie,dorn,...,155,computable,caboose,entity,goup,curso,godzinie,customs,councils,keep,haigh,1917,talkington,mataya,347,deem,sparkling,ebrochure,hedland,intel,ondarza,neural,marking,whitebook,ord,spacial,wives,3652,markswv,jewels,kmart,objected,driven,muzzy,31212,parcel,toronto,116,ridge,heuristically
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head(3)

NameError: ignored

# Calculating Constants First
I am now done with cleaning the training set, and can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages.

- Probabilty that the message is spam given the vocabulary; P(Spam|w1,w2,w3...wn)
- Probabilty that the message is no_spam given the vocabulary; P(non_spam|w1,w2,w3...wn)

In [0]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['spam'] == 1]
non_spam_messages = training_set_clean[training_set_clean['spam'] == 0]

# P(Spam) and P(non_spam)
p_spam = len(spam_messages) / len(training_set_clean)
p_non_spam = len(non_spam_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['text'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Non_spam
n_words_per_non_spam_message = non_spam_messages['text'].apply(len)
n_non_spam = n_words_per_non_spam_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

# Calculating Parameters
Now that the constant terms have been calculated above, I can move on with calculating the parameters P(wi|Spam)  and P(wi|non_spam) . Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

In [0]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_non_spam = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_non_spam = non_spam_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_non_spam = (n_word_given_non_spam + alpha) / (n_non_spam + alpha*n_vocabulary)
    parameters_non_spam[word] = p_word_given_non_spam

# Classifying A New Message
Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

Takes in as input a new message (w1, w2, ..., wn).
Calculates P(Spam|w1, w2, ..., wn) and P(non_spam|w1, w2, ..., wn).
Compares the values of P(Spam|w1, w2, ..., wn) and P(non_spam|w1, w2, ..., wn), and:
If P(non_spam|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
If P(non_spam|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
If P(non_spam|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [0]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = email['text'].str.replace('.', '').str.replace('*', '').str.replace(',', '').str.replace(':', '').str.replace('-', '').str.strip()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_non_spam_given_message = p_non_spam

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_non_spam_given_message *= parameters_non_spam[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(non_spam|message):', p_non_spam_given_message)
    
    if p_non_spam_given_message > p_spam_given_message:
        print('Label: Non_spam')
    elif p_non_spam_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [0]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

# Measuring the Spam Filter's Accuracy
The two results above look promising, but let's see how well the filter does on our test set, which has 1,146 messages.

We'll start by writing a function that returns classification labels instead of printing them.

In [0]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    #message = re.sub('\W', ' ', message)
    #message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_non_spam_given_message = p_non_spam

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_non_spam:
            p_non_spam_given_message *= parameters_non_spam[word]
    
    if p_non_spam_given_message > p_spam_given_message:
        return 'non_spam'
    elif p_spam_given_message > p_non_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that there is a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [0]:
test_set['predicted'] = test_set['text'].apply(classify_test_set)
test_set.head()

Now, I'll write a function to measure the accuracy of the spam filter to find out how well the spam filter does.

In [0]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['spam'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)