In [1]:
%cd C:\\Users\\debie\\Documents\\anaconda_space

C:\Users\debie\Documents\anaconda_space


# Building a Spam Filter with Naive Bayes

In this project we're going to study the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, a computer:

    Learns how humans classify messages.
    Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
    Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. The SMS are classified in 2 categories inside the dataset: spam (unwanted message) and ham (normal message).

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Let's look into the SMS dataset we'll use.

In [3]:
sms = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=["Label", "SMS"])

sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
sms['Label'].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

There are 13 % of messages that are spam and 87% that are ham (not spam).

For the purpose of testing the spam filter that we will create, we will split the datset into 2 categories:

    A training set that will be used to train the computer to classify the  sms. It will have 4458 messages (around 80 % of the dataset).

    A test set that will be used to test our spam filter and check if it is efficient. It will have 1114 messages (around 20 % of the dataset).

First we randomize the entire dataset:

In [6]:
sms_r = sms.sample(frac = 1, random_state = 1)

training = sms_r.iloc[:4458, :]
training.reset_index(drop = True, inplace = True)

test = sms_r.iloc[4458:, :]
test.reset_index(drop = True, inplace = True)

print('Training dataset:', training['Label'].value_counts(normalize = True))

print('Test dataset:', test['Label'].value_counts(normalize = True))

Training dataset: ham     0.86541
spam    0.13459
Name: Label, dtype: float64
Test dataset: ham     0.868043
spam    0.131957
Name: Label, dtype: float64


In [7]:
training.shape

(4458, 2)

The 2 datasets have approximately the same proportion of spam and ham.

In order to make the calculations of the probability that a message is a spam, we need to transform the dataset as follow:


    - The SMS column doesn't exist anymore.
    - Instead, the SMS column is replaced by a series of new columns, where each column represents a unique word from the vocabulary.
    - Each row describes a single message. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it  has the values spam, 2, 2, 1, 1, 0, 0, 0, 0, 0. These values tell us    that:
        The message is spam.
        The word "secret" occurs two times inside the message.
        The word "prize" occurs two times inside the message.
        The word "claim" occurs one time inside the message.
        The word "now" occurs one time inside the message.
        The words "coming", "to", "my", "party", and "winner" occur zero times inside the message.
    - All words in the vocabulary are in lower case, so "SECRET" and "secret" come to be considered to be the same word.
    - Punctuation is not taken into account anymore (for instance, we can't look at the table and conclude that the first message initially had three exclamation marks).


In [8]:
training.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
training['SMS'] = training['SMS'].str.replace('\W', ' ')
training['SMS'] = training['SMS'].str.lower()

In [10]:
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [11]:
training["SMS"] = training["SMS"].str.split()
vocabulary = []
for words in training['SMS']:
    for word in words:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
    

In [12]:
len(vocabulary)

7783

We have all the words contained in the messages of the training dataset that are listed in "vocabulary", there are 7783 different words.

Now, we will count how many time each word in the list "vocabulary" appears in each SMS and then add this count to the datframe.

In [13]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
training_words = pd.concat([training, word_counts], axis = 1)


In [14]:
word_counts.head()

Unnamed: 0,lobby,asleep,forgive,hunks,asp,astoundingly,edukkukayee,missing,disturb,09053750005,...,gauti,double,apo,belly,macedonia,bettersn,moral,rolled,3d,en
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
training_words.head()

Unnamed: 0,Label,SMS,lobby,asleep,forgive,hunks,asp,astoundingly,edukkukayee,missing,...,gauti,double,apo,belly,macedonia,bettersn,moral,rolled,3d,en
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.

To calculate the probability tha a message is a spam given the words it contains, we first need to calculate P(Spam), P(Ham), Nspam, Nham and Nvocabulary.

Nspam equal to the number of words in all the spam messages.

Nham equal to the number of words in all the non-spam messages.

In [16]:
training_words['Label'].value_counts()

ham     3858
spam     600
Name: Label, dtype: int64

According to the value_counts above we can determine the value of Pspam and Pham.

In [17]:
p_spam = 600 / (3858 + 600)
p_ham = 1 - p_spam
print(p_spam,p_ham)

0.13458950201884254 0.8654104979811574


Now we'll determine Nspam, Nham and Nvocabulary:

In [18]:
n_vocabulary = len(vocabulary)

n_spam = 0
for words in training_words[training_words['Label'] == 'spam']['SMS']:
    n_spam += len(words)
    
n_ham = 0
for words in training_words[training_words['Label'] == 'ham']['SMS']:
    n_ham += len(words)
    
print('Nspam = ', n_spam)
print('Nham = ', n_ham)
print('Nvocabulary = ', n_vocabulary)

Nspam =  15190
Nham =  57237
Nvocabulary =  7783


We'll also use Laplace smoothing and set α = 1.

In [19]:
alpha = 1

Now we can calculate the parameters P(wi|spam) and P(wi|ham) which are the probabilities that a word wi appears in a message given that the message is a spam or a ham.

In [20]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

training_spam = training_words[training_words['Label'] == 'spam']
training_ham = training_words[training_words['Label'] == 'ham']

for word in vocabulary:
    parameters_spam[word] = (training_spam[word].sum() + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_ham[word] = (training_ham[word].sum() + alpha) / (n_ham + alpha * n_vocabulary)


Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

    Takes in as input a new message (w1, w2, ..., wn)
    Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
    Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
        If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
        If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
        If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


In [21]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Let's try the function on 2 messages:

In [22]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [23]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


The spam filter is now complete, we'll just make a little modification to return the labels instead of printing them.

Then we'll measure its accuracy.

In [25]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [26]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [30]:
correct = 0
total = 1114
    
for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The accuracy is around 99 %, it is excellent.