## Spam message filtering with Naive Bayes

Classifying messages as spam or non-spam is one of the classical ways to introduce machine learning techniques. We will also build a simple Naive Bayes algorithm that detects whether a message is spam or not. For our learning purposes, we will be using a dataset of 5572 SMS messages classified by humans to train our model. The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).  

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [4]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [5]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
print(sms.shape)
sms['Label'].value_counts(normalize=True)*100

(5572, 2)


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The dataset has two columns: label describes whether the message is spam or not; SMS is the actual message. We have 5572 messages in total. Around 13.4% of the dataset is labeled as spam. This means if we are to build a Naive Bayes model, its accuracy should be better than 86.6% because if we predict all observations as ham we would get 86.6% correct without using any model.
To train our model and check whether it is predicting accurately, we will divide the dataset into two parts: 80% training, 20% test.

In [7]:
random_sms = sms.sample(frac=1, random_state = 1)

In [8]:
train = random_sms[:4458].copy().reset_index(drop=True)
test = random_sms[4458:].copy().reset_index(drop=True)

In [9]:
print('train categories: \n\n', train['Label'].value_counts(normalize=True)*100)
print('\n')
print('test categories: \n\n', test['Label'].value_counts(normalize=True)*100)

train categories: 

 ham     86.54105
spam    13.45895
Name: Label, dtype: float64


test categories: 

 ham     86.804309
spam    13.195691
Name: Label, dtype: float64


As we can see above, our train and test contain almost equal proportion of spam and ham messages. This is important for training our model. 

To train our model, we will need to build a vocabulary of unique words from the training dataset. For this purposes, we will remove punctuations and change words to lowercase.

In [10]:
train['SMS']=train['SMS'].str.replace('\W',' ').str.lower()

In [11]:
vocabulary = []
train['SMS']=train['SMS'].str.split()

In [12]:
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)

In [13]:
vocabulary = list(set(vocabulary))
vocabulary[:5]

['fulfil', 'environment', 'star', 'swoop', 'como']

In [14]:
len(vocabulary)

7783

In [15]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


Entire dataset has 7783 unique words.

We need to create a dictionary that holds counts of words in each sms.

In [16]:
word_counts_per_sms = {unique_word: [0]*len(train['SMS']) for unique_word in vocabulary}
for i, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][i] += 1

In [17]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [18]:
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Now we have a table with 7783 words and their counts for each sms message.

In [19]:
train_counted = pd.concat([train,word_counts], axis=1)

In [20]:
train_counted.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


The dataset is ready for usage. We need to calculate the parameters of the model with the above dataset.

In [59]:
p_spam = train['Label'].value_counts(normalize=True)['spam']
p_ham = train['Label'].value_counts(normalize=True)['ham']
alpha = 1

In [60]:
n_spam = sum(train[train['Label']=='spam']['SMS'].apply(lambda x: len(x)))
n_ham = sum(train[train['Label']=='ham']['SMS'].apply(lambda x: len(x)))
n_vocabulary = len(vocabulary)

In [61]:
p_w_ham = {unique_words: 0 for unique_words in vocabulary}
p_w_spam = {unique_words: 0 for unique_words in vocabulary}

In [62]:
spam_train = train_counted[train_counted['Label']=='spam']
ham_train =  train_counted[train_counted['Label']=='ham']

In [63]:
for word in vocabulary:
    n_word_for_spam = spam_train[word].sum() # count of one word in the spam dataset
    p_word_given_spam = (n_word_for_spam + alpha)/(n_spam + alpha * n_vocabulary) # calculate the probability
    p_w_spam[word] = p_word_given_spam # assign this word's probability to its key
    
    n_word_for_ham = ham_train[word].sum()
    p_word_given_ham = (n_word_for_ham + alpha)/(n_ham + alpha * n_vocabulary)
    p_w_ham[word] = p_word_given_ham

We have assigned conditional probabilities for each word based on their type (ham, spam). Now, we can do the classification for new messages.

In [67]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

       
    #This is where we calculate:

    p_spam_given_message = p_spam # initiate probabilities with the probability of spam
    p_ham_given_message = p_ham
       
    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]
        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]
            
    print('P(Spam|message):', p_spam_given_message ) 
    print('P(Ham|message):', p_ham_given_message )
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')

In [68]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [69]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Our Naive Bayes model is working! We can use the test set to check for accuracy.

In [81]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]

        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [82]:
test['predicted'] = test['SMS'].apply(classify_test_set)

In [85]:
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [84]:
correct = 0
total = len(test)
for i in test.iterrows():
    row = i[1]
    if row['Label']==row['predicted']:
        correct += 1
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)    

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


Our custom built model is achieving around 98% accuracy, which means Naive Bayes is very effective for spam classifications.