Spam Filter Based on Multinomial Naive Bayes

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The data collection process is described in more details on this page, where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.

In [1]:
import pandas as pd
import numpy as np

In [2]:
sms = pd.read_csv("SMSSpamCollection",sep ='\t', header=None, names=['Label', 'SMS'])

sms.shape

(5572, 2)

In [3]:
sms.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

On the previous screen, we read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam.

for the total of 5000+ massage data, I am  going to use 80% of them for training and rest of those for testing purpose

In [5]:
random = sms.sample(frac=1, random_state=1)
training_size = round(len(random)*0.8)
training_set = random[:training_size]
testing_set   = random[training_size:]

print(training_set['Label'].value_counts(normalize=True))
print(testing_set['Label'].value_counts(normalize=True))


ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


the pencetage of 'spam' vs'ham' are same for both testing and traning dataset

The next step is dataclearning. 
To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

In [6]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ').str.lower()
testing_set['SMS'] = testing_set['SMS'].str.replace('\W', ' ').str.lower()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [7]:
training_set.head()

Unnamed: 0,Label,SMS
1078,ham,yep by the pretty sculpture
4028,ham,yes princess are you going to make me moan
958,ham,welp apparently he retired
4642,ham,havent
4674,ham,i forgot 2 ask ü all smth there s a card on ...


In [8]:
training_set['SMS'] = training_set['SMS'].str.split()
# split the SMS column by space and return a list of words

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [9]:
word_counts_per_sms = {word: [0]*len(training_set['SMS']) for word in vocabulary}
#create a dictionary where the index are the unique words we get from SMS, and initial number is o

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_count = pd.DataFrame(word_counts_per_sms)
word_count.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [10]:
combined = pd.concat([training_set,word_count],axis=1)

In [11]:
combined.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[go, until, jurong, point, crazy, available, o...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,"[ok, lar, joking, wif, u, oni]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,"[u, dun, say, so, early, hor, u, c, already, t...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0


In [None]:
# now need to calculate the possbility
msg_spam = combined[combined['Label'] == 'spam']
msg_non_spam = combined[combined['Label'] == 'ham']

p_spam = len(msg_spam) / len(combined)
p_non_spam = len(msg_non_spam) / len(combined)

msg_spam['count'] = msg_spam['SMS'].apply(len)
n_spam = msg_spam['count'].sum()

msg_non_spam['count'] = msg_non_spam['SMS'].apply(len).sum()
n_non_spam = msg_spam['count'].sum()

n_vocabulary = len(vocabulary)
alpha = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [None]:
p_word_spam = {word: [0]*len(vocabulary)for word in vocabulary}
p_word_ham = {word: [0]*len(vocabulary) for word in vocabulary}

In [None]:
msg_spam = combined[combined['Label'] == 'spam']
msg_non_spam = combined[combined['Label'] == 'ham']

for word in vocabulary:
    p_w_spam = (msg_spam[word].sum() + alpha) / (n_spam + alpha * n_vocabulary)
    p_w_ham = (msg_non_spam[word].sum() + alpha) / (n_non_spam + alpha * n_vocabulary)
    p_word_spam[word] = p_w_spam
    p_word_ham[word] = p_w_ham


In [None]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_massage = p_spam
    p_ham_given_message = p_non_spam
    
    for word in message:
        if word in p_word_spam:
            p_spam_given_message *= p_word_spam[word]
        if word in p_word_ham:
            p_ham_given_message *= p_word_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [None]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

In [None]:
classify("Sounds good, Tom, then see u there")

In [1]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_massage = p_spam
    p_ham_given_message = p_non_spam
    
    for word in message:
        if word in p_word_spam:
            p_spam_given_message *= p_word_spam[word]
        if word in p_word_ham:
            p_ham_given_message *= p_word_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [None]:
testing_set['predicted'] = testing_set['SMS'].apply(classify_test_set, axis=1)

In [None]:
correct = 0
total = len(testing_set)
for sms in testing_set.iterrows():
    sms = sms[1]
    if sms['Label'] == sms['predicted']:
        correct += 1
        
accuracy = correct / total