## Building a spam filter multinomial Naive Bayes

In this project i\`m going to build a Naive Bayes classifier to recognize sms messages as spam or non spam.

I use the SMSSpamcollection dataset from UCI Machine Learning. The dataset contains 5572 messages and can be downloaded [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition)

__The Algorithm__
Naive Bayes uses prior probabilities based on the test set. When the test set is small this could be an issue. After calculation of the prior probabilities I calculate the likelyhood if a message is spam or not. In this example words are counted for occurence in spam messages and non spam messages. For classification the probabilities are compared and biggest probability gets the final label.

__Load and inspect the dataset__

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names=['Label', 'SMS'])
sms.shape

(5572, 2)

In [2]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [4]:
sum(sms['Label']== 'spam')/len(sms['Label'])

0.13406317300789664

13,4 % of all messages are labeled 'spam 

# Splitting the dataset in train en test data 80/20

First randomize the set to prevent non random samples. Then split the data and check if the split has approximately the same ammount of spam messages. 

In [5]:
data_randomized = sms.sample(frac=1, random_state=1)

training_test_index = round(len(data_randomized) * 0.8)

training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [6]:
training_set['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [7]:
test_set['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

# Cleaning the datasets

First remove non word characters then I make all words lowercase. The downside to removing non word characters is that I could miss some spam messages. I convert every sentance in a list of words. Afterwards count all words and find their probabilities for appearing in a spam and non spam message

In [8]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', " ")
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [9]:
training_set['SMS']= training_set["SMS"].str.split()

In [10]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
    

In [11]:
len(vocabulary)

7783

In [12]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
words = pd.DataFrame(word_counts_per_sms)

In [14]:
words.head()

Unnamed: 0,07801543489,finance,ipads,tue,ticket,2price,dinner,instructions,ericson,borderline,...,087104711148,morphine,aeronautics,gail,unhappiness,dinero,kicks,mornin,benefits,land
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
training_set_clean = pd.concat([training_set, words], axis = 1)
training_set_clean.head()

Unnamed: 0,Label,SMS,07801543489,finance,ipads,tue,ticket,2price,dinner,instructions,...,087104711148,morphine,aeronautics,gail,unhappiness,dinero,kicks,mornin,benefits,land
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training set complete and  start making the filter

To calculate probabilities I add one to prevent divide by zero error (add one to numerator and denominator). I also calculate the initial probabilities for a message being spam or not

In [16]:
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)

alpha = 1

Create dictionary with counts per word for spam and non spam.

In [17]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}


for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

In [18]:
import re

def classify(message):
    '''
    message: a string
    '''    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message*1.2:        print('Label: Ham')
    elif p_ham_given_message*1.2 < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [19]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [20]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


__Now adjust the old function to take in new messages and give just the result__

In [21]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]        
       
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [22]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [23]:
sum(test_set['Label']== test_set['predicted'])

1100

Calculate percentage correctly predicted

In [24]:
print("Accuracy:" , sum(test_set['Label']== test_set['predicted'])/len(test_set))

Accuracy: 0.9874326750448833


In [25]:
bad_label  = test_set[test_set['Label'] != test_set['predicted']]

In [26]:
bad_label

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


To improve the spamfilter I could make it case insensitive or keep a sensitive and insentive version of the word. Keeping some non word characters like !! and ? might also be an option.  Regex could also help captering words with a lot of CAPS.
Implementing a threshold to tag messages for human classification is also an option.