## Building a Spam Filter with Multinomial Naive Bayes Algorithm

We need to build a spam filter for SMS messages by teaching the computer how humans classify messages as spam and use this knowledge to come up with the probabilities. 

The dataset being used is sourced from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [1]:
import pandas as pd
mc = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS']) #mc: message collection
mc.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
mc.shape

(5572, 2)

In [3]:
mc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
mc.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
mc['Label'].value_counts(normalize = True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [6]:
mcr = mc.sample(frac = 1, random_state = 1)
mcr['SMS'] = mcr['SMS'].copy().str.replace('\W', ' ').str.lower()

In [7]:
eighty = int(0.80*mcr.shape[0])

In [8]:
mcr_training = mcr[0:4458].reset_index()
mcr_test = mcr[4458:].reset_index()

In [9]:
mcr_training.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


In [10]:
mcr_test.head()

Unnamed: 0,index,Label,SMS
0,2131,ham,later i guess i needa do mcat study too
1,3418,ham,but i haf enuff space got like 4 mb
2,3424,spam,had your mobile 10 mths update to latest oran...
3,1538,ham,all sounds good fingers makes it difficult ...
4,5393,ham,all done all handed in don t know if mega sh...


In [11]:
mcr_training.shape

(4458, 3)

In [12]:
mcr_test.shape

(1114, 3)

In [13]:
mcr_training['Label'].value_counts(normalize = True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [14]:
mcr_test['Label'].value_counts(normalize = True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

Training & Test data have similar ratios of spam vs ham

In [15]:
vocabulary = []
mcr_training['SMS'] = mcr_training['SMS'].str.split()
mcr_training.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,"[yep, by, the, pretty, sculpture]"
1,4028,ham,"[yes, princess, are, you, going, to, make, me,..."
2,958,ham,"[welp, apparently, he, retired]"
3,4642,ham,[havent]
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [16]:
vocabulary = []
for e_sms in mcr_training['SMS']:
    for e_w in e_sms:
        vocabulary.append(e_w)
        
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

In [17]:
vocabulary[:20] # just checking
len(vocabulary)

7783

In [18]:
word_counts_per_sms = {unique_word: [0] * len(mcr_training['SMS']) for unique_word in vocabulary}
# dictionary with unique words as keys 
# each item is 0's x length of sms columns

for index, sms in enumerate(mcr_training['SMS']):
    # index, sms loops through each row in mcr_training
    for word in sms:
        # each word in sms is looped through and dictionary is updated
        word_counts_per_sms[word][index] += 1

In [19]:
mcr_training['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [20]:
word_counts_dataframe = pd.DataFrame(word_counts_per_sms)

In [21]:
word_counts_dataframe.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [22]:
word_counts_dataframe.shape

(4458, 7783)

In [23]:
mcr_training_expanded = pd.concat([mcr_training, word_counts_dataframe], axis = 1)

In [24]:
mcr_training_expanded.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [25]:
p_spam = mcr_training_expanded['Label'].value_counts(normalize = True)[1]
p_ham = mcr_training_expanded['Label'].value_counts(normalize = True)[0]

In [26]:
p_spam

0.13458950201884254

In [27]:
p_ham

0.8654104979811574

In [28]:
n_spam = mcr_training_expanded[mcr_training_expanded['Label'] == 'spam']['SMS'].apply(len).sum()

In [29]:
n_spam

15190

In [30]:
n_ham = mcr_training_expanded[mcr_training_expanded['Label'] == 'ham']['SMS'].apply(len).sum()

In [31]:
n_ham

57237

In [32]:
alpha = 1

In [33]:
n_vocabulary = len(vocabulary)

In [34]:
n_vocabulary

7783

In [35]:
spam_dict = {unique_word: 0 for unique_word in vocabulary}
ham_dict = {unique_word: 0 for unique_word in vocabulary}

In [36]:
spam_messages = mcr_training_expanded[mcr_training_expanded['Label'] == 'spam']
ham_messages = mcr_training_expanded[mcr_training_expanded['Label'] == 'ham']

In [37]:
for uw in vocabulary:
    spam_dict[uw] = (spam_messages[uw].sum() + alpha)/(n_spam + alpha*n_vocabulary)
    ham_dict[uw] = (ham_messages[uw].sum() + alpha)/(n_ham + alpha*n_vocabulary)

In [38]:
import re
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
   
    #This is where we calculate:
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for each_word in message:
        if each_word in spam_dict:
            p_spam_given_message *= spam_dict[each_word]
        if each_word in ham_dict:
            p_ham_given_message *= ham_dict[each_word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [39]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [40]:
classify("Sounds good, Tom, then see u there")

'ham'

In [41]:
mcr_test['predicted'] = mcr_test['SMS'].apply(classify)
mcr_test.head()

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,later i guess i needa do mcat study too,ham
1,3418,ham,but i haf enuff space got like 4 mb,ham
2,3424,spam,had your mobile 10 mths update to latest oran...,spam
3,1538,ham,all sounds good fingers makes it difficult ...,ham
4,5393,ham,all done all handed in don t know if mega sh...,ham


In [42]:
correct = 0
total = len(mcr_test['SMS'])

for row in mcr_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
accuracy = correct/total

In [43]:
accuracy 

0.9874326750448833

Seems pretty accurate