# Building spam SMS filter using multinomidal Naive Bayes algorithm

To classify messages as spam or non-spam, the computer:
- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository:https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
sms.shape

(5572, 2)

In [3]:
100 * sms['Label'].value_counts(normalize = True)

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [4]:
sms = sms.sample(frac=1, random_state=1)

In [5]:
training = sms[:4458].reset_index(drop=True)
test = sms[-1114:].reset_index(drop=True)

In [6]:
100 * training['Label'].value_counts(normalize=True)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [7]:
100 * test['Label'].value_counts(normalize=True)

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [8]:
training['SMS'] = training['SMS'].str.replace('\W', ' ').str.lower()

In [9]:
training['SMS'].head()

0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object

In [10]:
vocabulary = []
training['SMS'] = training['SMS'].str.split()
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)

In [11]:
print(vocabulary[:5])

['yep', 'by', 'the', 'pretty', 'sculpture']


In [12]:
vocabulary_set = set(vocabulary)
vocabulary = list(vocabulary_set)

In [13]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [14]:
word_per_sms = pd.DataFrame(word_counts_per_sms)
word_per_sms.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
merge_sms = pd.concat([training, word_per_sms], axis=1)

In [16]:
merge_sms.shape

(4458, 7785)

In [17]:
merge_sms.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [19]:
spam_sms = merge_sms[merge_sms['Label'] == 'spam']
ham_sms = merge_sms[merge_sms['Label'] == 'ham']
p_spam = len(spam_sms) / len(merge_sms)
p_ham = len(ham_sms) / len(merge_sms)

In [20]:
n_spam = sum(spam_sms['SMS'].apply(len))

In [21]:
n_ham = sum(ham_sms['SMS'].apply(len))

In [22]:
n_vocabulary = len(vocabulary)

In [23]:
alpha = 1

In [24]:
spam_prob = {unique_word: 0 for unique_word in vocabulary}

In [25]:
ham_prob = {unique_word: 0 for unique_word in vocabulary}

In [26]:
for word in vocabulary:
    n_w_spam = spam_sms[word].sum()
    n_w_ham = ham_sms[word].sum()
    p_w_spam = (n_w_spam + alpha) / (n_spam + alpha * n_vocabulary)
    p_w_ham = (n_w_ham + alpha) / (n_ham + alpha * n_vocabulary)
    spam_prob[word] = p_w_spam
    ham_prob[word] = p_w_ham

In [32]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

       
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_prob:
            p_spam_given_message *= spam_prob[word]
        if word in ham_prob:
            p_ham_given_message *= ham_prob[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Need human classification'

In [33]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [34]:
classify('Sounds good, Tom, then see u there')

'ham'

In [35]:
classify('money money win win secret money')

'spam'

In [36]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [37]:
correct = 0
total = len(test)

In [41]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1114 entries, 0 to 1113
Data columns (total 3 columns):
Label        1114 non-null object
SMS          1114 non-null object
predicted    1114 non-null object
dtypes: object(3)
memory usage: 26.2+ KB


In [42]:
bool = test['Label'] == test['predicted']

In [44]:
correct = sum(bool)

In [45]:
accuracy = correct / total

In [46]:
print(accuracy)

0.9874326750448833
