# Classify messages with Naive Baye's

* Using practical side of Naive Baye's algorithm by building a spam filter for SMS messages.
* Dataset can be downloaded from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
       

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Read SMS dataset and explore.

In [4]:
import numpy as np
import pandas as df
sms_data = df.read_csv("SMSSpamCollection", sep='\t', header=None, names =['Label', 'SMS'])
sms_data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
sms_data.shape


(5572, 2)

In [6]:
sms_data.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [7]:
sms_data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

## Split Training and Test Set
* Split  dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%

In [8]:
sms_rand = sms_data.sample(frac=1, random_state=1)

In [9]:
training_test_index = round(len(sms_rand) * 0.8)


training_set = sms_rand[:training_test_index].reset_index(drop=True)
test_set = sms_rand[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)



Analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

In [10]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [11]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [12]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


## Data cleaning:
* Begin with removing all the punctuation and bringing every letter to lower case

In [13]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


* Create a vocabulary a list of all unique words in training set

In [14]:
training_set['SMS']=training_set['SMS'].str.split()

vocab = []
for sms in training_set['SMS']:
    for word in sms:
        vocab.append(word)
vocab = list(set(vocab))
    

In [15]:
len(vocab)

7783

* Use vocabulary just created for data transformation required for analysis

In [16]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocab}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [17]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


* Concat dataset just created with original training set.

In [18]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.tail()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
4453,ham,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4457,ham,"[wherre, s, my, boytoy]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculate constants
 * required for Naive Baye's algorithm
 * use Laplace smoothing and set $\alpha = 1$
 * calculate parameters

In [19]:
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label']== 'ham']

p_spam = len(spam_messages)/len(training_set_clean)
p_ham = len(ham_messages)/ len(training_set_clean)

n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()


n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()


n_vocab = len(vocab)


alpha = 1

In [20]:
parameters_spam = {unique_word:0 for unique_word in vocab}
parameters_ham = {unique_word:0 for unique_word in vocab}

for word in vocab:
    n_word_given_spam = spam_messages[word].sum()   
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocab)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocab)
    parameters_ham[word] = p_word_given_ham

## Classifying A New Message
* Create a classifier (a function to classify a new message).

In [21]:
import re

def classify(message):
       
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, human may classify this!')

* Test that classifier/filter on new messages.

In [22]:
classify('WINNER!! This is the secret code to  money.')

P(Spam|message): 1.9601625317988245e-23
P(Ham|message): 1.3673295850585383e-25
Label: Spam


In [23]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


* Run that spam filter on test_set to see: how it works on.

In [24]:
def classify_test_set(message):    
        
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [25]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


* Measure the accuracy of the spam filter created.

In [26]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


# This filter classifies the messages as spam or ham with 98.7% accuracy.

### How to improve accuracy further?

* Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
* Make the filtering process more complex by making the algorithm sensitive to letter case             


In [37]:
test_set1 = test_set
# copy test set to analyze it further

In [38]:
test_set1.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [51]:
test_set1['accuracy'] = test_set1['Label'] == test_set1['predicted']
# add a column in test_set1 to save accuracy of filter, titled accuracy.

In [52]:
test_set1.head()

Unnamed: 0,Label,SMS,predicted,accuracy
0,ham,Later i guess. I needa do mcat study too.,ham,True
1,ham,But i haf enuff space got like 4 mb...,ham,True
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,True
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,True
4,ham,"All done, all handed in. Don't know if mega sh...",ham,True


In [53]:
test_set_wrong = test_set1[test_set1.accuracy == False]

In [54]:
test_set_wrong
# these messages filtered wrong

Unnamed: 0,Label,SMS,predicted,accuracy
114,spam,Not heard from U4 a while. Call me now am here...,ham,False
135,spam,More people are dogging in your area now. Call...,ham,False
152,ham,Unlimited texts. Limited minutes.,spam,False
159,ham,26th OF JULY,spam,False
284,ham,Nokia phone is lovly..,spam,False
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification,False
302,ham,No calls..messages..missed calls,spam,False
319,ham,We have sent JD for Customer Service cum Accou...,spam,False
504,spam,Oh my god! I've found your number again! I'm s...,ham,False
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham,False
