## CLASSIFICATION OF MESSAGES AS EITHER SPAM OR NON-SPAM

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).


In [68]:
#Import the necessaty libraries needed for the analysis
import pandas as pd
import numpy as np

In [69]:
#read the file
text = pd.read_csv("SMSSpamCollection",sep='\t',header=None,names=['Label', 'SMS'])
text.head(3)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [70]:
#Find the number of rows and columns
text.shape

(5572, 2)

In [71]:
text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [72]:
#Find the number of messages already classified as spam or non-spam
text['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### SPLIT THE DATA INTO TRAINING AND TEST DATASET
We're now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

In [73]:
#Randomize the dataset and use random state to ensure results are reproducible

text2=text.sample(frac=1,random_state=1)
text2.head(2)

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"


In [74]:

# Calculate index for split
training_test_index = round(len(text2) * 0.8)

# Training/Test split
training_set = text2[:training_test_index].reset_index(drop=True)
test_set = text2[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


## LETTER CASE AN PUNCTUATION
To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.




In [75]:
#removing Puntuation on the data set
training_set .head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [76]:
#remove punctuation and transform all dataset to lower case
training_set['SMS']= training_set['SMS'].str.lower()
training_set['SMS']=training_set['SMS'].str.replace('\W','')
training_set.head(3)

  training_set['SMS']=training_set['SMS'].str.replace('\W','')


Unnamed: 0,Label,SMS
0,ham,yepbytheprettysculpture
1,ham,yesprincessareyougoingtomakememoan
2,ham,welpapparentlyheretired


### CREATING THE VOCABULARY
Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string.

Begin by transforming each message from the SMS column into a list by splitting the string at the space character — use the Series.str.split() method.

Initiate an empty list named vocabulary.
Iterate over the the SMS column (each message in this column should be a list of strings by the time you start this loop).

Using a nested loop, iterate each message in the SMS column (each message should be a list of strings) and append each string (word) to the vocabulary list.

Transform the vocabulary list into a set using the set() function. This will remove the duplicates from the vocabulary list.

Transform the vocabulary set back into a list using the list() function.

In [77]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary=[]
for i in training_set['SMS']:
    for word in i:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))


In [90]:
#preview vocabulary
len(vocabulary)

4142

### The final Training Set
We're now going to use the vocabulary we just created to make the data transformation we want.

In [79]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [80]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(6)

Unnamed: 0,busyheretryingtofinishfornewyeariamlookingforwardtofinallymeetingyou,oiccosmenmysisgotnolunchtodaymydadwentoutsodunnowhether2eatinschorwat,congratulationsurawardedeither500ofcdgiftvouchersfreeentry2our100weeklydrawtxtmusicto87066tncswwwldewcom1win150ppmx3age16,uknowwewatchinatlido,lolyesbutitwilladdsomespicetoyourday,noprobiwillsendtoyouremail,actuallyimwaitingfor2weekswhentheystartputtingad,whenwheredoipickyouup,iabsolutelylovesouthparkionlyrecentlystartedwatchingtheoffice,youknowmyolddomitoldyouaboutyesterdayhisnameisrogerhegotintouchwithmelastnightandwantsmetomeethimtodayat2pm,...,anywayimgoingshoppingonmyownnowcosmysisnotdoneyetdundisturbuliao,thenwegottadoitafterthat,heysorryididntgiveyaaabellearlierhunnyjustbeeninbedbutmitego2thepubl8trifuwanamtuploadsaluvjenxxx,youhavebeenspeciallyselectedtoreceivea3000awardcall08712402050beforethelinesclosecost10ppm16tcsapplyagpromo,youvealreadygotaflakyparentitsnotsupposedtobethechildsjobtosupporttheparentnotuntiltheyretherideageanywayimsupposedtobetheretosupportyouandnowivehurtyouunintentionalbuthurtnonetheless,sorrywenttobedearlynightnight,plsdontforgettostudy,yeahimmacomeovercausejaywantstodosomedrugs,thatssignificantbutdontworry,noidonthavecancermomsmakingabigdealoutofaregularcheckupakapapsmear
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,busyheretryingtofinishfornewyeariamlookingforwardtofinallymeetingyou,oiccosmenmysisgotnolunchtodaymydadwentoutsodunnowhether2eatinschorwat,congratulationsurawardedeither500ofcdgiftvouchersfreeentry2our100weeklydrawtxtmusicto87066tncswwwldewcom1win150ppmx3age16,uknowwewatchinatlido,lolyesbutitwilladdsomespicetoyourday,noprobiwillsendtoyouremail,actuallyimwaitingfor2weekswhentheystartputtingad,whenwheredoipickyouup,...,anywayimgoingshoppingonmyownnowcosmysisnotdoneyetdundisturbuliao,thenwegottadoitafterthat,heysorryididntgiveyaaabellearlierhunnyjustbeeninbedbutmitego2thepubl8trifuwanamtuploadsaluvjenxxx,youhavebeenspeciallyselectedtoreceivea3000awardcall08712402050beforethelinesclosecost10ppm16tcsapplyagpromo,youvealreadygotaflakyparentitsnotsupposedtobethechildsjobtosupporttheparentnotuntiltheyretherideageanywayimsupposedtobetheretosupportyouandnowivehurtyouunintentionalbuthurtnonetheless,sorrywenttobedearlynightnight,plsdontforgettostudy,yeahimmacomeovercausejaywantstodosomedrugs,thatssignificantbutdontworry,noidonthavecancermomsmakingabigdealoutofaregularcheckupakapapsmear
0,ham,[yepbytheprettysculpture],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,[yesprincessareyougoingtomakememoan],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,[welpapparentlyheretired],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,[iforgot2asküallsmththeresacardondapresentleih...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Calculating Constant First
We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:


In [82]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

### Caluculating Parameters
Now that we have the constant terms calculated above, we can move on with calculating the parameters  and  Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

In [83]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

Classifying A New Message
Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

Takes in as input a new message (w1, w2, ..., wn).

Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).

Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:

If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [84]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [85]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 0.13458950201884254
P(Ham|message): 0.8654104979811574
Label: Ham


In [86]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 0.13458950201884254
P(Ham|message): 0.8654104979811574
Label: Ham


Measuring the Spam Filter's Accuracy
The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.

We'll start by writing a function that returns classification labels instead of printing them.

In [87]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [88]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,ham
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [89]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 967
Incorrect: 147
Accuracy: 0.8680430879712747
