# Spam Filter for SMS messages using Naive Bayes

The aim of this project, to classify messages as spam or non-spam, we saw in the previous mission that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages probabilities for spam and non-spam.
3. Classifies a new message based on these probability values if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

In [1]:
import pandas as pd
sms_spam = pd.read_csv("SMSSpamCollection", sep = '\t',header = None, names = ['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()


(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


"ham" means non-spam

In [2]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

- 87% of sms are non-spam.
- 13% of sms are spam

# Randomize the dataset

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham.

## Training and Test Dataset

- Split the dataset into a training and test dataset
- 80% training and 20% test dataset


In [3]:
data = sms_spam.sample(random_state=1, frac=1)

# identifing index to split data
# 80% data
spliting_index = round(len(data)*0.8) 

# training dataset: upto spliting index
# test dataset: from spliting index
training_set = data[:spliting_index].reset_index(drop=True)
test_set = data[spliting_index:].reset_index(drop=True)


In [4]:
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [5]:
print("Training Dataset")
print(training_set['Label'].value_counts(normalize=True))
print("\n")
print("Test dataset")
print(test_set['Label'].value_counts(normalize=True))

Training Dataset
ham     0.86541
spam    0.13459
Name: Label, dtype: float64


Test dataset
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


------

# Data Cleaning
- To bring the data in a format that will allow us to extract easily all the information we need.

Tasks to do:

1. Letter Case and Punctuation
    Remove all the punctuation and bring every letter to lower case.
2. Creating the Vocabulary
    A list with all the unique words in our training set.
3. Final Training Set
    Use the created vocabulary to make the data transformation we want
-----
### 1. Letter Case and Punctuation

In [6]:
# check the dataset
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
# use \W to detect any character that is not from a-z, A-Z or 0-9.
training_set['SMS']= training_set['SMS'].str.replace('\W',' ')

#lower case
training_set['SMS']= training_set['SMS'].str.lower()

training_set.head()



Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### 2. Creating the Vocabulary

In [8]:
training_set['SMS'] = training_set['SMS'].str.split()

vocab = []

for sms in training_set['SMS']:
    for word in sms:
        vocab.append(word)
    
vocabulary = list(set(vocab))
    

In [9]:
len(vocabulary)

7783

### 3. The Final training set

In [10]:
# create an empty dictionary named "word_counts_per_sms"
# where key --> unique word, value --> 0
# [0]* 5 i.e. [0,0,0,0,0]
# here {(uw1,0),(uw2,0),..(last uw,0)}

word_counts_per_sms = {unique_word:[0] * len(training_set['SMS']) for unique_word in vocabulary}

# loop over index and SMS to count the occurence of word
# enumerate() function to get both the index and the SMS message

for index,sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] +=1


In [11]:
# Transform word_counts_per_sms into a DataFrame 

word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,sharing,republic,checkboxes,09061702893,line,47per,holla,arrival,tomorw,terrible,...,ga,away,jod,brisk,okay,hell,kids,09066358152,describe,darling
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# Concatenate the DataFrame we just built above with the DataFrame containing the training set. This way, 
# we'll also have the Label and the SMS columns.
# axis = 1 to concatenate as column

final_training_data = pd.concat([training_set,word_counts], axis =1)
final_training_data.head()

Unnamed: 0,Label,SMS,sharing,republic,checkboxes,09061702893,line,47per,holla,arrival,...,ga,away,jod,brisk,okay,hell,kids,09066358152,describe,darling
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- NSpam is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
- NHam is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.
- NVocabulary is len(vocabulary)
- alpha = 1

In [15]:
# P(Spam) and P(Ham)--> normalised value counts
#------------------------------------
p_spam = final_training_data['Label'].value_counts(normalize=True)['spam']
p_ham = final_training_data['Label'].value_counts(normalize=True)['ham']

# N_Spam, N_Ham, and N_Vocabulary -->
#-----------------------------
n_spam = final_training_data[final_training_data['Label']=="spam"].sum(axis = 1).sum()
n_ham = final_training_data[final_training_data['Label']=="ham"].sum(axis = 1).sum()
n_vocabulary = len(vocabulary)
alpha = 1


The probability values that P(wi|Spam) and P(wi|Ham) will take are called **parameters**.
\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Step 1:

- Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. - One dictionary to store the parameters for P(wi|Spam), and the other for P(wi|Ham).

In [14]:
# ['sea', 'navigate'] should look like this: {'sea': 0, 'navigate': 0}

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

Step 2:
- Isolate the spam and the ham messages in the training set into two different DataFrames. The Label column will help you isolate the messages.

In [18]:
spam_msgs = final_training_data[final_training_data['Label']=='spam']
ham_msgs = final_training_data[final_training_data['Label']== 'ham']                              


Step 3:
- Calculate parameters using this:
\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [19]:
# N(wi|Spam)--> n_word_given_spam
# P(wi|Spam)--> p_word_given_spam
n_vocabulary = len(vocabulary)

for word in vocabulary:
    n_word_given_spam = spam_msgs[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam+(alpha* n_vocabulary))
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_msgs[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham+(alpha* n_vocabulary))
    parameters_ham[word] = p_word_given_ham
    

## Classifying A New Message

The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [24]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
# cleaning message

    message = message.lower()
    message = message.split()
    
# calculating P(spam|message) and P(ham|message)

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
# comparing values of spam and ham    

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
        
        

In [25]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


## Measuring Accuracy of the Spam Filter

We will run classify() to filter the test dataset created above with 20% of actual data.

Here we will modify classify() to returns classification label instead of printing them.


In [26]:
import re

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
# cleaning message

    message = message.lower()
    message = message.split()
    
# calculating P(spam|message) and P(ham|message)

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    
    
# comparing values of spam and ham    

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        print('need human classification')
        
        

#### Use classify_test_set to create a new column in our test set.

In [27]:
test_set['filtered'] = test_set['SMS'].apply(classify_test_set)
test_set.head(10)

need human classification


Unnamed: 0,Label,SMS,filtered
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
5,ham,But my family not responding for anything. Now...,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to ta...,ham
8,ham,Genius what's up. How your brother. Pls send h...,ham
9,ham,I liked the new mobile,ham


### Fuction to measure the accuracy

In [29]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['filtered']:
        correct += 1
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.