# Spam SMS messages

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.


In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])

In [2]:
sms

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
sms.shape

(5572, 2)

In [4]:
sms['Label'].value_counts(normalize=True, dropna=False)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We clearly have alot more non-spam examples in our dataset

In [5]:
sms = sms.sample(frac=1, random_state=1)

<img src="cpgp_dataset_3.png">

At the moment our data is in the following format and we want to convert it into the format at the bottom of this image

In [6]:
#split the data 
sms_copy = sms.copy()
train_set = sms_copy.sample(frac=0.8, random_state=1)
test_set = sms_copy.drop(train_set.index)

In [7]:
train_set.shape

(4458, 2)

In [8]:
test_set.shape

(1114, 2)

In [9]:
#check the sets are representative 
train_set['Label'].value_counts(normalize=True)*100

ham     86.675639
spam    13.324361
Name: Label, dtype: float64

In [10]:
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ')
train_set['SMS'] = train_set['SMS'].str.lower()
train_set = train_set.reset_index(drop=True)

In [11]:
train_set

Unnamed: 0,Label,SMS
0,ham,good night my dear sleepwell amp take care
1,ham,sen told that he is going to join his uncle fi...
2,ham,thank you baby i cant wait to taste the real ...
3,ham,when can ü come out
4,ham,no thank you you ve been wonderful
5,ham,watching telugu movie wat abt u
6,ham,get ready to moan and scream
7,ham,babe i miiiiiiissssssssss you i need you...
8,ham,up to ü ü wan come then come lor but i d...
9,ham,i ve sent my wife your text after we buy them...


In [12]:
#lets get our vocabulary 
# first we turn the messages into lists of words
vocabulary = []
train_set['SMS'] = train_set['SMS'].str.split()


In [13]:
train_set

Unnamed: 0,Label,SMS
0,ham,"[good, night, my, dear, sleepwell, amp, take, ..."
1,ham,"[sen, told, that, he, is, going, to, join, his..."
2,ham,"[thank, you, baby, i, cant, wait, to, taste, t..."
3,ham,"[when, can, ü, come, out]"
4,ham,"[no, thank, you, you, ve, been, wonderful]"
5,ham,"[watching, telugu, movie, wat, abt, u]"
6,ham,"[get, ready, to, moan, and, scream]"
7,ham,"[babe, i, miiiiiiissssssssss, you, i, need, yo..."
8,ham,"[up, to, ü, ü, wan, come, then, come, lor, but..."
9,ham,"[i, ve, sent, my, wife, your, text, after, we,..."


In [14]:
#then build up the vocabulary
def getWords(words):
    words = set(words)
    for word in words:
        vocabulary.append(word)
train_set['SMS'].apply(getWords)
vocabulary = list(set(vocabulary))

In [15]:
len(vocabulary)

7712

In [16]:
# now turn the dataset into a new dataframe with vocab words for columns 
# and the number of times they appear as rows
word_counts_per_sms = {key: [0]*len(train_set['SMS']) for key in vocabulary}

In [17]:
for index, sms in enumerate(train_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [18]:
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

In [19]:
word_counts_per_sms.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
train_set = pd.concat([train_set, word_counts_per_sms], axis=1)

In [21]:
train_set

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,0121,01223585236,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,ham,"[good, night, my, dear, sleepwell, amp, take, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[sen, told, that, he, is, going, to, join, his...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[thank, you, baby, i, cant, wait, to, taste, t...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[when, can, ü, come, out]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,ham,"[no, thank, you, you, ve, been, wonderful]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,ham,"[watching, telugu, movie, wat, abt, u]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ham,"[get, ready, to, moan, and, scream]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ham,"[babe, i, miiiiiiissssssssss, you, i, need, yo...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,ham,"[up, to, ü, ü, wan, come, then, come, lor, but...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
9,ham,"[i, ve, sent, my, wife, your, text, after, we,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Naive Bayes

We have the following definition of the naive bayes algorithm which we will work with 

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}


In order to calculate 
\begin{equation}
P(Spam | w_1,w_2, ..., w_n)
\end{equation}

we will make use of the following definitions 



\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [22]:
# Probability of spam and ham (not spam)
spam_train = train_set[train_set['Label'] == 'spam']
ham_train = train_set[train_set['Label'] == 'ham']

p_spam_train = len(spam_train)/len(train_set)
p_ham_train = len(ham_train)/len(train_set)

In [23]:
p_spam_train

0.13324360699865412

In [24]:
p_ham_train

0.8667563930013459

In [25]:
alpha = 1

In [26]:
# dictionaries to hold the values P(w_i|Spam) where w_i is a word
# in the vocabulary

spam_probs = {key : 0 for key in vocabulary}
ham_probs = {key : 0 for key in vocabulary}


In [27]:
counts_spam = spam_train.iloc[:,2:].sum(axis=0)
counts_ham = ham_train.iloc[:,2:].sum(axis=0)
N_spam = counts_spam.sum()
N_ham = counts_ham.sum()

In [28]:
for word in counts_spam.index.values:
    spam_probs[word] = (counts_spam[word] + alpha)/(N_spam+alpha*len(vocabulary))
    
for word in counts_ham.index.values:
    ham_probs[word] = (counts_ham[word] + alpha)/(N_ham+alpha*len(vocabulary))

# The Spam Filter

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
     - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [29]:
# calculate probabilities of spam and ham given a new message 
# the classify function
import re
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    

    p_spam_given_message = p_spam_train
    p_ham_given_message = p_ham_train
    for word in message:
        if word in vocabulary:
            p_spam_given_message*=spam_probs[word]
            p_ham_given_message*=ham_probs[word]
        
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
test_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
test_2 = "Sounds good, Tom, then see u there"

In [31]:
classify(test_1)

P(Spam|message): 1.273700039190484e-25
P(Ham|message): 2.6479653658243408e-27
Label: Spam


# Testing
Lets now test our test set on the classifier 


In [32]:
# the classifer modified for the test set
import re
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    

    p_spam_given_message = p_spam_train
    p_ham_given_message = p_ham_train
    for word in message:
        if word in vocabulary:
            p_spam_given_message*=spam_probs[word]
            p_ham_given_message*=ham_probs[word]
        

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [33]:
correct = 0
total = len(test_set)
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head(3)

Unnamed: 0,Label,SMS,predicted
958,ham,Welp apparently he retired,ham
2498,ham,Dai what this da.. Can i send my resume to thi...,ham
4259,ham,I am late. I will be there at,ham


In [34]:
#the results

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

correct/total 

0.9874326750448833

In [35]:
print(len(test_set))

1114
