# Building a Spam Filter with Naive Bayes

In this project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

Learns how humans classify messages.
Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).
So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd 

spam_collection = pd.read_csv("SMSSpamCollection", sep = "\t",header = None, names = ["Label","SMS"])

In [2]:
spam_collection.shape

(5572, 2)

In [3]:
print(spam_collection.head())

spam_collection["Label"].value_counts(normalize = True)

  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Let's start by randomizing the entire dataset into a training set and test set

In [4]:
randomized = spam_collection.sample(frac = 1, random_state = 1)

randomized_index = round(len(randomized) * 0.8)

training_set = randomized[:randomized_index].reset_index(drop = True)
test_set = randomized[randomized_index:].reset_index(drop = True)

In [5]:
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [6]:
#proportion of spam and ham in both the training and the test set
print(training_set["Label"].value_counts(normalize=True))
print(test_set["Label"].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

# Data Cleaning

In [7]:
import re

training_set["SMS"] = training_set["SMS"].apply(lambda x: re.sub('\W', ' ', x)).str.lower()
test_set["SMS"] = test_set["SMS"].apply(lambda x: re.sub('\W', ' ', x)).str.lower()

In [8]:
print(training_set.head())
print(test_set.head())

  Label                                                SMS
0   ham                       yep  by the pretty sculpture
1   ham      yes  princess  are you going to make me moan 
2   ham                         welp apparently he retired
3   ham                                            havent 
4   ham  i forgot 2 ask ü all smth   there s a card on ...
  Label                                                SMS
0   ham          later i guess  i needa do mcat study too 
1   ham             but i haf enuff space got like 4 mb   
2  spam  had your mobile 10 mths  update to latest oran...
3   ham  all sounds good  fingers   makes it difficult ...
4   ham  all done  all handed in  don t know if mega sh...


In [9]:
training_set['SMS'] = training_set['SMS'].str.split()

In [10]:
vocabulary = []

for sms in training_set["SMS"]:
    for word in sms: 
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [11]:
len(vocabulary)

7783

In [12]:
word_counts_per_sms = {unique_word: [0] * len(training_set["SMS"]) for unique_word in vocabulary}

for index, sms in enumerate(training_set["SMS"]):
    for word in sms: 
        word_counts_per_sms[word][index] += 1

In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [14]:
training_set_clean = pd.concat([training_set,word_counts], axis =1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
#isolating messages
spam_messages = training_set_clean[training_set_clean["Label"] == "spam"]
ham_messages = training_set_clean[training_set_clean["Label"] == "ham"]

#calculationg probabilities 
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

n_words_per_spam_messages = spam_messages["SMS"].apply(len)
n_spam = n_words_per_spam_messages.sum()

n_words_per_ham_messages = ham_messages["SMS"].apply(len)
n_ham = n_words_per_ham_messages.sum()

n_vocabulary = len(vocabulary)

alpha = 1

## Calculating Parameters

In [16]:
parameters_spam = {unique_word: 0 for unique_word in vocabulary} 
parameters_ham = {unique_word: 0 for unique_word in vocabulary}

for word in vocabulary: 
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Classifying A New Message

In [17]:
def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('HUMAN HELP!!!')

In [18]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [19]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Measuring the Spam Filter's Accuracy

In [20]:
def classify_test_set(message):
    
    message = re.sub("\W"," ", message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    if p_ham_given_message > p_spam_given_message:
        return "ham"
    elif p_ham_given_message < p_spam_given_message:
        return "spam"
    else:
        "HUMAN HELP!!!"
        

In [21]:
test_set["predicted"] = test_set["SMS"].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [23]:
correct = 0 
total = test_set.shape[0]
incorrect = []

for row in test_set.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct += 1
    else: 
        incorrect.append(row)

print("correct: ", correct)
print("incorrect: ", total - correct)
print("accuracy: ", correct / total)

correct:  1100
incorrect:  14
accuracy:  0.9874326750448833


As a result, we have accuracy 98% that is very high. Only 14 incorrect classifying against 1100 correct

In [28]:
incorrect_messages = pd.DataFrame(incorrect)

In [29]:
incorrect_messages

Unnamed: 0,Label,SMS,predicted
114,spam,not heard from u4 a while call me now am here...,ham
135,spam,more people are dogging in your area now call...,ham
152,ham,unlimited texts limited minutes,spam
159,ham,26th of july,spam
284,ham,nokia phone is lovly,spam
293,ham,a boy loved a gal he propsd bt she didnt mind...,
302,ham,no calls messages missed calls,spam
319,ham,we have sent jd for customer service cum accou...,spam
504,spam,oh my god i ve found your number again i m s...,ham
546,spam,hi babe its chloe how r u i was smashed on s...,ham
