# Guided Project: Building a Spam Filter with Naive Bayes

## Exploring the Dataset 

In this particular project, I will create a Spam Filter with the Naive Bayes algorithm (Multinomial Naive Bayes) by classifying messages as spam or non-spam. In order to achieve this, the computer must be able to do three steps: 

   1) Learn how people classify messages. 
   
   2) Calculate probabilites for a new message either being spam or non-spam.
   
   3) Classify the respective message by seeing which probability value is 
      higher. For example, if spam is greater than non-spam with respect to
      probabilites, then the message is spam. 
      
So, I will proceed with the first step by teaching the computer how to classify messages. The dataset that I'll be using is a dataset from the UCI Machine Learning Repository. 

In [1]:
import pandas as pd 

df = pd.read_csv('SMSSpamCollection', sep = '\t', parse_dates = True, header = None, names = ['Label', 'SMS'])
print('There are {} rows and {} columns in this dataset'.format(df.shape[0], df.shape[1]))
df.head()
df['Label'].value_counts()
df['Label'].value_counts(normalize = True) * 100
print('There are {} non-spam messages and {} spam messages'.format(df['Label'].value_counts()[0], df['Label'].value_counts()[1]))
df['Label'].value_counts(normalize = True)

There are 5572 rows and 2 columns in this dataset
There are 4825 non-spam messages and 747 spam messages


ham     0.865937
spam    0.134063
Name: Label, dtype: float64

## Training and Test Set

After analyzing the data, I will now move to the next step which is to build the spam filter. From the data, I saw that there are 5,572 rows and 2 columns in this dataset. Also, I then saw that there are 4,825 non-spam messages and 747 spam messages. In order to test the validity of the spam filter, I will divide the dataset into two categories
    
   1) Training Set: Set we will use to 'train' the computer on how to classify
      these messages. 
    
   2) Test Set: Set we will use to test how good the spam filter is on 
      classifying new messages.

I will split the dataset 80% training and 20% testing. Therefore, the training set will have 4,458 messages and then the test set will have 1,114 messages. The 1,114 messages will be the messages already classified by a human and then the spam filter will treat them as new and try to classify them. I will create a training and test set by randomizing the dataset. 
      


In [2]:
randomized_data = df.sample(frac = 1, random_state = 1) 
training_test_index = round(len(randomized_data) * 0.8)
training_set = randomized_data[:training_test_index].reset_index(drop = True)
test_set = randomized_data[training_test_index:].reset_index(drop = True)
print(training_set['Label'].value_counts())
print('\n')
print(training_set['Label'].value_counts(normalize = True))
print('\n')
print(test_set['Label'].value_counts())
print('\n')
print(test_set['Label'].value_counts(normalize = True))

ham     3858
spam     600
Name: Label, dtype: int64


ham     0.86541
spam    0.13459
Name: Label, dtype: float64


ham     967
spam    147
Name: Label, dtype: int64


ham     0.868043
spam    0.131957
Name: Label, dtype: float64


## Letter Case and Punctuation 

In the previous step, I split the data into a training and test set. Now, I will use the training set to teach the algorithm to classify new messages. I now need to calculate the probabilites that are needed: 

   1) P(Spam | message) 
   
   2) P(Ham | message) 

To calculate these probabilites, I will need to clean the data so that the format will allow us to extract the information I need very easily. I will replace the SMS column with a new columns where each column represents a unique word from the vocabulary. Also, each row represents a single message, all words are lower case, and punctuation is not taken into account anymore. 


In [3]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ').str.lower()
training_set['SMS'].head()

0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object

## Creating the Vocabulary

In the previous step I removed the punctuation and changed all letters to lowercase. Everything except the 'Label' column will be transformed in which there will be a unique word in our vocabulary. I will now create the unique words that occur in the training set. 

In [4]:
training_set['SMS'] = training_set['SMS'].str.split()
vocabulary = []
for n in training_set['SMS']:
    for word in n: 
        vocabulary.append(word)

vocabulary = list(set(vocabulary))
len(vocabulary)

7783

## Final Training Set

After creating the vocabulary for the messages in the training set, I will now make the data transformation from the vocabulary. The finished product will be a new dataframe, but before I can achieve this, I will need to build a dictionary that will convert to a dataframe. 

In [5]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
training_set_clean = pd.concat([training_set, word_counts_per_sms], axis = 1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

After finishing the data cleaning and creating the data, I finally have a training set that I can work with. I will begin creating the spam filter at this point by following the multinomial Naive Bayes Algorithm. 

In [6]:
spam = training_set_clean[training_set_clean['Label'] == 'spam']
ham = training_set_clean[training_set_clean['Label'] == 'ham']

p_spam = len(spam) / len(training_set_clean)
p_ham = len(ham) / len(training_set_clean) 

number_of_spam_words = spam['SMS'].apply(len)
n_spam = number_of_spam_words.sum()

number_of_ham_words = ham['SMS'].apply(len)
n_ham = number_of_ham_words.sum()

number_of_vocabulary = len(vocabulary)

alpha = 1

## Calculating Parameters

In the previous step, I calculated the constants needed for the equation. Now, I need to calculate the parameters, the probability of a word given non-spam and the probability of a word given spam. 

In [7]:
spam_parameters = {unique_word : 0 for unique_word in vocabulary}
ham_parameters = {unique_word : 0 for unique_word in vocabulary}

for word in vocabulary: 
    number_of_words_given_spam = spam[word].sum()
    prob_word_given_spam = (number_of_words_given_spam + alpha) / (n_spam + (alpha * number_of_vocabulary))
    spam_parameters[word] = prob_word_given_spam
    
    number_of_words_given_ham = ham[word].sum()
    prob_word_given_ham = (number_of_words_given_ham + alpha) / (n_ham + (alpha * number_of_vocabulary))
    ham_parameters[word] = prob_word_given_ham

## Classifying A New Message

After calculating the constants and calculating the parameters I needed, I can start to make the spam filter. The spam filter is a function that: 
    
   1) Takes in input to be the message.
   
   2) Calculates P(Spam | w1, w2, .., wn) and P(Ham | w1, w2, .., wn) 
   
   3) Compares which of the two are greater and the bigger one gets assigned the 
      title of either spam or not spam. 
      

In [8]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message: 
        if word in spam_parameters: 
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters: 
            p_ham_given_message *= ham_parameters[word]
    

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [9]:
classify('Winner!! This is the secret code to unlock the money: C3421.')
print('\n')
classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

As a result of our calculations for P(Spam|message) and P(Ham|message) for both calculations, I will now measure the accuracy of the filter on the test set of 1,114 messages. The algorithm will present a result with a classification label for every message in the new test set of 1,114 messages to compare with the actual label given by a human. I will calculate the accuracy of the algorithm and have accuracy defined as:
       
      - Accuracy = # correctly classified / total number classified messages

In [10]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message: 
        if word in spam_parameters: 
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters: 
            p_ham_given_message *= ham_parameters[word]

    if p_ham_given_message > p_spam_given_message:
        return 'Ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'Spam'
    else:
        return 'Needs human classification'

In [11]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,Ham
1,ham,But i haf enuff space got like 4 mb...,Ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,Spam
3,ham,All sounds good. Fingers . Makes it difficult ...,Ham
4,ham,"All done, all handed in. Don't know if mega sh...",Ham


In [20]:
print(test_set['Label'].value_counts())
print('\n')
print(test_set['predicted'].value_counts())



ham     967
spam    147
Name: Label, dtype: int64


Ham                           969
Spam                          144
Needs human classification      1
Name: predicted, dtype: int64


## Next Steps

Overall, I managed to build a spam flter for SMS messages using the multinomial Naive Bayes algorithm. The project deemed to be successful with only 2 differences for ham, 3 differences for spam, and 1 human classification for the test_set vs. the predicted. The predicted guessed 2 more for the ham and 3 less for the spam. 