# Building a Spam Filter Using Naive Bayes

In this project I will be building a classification model that utlizes the naive bayes algorithm which will be designed to classify SMS messages as either spam or not spam. This project is designed to demonstrate knowledge of conditional proability, bayes thereom and the naive bayes algorithm itself.

To classify messages as spam or non-spam using naive bayes the computer does the following:

* Learns how humans classify messages.
* Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
* Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

For this project, my goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [1]:
import pandas as pd
import re

sms_spam = pd.read_csv('/Users/miesner.jacob/Desktop/DataQuest/SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [2]:
sms_spam.shape

(5572, 2)

In [3]:
sms_spam['Label'].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

## Splitting into Training and Test Sets

(Using 80/20 Training Test Split)

In [4]:
randomized = sms_spam.sample(frac=1, random_state=1)

training_test_split_point = round(len(randomized) * 0.8)

training_set = randomized[:training_test_split_point].reset_index(drop=True)
test_set = randomized[training_test_split_point:].reset_index(drop=True)


#Check shape and verify split

print(training_set.shape)
print("The Training Set has {:.2f}% of rows".format(training_set.shape[0] / len(randomized)*100))
print('\n')
print(test_set.shape)
print("The Test Set has {:.2f}% of rows".format(test_set.shape[0] / len(randomized)*100))

(4458, 2)
The Training Set has 80.01% of rows


(1114, 2)
The Test Set has 19.99% of rows


## Cleaning the Training Set

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

In [5]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()

### Creating vocabulary of all unique words

In [6]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [7]:
len(vocabulary)

7783

### Creating one-hot encoded dataframe 

In [8]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS'])  for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [9]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [10]:
training_set = pd.concat([training_set,word_counts],axis = 1)

In [11]:
training_set.head()

Unnamed: 0,Label,SMS,maximize,objection,ab,dining,predictive,mind,easiest,3mins,...,heaven,accordin,babies,polyh,taxi,cosign,weird,bother,ternal,careless
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Constants First

I'm now done with cleaning the training set, and can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

$$
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
$$$$
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
$$
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$
Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, I'll use our training set to calculate:

P(Spam) and P(Ham)
NSpam, NHam, NVocabulary
I'll also use smoothing and set $\alpha = 1$.

In [12]:
spam_messages = training_set[training_set['Label'] == 'spam']
ham_messages = training_set[training_set['Label'] == 'ham']

p_spam = len(spam_messages) / len(training_set)
p_ham = len(ham_messages) / len(training_set)

print("Probability of ham: " + str(p_ham))
print("Probability of spam: " + str(p_spam))
print('\n')

n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

print("Number of ham words: " + str(n_ham))
print("Number of spam words: " + str(n_spam))
print('\n')

n_vocabulary = len(vocabulary)

print("Number of total words aka vocabulary: " + str(n_vocabulary))
print('\n')

alpha = 1

print('Our smoothing metrics aplha: 1')

Probability of ham: 0.8654104979811574
Probability of spam: 0.13458950201884254


Number of ham words: 57237
Number of spam words: 15190


Number of total words aka vocabulary: 7783


Our smoothing metrics aplha: 1


In [13]:
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()


n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

## Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

In [14]:
spam_params = {unique_word:0 for unique_word in vocabulary}
ham_params = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    spam_params[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    ham_params[word] = p_word_given_ham

## Building the Classifier

In [15]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
            
        if word in ham_params:
            p_ham_given_message *= ham_params[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [16]:
classify("Yo, text me back. I heard a secret about you know what.")

P(Spam|message): 8.18779525209903e-38
P(Ham|message): 1.0819808094604452e-32
Label: Ham


Here we can see that the message was classified as ham, because the probability of ham given the message was higher than the probability of spam given the message. 

Keep in mind that we didnt divide by the total probability for either to save computational power...this does not effect the outcome of the classification, since the difference is proportional across both calculations.

In [17]:
classify("Text back this super secret code to win money! :7845")

P(Spam|message): 8.670425836204817e-28
P(Ham|message): 9.204334989396223e-31
Label: Spam


Here we can see that the message was classified as spam, because the probability of spam given the message was higher than the probability of ham given the message. 

Once again, keep in mind that we didnt divide by the total probability for either to save computational power...this does not effect the outcome of the classification, since the difference is proportional across both calculations.

## Measuring the model's accuracy using the Test Dataset

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

In [18]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]

        if word in ham_params:
            p_ham_given_message *= ham_params[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [19]:
#Creating classification column in dataset to use for accuracy test

test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


It looks like the first five messages were classified correctly! Lets check to see how it did on the entire test dataset.

In [20]:
#Counting the number of correct classifications
correct = 0

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
    else:
        continue

In [21]:
#Finding Accuracy Score

total = test_set.shape[0]
accuracy = correct / total

print('Correct Classifications:', correct)
print('Incorrect Classifications:', (total - correct))
print('Accuracy Score:', "{:,.2f}%".format(accuracy *100))

Correct Classifications: 1100
Incorrect Classifications: 14
Accuracy Score: 98.74%


## Conclusion

Our classifier model utilizing Naive Bayes turned out to be extremely accurate with a accuracy score over 98% which is well above the original goal of 80%! This shows the power of Naive Bayes in text classification, despite its pitfall of assuming independence among variables.

Next steps include:

Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly

In [24]:
test_set['Correct?'] = ""
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        row['Correct?'] = True
    else:
        row['Correct?'] = False

In [36]:
incorrect_classifications = test_set.loc[test_set['Correct?'] == False]

for message in incorrect_classifications.iterrows():
    message = message[1]
    print(message['SMS'])
    print(message['predicted'])
    print('\n')

Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
ham


More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
ham


Unlimited texts. Limited minutes.
spam


26th OF JULY
spam


Nokia phone is lovly..
spam


A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8
needs human classification


No calls..messages..missed calls
spam


We have sent JD for Customer Service cum Accounts Executive to ur m

It looks like some short messages were misclassified as spam and some spam messages with urls and nonconvential words were classified as ham. Although the benefit of fixing these misclaffications through increased complexity in the model does not match the cost of labor. 98% accuracy is sufficient for this use case.