# Naïve Bayes Spam Classifier

Probability is a powerful tool that lets us answer interesting questions about data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naïve Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
from sklearn.model_selection import train_test_split

## Building a Naïve Bayes Classifier from Scratch

### Email Data

The data was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

In [2]:
data = pd.read_csv('../data/csv/sms.csv.gz', compression='gzip', sep='\t', header=None, names=['Label', 'SMS'])
data

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


### Basic Data Statistics

Let us see what fraction of the dataset is ham, and what fraction of the dataset is spam.

In [3]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

About 86.6% of messages are ham, and 13.4% of messages are labeled spam.

### Splitting Training and Testing Data

Now, let's split our data into training and testing tests. A common split of training and testing data is 80% in the training set, 20% in the test set.

In [4]:
(X_train, X_test, y_train, y_test) = train_test_split(data['SMS'], data['Label'], test_size=0.2, random_state=42)

training_data = pd.DataFrame()
training_data['SMS'] = X_train
training_data['Label'] = y_train

testing_data = pd.DataFrame()
testing_data['SMS'] = X_test
testing_data['Label'] = y_test

As a little sanity check, let's verify that the percentages of spam and non-spam:

In [5]:
training_data['Label'].value_counts(normalize=True)

ham     0.865829
spam    0.134171
Name: Label, dtype: float64

In [6]:
testing_data['Label'].value_counts(normalize=True)

ham     0.866368
spam    0.133632
Name: Label, dtype: float64

### Data Cleaning

Let's do some data cleaning: 
1. remove all punctuation
2. make all words lowercase

In [7]:
training_data['SMS'] = training_data['SMS'].str.replace('\W',' ').str.lower()

Let's now collect the unique set of words that occur in the training data, otherwise known as the **vocabulary**.

In [8]:
vocabulary = []

# make a word list from each SMS
training_data['SMS'] = training_data['SMS'].str.split()

# aggregate all word lists into one
for sms in training_data['SMS']:
    vocabulary += sms
    
# remove duplicates from the list
vocabulary = list(set(vocabulary))

We now compute how many times each word occurs in each SMS message.

In [9]:
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

word_counts_df = pd.DataFrame(word_counts_per_sms)

In [10]:
training_data = training_data.reset_index(drop=True)

In [11]:
training_data = pd.concat([training_data, word_counts_df], axis=1)
training_data

Unnamed: 0,SMS,Label,line,raglan,andrews,ipaditan,bruce,30apr,friendship,blackberry,...,wating,report,22,voila,ramaduth,08000776320,ummma,past,stayin,cc100p
0,"[reply, to, win, 100, weekly, where, will, the...",spam,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[hello, sort, of, out, in, town, already, that...",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[how, come, guoyang, go, n, tell, her, then, u...",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[hey, sathya, till, now, we, dint, meet, not, ...",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[orange, brings, you, ringtones, from, all, ti...",spam,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,"[hi, wlcome, back, did, wonder, if, you, got, ...",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,"[sorry, i, ll, call, later]",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,"[prabha, i, m, soryda, realy, frm, heart, i, m...",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,"[nt, joking, seriously, i, told]",ham,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training the Model

In [12]:
# splitting the dataframe into spam and ham
spam = training_data[training_data['Label'] == 'spam']
ham = training_data[training_data['Label'] == 'ham']

In [13]:
# calculating the probability of spam or ham (our prior probabilities for Bayes' Rule)
p_spam = len(spam) / len(training_data)
p_ham = len(ham) / len(training_data)
p_spam

0.13417096701817366

In [14]:
# calculating the total number of words in spam messages
n_spam = 0

for sms in spam['SMS']:
    len_sms = len(sms)
    n_spam += len_sms

In [15]:
# calculating the total number of words in ham messages
n_ham = 0

for sms in ham['SMS']:
    len_sms = len(sms)
    n_ham += len_sms

In [16]:
# calculating number of unique words in training data
n_vocabulary = len(vocabulary)

In [17]:
# normalization parameter for non-occurrences of words
alpha = 1

In [18]:
# initialize two dictionaries to store probabilities
spam_dict = {word : 0 for word in vocabulary}
ham_dict = {word : 0 for word in vocabulary}

In [19]:
# create list of all words in spam and ham messages

spam_words = []
for sms in spam['SMS']:
    spam_words += sms
    
ham_words = []
for sms in ham['SMS']:
    ham_words += sms

We now have everything we need to calculate $P(x_i\ |\ y = Spam)$ and $P(x_i\ |\ y = Ham)$ for all words $x_i$ in the vocabulary.

In [20]:
for unique_word in vocabulary:
    # calculating how many times each word in vocabulary occurs in spam and ham messages
    spam_count = 0
    ham_count = 0
    for word in spam_words:
        if word == unique_word:
            spam_count += 1
    
    for word in ham_words:
        if word == unique_word:
            ham_count += 1
            
    # calculate probability that message is spam / ham given it contains the word
    p_word_spam = (spam_count + alpha) / (n_spam + alpha * n_vocabulary)
    p_word_ham = (ham_count + alpha) / (n_ham + alpha * n_vocabulary)
    
    # finally update dictionaries with their respective probabilities
    spam_dict[unique_word] = p_word_spam
    ham_dict[unique_word] = p_word_ham
            

We now define our spam filter function:

In [21]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # calculate the probability that the message is SPAM given set of words
    p_spam_given_message = p_spam
    for word in message:
        if word in vocabulary:
            prob = spam_dict[word]
        else:
            prob = alpha / (n_spam + alpha * n_vocabulary)
            
        p_spam_given_message *= prob

    # calculate the probability that the message is HAM given set of words
    p_ham_given_message = p_ham
    for word in message:
        if word in vocabulary:
            prob = ham_dict[word]
        else:
            prob = alpha / (n_ham + alpha * n_vocabulary)
            
        p_ham_given_message *= prob
    
    # these are not technically conditional probabilities.
    # these are the two quantities that Naive Bayes compares: prior times likelihood.
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

## Testing the Model

### Example Messages
Let's test the spam classifier on a few messages.

In [22]:
classify('Sounds good, Alex, then see u there')

P(Spam|message): 1.7102832995008655e-25
P(Ham|message): 8.156894573341021e-21
Label: Ham


In [23]:
classify('YOU WIN THE PRIZE MONEY JACKPOT! CALL 14')

P(Spam|message): 2.090971907616407e-24
P(Ham|message): 2.7511284153561355e-28
Label: Spam


### Accuracy on Test Set

With obvious spam and non-spam, the classifier seems to be working in the way we would expect. Let's properly evaluate model performance using test data now. We just need to update our function to actual return something first, rather than print

In [24]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    # we need to calculate the probabilities that the message is spam given message
    p_spam_given_message = p_spam
    for word in message:
        if word in vocabulary:
            prob = spam_dict[word]
        else:
            prob = alpha / (n_spam + alpha * n_vocabulary)
            
        p_spam_given_message *= prob
     
    p_ham_given_message = p_ham
    for word in message:
        if word in vocabulary:
            prob = ham_dict[word]
        else:
            prob = alpha / (n_ham + alpha * n_vocabulary)
            
        p_ham_given_message *= prob

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [25]:
testing_data['predicted'] = testing_data['SMS'].apply(classify_test_set)
testing_data

Unnamed: 0,SMS,Label,predicted
3245,Squeeeeeze!! This is christmas hug.. If u lik ...,ham,ham
944,And also I've sorta blown him off a couple tim...,ham,ham
1044,Mmm thats better now i got a roast down me! i...,ham,ham
2484,Mm have some kanji dont eat anything heavy ok,ham,ham
812,So there's a ring that comes with the guys cos...,ham,ham
...,...,...,...
4264,Den only weekdays got special price... Haiz......,ham,ham
2439,I not busy juz dun wan 2 go so early.. Hee..,ham,ham
5556,Yes i have. So that's why u texted. Pshew...mi...,ham,ham
4205,How are you enjoying this semester? Take care ...,ham,ham


In [26]:
total = len(testing_data)
correct = 0

In [27]:
for row in testing_data.iterrows():
    actual_classifcation = row[1]['Label']
    predicted_classifcation = row[1]['predicted']
    if actual_classifcation == predicted_classifcation:
        correct += 1

In [28]:
accuracy = correct / total

print(accuracy)

0.9829596412556054
