# Building a Naive Bayes Classifier from Scratch

I am going to build a Naive Bayes classifier from scratch for this project. This classifier will be used to classify text messages as either spam or ham (safe). To compare the performance of my classifier, I will use the MultinomialNB classifier from scikit-learn.

## Exploring the Dataset

We will start by importing the necessary libraries to perform our data analysis with.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_colwidth = None

Next we will explore our dataset using the read_csv function.

In [2]:
messages = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["Label", "SMS"])

In [3]:
messages.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [4]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
messages.describe(include="all")

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
messages["Label"].value_counts(normalize=True) * 100

Label
ham     86.593683
spam    13.406317
Name: proportion, dtype: float64

The messages dataset consists of 5,572 messages of which nearly 87% are ham messages and the remaining 13% are spam messages. There are no missing values in the dataset. Each observation consists of a label (ham or spam) and the message itself, which is a string of words. According to the data distribution and statistics there are only 5,169 unique messages and the most frequent message is: "Sorry, I'll call later". This message occurs 30 times.

It could be interesting to verify the duplicate messages and understand if the labels vary. If the duplicate messages do not add additional information, it would be best to remove them.

## Understanding Duplicates

In [7]:
sum(messages.duplicated())

403

In [8]:
sum(messages.duplicated(subset=["SMS"]))

403

The above two lines of code tell me that each duplicate messages in the dataset is a duplicate in terms of the label and the message combination. This is because running the duplicated method with or without the subset specification returns the same number of duplicates.

In order to verify this claim, we are going to review all duplicate messages and verify how many labels they have.

In [9]:
for m in messages.loc[messages.duplicated(), "SMS"]:
    if messages.loc[messages["SMS"] == m, "Label"].unique().size != 1:
        print("More than 1 label found")
        break
else:
    print("All duplicate messages have the same label")

All duplicate messages have the same label


The above code verifies that each duplicate message has the same label as the original message. We can now confidently remove all duplicate messages from the dataset, since they do not add any additional information.

In [10]:
messages_clean = messages.drop_duplicates(ignore_index=True)
messages_clean.shape

(5169, 2)

In [11]:
messages_clean.describe(include="all")

Unnamed: 0,Label,SMS
count,5169,5169
unique,2,5169
top,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
freq,4516,1


We have now effectively removed all duplicate messages and stored the result in the messages_clean dataframe. Running the describe method now reveals 5,169 unique messages out of a total of 5,169 entries.

## Building a Training and Test Set

Next we are going to build a training and test set. The most straightforward and popular way of doing so is to use the train_test_split function from scikit-learn.

In [12]:
from sklearn.model_selection import train_test_split

We are going to work with the messages_clean dataset. This dataset contains 5,169 entries. If we decide to do a 80 - 20 split, then roughly 1,000 messages will be used for testing. This should be enough to get a sense for the performance of the classifier.

In [13]:
mc_train, mc_test = train_test_split(messages_clean, test_size=0.2, random_state=2)

Next we will verify the distribution of ham and spam messages in the new dataframes.

In [14]:
print(f"Training distribution: \n{mc_train['Label'].value_counts(normalize=True) * 100}")
print()
print(f"Test distribution: \n{mc_test['Label'].value_counts(normalize=True) * 100}")

Training distribution: 
Label
ham     87.255139
spam    12.744861
Name: proportion, dtype: float64

Test distribution: 
Label
ham     87.814313
spam    12.185687
Name: proportion, dtype: float64


The distribution of ham and spam messages is quite equal between the training and test datasets.

In [15]:
print(f"The size of the training dataset is: {mc_train.shape[0]}")
print(f"The size of the test dataset is: {mc_test.shape[0]}")

The size of the training dataset is: 4135
The size of the test dataset is: 1034


## Transforming the Training Dataset

We are now going to transform the dataframe in order to work with some key formulas that are at the heart of the Naive Bayes Classifier.

$ P(Spam\vert w_1, w_2,..., w_n) \propto P(Spam) \cdot \prod_{i = 1}^{n} P(w_i\vert Spam) $

$ P(Ham\vert w_1, w_2,..., w_n) \propto P(Ham) \cdot \prod_{i = 1}^{n} P(w_i\vert Ham) $

$ P(w_i\vert Spam) = \frac {N_{w_i\vert Spam} + \alpha} {N_{Spam} + \alpha \cdot N_{Vocabulary}} $

$ P(w_i\vert Ham) = \frac {N_{w_i\vert Ham} + \alpha} {N_{Ham} + \alpha \cdot N_{Vocabulary}} $

These above formulas represent the Naive Bayes Classifier. The first two describe the relationship between the posterior and prior probabilities. We have two competiting probabilities given a new message. If the spam probability is higher than ham, the message will be labeled spam. The Naive part of the classifier is revealed in these first two equations, because conditional independence is assumed amongst all predictors (words in the message). The last two functions ensure that the probabilities will never be zero if a certain word does not appear in either the spam or ham message. Alpha represents a smoothing parameter, which is a hyperparameter for the model.

We are now going to remove punctuation from the training data and force everything to lower case.

In [16]:
mc_train.head()

Unnamed: 0,Label,SMS
3963,ham,Buy one egg for me da..please:)
2905,ham,K..k...from tomorrow onwards started ah?
3165,ham,Oh great. I.ll disturb him more so that we can talk.
3271,spam,Bloomberg -Message center +447797706009 Why wait? Apply for your future http://careers. bloomberg.com
2533,ham,"Hi, can i please get a &lt;#&gt; dollar loan from you. I.ll pay you back by mid february. Pls."


In [17]:
mc_train_copy = mc_train.copy()

In [18]:
mc_train_copy["SMS"] = mc_train_copy["SMS"].str.replace(r"\W", " ", regex=True)

In [19]:
mc_train_copy.head()

Unnamed: 0,Label,SMS
3963,ham,Buy one egg for me da please
2905,ham,K k from tomorrow onwards started ah
3165,ham,Oh great I ll disturb him more so that we can talk
3271,spam,Bloomberg Message center 447797706009 Why wait Apply for your future http careers bloomberg com
2533,ham,Hi can i please get a lt gt dollar loan from you I ll pay you back by mid february Pls


In [20]:
mc_train_copy["SMS"] = mc_train_copy["SMS"].str.lower()

In [21]:
mc_train_copy.head()

Unnamed: 0,Label,SMS
3963,ham,buy one egg for me da please
2905,ham,k k from tomorrow onwards started ah
3165,ham,oh great i ll disturb him more so that we can talk
3271,spam,bloomberg message center 447797706009 why wait apply for your future http careers bloomberg com
2533,ham,hi can i please get a lt gt dollar loan from you i ll pay you back by mid february pls


We copied the training data to make sure the modifications would work. We will continue to work with the mc_train_copy data.

## Creating the Vocabulary for the Naive Bayes Classifier

We are now going to create a vocabulary, so we are able to track the words and calculate the necessary probabilities.

We will first convert the string of individual words to a list of words. We can split the string on the space character to achieve this.

In [22]:
mc_train_copy["SMS"] = mc_train_copy["SMS"].str.split()

In [23]:
mc_train_copy.head()

Unnamed: 0,Label,SMS
3963,ham,"[buy, one, egg, for, me, da, please]"
2905,ham,"[k, k, from, tomorrow, onwards, started, ah]"
3165,ham,"[oh, great, i, ll, disturb, him, more, so, that, we, can, talk]"
3271,spam,"[bloomberg, message, center, 447797706009, why, wait, apply, for, your, future, http, careers, bloomberg, com]"
2533,ham,"[hi, can, i, please, get, a, lt, gt, dollar, loan, from, you, i, ll, pay, you, back, by, mid, february, pls]"


In [24]:
vocabulary = []

In [25]:
for m in mc_train_copy["SMS"]:
    for w in m:
        vocabulary.append(w)

vocabulary = list(set(vocabulary))

In [26]:
len(vocabulary)

7738

We looped over every word in every message using a nested loop and added every word to the vocabulary list. In order to remove duplicates from this list, we used the set() function and the list() function to convert the vocabulary to a list again. The final vocabulary list contains 7,738 unique words.

## Final Training Set

We now want to create our final training set by converting the dataframe into a unique word count structure using the vocabulary. Essentially every row in the training set will have a counter for each unique word as a column.

To start with this conversion, we are going to create a dictionary structure that will count the occurrence for each unique word in the vocabulary per row of the training set. The structure will be a key-value pair, the key will be the unique word and the value a list of integers that represent the number of times the word occurs in the message.

In [27]:
word_counts_per_sms = {word: [0] * mc_train_copy.shape[0] for word in vocabulary} # Initialize the dictionary to all zeros
for w in word_counts_per_sms:
    print(w, word_counts_per_sms[w][:5])
    break

81303 [0, 0, 0, 0, 0]


In [28]:
for idx, words in enumerate(mc_train_copy["SMS"]):
    for w in words:
        word_counts_per_sms[w][idx] += 1

In [29]:
print(word_counts_per_sms["buy"][0])

1


We now have a dictionary that keeps track of all the unique words in the messages per message. We now convert this dictionary into a dataframe and merge it together with the training set.

In [30]:
wcps = pd.DataFrame(data=word_counts_per_sms)

In [31]:
wcps.loc[:5, "buy"]

0    1
1    0
2    0
3    0
4    0
5    0
Name: buy, dtype: int64

In [32]:
wcps.iloc[:5, :5]

Unnamed: 0,81303,huh,xxx,aboutas,drive
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0


In [33]:
wcps.shape

(4135, 7738)

We just converted our word_counts_per_sms to a dataframe. This dataframe should have the same number of rows as our original training dataset (4135) and a total number of columns that is equal to the amount of unique words in the vocabulary (7738)

Finally merging the two dataframes together as follows:

In [99]:
mc_train_conv = pd.concat([mc_train_copy.reset_index(drop=True), wcps], axis=1)

In [100]:
mc_train_conv.shape

(4135, 7740)

In [101]:
mc_train_conv[["SMS", "buy", "one", "egg", "for", "me", "da", "please"]].head()

Unnamed: 0,SMS,buy,one,egg,for,me,da,please
0,"[buy, one, egg, for, me, da, please]",1,1,1,1,1,1,1
1,"[k, k, from, tomorrow, onwards, started, ah]",0,0,0,0,0,0,0
2,"[oh, great, i, ll, disturb, him, more, so, that, we, can, talk]",0,0,0,0,0,0,0
3,"[bloomberg, message, center, 447797706009, why, wait, apply, for, your, future, http, careers, bloomberg, com]",0,0,0,1,0,0,0
4,"[hi, can, i, please, get, a, lt, gt, dollar, loan, from, you, i, ll, pay, you, back, by, mid, february, pls]",0,0,0,0,0,0,1


Since the dataframes are merged together based on the index, the indexing of the training dataset needed to be reset in order to properly merge both dataframes. To ensure that the dataframes were properly merged, the shape is checked and is correct and the input is verified (first entrance correctly indicating presence of words)

## Calculating the Constants of the Naive Bayes Classifier

There are a number of constants that were defined above in the Naive Bayes Classifier logic. These constants will now be determined. 

First we will calculate the prior probabilities as follows:

In [102]:
p_spam = float(mc_train_copy["Label"].value_counts(normalize=True)["spam"])
p_spam

0.12744860943168076

In [103]:
p_ham = float(mc_train_copy["Label"].value_counts(normalize=True)["ham"])
p_ham

0.8725513905683192

Next we will calculate the total number of words in the vocabulary as well as the total number of words in all spam training messages and the total number of words in all ham training messages.

In [104]:
n_vocabulary = len(vocabulary)
n_vocabulary

7738

In [105]:
n_spam = sum(mc_train_copy.loc[mc_train_copy["Label"] == "spam", "SMS"].apply(lambda x: len(x)))
n_spam

13348

In [106]:
n_ham = sum(mc_train_copy.loc[mc_train_copy["Label"] == "ham", "SMS"].apply(lambda x: len(x)))
n_ham

53049

Finally the smoothing parameter alpha will be initiated here and the value of 1 is chosen:

In [107]:
alpha = 1

## Calculating the Parameters

Next we are going to calculate the conditional probabilities for each word and ham or spam message status. These probabilities are needed to calculate the posterior probabilities. We will use a dictionary data-structure for this exercise.

In [108]:
p_w_spams = {word: 0 for word in vocabulary}
p_w_hams = {word: 0 for word in vocabulary}

In [109]:
train_ham = mc_train_conv[mc_train_conv["Label"] == "ham"]
train_spam = mc_train_conv[mc_train_conv["Label"] == "spam"]

In [110]:
for w in vocabulary:
    n_w_spam = sum(train_spam[w])
    p_w_spam = (n_w_spam + alpha) / (n_spam + alpha * n_vocabulary)
    p_w_spams[w] = p_w_spam

    n_w_ham = sum(train_ham[w])
    p_w_ham = (n_w_ham + alpha) / (n_ham + alpha * n_vocabulary)
    p_w_hams[w] = p_w_ham

In [111]:
p_w_spams["egg"]

4.742483164184767e-05

In [112]:
p_w_hams["egg"]

6.58035435208186e-05

The above two lines of code represent the conditional probability of seeing the word "egg" in either a spam or ham message respectively based on the training data. Reviewing the probabilities, the word "egg" is more likely to appear in a ham message than a spam message.

## Classifying a New Message Using the NB Algorithm

Next we will actually perform the NB classifier by building a custom function and applying this function to the test dataset that we haven't worked with yet. Based on the outcome of the test dataset, we can review the performance of the model through standard classifier analysis (accuracy, confusion matrix)

In [126]:
import re

In [135]:
def NBClassifier(message):

    # Perform similar datacleaning on the message as before starting with removing punctuation
    message = re.sub(r"\W", " ", message)

    # Force every word to lowercase
    message = message.lower()

    # Split the message on each space to create a list of words
    message = message.split()

    # Next we will calculate the posterior probabilities using the data from the training set
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    # Looping over every word in the message, if the word is in the conditional probabilities dictionary, update the posterior probabilities
    for w in message:
        if w in p_w_spams:
            p_spam_given_message *= p_w_spams[w]
            p_ham_given_message *= p_w_hams[w]

    # If the ham probability is greater than spam, label ham, otherwise spam. If the probabilities are equal, human classification is required.
    if p_ham_given_message > p_spam_given_message:
        return "ham"
    elif p_ham_given_message < p_spam_given_message:
        return "spam"
    else:
        return "Equal probabilities, needs human classification"

We are now going to apply this function to the text messages in the test dataset and creating a new column with predicted labels.

In [128]:
mc_test.head()

Unnamed: 0,Label,SMS
2360,ham,i cant talk to you now.i will call when i can.dont keep calling.
4365,spam,"Customer service announcement. We recently tried to make a delivery to you but were unable to do so, please call 07099833605 to re-schedule. Ref:9280114"
4500,ham,Im good! I have been thinking about you...
3916,ham,Ok... But they said i've got wisdom teeth hidden inside n mayb need 2 remove.
4714,ham,Hey next sun 1030 there's a basic yoga course... at bugis... We can go for that... Pilates intro next sat.... Tell me what time you r free


We will first reset the index of this dataframe to avoid potential issues with the apply method.

In [129]:
mc_test_copy = mc_test.copy()

In [130]:
mc_test_copy = mc_test_copy.reset_index(drop=True)
mc_test_copy.head()

Unnamed: 0,Label,SMS
0,ham,i cant talk to you now.i will call when i can.dont keep calling.
1,spam,"Customer service announcement. We recently tried to make a delivery to you but were unable to do so, please call 07099833605 to re-schedule. Ref:9280114"
2,ham,Im good! I have been thinking about you...
3,ham,Ok... But they said i've got wisdom teeth hidden inside n mayb need 2 remove.
4,ham,Hey next sun 1030 there's a basic yoga course... at bugis... We can go for that... Pilates intro next sat.... Tell me what time you r free


In [131]:
mc_test_copy["Predicted"] = mc_test_copy["SMS"].apply(NBClassifier)

In [132]:
mc_test_copy.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,i cant talk to you now.i will call when i can.dont keep calling.,ham
1,spam,"Customer service announcement. We recently tried to make a delivery to you but were unable to do so, please call 07099833605 to re-schedule. Ref:9280114",spam
2,ham,Im good! I have been thinking about you...,ham
3,ham,Ok... But they said i've got wisdom teeth hidden inside n mayb need 2 remove.,ham
4,ham,Hey next sun 1030 there's a basic yoga course... at bugis... We can go for that... Pilates intro next sat.... Tell me what time you r free,ham


We can clearly see that we have now generated predicted labels for unseen test data for each message. In order to understand how well our classifier performs, we can look at accuracy metrics. We are going to use the scikit-learn library for this assessment.

## Measuring the Classifier's accuracy

In [134]:
from sklearn.metrics import accuracy_score, confusion_matrix

We are now going to store the actual labels of the messages in y_true and the predicted labels in y_pred and calculate all the performance metrics.

In [136]:
y_true = mc_test_copy["Label"]
y_pred = mc_test_copy["Predicted"]

In [138]:
accuracy = accuracy_score(y_true, y_pred)
print(f"The accuracy of our custom Naive Bayes Classifier is {accuracy * 100:.2f}%")

The accuracy of our custom Naive Bayes Classifier is 98.26%


This accuracy number means that on unseen data the classifier correctly predicts the right label in almost 99% of the cases, which is quite good. There are certainly a number of assumptions and decisions that were made that could potentially lower the accuracy of the classifier. Words were simply split on spaces and punctuation removed, we could have decided to do something more complex like keeping website urls and other formats intact. Also we ignored any new words in the test data and only looked for words that were common between the training and test sets.

To get a better grasp on the performance, we are going to look at the confusion_matrix next.

In [139]:
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel().tolist()
(tn, fp, fn, tp)

(905, 3, 15, 111)

The confusion_matrix reveals that 905 messages were correctly assigned ham and 111 messages correctly assigned spam. 3 messages were incorrectly assigned spam and 15 messages were incorrectly assigned ham. The false negative rate seems larger than the false positive, let's dive a little bit deeper.

In [143]:
precision = tp / (tp + fp)
print(f"The precision of our custom Naive Bayes Classifier is {precision * 100:.2f}%")

The precision of our custom Naive Bayes Classifier is 97.37%


In [141]:
recall = tp / (fn + tp)
print(f"The recall of our custom Naive Bayes Classifier is {recall * 100:.2f}%")

The recall of our custom Naive Bayes Classifier is 88.10%


In [142]:
f1_score = 2 * precision * recall / (precision + recall)
print(f"The F1-score of our custom Naive Bayes Classifier is {f1_score * 100:.2f}%")

The F1-score of our custom Naive Bayes Classifier is 92.50%


These last few metrics highlight the worse false negative Type II error than the false positive Type I error since the recall is much lower than the precision. The F1-score provides a balanced approach and comes out to be 92.5%, which is not too bad.

To gain a deeper insight we could quickly review the messages that were incorrectly classified in our test dataset

In [144]:
mc_test_copy[mc_test_copy["Label"] != mc_test_copy["Predicted"]]

Unnamed: 0,Label,SMS,Predicted
23,ham,Anytime...,spam
26,spam,"SMS. ac sun0819 posts HELLO:""You seem cool, wanted to say hi. HI!!!"" Stop? Send STOP to 62468",ham
33,spam,Filthy stories and GIRLS waiting for your,ham
123,spam,"Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!",ham
186,spam,thesmszone.com lets you send free anonymous and masked messages..im sending this message from there..do you see the potential for abuse???,ham
325,spam,"Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?",ham
380,ham,Finally the match heading towards draw as your prediction.,spam
403,spam,"Did you hear about the new ""Divorce Barbie""? It comes with all of Ken's stuff!",ham
407,spam,"Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy?",ham
425,ham,No calls..messages..missed calls,spam


One thing that stands out is the removal of punctuation and uppercase that is sometimes very distinct for spam messages. This removes some of the easy to recognize signals for spam messages and could contribute to the high Type II error. A smarter algorithm that leaves message syntax intact might perform better.

Next we will compare the performance of our custom classifier to the classifier in the scikit-learn package: MultiNomialNB

## Comparing Custom Classifier Performance to MultinomialNB from Scikit-learn

In [145]:
from sklearn.naive_bayes import MultinomialNB

In [146]:
NB = MultinomialNB(alpha=1, force_alpha=True, fit_prior=True, class_prior=None)

The NB classifier has now been instantiated using a similar alpha Laplace smoothing parameter and allowing the classifier to also fit and learn the prior probabilities instead of assuming an equal distribution.

Next we need to generate X_train, X_test, y_train and y_test which should be similar data as our custom classifier. We are going to use the mc_train and mc_test data for this since we want this classifier to learn the patterns itself and not go through the same data manipulation methods.

In [147]:
mc_train.head()

Unnamed: 0,Label,SMS
3963,ham,Buy one egg for me da..please:)
2905,ham,K..k...from tomorrow onwards started ah?
3165,ham,Oh great. I.ll disturb him more so that we can talk.
3271,spam,Bloomberg -Message center +447797706009 Why wait? Apply for your future http://careers. bloomberg.com
2533,ham,"Hi, can i please get a &lt;#&gt; dollar loan from you. I.ll pay you back by mid february. Pls."


In [148]:
mc_test.head()

Unnamed: 0,Label,SMS
2360,ham,i cant talk to you now.i will call when i can.dont keep calling.
4365,spam,"Customer service announcement. We recently tried to make a delivery to you but were unable to do so, please call 07099833605 to re-schedule. Ref:9280114"
4500,ham,Im good! I have been thinking about you...
3916,ham,Ok... But they said i've got wisdom teeth hidden inside n mayb need 2 remove.
4714,ham,Hey next sun 1030 there's a basic yoga course... at bugis... We can go for that... Pilates intro next sat.... Tell me what time you r free


The data clearly still has punctuation and is unaltered. Let's split the data into X and y:

In [154]:
X_train = mc_train["SMS"]
X_test = mc_test["SMS"]
y_train = mc_train["Label"]
y_test = mc_test["Label"]

The NB classifier cannot work with strings directly and so the X_training and X_testing data needs to be vectorized first. We will use a pipeline method to connect vectorization and model fitting together.

In [155]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

In [156]:
model = make_pipeline(TfidfVectorizer(stop_words="english"), NB)

Next we will fit our model which will vectorize the words using out-of-bag and then fit our NB model using the training data.

In [157]:
model.fit(X_train, y_train)

0,1,2
,steps,"[('tfidfvectorizer', ...), ('multinomialnb', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,alpha,1
,force_alpha,True
,fit_prior,True
,class_prior,


In [158]:
y_pred_NB = model.predict(X_test)

In [160]:
accuracy_NB = accuracy_score(y_test, y_pred_NB)
print(f"The accuracy of the scikit-learn NB classifier is {accuracy_NB * 100:.2f}%")

The accuracy of the scikit-learn NB classifier is 96.23%


The accuracy of the scikit-learn NB classifier is actually slightly lower than our custom one by about 2%! We could take a look at the recall and precision next to figure out how the Type I and Type II error compares.

In [165]:
tn_NB, fp_NB, fn_NB, tp_NB = confusion_matrix(y_test, y_pred_NB).ravel().tolist()
(tn_NB, fp_NB, fn_NB, tp_NB)

(908, 0, 39, 87)

This classifier actually has a false positive or Type I error of zero! The type II error is higher than our custom version with 39 cases misclassified as ham instead of spam.

We can review these cases again by looking at the test dataset and indexing in the rows where y_test and y_pred_NB are not equal.

In [166]:
mc_test[y_test != y_pred_NB]

Unnamed: 0,Label,SMS
661,spam,"SMS. ac sun0819 posts HELLO:""You seem cool, wanted to say hi. HI!!!"" Stop? Send STOP to 62468"
935,spam,Filthy stories and GIRLS waiting for your
1928,spam,TheMob>Yo yo yo-Here comes a new selection of hot downloads for our members to get for FREE! Just click & open the next link sent to ur fone...
862,spam,Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.
54,spam,SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice Hockey. Correct or Incorrect? End? Reply END SPTV
2177,spam,Not heard from U4 a while. Call 4 rude chat private line 01223585334 to cum. Wan 2C pics of me gettin shagged then text PIX to 8552. 2End send STOP 8552 SAM xxx
4437,spam,Cashbin.co.uk (Get lots of cash this weekend!) www.cashbin.co.uk Dear Welcome to the weekend We have got our biggest and best EVER cash give away!! These..
2160,spam,"Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!"
2563,spam,"New Tones This week include: 1)McFly-All Ab.., 2) Sara Jorge-Shock.. 3) Will Smith-Switch.. To order follow instructions on next message"
798,spam,"U were outbid by simonwatson5120 on the Shinco DVD Plyr. 2 bid again, visit sms. ac/smsrewards 2 end bid notifications, reply END OUT"


The classifier may be improved by determining the smoothing parameter through cross-validation. Changing the size of the training and test data may also change the overall accuracy of the classifier.

The final step is to delete all of the large datasets from memory.

In [167]:
del (
    messages,
    messages_clean,
    mc_train,
    mc_test,
    mc_train_copy,
    vocabulary,
    word_counts_per_sms,
    wcps,
    mc_train_conv,
    p_w_spams,
    p_w_hams,
    train_ham,
    train_spam,
    mc_test_copy,
    y_true,
    y_pred,
    X_train,
    X_test,
    y_train,
    y_test,
    y_pred_NB
)