---
---
---

# 💠 **TUTORIAL: Applying Naive Bayesian Classifiers for Basic Text Classification**

---

This tutorial is designed to showcase Naïve Bayes classifiers for text classification purposes – by the end, you should be comfortable with adding Bayesian classifiers to your machine learning toolkit!

Let's start with getting access to all relevant importations and initializations.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

The main two classifier models we'll be making use of today are the **`GaussianNB()`** and the **`MultinomialNB()`** algorithms.

---
---

First, let's get access to our dataset.

You can access this dataset either via the **[external download link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)**, the uploaded dataset in the Google Drive, or the shared dataset in the Slack.

In [3]:
PATH = "spam.csv"

dataset = pd.read_csv(PATH,
                      skiprows=1,
                      usecols=[0, 1],
                      encoding="latin-1",
                      names=["target", "content"])

In [4]:
dataset.head(3)

Unnamed: 0,target,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


This dataset is comprised of text messages and labels that identify whether the content of the text message is interpretably valuable (`ham`) or contains spam-like and erroneous junk (`spam`).

---

We'll have to perform some data cleaning and attribution in order to most effectively make use of the dataset for Naive Bayesian modeling.

First things first: let's separate the text messages themselves by breaking them up from sentences to cleaned lists-of-words (we'll refer to this as _tokens_).

In [5]:
try:
    dataset["tokens"] = dataset["content"].str.replace("\W+", " ").str.replace("\s+", " ").str.strip()
    dataset["tokens"] = dataset["tokens"].str.lower()
    dataset["tokens"] = dataset["tokens"].str.split()
    dataset.drop(columns=["content"], inplace=True)
except:
    pass

In [6]:
dataset.head(3)

Unnamed: 0,target,tokens
0,ham,"[go, until, jurong, point,, crazy.., available..."
1,ham,"[ok, lar..., joking, wif, u, oni...]"
2,spam,"[free, entry, in, 2, a, wkly, comp, to, win, f..."


This is performed so we can both clean up any odd symbols and characters across our data as well as so we can more iteratively analyze data without having to worry about modeling constraints like the maximum length of a sentence.

---

We'll be performing some more advanced data preparation when working with text-based data.

In thsi case, we'll be working with a _label encoding algorithm_, which will replace string-like class occurrences within our target label with generatively assigned numerical classes.

(We can replace these "dummy labels" at any time by using the precise encoding model.)

In [7]:
le = preprocessing.LabelEncoder()

In [8]:
try:
    dataset["target"] = le.fit_transform(dataset["target"])
except:
    pass

In [9]:
dataset.head(3)

Unnamed: 0,target,tokens
0,0,"[go, until, jurong, point,, crazy.., available..."
1,0,"[ok, lar..., joking, wif, u, oni...]"
2,1,"[free, entry, in, 2, a, wkly, comp, to, win, f..."


Now that our labels are encoded and our data is cleaned, we can move forward for classification purposes.

---

As always, let's start by segmenting our data into training and testing splits.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(dataset["tokens"],
                                                    dataset["target"],
                                                    train_size=0.8,
                                                    test_size=0.2,
                                                    random_state=42)

---

In order to best evaluate our Bayesian classification models, we'll slightly impute our training data using the full range of possible tokens in the data.

This is not as important to fully understand as it has to do with reducing error due to the variability of language data; however, if you are interested in natural language processing (NLP) as a future focus, it is recommended to play around with this step.

In [11]:
vocabulary = list(set(X_train.sum()))

X_train_vocab = pd.DataFrame(
    [[message.count(token) for token in vocabulary] for message in X_train],
    columns=vocabulary)
X_test_vocab = pd.DataFrame(
    [[message.count(token) for token in vocabulary] for message in X_test],
    columns=vocabulary)

In [12]:
X_train_vocab

Unnamed: 0,nok,configure,lying.,ansr,june.,hv,me..,lunch?,contract!!,jones!,...,press,play.,yijue...,str,happend?,lips,elaborating,thatû÷s,2667,hi..i
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### What is the problem with token data?
It is high-dimensional. 

We're ready to create our models and perform some machine learning analysis.

---
---

As always, we'll design and develop our predictive pipeline here following our four-step process:

1. Create our model.
2. Fit our model to training data.
3. Produce predicted labels using the testing data.
4. Assess our model's accuracy by comparing predicted and true labels.

In this case, we'll make use of our **Gaussian Naïve Bayesian Model** first.

This assumes that the distribution of `spam` and `ham` classes is Gaussian.

In [13]:
GaussianNB_Classifier = GaussianNB()

Now that our model is created, let's fit it to our modified training data.

In [14]:
GaussianNB_Classifier.fit(X_train_vocab, y_train)

Once our model is appropriately fitted, we can go ahead and run some predictions using the similarly modified testing data (without true labels provided).

In [15]:
y_pred = GaussianNB_Classifier.predict(X_test_vocab)

And finally, we can access our accuracy score by comparing true labels to the model-predicted ones.

In [16]:
100 * accuracy_score(y_true=y_test, y_pred=y_pred)

92.28699551569507

Not too bad!

However, as we level up in predictive analytics, we want to become comfortable with utilizing multiple accuracy methods to ensure our prediction results are truly sound.

Standard accuracy tends to not show the importance of failed predictions and errors as strongly as other metrics – one useful accuracy metric that highlights the impact of false positives/negatives and error rate is the **F1 Score**.

Let's go ahead and use that to see how truly accurate our model is.

In [None]:
100 * f1_score(y_true=y_test, y_pred=y_pred)

75.56818181818183

In [18]:
# Difference between f1 score and actual accuracy score is huge. There may be wrong with actual.  
# And reason for this mismatch could be imbalanced data or mistake with Gaussian Assumption. 

# Gaussian Naive Bayes assumes features follow a Gaussian (normal) distribution. Word frequencies in text data often don't fit this assumption well. This mismatch can reduce the model's effectiveness in capturing the underlying patterns of the data.#

Hmm... not quite as good as we had hoped.

It seems our Gaussian expectation may not be as well performant as we expected.

That's probably because it's a little _too naïve_ to expect two explicit, discrete labels to naturally fall into a normal distribution.

---

Instead, let's go ahead and repeat the modeling process but with a better expectation for our distribution.

In this case, we'll anticipate our labels to be _binomial_ (a special case of a **multinomial distribution**).

We should expect improved accuracy scores after our process is complete.

In [19]:
MultinomialNB_Classifier = MultinomialNB()

In [20]:
MultinomialNB_Classifier.fit(X_train_vocab, y_train)

In [21]:
y_pred = MultinomialNB_Classifier.predict(X_test_vocab)

In [22]:
100 * accuracy_score(y_true=y_test, y_pred=y_pred)

98.11659192825111

In [23]:
100 * f1_score(y_true=y_test, y_pred=y_pred)

92.57950530035335

### Make F1 Score higher than 92.579

In [26]:
# Tune the alpha parameter of MultinomialNB
best_f1_score = 0
best_alpha = 0
for alpha in np.arange(0.1, 1.1, 0.1):
    MultinomialNB_Classifier = MultinomialNB(alpha=alpha)
    MultinomialNB_Classifier.fit(X_train_vocab, y_train)
    y_pred = MultinomialNB_Classifier.predict(X_test_vocab)
    f1 = 100 * f1_score(y_true=y_test, y_pred=y_pred)
    if f1 > best_f1_score:
        best_f1_score = f1
        best_alpha = alpha

print(f"Best F1 Score: {best_f1_score} with alpha: {best_alpha}")


Best F1 Score: 93.7062937062937 with alpha: 0.1


And what do you know!

It turns out that with a more appropriate expectation for our labels' distributions, our accuracy is much, much higher!

This is why it's so important to understand the relationship between our data and our algorithms _before_ jumping into predictive modeling: it can save us time and a headache in terms of optimizing our results.

---
---
---