Name: Charles Ong

Section: MACHLRN X22

# Naive Bayes Exercise

In this notebook, you will learn to implement a Naive Bayes classifier using sklearn. We will be creating two classifiers, one which assumes a Gaussian distribution, and another that assumes a multinomial distrbution.

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with "A: " on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* There are commented markdown cells that have crumbs. Do not delete them or separate them from the cell originally directly below it.  
* You may add new cells for "scrap work" as long as the crumbs are not separated from the cell below it.
* The notebooks will undergo a "Restart and Run All" command, so make sure that your code is working properly.
* You are expected to understand the data set loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

## Import
Import **matplotlib**, **numpy**, and **pandas**.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (12.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'

# Fix the seed of the random number 
# generator so that your results will match ours
np.random.seed(1)

%load_ext autoreload
%autoreload 2

# Gaussian Naive Bayes

For our first dataset (iris dataset), we assume that our data follows a Gaussian distribution.
   
## Iris Dataset
We will use the Iris dataset as our dataset. Each instance represents an Iris flower using 4 distinct features:
- `sepal_length` - length of the sepal in centimeters
- `sepal_width` - width of the sepal in centimeters
- `petal_length` - length of the petal in centimeters
- `petal_width` - width of the petal in centimeters

Iris flowers can be 3 divided into different classes, which are:
- `Iris-setosa`
- `Iris-versicolor`
- `Iris-virginica`

## Preprocessing our data

Let's load the iris dataset.

In [2]:
iris = pd.read_csv('iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Right now, we have to convert our nominal labels (word labels: setosa, versicolor, viriginica) into numerical labels (number labels: 0 for setosa, 1 for versicolor, 2 for viriginica)

In [3]:
from sklearn import preprocessing

Let's use `preprocessing.LabelEncoder()` to convert the  nominal values in `iris['species']` to a unique number.

In [4]:
encoder = preprocessing.LabelEncoder()

Fit the `species` feature by calling the `fit()` function of the object.

In [5]:
encoder.fit(iris['species'])

Transform the `species` feature by calling the `transform()` function of the object. 

In [6]:
encoder.transform(iris['species'])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let's see the mapping of the original nominal labels and the numerical codes

In [7]:
print('Original labels:', encoder.classes_, '\n')

print('Mapping from nominal to numerical labels:')
print(dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

Original labels: ['setosa' 'versicolor' 'virginica'] 

Mapping from nominal to numerical labels:
{'setosa': 0, 'versicolor': 1, 'virginica': 2}


Now that we have the numerical encoding and the mapping, we can now change the `species` column to its numerical mapping

In [8]:
iris['species'] = encoder.transform(iris['species'])

Let's see the results now:

In [9]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Like in the previous notebooks, we will separate our `X` from our target `y` (species). 


__Note__: `iris.values[:, :-1]` will get all rows, and all columns except for the last column


__Note__: `iris.values[:, -1]` will get the last column only. We set the the labels as integers because its default data type is float.

In [10]:
X = iris.values[:, :-1]
y = iris.values[:, -1].astype(int)

print('X shape: ', X.shape)
print('y shape: ', y.shape)

X shape:  (150, 4)
y shape:  (150,)


Import `train_test_split()`.

In [11]:
from sklearn.model_selection import train_test_split

Divide the dataset into train and test sets, where 30% of the data will be placed in the test set.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.3, 
                                                    stratify=y, 
                                                    random_state=42)

X_train.shape

(105, 4)

## Building our model
Because our features `X` are continuous values, we will use `sklearn`'s `GaussianNB` model.

In [13]:
from sklearn.naive_bayes import GaussianNB

Instantiate a `GaussianNB` model. Assign the object to variable `model`.

In [14]:
# Write your code here
model = GaussianNB()

Train the model using the `fit()` function.

In [15]:
# Write your code here
model.fit(X_train, y_train)

## Try our trained model on the train data

Let's get the prediction results on the train data to see if our model does well. Store the predicted labels in the variable `predictions`.

In [16]:
# Write your code here
predictions = model.predict(X_train)

Print the predictions.

In [17]:
print(predictions)

[1 1 0 2 1 2 0 0 0 2 2 0 0 1 1 2 0 0 2 2 0 2 2 2 1 0 0 0 1 1 0 0 1 1 0 0 1
 2 2 0 2 0 1 0 2 1 0 2 1 2 1 0 1 2 1 2 0 1 0 1 1 1 2 1 1 2 2 0 2 1 1 2 0 2
 2 1 0 2 2 0 0 2 2 2 0 2 1 2 2 0 1 1 1 1 1 0 2 1 2 0 0 1 0 1 0]


Compare the ground truth labels with the predicted labels. Store the total number of correct predictions in the variable `num_correct`.

In [18]:
# Write your code here
num_correct = (y_train == predictions).sum()

Print the number of correct predictions.

In [19]:
print(num_correct)

103


Compute for the accuracy. Store the accuracy in the variable `accuracy`.

In [20]:
# Write your code here
accuracy = num_correct / len(y_train)

Print the accuracy.

In [21]:
print(accuracy)

0.9809523809523809


**Question #1:** What is the accuracy of the model when evaluated on the train set? Express your answer in a floating point number from 0 to 1. Limit to 4 decimal places.

In [22]:
print(f'A: {accuracy:.4f}')

A: 0.9810


## Try our trained model on the test data

Let's get the prediction results on the test data to see if our model does well. Store the predicted labels in the variable `predictions`.

In [23]:
# Write your code here
predictions = model.predict(X_test)

Print the predictions.

In [24]:
print(predictions)

[2 1 1 1 2 2 1 1 0 2 0 0 2 2 0 2 1 0 0 0 1 0 1 2 2 1 1 1 1 0 1 2 1 0 2 0 0
 0 0 2 1 0 1 2 1]


Compare the ground truth labels with the predicted labels. Store the total number of correct predictions in the variable `num_correct`.

In [25]:
# Write your code here
num_correct = (y_test == predictions).sum()

Print the number of correct predictions.

In [26]:
print(num_correct)

41


Compute for the accuracy. Store the accuracy in the variable `accuracy`.

In [27]:
# Write your code here
accuracy = num_correct / len(y_test)

Print the accuracy.

In [28]:
print(accuracy)

0.9111111111111111


**Question #2:** What is the accuracy of the model when evaluated on the test set? Express your answer in a floating point number from 0 to 1. Limit to 4 decimal places.

In [29]:
print(f'A: {accuracy:.4f}') 

A: 0.9111


## Checking the learned parameters
We can also peer into the parameters the model learned.

This is how you get the number of instances (of each class) the model received as the training set

In [30]:
model.class_count_

array([35., 35., 35.])

You can also get the priors the model learned

In [31]:
model.class_prior_

array([0.33333333, 0.33333333, 0.33333333])

**Question #3:** How are the priors calculated?

A: Naive Bayes' priors are calculated based on P(A) or probability of a class.

In the case of the iris test data, we provided it equal numbers of classes hence they each have equal probabilites of .33 from...

<center>35/Total of 35*3</center>

Gaussian Naive Bayes classifiers have **`k * d * 2`** number of parameters (not including the priors)

> where <br>
> **`k`** - number of classes <br>
> **`d`** - number of dimensions/features <br>
> **`2`** - because we calculate for the means and variances of each feature <br>

Get the computed means of the model

In [32]:
model.theta_

array([[4.98857143, 3.41142857, 1.48857143, 0.23714286],
       [5.94857143, 2.73142857, 4.23714286, 1.30857143],
       [6.68285714, 3.00857143, 5.63142857, 2.06857143]])

Get the computed variances of the model

In [33]:
model.var_

array([[0.10329796, 0.17586939, 0.02272653, 0.00976327],
       [0.24078368, 0.08558368, 0.21147755, 0.03564082],
       [0.42484898, 0.11735511, 0.32272653, 0.06386939]])

____________

# Multinomial Naive Bayes

For our second dataset (spam/not spam dataset), we assume that our data follows a multinomial distribution.

## Sample data

Before we go and train with the spam/ham dataset, we have to convert the `content` column into numbers we can crunch. In our case, our features will be the frequency of words in the data instance.

**Example:**


|                                                  | Never | gonna | give | you | up | let | down | make | cry | say | goodbye |
|--------------------------------------------------|-------|-------|------|-----|----|-----|------|------|-----|-----|---------|
|                          Never gonna give you up |   1   |   1   |   1  |  1  |  1 |  0  |   0  |   0  |  0  |  0  |    0    |
| Never gonna give you up Never gonna let you down |   2   |   2   |   1  |  2  |  1 |  1  |   1  |   0  |  0  |  0  |    0    |
|                         Never gonna make you cry |   1   |   1   |   0  |  1  |  0 |  0  |   0  |   1  |  1  |  0  |    0    |
|                          Never gonna say goodbye |   1   |   1   |   0  |  0  |  0 |  0  |   0  |   0  |  0  |  1  |    1    |

<div style="text-align: right"><sub>Reference: Never Gonna Give You Up by Rick Astley</sub></div>

In [34]:
data = ['Never gonna give you up',
        'Never gonna give you up Never gonna let you down',
        'Never gonna make you cry',
        'Never gonna say goodbye']

First, let's convert our words all to lower case. This is a common practice.

In [35]:
for i in range(len(data)):
    data[i] = data[i].lower()
    
data

['never gonna give you up',
 'never gonna give you up never gonna let you down',
 'never gonna make you cry',
 'never gonna say goodbye']

Now, we'll count for the frequency of each word of each sentence.

Import `CountVectorizer`.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

Instantiate a `CountVectorizer`. This will convert the text into a matrix of word/token counts.

In [37]:
vectorizer = CountVectorizer()

Use the `fit()` function to get the words in our dataset.

In [38]:
vectorizer.fit(data)

Display the words/tokens/features:

In [39]:
word_features = vectorizer.get_feature_names_out()
word_features

array(['cry', 'down', 'give', 'gonna', 'goodbye', 'let', 'make', 'never',
       'say', 'up', 'you'], dtype=object)

The following code computes for the counts of each word for each of our data sentences. 

It outputs a **sparse count matrix**. 

__Note:__ The sparse refers to the matrix having mostly 0 values for the columns (see table above). If we store this as a normal matrix, it will take up a lot of space. To save space, the following data is stored in this fashion:
> `(<sentence>, <word>)         <count>`

All combinations where the count is 0 will be ignored

In [40]:
count_sparse_matrix = vectorizer.transform(data)
print(count_sparse_matrix)

  (0, 2)	1
  (0, 3)	1
  (0, 7)	1
  (0, 9)	1
  (0, 10)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	2
  (1, 5)	1
  (1, 7)	2
  (1, 9)	1
  (1, 10)	2
  (2, 0)	1
  (2, 3)	1
  (2, 6)	1
  (2, 7)	1
  (2, 10)	1
  (3, 3)	1
  (3, 4)	1
  (3, 7)	1
  (3, 8)	1


It may seem a lot of work to save little space, but as your data grows this will save a ton of memory.

To better understand the representation above, let's display it in a matrix. The first instance in `count_sparse_matrix` shows a value `(0, 2) -> 1`. In the display below, this corresponds to the the number of frequency of the word 2, i.e., give, in sentence 0, i.e., the first sentence in the table below.

In [41]:
n_sentences = count_sparse_matrix.shape[0]
n_word_features = count_sparse_matrix.shape[1]

# header
for i in range(n_word_features):
    print(word_features[i], end ='\t')
print('sentence', end='\n')
    
for i in range(n_sentences):
    for j in range(n_word_features):
        print(count_sparse_matrix[i, j], end='\t')
    print(data[i], end='\n')

cry	down	give	gonna	goodbye	let	make	never	say	up	you	sentence
0	0	1	1	0	0	0	1	0	1	1	never gonna give you up
0	1	1	2	0	1	0	2	0	1	2	never gonna give you up never gonna let you down
1	0	0	1	0	0	1	1	0	0	1	never gonna make you cry
0	0	0	1	1	0	0	1	1	0	0	never gonna say goodbye


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document/sentence is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document/sentence** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or 'Bag of n-grams' representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

For our training, we will convert count_sparse_matrix into a count_dense_matrix.

In [42]:
count_dense_matrix = count_sparse_matrix.toarray()
count_dense_matrix

array([[0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1],
       [0, 1, 1, 2, 0, 1, 0, 2, 0, 1, 2],
       [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0]], dtype=int64)

Here is our data in a `pandas` `DataFrame`.

In [43]:
pd.DataFrame(count_dense_matrix, columns=vectorizer.get_feature_names_out(), index=data)

Unnamed: 0,cry,down,give,gonna,goodbye,let,make,never,say,up,you
never gonna give you up,0,0,1,1,0,0,0,1,0,1,1
never gonna give you up never gonna let you down,0,1,1,2,0,1,0,2,0,1,2
never gonna make you cry,1,0,0,1,0,0,1,1,0,0,1
never gonna say goodbye,0,0,0,1,1,0,0,1,1,0,0


## Spam/Not Spam Dataset
We will use the Spam/Not Spam dataset as our dataset. Our goal with this dataset is to classify a sentence as either **spam** or **not spam** (ham). You can check out the `spam_ham.csv` for examples of spam and not spam messages. Check the file and see its body contents.

(This section is a slight modification from <a src="http://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/">Ritchie Ng's notebook</a>)

Load the text data from the csv file.

In [44]:
spam_ham = pd.read_csv('spam_ham.csv')
spam_ham.dropna(inplace=True)
spam_ham.head(10)

Unnamed: 0,type,location,body
0,spam,data/000/001,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1,spam,data/000/002,Academic Qualifications available from prestig...
2,ham,data/000/003,Greetings all. This is to verify your subscrip...
3,spam,data/000/004,try chauncey may conferred the luscious not co...
4,ham,data/000/005,"It's quiet. Too quiet. Well, how about a straw..."
5,ham,data/000/006,It's working here. I have departed almost tota...
6,spam,data/000/008,The OIL sector is going crazy. This is our wee...
7,spam,data/000/009,Little magic. Perfect weekends.http://othxu.rz...
8,ham,data/000/010,Greetings all. This is a mass acknowledgement ...
9,spam,data/000/011,"Hi, L C P A X V V e I r m a A I v A o b n L A ..."


Before we proceed to vectorizing, let's change our label type from `spam` and `ham` to numerical values.

Let's use `preprocessing.LabelEncoder()` to convert the  nominal values in `spam_ham['type']` to a unique number.

In [45]:
encoder = preprocessing.LabelEncoder()

Fit the `type` feature by calling the `fit()` function of the object.

In [46]:
encoder.fit(spam_ham['type'])

Then, get the mapping so we know what the `0`s and `1`s mean later in the notebook

In [47]:
mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

print('Mapping:', mapping)

Mapping: {'ham': 0, 'spam': 1}


Transform the `type` feature by calling the `transform()` function of the object. 

In [48]:
spam_ham['type'] = encoder.transform(spam_ham['type'])
spam_ham

Unnamed: 0,type,location,body
0,1,data/000/001,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1,1,data/000/002,Academic Qualifications available from prestig...
2,0,data/000/003,Greetings all. This is to verify your subscrip...
3,1,data/000/004,try chauncey may conferred the luscious not co...
4,0,data/000/005,"It's quiet. Too quiet. Well, how about a straw..."
...,...,...,...
31156,1,data/126/016,bla bla blaeeeerererreerererre
31157,1,data/126/018,The OIL sector is going crazy. This is our wee...
31158,1,data/126/019,http://vdtobj.docscan.info/?23759301Suffering ...
31159,1,data/126/020,U N I V E R S I T Y D I P L O M A SDo you want...


**Sanity Check:** The type column should now be in 1's and 0's. Make sure that they are still properly labelled.

Now, we will separate our features `X` from our labels `y`. Disregard the `location` column (it points to the text file where the text `body` came from)

In [49]:
X = spam_ham['body']
y = spam_ham['type']

print('X shape: ', X.shape)
print('y shape: ', y.shape)

X shape:  (30974,)
y shape:  (30974,)


Show some rows from the dataset.

In [50]:
print(X[:5])
print(y[:5])

0    LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1    Academic Qualifications available from prestig...
2    Greetings all. This is to verify your subscrip...
3    try chauncey may conferred the luscious not co...
4    It's quiet. Too quiet. Well, how about a straw...
Name: body, dtype: object
0    1
1    1
2    0
3    1
4    0
Name: type, dtype: int32


Split the dataset into train and test data sets. Set the test size to 30%, and `random_state` to 42. Make sure we also stratify based on the type (spam/ham).

In [51]:
# Write your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

Show the number of instances for class `0` and class `1` in the train set.

In [52]:
y_train.value_counts()

type
1    13496
0     8185
Name: count, dtype: int64

Show the number of instances for class `0` and class `1` in the test set.

In [53]:
y_test.value_counts()

type
1    5784
0    3509
Name: count, dtype: int64

You should see that the distribution of classes in the train and test sets are maintained (1.648:1)

### Vectorization

Let's process the data as we did in the section before. Note that we will get a new dictionary based on the training data (we won't use the *Never gonna give you up* dataset anymore). Thus, let's instantiate a new `CountVectorizer`.

In [54]:
vectorizer = CountVectorizer()

Get the words from the training set using the `fit()` function. We should train without knowing the words from the test set.

In [55]:
vectorizer.fit(X_train)

**Sanity Check:** This is a large dataset, it may take a few seconds.

And, then get the frequency of each word in  each sentence using the `transform()` function.

In [56]:
X_train_count_sparse_matrix = vectorizer.transform(X_train)

__Note:__ A shorthand of these two lines is `vectorizer.fit_transform()`

In [57]:
X_train_count_sparse_matrix.shape

(21681, 147622)

Let's check out the fitted vocabulary

In [58]:
vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような'],
      dtype=object)

This will really get funny characters. Try seeing the 45,000th words onward to see more 'normal' words

In [59]:
vectorizer.get_feature_names_out()[45000:]

array(['demütige', 'den', 'denardo', ..., 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような'],
      dtype=object)

Now, we also have to transform our test data to our fitted vocabulary. Note that, we should not fit the test data's vocabulary. We're going to use the word features we culled from the training dataset.

In [60]:
# Write your code here
X_test_count_sparse_matrix = vectorizer.transform(X_test)
print(f'Train: {X_train_count_sparse_matrix.shape}')
print(f'Test: {X_test_count_sparse_matrix.shape}')

Train: (21681, 147622)
Test: (9293, 147622)


Show the shape of the transformed test data.

In [61]:
X_test_count_sparse_matrix.shape

(9293, 147622)

**Sanity Check:** The number of features (dimensions, not instances) of the train and test should match.

Now we have two transformed sparse matrices: (1) `X_train_count_sparse_matrix` and (2) `X_test_count_sparse_matrix`.



## Modelling

Now that we've got preprocessing done, we can focus on building the model. Here, we will use sklearn's `MultinomialNB` because our assumption is that our data follows a multinomial distribution.

In [62]:
from sklearn.naive_bayes import MultinomialNB

Instantiate a `MultinomialNB` model. Assign the object to variable `model`.

In [63]:
# Write your code here
model = MultinomialNB()

Train the model using the `fit()` function. Use the transformed sparse matrix as input training data.

In [64]:
# Write your code here
model.fit(X_train_count_sparse_matrix, y_train)

## Try our trained model on the train data

Let's get the prediction results on the train data to see if our model does well. Store the predicted labels in the variable `predictions`.

In [65]:
# Write your code here
predictions = model.predict(X_train_count_sparse_matrix)

Compare the ground truth labels with the predicted labels. Store the total number of correct predictions in the variable `num_correct`.

In [66]:
# Write your code here
num_correct = (predictions == y_train).sum()

Compute for the accuracy. Store the accuracy in the variable `accuracy`.

In [67]:
# Write your code here
accuracy = num_correct / len(y_train)

Print the accuracy.

In [68]:
print(accuracy)

0.9904985932383192


**Question #4:** What is the accuracy of the model when evaluated on the train set? Express your answer in a floating point number from 0 to 1. Limit to 4 decimal places.

In [69]:
print(f'A: {accuracy:.4f}') 

A: 0.9905


## Try our trained model on the test data

Let's get the prediction results on the test data to see if our model does well. Store the predicted labels in the variable `predictions`.

In [70]:
# Write your code here
predictions = model.predict(X_test_count_sparse_matrix)

Compare the ground truth labels with the predicted labels. Store the total number of correct predictions in the variable `num_correct`.

In [71]:
# Write your code here
num_correct = (predictions == y_test).sum()

Compute for the accuracy. Store the accuracy in the variable `accuracy`.

In [72]:
# Write your code here
accuracy = num_correct / len(y_test)

Print the accuracy.

In [73]:
print(accuracy)

0.9821370924351662


**Question #5:** What is the accuracy of the model when evaluated on the test set? Express your answer in a floating point number from 0 to 1. Limit to 4 decimal places.

In [74]:
print(f'A: {accuracy:.4f}') 

A: 0.9821


We should also be able to call `classification_report()` to see how well our model performed with different metrics

In [75]:
from sklearn.metrics import classification_report

Print the test classification report of our model. Set the `target_names` to `mapping.keys()` so we can see what `0` and `1` refers to.

In [76]:
print(classification_report(y_test, predictions, digits=4, target_names=mapping.keys()))

              precision    recall  f1-score   support

         ham     0.9573    0.9972    0.9768      3509
        spam     0.9982    0.9730    0.9855      5784

    accuracy                         0.9821      9293
   macro avg     0.9778    0.9851    0.9811      9293
weighted avg     0.9828    0.9821    0.9822      9293



**Question #6:** Among the classes (`ham` or `spam`), which is more likely to get labelled its class?

A: We can reference the recall to know more since it's described as...

> The recall is intuitively the ability of the classifier to find all the positive samples

In such case, we see that `ham` has a higher chance of getting labelled properly as opposed to `spam` that were missed.

## Checking the learned parameters
Let's see the parameters the `MultinomialNB` model learned.

Get the token counts the model computed

In [77]:
token_counts = model.feature_count_
token_counts.shape

(2, 147622)

**Question #7:** Why did we get a `(2, 147622)` matrix for the token counts? What does each value represent?

A: The 1st value represents the Classes or labels while the 2nd value represents the features (words).

Hence we 2 possible outcomes (`spam`, `ham`) and thousands of words.

To get the token counts of `spam` or `ham`, we can use our `mapping`.

In [78]:
spam_token_counts = token_counts[mapping['spam']]
ham_token_counts = token_counts[mapping['ham']]

We can sort the token counts to see the word that occurs the less/most for that class.

In [79]:
np.sort(spam_token_counts)

array([    0.,     0.,     0., ..., 38981., 41424., 53076.])

While `np.sort()` returns the actual counts, `np.argsort()` returns the sorted indices.

In [80]:
np.argsort(spam_token_counts)

array([ 73810, 119826, 119827, ..., 128265,  21596, 124004], dtype=int64)

**Sanity check:** You should see the following:

`array([ 73810, 119826, 119827, ..., 128265,  21596, 124004])`

The two sorts show that the `73,810th` word occurred `0` times, while the `124,004th` occurred `53,076` times in the spam sentences. Note that these are raw counts that are skewed because there are significantly more spam sentences. The model normalizes the counts relative to the class.

To get the the `ith` word/token, we can use `vectorizer.get_feature_names_out()`.

In [81]:
vectorizer.get_feature_names_out()[73810]

'jdk'

**Question #8:** What word occurred the most in the spam sentences?

In [82]:
print(f'A:\nMost occuring spam word: "{vectorizer.get_feature_names_out()[np.argsort(spam_token_counts)[-1]]}" for {np.sort(spam_token_counts)[-1]} occurences')

A:
Most occuring spam word: "the" for 53076.0 occurences


The following code lists the top occurring words per class:

In [83]:
top = 50

ham_idx = np.argsort(ham_token_counts)[::-1][:top]
spam_idx = np.argsort(spam_token_counts)[::-1][:top]

print('spam \t ham')
print('------------')

for i in range(top):
    print(vectorizer.get_feature_names_out()[ham_idx[i]], '\t', vectorizer.get_feature_names_out()[spam_idx[i]])

spam 	 ham
------------
the 	 the
to 	 and
of 	 to
and 	 of
in 	 in
is 	 you
for 	 is
it 	 font
that 	 http
you 	 for
on 	 this
this 	 padding
with 	 our
be 	 it
from 	 your
have 	 with
are 	 we
at 	 com
as 	 0px
not 	 that
if 	 price
or 	 border
by 	 product_table
can 	 on
but 	 are
edu 	 from
will 	 color
an 	 top
my 	 95
we 	 size
your 	 as
all 	 weight
one 	 at
would 	 be
there 	 will
was 	 info
any 	 by
so 	 my
http 	 or
use 	 all
has 	 not
do 	 none
com 	 left
they 	 have
what 	 right
which 	 more
about 	 company
some 	 no
www 	 www
list 	 out


The model does not depend on raw counts but instead uses the log probability. Get the model's log probabilities.

In [84]:
model.feature_log_prob_

array([[ -7.05463942,  -8.76178963,  -8.97965613, ..., -14.69668383,
        -14.69668383, -14.69668383],
       [ -6.45903662,  -6.08493524, -14.58437171, ..., -13.89122453,
        -12.97493379, -13.89122453]])

Let's sort the `feature_log_prob_` similar to the way we sorted the token counts

In [85]:
np.sort(model.feature_log_prob_[mapping['spam']])

array([-14.58437171, -14.58437171, -14.58437171, ...,  -4.01351643,
        -3.95273186,  -3.70487274])

In [86]:
np.argsort(model.feature_log_prob_[mapping['spam']])

array([ 73810, 119826, 119827, ..., 128265,  21596, 124004], dtype=int64)

We can see that the order is maintained, and the `124,004th` word is still the most occurring word in the `spam` sentences.

We can also see the class count and computed priors for each class

In [87]:
model.class_count_

array([ 8185., 13496.])

In [88]:
model.class_log_prior_

array([-0.97413309, -0.47404296])

Note that the priors are computed based on the count of each class (spam or not spam) in the dataset. The log probability is computed.

# Tuning our Naive Bayes model

In this section we will reuse our spam/ham dataset. We will resplit our dataset in the following manner:
1. Allot 20% of the original dataset as our hold-out test set.
1. Allot 25% of our remaining data as our validation data set. The remaining 75% will serve as our training data.

We will use `sklearn`'s `ParameterGrid` to tune our hyperparameters

Let's separate our test set. Set the test set to `20%`, stratify based on the target class, and set the `random_state` to 42.

In [89]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, 
                                                            y, 
                                                            test_size=0.2, 
                                                            random_state=42, 
                                                            stratify=y)

print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)

X_test shape:  (6195,)
y_test shape:  (6195,)


We will the same thing to separate our validation set. Set the validation set size to `25%`, stratify based on the target class, and set the `random_state` to 42. Don't forget that we are now splitting `X_train_val` and `y_train_val`. There should be no data leakage.

In [90]:
X_train, X_validation, y_train, y_validation = train_test_split(X_train_val, 
                                                                y_train_val, 
                                                                test_size=0.25, 
                                                                random_state=42, 
                                                                stratify=y_train_val)

print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_validation shape: ', X_validation.shape)
print('y_validation shape: ', y_validation.shape)

X_train shape:  (18584,)
y_train shape:  (18584,)
X_validation shape:  (6195,)
y_validation shape:  (6195,)


## Vectorization

Now that we have our data sets prepared, we can now start computing for the token counts. Remember that we will have to refit our vectorizer to our new train data.

Instantiate a `CountVectorizer`. Set it to remove English stop words. This should remove common words like `the`, `of`, `and` that likely will not anything meaningful to distinguish the two classes apart.

In [91]:
vectorizer = CountVectorizer(stop_words='english')

Fit the vectorizer on the training set by calling the `fit()` function. Then, get the count matrix for the training and validation set using the `transform()` function.

In [92]:
# Write your code here
vectorizer.fit(X_train)

X_train_count_sparse_matrix = vectorizer.transform(X_train)
X_validation_count_sparse_matrix = vectorizer.transform(X_validation)

In [93]:
print('X_train_count_sparse_matrix shape:', X_train_count_sparse_matrix.shape)
print('X_validation_count_sparse_matrix shape:', X_validation_count_sparse_matrix.shape)

X_train_count_sparse_matrix shape: (18584, 146302)
X_validation_count_sparse_matrix shape: (6195, 146302)


**Sanity check:** You should see the following values:
```
X_train_count_sparse_matrix shape:  (18584, 146302)
X_validation_count_sparse_matrix shape:  (6195, 146302)
```

**Question #9:** Why should we not get the count vectorizer to fit on the validation data set instead?

A: There's less data in the validation set (it's 25% of the train set), hence fitting on it would mean losing out certain words.

## GridSearch with `ParameterGrid`
In this section, we will use `ParameterGrid` to get the combinations of hyperparameters we will try on our model.

Import `ParameterGrid`.

In [94]:
from sklearn.model_selection import ParameterGrid

Instantiate a `MultinomialNB` model. Assign the object to variable `model`.

In [95]:
# Write your code here
model = MultinomialNB()

For this model, we can tweak the `alpha` (our smoothing operator) and whether or not we want to compute for the prior (`fit_prior`). You can read more about this in the documentation.

In [96]:
model.get_params()

{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': True}

For the following section, we will define our hyperparameters. For now, set the following hyperparameter choices:

**Hyperparameters**:
- alpha could be 1, 3, 5, 10, 15, 20, 50
- fit_prior could be true or false

In [97]:
hyperparameters = [{
    'alpha': [1, 3, 5, 10, 15, 20, 50],
    'fit_prior': [False, True]
}]

If we call `ParameterGrid`, it should list the following:

In [98]:
list(ParameterGrid(hyperparameters))

[{'alpha': 1, 'fit_prior': False},
 {'alpha': 1, 'fit_prior': True},
 {'alpha': 3, 'fit_prior': False},
 {'alpha': 3, 'fit_prior': True},
 {'alpha': 5, 'fit_prior': False},
 {'alpha': 5, 'fit_prior': True},
 {'alpha': 10, 'fit_prior': False},
 {'alpha': 10, 'fit_prior': True},
 {'alpha': 15, 'fit_prior': False},
 {'alpha': 15, 'fit_prior': True},
 {'alpha': 20, 'fit_prior': False},
 {'alpha': 20, 'fit_prior': True},
 {'alpha': 50, 'fit_prior': False},
 {'alpha': 50, 'fit_prior': True}]

Implement the code below. For every iteration, we will:
1. Set the parameters of our base model to the current hyperparameter combination 
1. Fit our model to our training data
1. Compute for our training accuracy
1. Run predictions on our validation data
1. Compute for our training accuracy
1. Keep track of the best performing validation accuracy and its associate hyperparam combo.

In [99]:
best_score = 0
for hyperparameter in ParameterGrid(hyperparameters):
    print(hyperparameter)
    
    model.set_params(**hyperparameter)
    
    # Write your code here
    pass
    model.fit(X_train_count_sparse_matrix, y_train)
    predictions = model.predict(X_train_count_sparse_matrix)
    num_correct = (predictions == y_train).sum()
    train_accuracy = num_correct / len(y_train)
    
    # Write your code here
    predictions = model.predict(X_validation_count_sparse_matrix)
    num_correct = (predictions == y_validation).sum()
    val_accuracy = num_correct / len(y_validation)
    
    print(f'Train acc: {train_accuracy}% \t Val acc: {val_accuracy}%', end='\n\n')
    
    if val_accuracy > best_score:
        best_score = val_accuracy
        best_grid = hyperparameter

print('Best accuracy: ', best_score, '%')
print('Best grid: ', best_grid)

{'alpha': 1, 'fit_prior': False}
Train acc: 0.990206629358588% 	 Val acc: 0.9836965294592414%

{'alpha': 1, 'fit_prior': True}
Train acc: 0.9908523461041756% 	 Val acc: 0.9845036319612591%

{'alpha': 3, 'fit_prior': False}
Train acc: 0.9859018510546707% 	 Val acc: 0.9807909604519774%

{'alpha': 3, 'fit_prior': True}
Train acc: 0.9863861386138614% 	 Val acc: 0.9815980629539952%

{'alpha': 5, 'fit_prior': False}
Train acc: 0.9832651743435213% 	 Val acc: 0.9788539144471348%

{'alpha': 5, 'fit_prior': True}
Train acc: 0.9840723202755058% 	 Val acc: 0.9801452784503631%

{'alpha': 10, 'fit_prior': False}
Train acc: 0.9782070598364184% 	 Val acc: 0.9738498789346247%

{'alpha': 10, 'fit_prior': True}
Train acc: 0.9798213517003874% 	 Val acc: 0.9767554479418886%

{'alpha': 15, 'fit_prior': False}
Train acc: 0.9769156263452432% 	 Val acc: 0.973042776432607%

{'alpha': 15, 'fit_prior': True}
Train acc: 0.977238484718037% 	 Val acc: 0.9749798224374495%

{'alpha': 20, 'fit_prior': False}
Train acc:

**Question #10:** What is the best found value for `alpha`?

In [100]:
print(f'A: {best_grid["alpha"]}') 

A: 1


**Question #11:** What is the best found value for `fit_prior`?

In [101]:
print(f'A: {best_grid["fit_prior"]}') 

A: True


## Retraining our estimator with the best hyperparameters

Now that we know the best hyperparameters, we can now make a new classifier and retrain it.

Instantiate a `MultinomialNB` model with the best values for the hyperparameter. Assign the object to variable `model`.

In [102]:
# Write your code here
model = MultinomialNB(**best_grid)

Instantiate a `CountVectorizer`. Set it to remove English stop words. 

In [103]:
# Write your code here
vectorizer = CountVectorizer(stop_words='english')

Since we are done using the validation set, we can use it in conjunction with the training set to train the model. Use the `CountVectorizer` in getting the count sparse matrix for `X_train_val`.

In [104]:
# Write your code here
X_train_val_count_sparse_matrix = vectorizer.fit_transform(X_train_val)

Train the model using the `fit()` function. Use the transformed sparse matrix containing both the training and validation sets as input data.

In [105]:
# Write your code here
model.fit(X_train_val_count_sparse_matrix, y_train_val)

## Testing phase

Let's get the prediction results on the test data to see if our model does well. Store the predicted labels in the variable `predictions`. Do not forget to transform your test data before predicting.

In [106]:
# Write your code here
predictions = model.predict(vectorizer.transform(X_test))

Compare the ground truth labels with the predicted labels. Store the total number of correct predictions in the variable `num_correct`.

In [107]:
# Write your code here
num_correct = (predictions == y_test).sum()

Compute for the accuracy. Store the accuracy in the variable `accuracy`.

In [108]:
# Write your code here
accuracy = num_correct / len(y_test)

Print the accuracy.

In [109]:
print(accuracy)

0.9846650524616626


**Question #12**: What is the accuracy of the model when evaluated on the test set? Express your answer in a floating point number from 0 to 1. Limit to 4 decimal places.

In [110]:
print(f'A {accuracy:.4f}')

A 0.9847


# Summary

In this notebook, we created two kinds of Naive Bayes models: Gaussian and Multinomial. 

We also saw the models' learned parameters. For Gaussian NB models, the model learns the mean and standard deviation of each feature per class, while multinomial NB models learn the log probability of each token per class.

We also experienced creating a natural language processing (NLP) machine learning model. Unlike its deep learning counterpart, the features are more hand-crafted because we dictate what the model should look at. In this case, we specifically designed it to look at token/term frequency/count, but we could build more sophisticated versions like inverse document frequency or term frequency-inverse document frequency (TF-IDF). 

## <center>fin</center>


<!-- DO NOT MODIFY OR DELETE THIS -->

<sup>made/compiled by daniel stanley tan & courtney anne ngo 🐰 & thomas james tiam-lee</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> danieltan07@gmail.com & courtneyngo@gmail.com & thomasjamestiamlee@gmail.com</sup><br>
<sup>please cc your instructor, too</sup>
<!-- DO NOT MODIFY OR DELETE THIS -->