**Junnan Shimizu**

Spring 2022

CS 251: Data Analysis and Visualization

# Lab 6B: Naive Bayes and K-Nearest Neighbor

In this lab we will be using the library scikit-learn to train a Gaussian Naive Bayes classifier and graph the results.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, preprocessing, model_selection, decomposition
from sklearn import neighbors, naive_bayes, metrics
import matplotlib.pyplot as plt

plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

  plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
  plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])


## Task 1: Load the digits dataset

We're going to be seeing if we can use Naive Bayes and K-Nearest Neighbor to classify hand writing data.  The [digits dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html) is a set of 1797 8x8 pixel images, representing handwriting samples of the numbers 0-9.  This is just a small sample of the [MNIST handwriting dataset](http://yann.lecun.com/exdb/mnist/).

1. Load the [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html). Use the `return_X_y` parameter so that it returns both the X data and y classifications.
2. Use [train test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the X data and y classifications, into an X_training dataset, X_testing dataset and the corresponding y_training labels and y_testing labels.  Set the test size be .3 and shuffle to True.
4. Print the shape of X_training, X_testing, y_training, and y_testing.

In [5]:
np.random.seed(42)

# Your code here
X, y = datasets.load_digits(return_X_y=True)

X_training, X_testing, y_training, y_testing = model_selection.train_test_split(X, y, test_size=0.3, shuffle=True)

print("X_training shape:", X_training.shape)
print("X_testing shape:", X_testing.shape)
print("y_training shape:", y_training.shape)
print("y_testing shape:", y_testing.shape)

print( "Expected output")
print('''
X training data shape:  (1257, 64)
X testing data shape:   (540, 64)
y training labels shape:(1257,)
y testing labels shape: (540,)
''')

X_training shape: (1257, 64)
X_testing shape: (540, 64)
y_training shape: (1257,)
y_testing shape: (540,)
Expected output

X training data shape:  (1257, 64)
X testing data shape:   (540, 64)
y training labels shape:(1257,)
y testing labels shape: (540,)



## Task 2: Create Classifiers and Calculate Accuracy

### Create a Naive Bayes Classifier
1. Create a [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) [(More Info)](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes). 
2. Use the fit method with the training dataset as X and the y training dataset labels as the target.
3. Calculate the accuracy of the classifier with the test data and test dataset labels using the score method.
4. Print the accuracy of the Naive Bayes classifier.

### Create a K-NN Classifier
1. Using the lab from last week as reference, create a [K-Nearest Neighbors Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) [(More Info)](https://scikit-learn.org/stable/modules/neighbors.html#classification).  Set n_neighbors equal to 7.
1. Assign your classifier to a variable with a **different** name than your Naive Bayes classifier.  
2. Use the fit method with the training dataset as X and the y training dataset labels as the target.
3. Calculate the accuracy of the classifier with the test data and test dataset labels using the score method.
4. Print the accuracy of the K-NN classifier.


In [10]:
# Your code here
gnb = naive_bayes.GaussianNB()
gnb.fit(X_training, y_training)
gnb_accuracy = gnb.score(X_testing, y_testing)
knn = neighbors.KNeighborsClassifier(n_neighbors=7)

knn.fit(X_training, y_training)
knn_accuracy = knn.score(X_testing, y_testing)

print("GNB Accuracy:", gnb_accuracy)

print("KNN Accuracy:", knn_accuracy)

print( "Expected output")
print('''
Gaussian Naive Bayes Classifier Accuracy: 0.85185...
K-Nearest Neighbor Classifier Accuracy:   0.99074...
''')

GNB Accuracy: 0.8518518518518519
KNN Accuracy: 0.9907407407407407
Expected output

Gaussian Naive Bayes Classifier Accuracy: 0.85185...
K-Nearest Neighbor Classifier Accuracy:   0.99074...



## Task 3: Create a confusion matrix for each classifier

1. Find the predicted labels for the X test data using the predict method for the Naive Bayes classifier and K-NN classifier.
1. Create a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for each classifier, using the predicted labels and actual labels.  
1. Print the confusion matrices, along with some indication that the rows indicate the number of points that truly have a given label and that the columns indicate the number of points predicted to have that label.
2. Visualize the confusion matrices using imshow. For reference, use Lab 4a and this [matplotlib example](https://matplotlib.org/3.1.1/gallery/images_contours_and_fields/image_annotated_heatmap.html) of an annotated heatmap.
    1. Set x_ticks and y_ticks to align with the list of digits.
    2. Use imshow to draw the matrix
    3. Choose a perceptually uniform [colormap](https://matplotlib.org/tutorials/colors/colormaps.html)
    3. Use a colorbar to label the matrix
    4. Remember to call `plt.show()` at the end, or other plots later might not work.
    5. Give your plot a meaningful title.
    
#### Review Question: Which digits are most likely to be misclassified and what are they most likely to be misclassified as?

In [22]:
# Your code goes here
# Naive Bayes classifier
nb_pred = gnb.predict(X)
nb_cm = metrics.confusion_matrix(y, nb_pred)
print("Confusion Matrix (Naive Bayes Classifier):\n", nb_cm)
print("Rows indicate the true labels, columns indicate predicted labels.")

# K-NN classifier
knn_pred = knn.predict(X)
knn_cm = metrics.confusion_matrix(y, knn_pred)
print("\nConfusion Matrix (K-NN Classifier):\n", knn_cm)
print("Rows indicate the true labels, columns indicate predicted labels.")# Naive Bayes classifier
nb_pred = gnb.predict(X)
nb_cm = metrics.confusion_matrix(y, nb_pred)
print("Confusion Matrix (Naive Bayes Classifier):\n", nb_cm)
print("Rows indicate the true labels, columns indicate predicted labels.")

# K-NN classifier
knn_pred = knn_classifier.predict(X)
knn_cm = metrics.confusion_matrix(y, knn_pred)
print("\nConfusion Matrix (K-NN Classifier):\n", knn_cm)
print("Rows indicate the true labels, columns indicate predicted labels.")

print("Expected output (rows indicate true class count, columns indicate predicted class count)")
print('''
K-Nearest Neighbor Confusion Matrix
 [[53  0  0  0  0  0  0  0  0  0]
 [ 0 50  0  0  0  0  0  0  0  0]
 [ 0  0 47  0  0  0  0  0  0  0]
 [ 0  0  0 54  0  0  0  0  0  0]
 [ 0  0  0  0 60  0  0  0  0  0]
 [ 0  0  0  0  0 64  1  0  0  1]
 [ 0  0  0  0  0  0 53  0  0  0]
 [ 0  0  0  0  0  0  0 55  0  0]
 [ 0  0  0  0  0  0  0  0 43  0]
 [ 0  0  0  1  1  1  0  0  0 56]]
 Gaussian Naive Bayes Confusion Matrix
[[52  0  0  0  0  0  0  1  0  0]
 [ 0 37  2  0  0  0  0  2  6  3]
 [ 0  3 31  0  0  0  1  0 12  0]
 [ 0  0  2 41  0  0  1  0  8  2]
 [ 0  0  0  0 51  0  2  7  0  0]
 [ 0  0  0  1  0 62  1  2  0  0]
 [ 0  0  0  0  1  1 51  0  0  0]
 [ 0  0  0  0  0  1  0 54  0  0]
 [ 0  2  0  0  0  0  0  2 39  0]
 [ 0  1  1  1  0  2  1  7  4 42]]
''')


Confusion Matrix (Naive Bayes Classifier):
 [[176   0   0   0   1   0   0   1   0   0]
 [  0 144   2   0   0   0   5   5  17   9]
 [  0  14 113   0   0   1   1   0  48   0]
 [  0   2   3 144   0   3   1   6  19   5]
 [  1   1   0   0 152   1   4  19   3   0]
 [  0   0   0   2   0 169   1   7   2   1]
 [  0   0   0   0   1   1 178   0   1   0]
 [  0   0   0   0   1   1   0 177   0   0]
 [  0   7   0   1   0   3   0  12 151   0]
 [  1   3   1   2   0   4   1  17  12 139]]
Rows indicate the true labels, columns indicate predicted labels.

Confusion Matrix (K-NN Classifier):
 [[178   0   0   0   0   0   0   0   0   0]
 [  0 182   0   0   0   0   0   0   0   0]
 [  0   0 176   0   0   0   0   1   0   0]
 [  0   0   0 181   0   0   0   1   1   0]
 [  0   0   0   0 180   0   0   1   0   0]
 [  0   0   0   0   0 178   1   0   0   3]
 [  0   0   0   0   0   0 181   0   0   0]
 [  0   0   0   0   0   0   0 179   0   0]
 [  0   3   0   2   0   0   0   0 169   0]
 [  0   0   0   2   1   2   0   0 

NameError: name 'knn_classifier' is not defined