# Naive Bayes and SVC 

**A note on this document**
This document is known as a Jupyter notebook; it allows text and executable code to coexist in a very easy-to-read format. Blocks can contain text or executable code. For blocks containing code, press `Shift + Enter`, `Ctrl+Enter`, or click the arrow on the block to run the code. Earlier blocks of code need to be run for the later blocks of code to work.

## Iris Flowers

In classification problems, the output space consists of a set of $C$ labels, which are referred to as `classes`. This set is denoted as $\mathcal{Y} = \{1, 2, \ldots, C\}$. The task of predicting the class label based on an input is commonly known as `pattern recognition`. When there are only two classes, they are often represented as $y \in \{0, 1\}$ or $y \in \{-1, +1\}$, and this specific scenario is called `binary classification`.

For instance, let's consider the task of classifying Iris flowers into their three subspecies: Setosa, Versicolor, and Virginica. The figure below showcases an example from each of these classes.

<div>
<img src="./figures/iris.png" width="600"/>
</div>
Three types of Iris flowers: Setosa (L), Versicolor (C), and Virginica (R). 



The features of the Iris dataset are: sepal length, sepla widht, petal length, petal width.  

In [None]:
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()

print(iris.feature_names)
print(iris.target_names)
## Iris Flowers

The Iris dataset is a collection of 150 labeled examples of Iris flowers, 50 of each type, described by these 4 features.

In [None]:
import pandas as pd
import numpy as np

X = iris.data
y = iris.target

# Convert to pandas dataframe
df = pd.DataFrame(data=X, columns=iris.feature_names)
df["label"] = pd.Series(iris.target_names[y], dtype="category")

df.head

For tabular data with a small number of features, it is common to make a `pair plot`, in which panel $(i, j)$ shows a scatter plot of variables $i$ and $j$, and the diagonal entries $(i,i)$ show the marginal density of variable $i$.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# we pick a color map to match that used by decision tree graphviz
palette = {"setosa": "orange", "versicolor": "green", "virginica": "purple"}

g = sns.pairplot(df, vars=df.columns[0:4], hue="label", palette=palette)
plt.show()

The figure above clearly shows that Iris setosa can be readily classified, whereas distinguishing between Iris versicolor and Iris virginica is more challenging.

Let's use the `train_test_split` function in scikit-learn to split the dataset into a **training set** and a **testing (or validation) set**. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Naive Bayes Classifier

Let's try the Naive Bayes method to classify the Iris flowers. We will use the `Guassian` Naive Bayes because the dataset is continuous, real-valued features where the values are assumed to be normally distributed. 

In [None]:
# Train a Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

We want to use 5-fold cross-validation to find accuracy scores and their average.

In [None]:
# Create a Naive Bayes classifier (Gaussian Naive Bayes for this example)
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
cv_scores = cross_val_score(nb_classifier, X, y, cv=5)

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Calculate and print the mean accuracy
mean_accuracy = np.mean(cv_scores)
print("Mean accuracy:", mean_accuracy)

### Support Vector Machine (SVM) Classifier

Let's try the Naive Bayes method to classify the Iris flowers. 

Kernels in SVMs are mathematical functions that are used to transform the original feature space into a higher-dimensional space. SVMs are primarily used for binary classification in their simplest form, and kernels help SVMs handle non-linearly separable data. There are several types of kernels we can use in SVMs:

- Linear Kernel (default): It is suitable for linearly separable data and assumes that the data can be separated by a straight line (hyperplane).
- Polynomial Kernel: This kernel is useful when data is not linearly separable and can be separated by a polynomial boundary.
- Radial Basis Function (RBF) Kernel: The RBF kernel is often a good choice when the decision boundary is not expected to be linear or polynomial. It is a versatile kernel for handling non-linear data.
- Sigmoid Kernel: The sigmoid kernel can be used for data that follows sigmoid-like patterns.

Note that we did not discuss the RBF and sigmoid kernels in class.  

The choice of kernel depends on the nature of your data and the problem we are trying to solve. It's often a good idea to experiment with different kernels to find the one that works best for your specific dataset.

We will use the linear kernel for this example.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train an SVM classifier
svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = svm_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

We want to use 5-fold cross-validation to find the accuracy and their average.

In [None]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(svm_classifier, X, y, cv=5)

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Calculate and print the mean accuracy
mean_accuracy = np.mean(cv_scores)
print("Mean accuracy:", mean_accuracy)

## MNIST-784

MNIST-784 is a widely used dataset in the field of machine learning and computer vision. It stands for the "Modified National Institute of Standards and Technology" database. The MNIST dataset contains a large collection of handwritten digits (0 through 9), which are commonly used for training and testing machine learning models, especially for image classification tasks. Each image in the MNIST dataset is a grayscale image with a resolution of 28x28 pixels, resulting in 784 total pixels. These images are typically used to develop and test algorithms for digit recognition.

In [None]:
# Import necessary libraries
from sklearn import datasets
import joblib
from pathlib import Path

# Load the MNIST dataset

mnist_filename = "data/mnist_784.pkl"
path = Path(mnist_filename)

if not path.is_file():
    # download the dataset. It will take about a minute.
    mnist_dataset = datasets.fetch_openml("mnist_784")
    joblib.dump(mnist_dataset, mnist_filename)

# Load the MNIST dataset
mnist = joblib.load(mnist_filename)

# Split the data into features and labels
features = mnist.data
labels = mnist.target

# features is a pandas.core.frame.DataFrame object
print(features.info())

# labels is a pandas.core.series.Series object
print(labels.info())

As indicated in the above information, the dataset comprises 70,000 entries. Let's display the first few entries.

In [None]:
import matplotlib.pyplot as plt

# Reshape and display the first few images
rows = 5  # Change this value to display more images
cols = 10  # Change this value to display more images
for i in range(rows):
    for j in range(cols):
        data = features.iloc[i * cols + j]
        image = data.to_numpy().reshape(28, 28)
        plt.subplot(rows, cols, i * cols + j + 1)
        plt.imshow(image, cmap="gray")
        plt.title(f"{labels.iloc[i*cols+j]}")
        plt.axis("off")

In [None]:
# features is a pandas.core.frame.DataFrame object
print(features.shape)

print(features.head())

Let's print labels.

In [None]:
print(labels.head())

### Deliverable 1

Use the Naive Bayes method to classify the handwritten digits. 

The very first step is to normalize the pixel values to the range [0, 1]. For the MNIST dataset, which consists of grayscale pixel values ranging from 0 to 255, normalizing by dividing all pixel values by 255 is a straightforward way to achieve this. This normalization ensures that each pixel value is between 0 and 1, making the dataset more suitable for training neural networks and other machine learning models.


In [None]:
# Load the MNIST dataset
features = mnist.data
labels = mnist.target

# Preprocess the data
features_normalized = features / 255.0  # Normalize the pixel values to the range [0, 1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    features_normalized, labels, test_size=0.2, random_state=42
)

We will use the `Multinomial` Naive Bayes because the dataset is discrete pixel values, where the features are generated from a multinomial distribution. 

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Write your code for multinomial Naive Bayes that classifies the handwritten digits.
# Ensure you print out the accuracy score.

# Create a Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

Use the following index of misclassification to display the misclassified digits.

In [None]:
mis_indices_nb = np.where(y_pred != y_test.values)[0]  # index of misclassification

# Write your code to display the first 10 misclassified digits.
# You should add the predicted values on top of the digits.
# To do this, use plt.title.

Use 5-fold cross-validation to find the accuracy scores and their average.

In [None]:
# Perform 5-fold cross-validation

### Deliverable 2

Use the SVM method to classify the handwritten digits. 

We will use the `linear` kernel to classify this dataset.  The linear kernel is suitable for linearly separable data and assumes that the data can be separated by a straight line (hyperplane).

Warning: It will take about 3-7 minutes.

In [None]:
# Write your code for SVM with linear kernel that classifies the handwritten digits.
# Ensure you print out the accuracy score.

Now, let's experiment with the `Radial Basis Function (RBF) kernel`, which is often a good choice when the decision boundary is not expected to be linear or polynomial. It is a versatile kernel for handling non-linear data.

Warning: It will take about 3-7 minutes.

In [None]:
# Write your code for SVM with RBF kernel that classifies the handwritten digits.
# Ensure you print out the accuracy score.

Let's confirm whether the digits misclassified by the Naive Bayes classifier are now correctly classified by SVM. Display these misclassified digits along with their predicted values from SVM. To do this, use the same `mis_indices_nb` in conjunction with the `y_pred` from the SVM classifier.

In [None]:
# Write your code to display the first 10 misclassified digits by Naive Bayes.
# You should add the predicted values by SVM on top of the digits.

Use 5-fold cross-validation to find the accuracy scores and their average.

Warning: It will take about 20 minutes.

In [None]:
# Perform 5-fold cross-validation