# Naive Bayes and SVM 

- Author: Stan Baek  
- Department of Electrical & Computer Engineering
- United States Air Force Academy
- Date: Aug 08, 2023  

*2024-11-24: Comments heavily revised by Stan Baek*

## Simplified BSD License (FreeBSD License)
**Copyright (c) 2023, Stan Baek, All rights reserved.**

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.

**THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.**

The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.


**A note on this document**
This document is known as a Jupyter notebook; it allows text and executable code to coexist in a very easy-to-read format. Blocks can contain text or executable code. For blocks containing code, press `Shift + Enter`, `Ctrl+Enter`, or click the arrow on the block to run the code. Earlier blocks of code need to be run for the later blocks of code to work.


## Iris Flowers

In classification problems, the output space consists of a set of $C$ labels, which are referred to as `classes`. These labels form a set denoted as $\mathcal{Y} = \{1, 2, \ldots, C\}$. The goal in such problem is to predict the correct label for a given input, a task widely known as `pattern recognition`. 

In cases where there are only two possible classes, the labels are typically represented as $y \in \{0, 1\}$ or  $y \in \{-1, +1\}$. This specific type of classification is called binary classification.

As an example of a classification task, consider classifying Iris flowers into one of three subspecies: Setosa, Versicolor, and Virginica. The image below illustrates an example from each class.

<div> <img src="./data/iris.png" width="600"/> </div> 
Three types of Iris flowers: Setosa (left), Versicolor (center), and Virginica (right).

The features in the Iris dataset are: sepal length, sepal width, petal length, and petal width. These features are used to classify the flowers into one of three subspecies: Setosa, Versicolor, or Virginica.

The following code demonstrates how to load the Iris dataset using the sklearn library:

In [None]:
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()

print(type(iris))  # Print the type of the dataset object
print(iris.feature_names)  # Print the names of the dataset's features
print(iris.target_names)  # Print the names of the target classes

The Iris dataset is a collection of 150 labeled examples of Iris flowers, 50 of each type, described by these 4 features.

In [None]:
import pandas as pd
import numpy as np

# Extract feature data (X) and target labels (y) from the Iris dataset
# Features: Sepal length, sepal width, petal length, petal width
X = iris.data
# Target labels: Encoded as integers (0 = Setosa, 1 = Versicolor, 2 = Virginica)
y = iris.target

# Convert the feature data and target labels into a Pandas DataFrame
df = pd.DataFrame(
    data=X, columns=iris.feature_names
)  # Create a DataFrame with feature names as column headers

# Display the first few rows of the DataFrame to verify its structure and content
df.head()  # Returns the first 5 rows, including features

In [None]:
# Add a new column for human-readable class labels using the target names
# Map the numerical target labels (0, 1, 2) to their corresponding class names (Setosa, Versicolor, Virginica)
df["label"] = pd.Series(iris.target_names[y], dtype="category")

# Display the first few rows of the DataFrame to verify its structure and content
df.head()  # Returns the first 5 rows, including features and their corresponding class labels

For tabular data with a small number of features, it is common to make a `pair plot`, in which panel $(i, j)$ shows a scatter plot of variables $i$ and $j$, and the diagonal entries $(i,i)$ show the marginal density of variable $i$.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Define a custom color palette to match the colors used in decision tree visualizations
# The keys are the class names (labels), and the values are the colors assigned to each class
palette = {
    "setosa": "orange",  # Setosa class will be represented in orange
    "versicolor": "green",  # Versicolor class will be represented in green
    "virginica": "purple",  # Virginica class will be represented in purple
}

# Create a pair plot using Seaborn to visualize pairwise relationships between features
# - `df`: The DataFrame containing the Iris dataset
# - `vars`: Specifies the columns to use for the pair plot; in this case, the first 4 feature columns
# - `hue`: Groups data points by the "label" column, which corresponds to the class labels
# - `palette`: Applies the custom color mapping defined above for the classes
g = sns.pairplot(df, vars=df.columns[0:4], hue="label", palette=palette)

# Display the resulting plot
plt.show()

The figure above demonstrates that Iris setosa is easily distinguishable due to its unique feature patterns. However, classifying Iris versicolor and Iris virginica is more difficult because their feature spaces overlap.

To evaluate a classification model effectively, we divide the dataset into two subsets:

- **Training Set**: Used to train the model.
- **Testing (or Validation) Set**: Used to evaluate the model's performance on unseen data.

This ensures that the model's performance is measured accurately and prevents overfitting. We’ll use the `train_test_split` function from `scikit-learn` to achieve this.

In [None]:
from sklearn.model_selection import train_test_split  # For splitting the dataset
from sklearn.naive_bayes import GaussianNB  # A classification algorithm (Naive Bayes)
from sklearn.metrics import accuracy_score  # To measure the model's performance

# Split the data into training and testing sets
# - X: Feature matrix (sepal/petal dimensions for each Iris sample)
# - y: Target labels (numerical representation of Iris species)
# - test_size=0.2: 20% of the data is reserved for testing, and 80% for training
# - random_state=42: Ensures reproducibility by using a fixed seed for randomness
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# The train_test_split function performs:
# - Random shuffling of the dataset.
# - Division of data into two parts: training set (X_train, y_train) and testing set (X_test, y_test).
# - A specified proportion for the split (e.g., 80% training, 20% testing).

# Print the shapes of the resulting datasets for verification
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

## Naive Bayes Classifier

The Naive Bayes method is a probabilistic classifier based on Bayes' Theorem. It assumes that features are independent given the class label. Since the Iris dataset consists of continuous, real-valued features, we use the `Gaussian` Naive Bayes classifier, which assumes the feature values are normally distributed.


In [None]:
# Import necessary libraries
from sklearn.naive_bayes import GaussianNB  # Gaussian Naive Bayes classifier
from sklearn.metrics import accuracy_score  # To evaluate model accuracy

# Train a Gaussian Naive Bayes classifier
nb_classifier = GaussianNB()  # Initialize the Gaussian Naive Bayes classifier

# Fit the classifier to the training data
# - X_train: Feature matrix for training
# - y_train: Target labels for training
nb_classifier.fit(X_train, y_train)

# Evaluate the model on the testing set
# - X_test: Feature matrix for testing
# - y_test: Target labels for testing
y_pred = nb_classifier.predict(X_test)  # Predict the class labels for the test set

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)  # Proportion of correct predictions
print(f"Accuracy: {accuracy * 100:.2f}%")  # Print accuracy as a percentage

We’ll train the Gaussian Naive Bayes classifier on the Iris dataset, evaluate its performance using the training and testing sets, and then use 5-fold cross-validation to assess the model's accuracy across different splits of the dataset.

In [None]:
# Create a Naive Bayes classifier (Gaussian Naive Bayes for this example)
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation to evaluate the model's performance
# - `cv=5`: Specifies 5-fold cross-validation
cv_scores = cross_val_score(nb_classifier, X, y, cv=5)

# Print cross-validation scores for each fold
print("Cross-validation scores:", cv_scores)

# Calculate and print the mean accuracy from cross-validation
mean_accuracy = np.mean(cv_scores)  # Average accuracy across all folds
print("Mean accuracy:", mean_accuracy)

## Support Vector Machine (SVM) Classifier

Support Vector Machines (SVMs) are powerful classification models that find the optimal **hyperplane** to separate data points in a feature space. SVMs are particularly effective for binary classification, but they can also handle multi-class problems (like the Iris dataset) using extensions like one-vs-one or one-vs-rest strategies.Let's try the Naive Bayes method to classify the Iris flowers. 

### Understanding Kernels in SVMs
Kernels are mathematical functions that transform data into a higher-dimensional space where a linear hyperplane can separate classes. The choice of kernel plays a critical role in the SVM's ability to handle different types of data:

- Linear Kernel: Suitable for linearly separable data. It finds a straight hyperplane to separate the classes.
- Polynomial Kernel: Useful when the data requires a polynomial decision boundary.
- Radial Basis Function (RBF) Kernel (default): Often a good choice for non-linear data as it maps data to an infinite-dimensional space.
- Sigmoid Kernel: Can handle sigmoid-like data distributions but is less commonly used.

The choice of kernel depends on the nature of the data and the problem we are trying to solve. It's often a good idea to experiment with different kernels to find the one that works best for your specific dataset.

For this example, we use the linear kernel, as it is computationally efficient and works well with the Iris dataset.

In [None]:
from sklearn.svm import SVC  # Support Vector Classifier (SVM implementation)
from sklearn.metrics import accuracy_score  # For evaluating the model's accuracy

# Initialize the SVM classifier
# - `kernel="linear"`: Specifies that we are using a linear kernel for this example
svm_classifier_linear = SVC(kernel="linear")

# Train the SVM model on the training data
# - `X_train`: Feature matrix for training
# - `y_train`: Target labels for training
svm_classifier_linear.fit(X_train, y_train)

# Evaluate the trained model on the testing set
# - `X_test`: Feature matrix for testing
# - `y_test`: True target labels for testing
y_pred = svm_classifier_linear.predict(X_test)  # Predict class labels for the test set

# Calculate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)  # Proportion of correct predictions
print(f"Accuracy: {accuracy * 100:.2f}%")  # Print accuracy as a percentage

### Using 5-Fold Cross-Validation

We will use 5-fold cross-validation to evaluate the accuracy of our SVM classifier. This approach divides our dataset into five subsets, trains the model on four subsets, and tests it on the remaining subset, repeating this process five times. This method helps us get a reliable estimate of the model's performance.

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Define the cross-validation method. We are using 5-fold cross-validation.
# - `n_splits=5` specifies the number of folds.
# - `shuffle=True` ensures that the data is shuffled before splitting into folds, which helps in achieving better generalization.
# - `random_state=42` sets a fixed random seed for reproducibility, so that we get the same data splits every time we run the code.
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform 5-fold cross-validation
# - `svm_classifier` is our model (assumed to be predefined).
# - `X` is the feature matrix (assumed to be predefined).
# - `y` is the target vector (assumed to be predefined).
# - `cv=kf` uses the KFold object we defined as the cross-validation strategy.
cv_scores = cross_val_score(svm_classifier_linear, X, y, cv=kf)

# Step 4: Print the cross-validation scores for each fold
print("Cross-validation scores:", cv_scores)

# Step 4: Calculate and print the mean accuracy
# - We use `np.mean` to compute the average of the cross-validation scores.
mean_accuracy = np.mean(cv_scores)
print("Mean accuracy:", mean_accuracy)

We want to evaluate the performance of the default kernel, Radial Basis Function (RBF), for a Support Vector Classifier (SVC) using 5-fold cross-validation on the Iris dataset and compare it with the linear kernel.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, KFold

# Define the cross-validation method using 5-fold cross-validation
# - We reuse `kf` from the previous code block for consistency.
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Create an instance of the classifier
# - `SVC()` creates a Support Vector Classifier with the default RBF kernel.
svm_classifier_rbf = SVC()

# Perform cross-validation and calculate the scores
# - `classifier` is the model we want to evaluate.
# - `X` is the feature matrix (assumed to be predefined).
# - `y` is the target vector (assumed to be predefined).
# - `cv=kf` uses the KFold object we defined as the cross-validation strategy.
# - `scoring="accuracy"` specifies that we want to evaluate the model based on accuracy.
scores = cross_val_score(svm_classifier_rbf, X, y, cv=kf, scoring="accuracy")

# Print the cross-validation scores for each fold
print("Cross-validation scores:", scores)

# Calculate and print the mean accuracy
# - We use `np.mean` to compute the average of the cross-validation scores.
mean_accuracy = np.mean(scores)
print("Mean accuracy:", mean_accuracy)

For this Iris dataset, the linear kernel works better than the RBF kernel.

## MNIST-784

**MNIST-784** is a widely used dataset in the field of machine learning and computer vision. It stands for the "Modified National Institute of Standards and Technology" database. The MNIST dataset contains a large collection of handwritten digits (0 through 9), which are commonly used for training and testing machine learning models, especially for image classification tasks. Each image in the MNIST dataset is a grayscale image with a resolution of 28x28 pixels, resulting in 784 total pixels. These images are typically used to develop and test algorithms for digit recognition.


In [None]:
# Import necessary libraries
# - `datasets` from `sklearn`: To fetch the MNIST dataset from OpenML.
from sklearn import datasets

# - `joblib`: To save and load the dataset locally for faster access in future runs.
import joblib

# - `Path` from `pathlib`: To handle file paths in a way that is compatible across different operating systems.
from pathlib import Path

# Define the filename where the MNIST dataset will be saved locally.
mnist_filename = "./data/mnist_784.pkl"

# Create a Path object for the filename. This makes it easier to check if the file exists.
path = Path(mnist_filename)

# Check if the MNIST dataset file already exists locally.
if not path.is_file():

    # Create a directory named 'data'
    # - `Path("data")`: Creates a Path object for the 'data' directory.
    # - `mkdir(parents=True, exist_ok=True)`: Creates the directory with specific conditions.
    #   - `parents=True`: Ensures that any missing parent directories are also created.
    #   - `exist_ok=True`: Prevents raising an error if the directory already exists.
    Path("data").mkdir(parents=True, exist_ok=True)

    # If the file does not exist, download the MNIST dataset from OpenML.
    # - `fetch_openml("mnist_784")`: Downloads the MNIST dataset.
    # Note: This process will take about a minute.
    mnist_dataset = datasets.fetch_openml("mnist_784")

    # Save the downloaded dataset to a local file using joblib.
    joblib.dump(mnist_dataset, mnist_filename)

# Load the MNIST dataset from the local file.
mnist = joblib.load(mnist_filename)

# Split the dataset into features (X) and labels (y).
# - `mnist.data`: Contains the pixel values of the images (features).
# - `mnist.target`: Contains the corresponding digit labels (targets).
features = mnist.data
labels = mnist.target

# Print information about the features.
# - `features.info()`: Provides detailed information about the DataFrame, such as the number of entries, column data types, and memory usage.
print(features.info())

# Print information about the labels.
# - `labels.info()`: Provides detailed information about the Series, such as the number of entries and memory usage.
print(labels.info())

The MNIST dataset includes a total of 70,000 handwritten digit images, spanning digits from 0 to 9. To get a better understanding of the dataset, we will visualize a subset of these entries.

In [None]:
import matplotlib.pyplot as plt

# Display the first few images from the dataset
# - The dataset comprises 70,000 entries.
# - We will reshape and display the first few images in a grid format.

# Define the number of rows and columns in the grid
# - `rows`: The number of rows in the grid.
# - `cols`: The number of columns in the grid.
# You can change these values to display more images.
rows = 5  # Change this value to display more images
cols = 10  # Change this value to display more images

# Loop through each position in the grid
for i in range(rows):
    for j in range(cols):
        # Calculate the index of the current image
        index = i * cols + j

        # Extract the image data from the features DataFrame
        # - `features.iloc[index]`: Selects the row at the specified index.
        # - `to_numpy()`: Converts the row to a numpy array.
        # - `reshape(28, 28)`: Reshapes the 1D array into a 2D array (28x28 pixels) to represent the image.
        data = features.iloc[index]
        image = data.to_numpy().reshape(28, 28)

        # Create a subplot in the specified position
        # - `plt.subplot(rows, cols, position)`: Creates a subplot in the specified grid position.
        # The position is calculated as `i * cols + j + 1` because subplot positions start from 1.
        plt.subplot(rows, cols, index + 1)

        # Display the image
        # - `plt.imshow(image, cmap="gray")`: Displays the image in grayscale.
        plt.imshow(image, cmap="gray")

        # Set the title of the subplot to the corresponding label
        # - `labels.iloc[index]`: Selects the label at the specified index.
        plt.title(f"{labels.iloc[index]}")

        # Turn off the axis
        # - `plt.axis("off")`: Hides the axis for a cleaner look.
        plt.axis("off")

# Show the grid of images
plt.show()

In [None]:
# Print the shape of the features DataFrame
# - `features.shape`: Returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).
# - This helps us understand the size of the dataset.
print(features.shape)

# Print the first few rows of the features DataFrame
# - `features.head()`: Displays the first 5 rows of the DataFrame by default.
# - This allows us to get a quick look at the structure and content of the dataset.
print(features.head())

Let's print labels.

In [None]:
# Print the first few rows of the labels Series
# - `labels.head()`: Displays the first 5 rows of the Series by default.
# - This allows us to quickly see the first few labels (target values) in the dataset.
print(labels.head())

## Deliverable 1

We will use the Naive Bayes method to classify handwritten digits from the MNIST dataset. The first step is to normalize the pixel values to the range [0, 1]. For the MNIST dataset, which consists of grayscale pixel values ranging from 0 to 255, normalizing by dividing all pixel values by 255 is a straightforward way to achieve this. This normalization ensures that each pixel value is between 0 and 1, making the dataset more suitable for training machine learning models.


In [None]:
from sklearn.model_selection import train_test_split

# Load the MNIST dataset
# - `mnist.data`: Contains the pixel values of the images.
# - `mnist.target`: Contains the corresponding digit labels.
features = mnist.data
labels = mnist.target

# Preprocess the data
# Normalize the pixel values to the range [0, 1]
# - `features / 255.0`: Divides all pixel values by 255 to scale them to the range [0, 1].
features_normalized = features / 255.0

# Split the data into training and testing sets
# - `train_test_split`: Splits the dataset into training and testing subsets.
# - `features_normalized`: The normalized pixel values.
# - `labels`: The corresponding digit labels.
# - `test_size=0.2`: Specifies that 20% of the data should be used for testing, and 80% for training.
# - `random_state=42`: Ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(
    features_normalized, labels, test_size=0.2, random_state=42
)

# Output shapes for verification
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

We will use the `Multinomial` Naive Bayes classifier because the dataset consists of discrete pixel values, where the features can be assumed to follow a multinomial distribution. This classifier is suitable for such data.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Create a Multinomial Naive Bayes classifier
# - `MultinomialNB()`: Initializes the classifier.
nb_classifier = MultinomialNB()

# TODO: Write your code for multinomial Naive Bayes that classifies the handwritten digits.
# Ensure you print out the accuracy score.








#### Displaying Misclassified Digits
We will use the index of misclassifications to display the first 10 misclassified digits from our model's predictions. Each displayed image will have the predicted label shown on top.

**Steps**:
1. Identify Misclassifications: Use the index of misclassifications where the predicted values differ from the true labels.
1. Set Up Visualization: Use matplotlib to create a grid for displaying the misclassified images.
1. Loop Through Misclassifications: Extract and display the first 10 misclassified images, adding the predicted values as titles, , as shown below.

<div> <img src="./mnist_misclassifications.png" width="600"/> </div> 

In [None]:
# Identify misclassified indices
# - `np.where(y_pred != y_test.values)[0]`: Finds the indices where the predicted labels differ from the true labels.
# - `mis_indices_nb`: Stores the indices of misclassifications.
mis_indices_nb = np.where(y_pred != y_test.values)[0]

# Write your code to display the first 10 misclassified digits
# - `num_images`: Number of misclassified images to display.
num_images = 10  # Change this value to display more images

# TODO: Write your code to display the first 10 misclassified digits.
# You should add the predicted values on top of the digits.
# To do this, use plt.title.





# Show the grid of misclassified images
plt.show()

### Using 5-Fold Cross-Validation

We will use 5-fold cross-validation to evaluate the accuracy of our Multinomial Naive Bayes classifier. This technique splits the dataset into five parts, trains the model on four parts, and tests it on the remaining part, repeating this process five times to get a robust estimate of model performance.

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

# TODO: Perform 5-fold cross-validation




## Deliverable 2

We will use the Support Vector Machine (SVM) method with a linear kernel to classify the handwritten digits from the MNIST dataset. The linear kernel is suitable for linearly separable data, assuming that the data can be separated by a straight line (hyperplane).

Warning: It will take about 3-7 minutes.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# TODO: Write your code for SVM with linear kernel that classifies the handwritten digits.
# Ensure you print out the accuracy score.




Now, let's experiment with the `Radial Basis Function (RBF) kernel`, which is a powerful choice for classification tasks, especially when the decision boundary is not expected to be linear or polynomial. It is highly versatile and can handle non-linear data effectively. Let's implement this and evaluate its performance.

Warning: It will take about 3-7 minutes.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# TODO: Write your code for SVM with RBF kernel that classifies the handwritten digits.


# Ensure you print out the accuracy score.




Let's confirm whether the digits misclassified by the Naive Bayes classifier are now correctly classified by SVM. Display these misclassified digits along with their predicted values from SVM. To do this, use the same `mis_indices_nb` in conjunction with the `y_pred` from the SVM classifier.

In [None]:
# TODO: Write your code to display the first 10 misclassified digits by Naive Bayes.
# You should add the predicted values by SVM on top of the digits.

# Reshape and display the first 10 misclassified digits by Naive Bayes
# - `num_images`: Number of misclassified images to display.
num_images = 10  # Change this value to display more images






Use 5-fold cross-validation to find the accuracy scores and their average.

We will use 5-fold cross-validation to evaluate the accuracy of our Support Vector Machine (SVM) classifier with the Radial Basis Function (RBF) kernel. This process will provide us with a reliable estimate of the model's performance.

Warning: It will take about 20 minutes.

In [None]:
# TODO: Perform 5-fold cross-validation











