# KNN Classification on MNIST Dataset (Small Sample)

This notebook demonstrates how to apply K-Nearest Neighbors (KNN) classification using a small subset of the MNIST dataset. We will follow the typical machine learning pipeline, including data loading, preprocessing, model training, and evaluation.


In [None]:
# 1. Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import utils  # Custom utility functions
import os


## 2. Load the MNIST Dataset

We load the training and test data using the utility functions from `utils.py`. The training data consists of a small subset of the MNIST dataset, while the test data is the full MNIST test set.


In [None]:
# Paths to the dataset CSV files
train_path = 'data/mnist_train_small.csv'
test_path = 'data/mnist_test.csv'

# Load data
X_train, y_train, X_test = utils.load_data(train_path, test_path)

# Check the shape of the datasets
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")


## 3. Data Preprocessing

Next, we normalize the pixel values to a range of [0, 1] to ensure consistent scaling. This is crucial for the KNN algorithm to perform optimally.


In [None]:
# Normalize the data
X_train = utils.normalize_data(X_train)
X_test = utils.normalize_data(X_test)

# Split the training data into training and validation sets
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train_split.shape}, Validation set shape: {X_val.shape}")


## 4. Visualize Sample Data

Before training, let's visualize some sample images from the dataset.


In [None]:
# Plot a few sample images from the training set
sample_indices = [0, 1, 2, 3, 4]
utils.plot_multiple_samples(X_train_split, y_train_split, sample_indices)


## 5. Train the K-Nearest Neighbors Model

We train the KNN model with `k=3` and evaluate its performance on the validation set. The optimal value of `k` can be fine-tuned later.


In [None]:
# Initialize the KNN model with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train_split, y_train_split)

# Predict on the validation set
y_val_pred = knn.predict(X_val)

# Evaluate accuracy
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy:.4f}")


## 6. Evaluate the Model

We generate and display a confusion matrix and classification report to better understand the model's performance.


In [None]:
# Generate confusion matrix and classification report
conf_matrix = confusion_matrix(y_val, y_val_pred)
class_report = classification_report(y_val, y_val_pred)

print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

# Plot and save the confusion matrix
os.makedirs('results', exist_ok=True)
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.savefig('results/confusion_matrix.png')
plt.show()

# Save the classification report to a file
with open('results/classification_report.txt', 'w') as f:
    f.write(class_report)


## 7. Predict on the Test Set

Finally, we use the trained KNN model to make predictions on the test set and save the predictions to a CSV file for further evaluation.


In [None]:
# Predict on the test set
y_test_pred = knn.predict(X_test)

# Save the test set predictions
pd.DataFrame(y_test_pred, columns=['Label']).to_csv('results/test_predictions.csv', index_label='ImageId')


## 8. Save the Validation Accuracy

We also save the validation accuracy to a text file in the `results` folder.


In [None]:
# Save the validation accuracy
with open('results/validation_accuracy.txt', 'w') as f:
    f.write(f"Validation Accuracy: {val_accuracy:.4f}\n")


## 9. Conclusion

This notebook demonstrated how to implement and evaluate a K-Nearest Neighbors classifier on a small subset of the MNIST dataset. The results, including the confusion matrix, classification report, and test predictions, have been saved to the `results` folder for further analysis.
