# Problem 1: Finding the Optimal k and Bootstrap Iterations

In this task, you will use the **K-Nearest Neighbors (KNN)** classifier to classify handwritten digits from the **MNIST dataset**. The goal is to determine:
1. The optimal number of neighbors (k) for the KNN classifier.
2. The optimal number of bootstrap iterations needed to confidently estimate the model’s accuracy.

## Task Breakdown

1. **Download and preprocess the MNIST dataset** (use 10% of the dataset for faster experimentation).
2. **Implement bootstrap resampling** to evaluate the KNN classifier for various values of k.
3. Experiment with **different numbers of bootstrap iterations**.
4. **Determine the optimal values of k** and the minimum number of bootstrap iterations required for a confident result.

## Steps to Follow

1. **Preprocess the MNIST dataset**:
   - Normalize the pixel values (e.g., divide by 255 to scale between 0 and 1).
   - Select a random 10% subset of the dataset.

2. **KNN Classifier**:
   - Use the `KNeighborsClassifier` from `sklearn.neighbors`.
   - Iterate over different values of k, specifically $k = 1, 2, \dots, 10$.

3. **Bootstrap Resampling**:
   - For each value of k, perform several bootstrap iterations:
     - Resample the entire dataset with replacement to create a training set.
     - Train the KNN model on the resampled data.
     - Test the model on the remaining data points (out-of-bag data).
     - Compute and store the accuracy on the out-of-bag data for each iteration.

4. **Determine Optimal k and Bootstrap Iterations**:
   - Experiment with different numbers of bootstrap iterations (e.g., try 1, 2, 11, 22, 44, 88, etc.).
   - Calculate the mean accuracy across bootstrap iterations for each k.
   - Determine the optimal k and the number of iterations required for a stable estimate.

5. **Plot**:
   - Plot the mean accuracy for each value of k.
   - Include another plot showing how the mean accuracy stabilizes with increasing bootstrap iterations.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import keras
import sklearn

In [9]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# trim, change shape to 28*28 and convert to [0, 1]
x_train, _, y_train, _ = sklearn.model_selection.train_test_split(x_train, y_train, test_size=0.1, random_state=101)
x_test, _, y_test, _ = sklearn.model_selection.train_test_split(x_test, y_test, test_size=0.1, random_state=101)

x_train = np.reshape(x_train, (6000, 28*28))
x_train = x_train.astype("float32") / 255

x_test = np.reshape(x_test, (1000, 28*28))
x_test = x_test.astype("float32") / 255

In [12]:
from sklearn.neighbors import KNeighborsClassifier

for k in range(1, 11):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    print("k = ", k, "accuracy = ", knn.score(x_test, y_test))

k =  1 accuracy =  0.904
k =  2 accuracy =  0.9
k =  3 accuracy =  0.913
k =  4 accuracy =  0.913
k =  5 accuracy =  0.916
k =  6 accuracy =  0.913
k =  7 accuracy =  0.914
k =  8 accuracy =  0.913
k =  9 accuracy =  0.909
k =  10 accuracy =  0.906
