# Linnaeus University
## Introduction to Machine learning, 25VT-2DV516
## Assignment 1

**Name:** Martim Oliveira

**Email:** mo223tz@student.lnu.se

## Introduction

In this assignment you will handle four exercises related to the k-Nearest Neighbors algorithm.
The main purpose is to get you up and running using Python, NumPy and Matplotlib. 
The library Scipy will be used specifically in Exercise 3, part 2.

## Submission Instructions

All exercises are individual. We expect you to submit a zip file with this notebook with your solutions and the MachineLearning.py with the models implemented. 
You must normalize your data before doing anything with your data.
When grading your assignments we will in addition to functionality also take into account code quality. 
We expect well structured and efficient solutions. 
Finally, keep all your files in a single folder named as username_A1 and submit a zipped version of this folder.

### Exercise 1: Models implementation and testing (All Mandatory)

1. Implement all the methods in the abstract classes **KNNRegressionModel** and **KNNClassificationModel** in the MachineLearningModel.py file. 
As the names suggest, you must implement the Regression (slide 30) and Classification (slide 24) versions of the KNN algorithm and you must follow the algorithms stated in the slides. 
* Both models must use the Euclidean distance as the distance function (*Tip: Code smart by implementing an auxiliary method _euclidian_distance() in the MachineLearningModel.py file*).
* The evaluate() function for the **KNNRegressionModel** must implement the Mean Squared Error (MSE)
* The evaluate() function for the **KNNClassificationModel** must count the number of correct predictions.

2. Use the *Polynomial200.csv* dataset to show that all your methods for the **KNNRegressionModel** is working as expected. You must produce a similar figure to the one in slide 31. Instructions to produce the figure are present in the slide. You must show the effects of using k = 3, 5, 7 and 9 and discuss your findings on the figure produced.

**Discuss your findings for this question below**

Note: Please read instructions and further information on README file.

After implementing the methods inherited from the abstract class, I've added one method to create the KNNRegression Plot representation. 


![KNNRegression_K=3](KNNRegression_Polynomial_3.png)

- **When k = 3** we can observe that the line is quite sensitive to variations in the data
- Follows the points closely - might be overfitting? 
- We can observer sharp peaks

![KNNRegression_K=5](KNNRegression_Polynomial_5.png)

- **When k = 5** we see a smoother line than k = 3
- The line has less fluctuations
- Overall balance, following the data and with smoother noise
- Potential overfitting reduced, compared with k = 3

![KNNRegression_K=7](KNNRegression_Polynomial_7.png)

-  **When k = 7** it is smoother than the previous k-value
- There is a reduced impact of individual points
- More stable predictions

![KNNRegression_K=9](KNNRegression_Polynomial_9.png)

-  **When k = 9** we have the smoothest line of all observed k's
- Most stable predicitons
- Least sensitive to individual points

**General Observations**
1. Plots show the same overall trend:
    - Sharp increse betweeen (0.0-0.2)
    - Relatively stable between (0.2-0.8)
    - Last increase at (0.8-1.0)
2. Larger k-values produce smoother results.

**Bellow we can see the implementation of the main methods - Euclidean distance calculation, predict and evaluate.**

In [None]:
def euclidean_distance(self, point, data):
        """
        Calculate the Euclidean distance between a point and the dataset points.
        Euclidean equation: sqrt((X₂-X₁)²+(Y₂-Y₁)²) where:
        X₂ = New entry's data.
        X₁= Existing entry's data.
        Y₂ = New entry's data.
        Y₁ = Existing entry's data.
        """
        point = np.array(point)
        data = np.array(data)
        return np.sqrt((point - data) ** 2) 

def predict(self, X):
        """
        Make predictions on new data.
        The predictions are made by averaging the target variable of the k nearest neighbors.

        Parameters:
        X (array-like): Features of the new data.

        Returns:
        predictions (array-like): Predicted values.
        """
        predictions = []
        for row in X:
            distances = self.euclidean_distance(row, self.X_train)
            sorted_distances = np.argsort(distances)
            top_k_rows = sorted_distances[:self.k]
            mean_value = np.mean(self.y_train[top_k_rows])
            predictions.append(mean_value)
        return predictions

def evaluate(self, y_true, y_predicted):
        """
        Evaluate the model on the given data.
        You must implement this method to calculate the Mean Squared Error (MSE) between the true and predicted values.
        The MSE is calculated as the average of the squared differences between the true and predicted values.        

        Parameters:
        y_true (array-like): True target variable of the data.
        y_predicted (array-like): Predicted target variable of the data.

        Returns:
        score (float): Evaluation score.
        """
        return np.mean((np.array(y_true) - np.array(y_predicted)) ** 2)



3. Use the *IrisDataset.csv* dataset to show that all your methods for the **KNNClassificationModel** is working as expected. You must produce a similar figure to the one in slide 28. Instructions on how to produce the figure are given in the slide. You must choose 2 input variables only to produce the figure (they do not need to match the figure in the slide). You must show the effects of using k = 3, 5, 7, and 9 and discuss the figure produced.

**Tips**

* Check the function *np.meshgrid* from numpy to create the samples.
* Check the function *plt.contourf* for generating the countours. 
* There are many tutorials online to produce this figure. Find one that most suits you.

**Discuss your findings for this question below**

**Key Points**

- It is clear that Iris-Setosa (blue points) is separated from the other species, clustured in the lower left area
- Iris-Versicolor and Virginica show some overlap in the middle region
- The model appear to perform well for Iris-Setosa with any k-value
- When k = 3 I would say that the model it's overfitting. However, for k = 5 and k = 7 I do have doubts; k = 5 seems the best model, although they have  similar baheviours. 

![Mesh Decision Boundary k = 3](MeshGrid_Iris_3.png)

![Mesh Decision Boundary k = 5](MeshGrid_Iris_5.png)

![Mesh Decision Boundary k = 7](MeshGrid_Iris_7.png)

- K = 9 it is clearly an underfitting model. 

![Mesh Decision Boundary k = 9](MeshGrid_Iris_9.png)

**Bellow we can see the implementation of the main methods - Euclidean distance calculation, predict, evaluate, and method to create Mesh Decision Boundary Visualization.**

In [None]:
def most_common_rows(self, lst):
        """
        Get most frequent class from a list of labels.
        """
        # Convert numpy arrays to list of strings
        return max(set(lst), key=list(lst).count)

def euclidean_distance(self, point, data):
        """
        Calculate the Euclidean distance between a point and the dataset points.
        Euclidean equation: sqrt((X₂-X₁)²+(Y₂-Y₁)²) where:
        X₂ = New entry's data.
        X₁= Existing entry's data.
        Y₂ = New entry's data.
        Y₁ = Existing entry's data."""
        
        point = np.array(point)
        data = np.array(data)
        return np.sqrt(np.sum((point - data) ** 2, axis=1))
    
def fit(self, X, y):
        """
        Train the model using the given training data.
        In this case, the training data is stored for later use in the prediction step.
        The model does not need to learn anything from the training data, as KNN is a lazy learner.
        The training data is stored in the class instance for later use in the prediction step.

        Parameters:
        X (array-like): Features of the training data.
        y (array-like): Target variable of the training data.

        Returns:
        None
        """
        self.X_train = np.array(X)
        self.y_train = np.array(y)

def predict(self, X):
        """
        Make predictions on new data.
        The predictions are made by taking the mode (majority) of the target variable of the k nearest neighbors.
        
        Parameters:
        X (array-like): Features of the new data.

        Returns:
        predictions (array-like): Predicted values.
        """
        predictions = []
        X = np.array(X)
        for row in X:
            distances = self.euclidean_distance(row, self.X_train)
            sorted_distances = np.argsort(distances)
            top_k_rows = sorted_distances[:self.k]
            neighbors = self.y_train[top_k_rows]
            predictions.append(self.most_common_rows(neighbors))
        return predictions
        

def evaluate(self, y_true, y_predicted):
        """
        Evaluate the model on the given data.
        You must implement this method to calculate the total number of correct predictions only.
        Do not use any other evaluation metric.

        Parameters:
        y_true (array-like): True target variable of the data.
        y_predicted (array-like): Predicted target variable of the data.

        Returns:
        score (float): Evaluation score.
        """
        correct_predictions = 0
        for i in range(len(y_true)):
            # Handle nested predictions
            pred = y_predicted[i]
            if isinstance(pred, (list, np.ndarray)):
                # For NumPy arrays, take the first element and convert to string
                if isinstance(pred, np.ndarray):
                    pred = str(pred[0])
                else:
                    pred = pred[0]
            if y_true[i] == pred:
                correct_predictions += 1
        return correct_predictions
    
def knn_mesh_decision_boundary(self, X, y, ax):
        """
        Create a visualization of the KNN classification model
        """
        label_encoder = LabelEncoder()
        y_encoded = label_encoder.fit_transform(y)
        x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
        y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
        xx, yy = np.meshgrid(
            np.linspace(x_min, x_max, 200),
            np.linspace(y_min, y_max, 200)
        )
        grid_points = np.c_[xx.ravel(), yy.ravel()]
        Z = self.predict(grid_points)
        Z_encoded = label_encoder.transform(Z)
        Z_encoded = Z_encoded.reshape(xx.shape)
    
        # Plot
        plt.figure(figsize=(8, 6))
        plt.contourf(xx, yy, Z_encoded, alpha=0.3, cmap=plt.cm.coolwarm)
        plt.scatter(X[:, 0], X[:, 1], c=y_encoded, edgecolor='k', s=30, cmap=plt.cm.coolwarm)
        plt.title("Mesh decision boundary for the Iris2D dataset using 5-NN")
        plt.xlabel("Sepal length (cm)")
        plt.ylabel("Sepal width (cm)")

        # Legend
        from matplotlib.patches import Patch
        legend_elements = [Patch(facecolor=plt.cm.coolwarm(i / 2), label=cls) for i, cls in enumerate(label_encoder.classes_)]
        plt.legend(handles=legend_elements, loc='upper right')
        plt.grid(True)
        plt.show()

### Exercise 2: KNN Regression (Mandatory)

1. (Mandatory) Create a procedure to repeat 10 times the following strategy.
* Use the values for k = 3, 5, 7, 9, 11, 13 and 15.
* Split your dataset randomly into 80% for training, and 20% testing. Use 10 different seeds for splitting the data.
* Evaluate (MSE implemented in your class) your **KNNRegressionModel** for each k in the **test set** and store the result. 
* Plot a barchart with these results.

Which k gives the best regression? Motivate your answer!

**Discuss your findings for this question below**

After analyzing the Bar Chart, we can conclude some key points:
- k = 3 gives us the highest MSE which suggests that has the poorest results. 
- This analysis suggest that k = 11 is the best choice for this task. 
- One can observe that there is a performance degradation when k is small and very large - too few neighbours leading to overfitting and too many to underfitting. 

![KNN Regression Bar Chart](KNNRegressionBarChart.png)



**Bellow you can observe the function created for the problem.**

In [None]:
def run_regression_experiment():
    """
    Run the KNN regression with:
    - 10 repetitions
    - k values: 3, 5, 7, 9, 11, 13, 15
    - 80-20 train-test split
    - Different random seeds for each repetition
    """
    print("\n=== Running KNN Regression Problem 2 ===")
    
    # Load and prepare the polynomial dataset
    polynomial = pd.read_csv('Polynomial200_normalized.csv')
    X = polynomial['x'].values
    y = polynomial['y'].values
    
    # Requirements 
    k_values = [3, 5, 7, 9, 11, 13, 15]
    n_repetitions = 10
    test_size = 0.2
    random_seeds = range(42, 42 + n_repetitions)  # 10 different seeds
    
    results = np.zeros((n_repetitions, len(k_values)))
    
    for rep_idx, seed in enumerate(random_seeds):
        print(f"\nRepetition {rep_idx + 1}/10")
        
        # Split data with current seed
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=seed
        )
        
        # Test each k value
        for k_idx, k in enumerate(k_values):
            print(f"Testing k={k}")
            model = KNNRegressionModel(k=k)
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            mse = model.evaluate(y_test, predictions)
            results[rep_idx, k_idx] = mse
            print(f"MSE for k={k}: {mse:.6f}")
    
    # Calculate mean and std of MSE for each k
    mean_mse = np.mean(results, axis=0)
    std_mse = np.std(results, axis=0)
    
    # Create Bar Chart
    plt.figure(figsize=(12, 6))
    x = np.arange(len(k_values))
    plt.bar(x, mean_mse, yerr=std_mse, capsize=5)
    plt.xlabel('k value')
    plt.ylabel('Mean Squared Error (MSE)')
    plt.title('KNN Regression BarChart')
    plt.xticks(x, k_values)
    plt.grid(True, alpha=0.3)
    
    # Add value labels on top of bars
    for i, v in enumerate(mean_mse):
        plt.text(i, v + std_mse[i], f'{v:.6f}', ha='center', va='bottom')
    
    plt.savefig('KNNRegressionBarChart.png')
    plt.close()
    
    print("\n===== Results =====")
    print("\nMean MSE for each k value:")
    for k, mean, std in zip(k_values, mean_mse, std_mse):
        print(f"k={k}: {mean:.6f} ± {std:.6f}")
    
    best_k_idx = np.argmin(mean_mse)
    best_k = k_values[best_k_idx]
    print(f"\nBest k value: {best_k} (MSE: {mean_mse[best_k_idx]:.6f} ± {std_mse[best_k_idx]:.6f})")
    print(f"- Worst k value: {k_values[np.argmax(mean_mse)]} (MSE: {np.max(mean_mse):.6f})")
   


### Exercise 3: KNN Classification (1 Mandatory , 1 Non-Mandatory)

1. **(Mandatory)** Using the **IrisDataset.csv**, find the best combination of two features that produces the best model using **KNNClassificationModel**.
* You must try all combinations of two features, and for k = 3, 5, 7, and 9.
* You must use plots to support your answer.

**Discuss your findings for this question below**

- It is clearly that the pair {PetalLengthCm and PetalWidthCm} is the best combinatioun. 
- The created Scatter Plot is quite similar to the original one - we can observe distinct clusters between Setosa and the other two species. 
- Iris Versicolor and Virigica overlap, having more similar dimensions.

![Scatter Plot](scatter_PetalLengthCm_PetalWidthCm.png)

**Createad a specific function inside Run_models.py to solve this problem**

The function returns the results for each possible pair of species and for different k-values.

Best combination:
Features: ('PetalLengthCm', 'PetalWidthCm')
k value: 5
Accuracy: 0.9667

In [None]:
def find_best_features_classification():
    """
    Function to find the best combination of two features for the Iris dataset classification.
    Tests all possible pairs of features with different k values.
    Creates visualization for each combination.
    """
    print("\n=== Finding Best Feature Combination for KNN Classification ===")

    # Load and prepare data
    df = pd.read_csv('IrisDataset_normalized.csv')
    features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
    y = np.array(df['species'])  # Convert to numpy array
    
    # Automatically generate all possible pairs of features
    feature_pairs = list(combinations(features, 2))
    
    k_values = [3, 5, 7, 9]
    n_splits = 5  
    results = {}  
    
    for pair in feature_pairs:
        print(f"\nTesting feature pair: {pair}")
        X = df[list(pair)].values  # Convert to numpy array
        pair_results = np.zeros((len(k_values), n_splits))
        
        for split_idx in range(n_splits):
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.2, random_state=42 + split_idx
            )
            
            for k_idx, k in enumerate(k_values):
                model = KNNClassificationModel(k=k)
                model.fit(X_train, y_train)
                predictions = model.predict(X_test)
                accuracy = model.evaluate(y_test, predictions) / len(y_test)
                pair_results[k_idx, split_idx] = accuracy
        
        results[pair] = np.mean(pair_results, axis=1)
        
        # Create scatter plot for this feature pair
        plt.figure(figsize=(10, 6))
        unique_species = np.unique(y)
        colors = ['blue', 'red', 'green']
        
        for species, color in zip(unique_species, colors):
            mask = df['species'] == species
            plt.scatter(
                df[pair[0]][mask], 
                df[pair[1]][mask],
                c=color,
                label=species,
                alpha=0.6
            )
        
        plt.xlabel(pair[0])
        plt.ylabel(pair[1])
        plt.title(f'Scatter Plot: {pair[0]} vs {pair[1]}\nBest Accuracy: {np.max(results[pair]):.4f}')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.savefig(f'scatter_{pair[0]}_{pair[1]}.png')
        plt.close()
    
    # Find best combination
    best_pair = max(results.items(), key=lambda x: np.max(x[1]))
    best_k_idx = np.argmax(best_pair[1])
    best_k = k_values[best_k_idx]
    
    print("\n===== Results =====")
    print("\nAccuracy for each feature combination:")
    for pair, accuracies in results.items():
        print(f"\n{pair}:")
        for k, acc in zip(k_values, accuracies):
            print(f"k={k}: {acc:.4f}")
    
    print(f"\nBest combination:")
    print(f"Features: {best_pair[0]}")
    print(f"k value: {best_k}")
    print(f"Accuracy: {best_pair[1][best_k_idx]:.4f}")
    
    # Summary Bar Chart
    plt.figure(figsize=(12, 6))
    x = np.arange(len(feature_pairs))
    
    # Plot average accuracy for each pair
    avg_accuracies = [np.max(results[pair]) for pair in feature_pairs]
    bars = plt.bar(x, avg_accuracies)
    
    # Customize the plot
    plt.xlabel('Feature Pairs')
    plt.ylabel('Best Accuracy')
    plt.title('Best Accuracy for Different Feature Pairs')
    plt.xticks(x, [f'{p[0]}\n{p[1]}' for p in feature_pairs], rotation=45)
    plt.grid(True, alpha=0.3)
    
    for bar, acc in zip(bars, avg_accuracies):
        plt.text(
            bar.get_x() + bar.get_width()/2,
            bar.get_height(),
            f'{acc:.4f}',
            ha='center',
            va='bottom'
        )
    
    plt.tight_layout()
    plt.savefig('feature_pairs_comparison.png')
    plt.close()

2. **(Non-mandatory)** Implement a new Class called **FastKNNClassificationModel**. This method should be faster than your regular implementation. This can be done by using a faster data structure to look for the closest neighbors faster. In this assignment, you must build the KDTree with the the training data and then search for the neighbors using it.

* You must use this implementation of KDTree from Scipy. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html
* The methods needed for your implementation are only the *constructor* (to build the KDTree) and the method *query* to find the k-neighbors.
* You must design an experiment using the **IrisDataset.csv** with **all features** to show that your new implementation is faster than your implementation of **KNNClassificationModel**.
* For example, you can measure the time using of each prediction, for each classifier, and plot the average time to give a decision for entries. Also, measure how this would increase/decrease with the increment of the input parameter *k*. 
* Use a plot(s) from matplotlib to support your answer.

**Discuss your findings for this question below**

To solve this exercise, I checked and followed this online information: [Geeks for Geeks](https://www.geeksforgeeks.org/how-to-reduce-knn-computation-time-using-kd-tree-or-ball-tree/)


KNN has a high computacional cost, specially with large datasets. There are a few methods to reduce this cost, like using efficient data structures; KD-Trees would significantly improve the times, dividing the dataset. 
"A KD tree recursively divides the space along alternating dimensions, allowing faster queries with a time complexity of O(logN), compared to the O(N) of brute force." 

I created a class - FastKNNClassificationModel.py -  which creates a KDTree instance, passes the training data, and calls query() method. Then, on the class RunFastKNNClassificationModel I created a method to compare both models for k's values. The models will be compared basede on Time and Accuracy, although the plot only shows the time differences. 

Based on the output printed on the terminal we can notice that the regular implementation time grows especially after k = 7, while the Fast implementation maintains a with small variations of time. 

While seeing the generated plot we can see that there's an increasing trend for the Regular KNN, with a spike after k = 5. In contrast, Fast KNN maintains a consistent time accross k values, with minimal increase, concluding that it is more stable compared to Regular KNN. 

This comparison shows that using KDTree data structure we can increase efficiency of Regular KNN model. 

![KNN Models Comparison](KNN_Models_Comparison.png)

**Please see part of the implementation bellow**



In [None]:
class FastKNNClassificationModel(MachineLearningModel):
    """
    Fast KNN Classification Model.
    KDTree data structure for efficient nearest neighbor search.
    """
    
    def __init__(self, k):
        """
        Initialize the model with k neighbors.
        """
        self.k = k
        self.kdtree = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Build the KDTree with training data.
        """
        # Convert inputs to numpy arrays and ensure correct shape
        self.X_train = np.array(X, dtype=float)
        if len(self.X_train.shape) == 1:
            self.X_train = self.X_train.reshape(-1, 1)
            
        self.y_train = np.array(y)
        
        # Here we pass the training data to the KDTree data structure
        self.kdtree = KDTree(self.X_train)
    
    def predict(self, X):
        """
        Predict method for test data using KDTree's efficient nearest neighbor search.
        """
        X = np.array(X, dtype=float)
        if len(X.shape) == 1:
            X = X.reshape(-1, 1)
            
        # Use KDTree to find k nearest neighbors
        distances, indices = self.kdtree.query(X, k=self.k)
        
        predictions = []
        for neighbor in indices:
            neighbor_labels = self.y_train[neighbor]
            most_common = Counter(neighbor_labels).most_common(1)[0][0]
            predictions.append(most_common)       
        return np.array(predictions)
    
    def evaluate(self, y_true, y_predicted):
        """
        Evaluate method by counting correct predictions.
        """
        y_true = np.array(y_true)
        y_predicted = np.array(y_predicted)
        return np.sum(y_true == y_predicted)


## Exercise 4: MNIST k-NN classification (Non-mandatory)

In this final exercise, we will use k-NN for classifying handwritten digits using the very famous MNIST dataset. Input to the algorithm is an image (28x28 pixel) with a handwritten digit (0-9) and the output should be a classification 0-9. The dataset and a description of it is available at http://yann.lecun.com/exdb/mnist/. Google MNIST Python to learn how to access it. The objective is to use your k-NN classifier to perform as good as possible on recognizing handwritten images. Describe your effort and what you found out to be the best k to lower the test error. The complete dataset has 60,000 digits for training and 10,000 digits for testing. Hence the computations might be heavy, so start of by a smaller subset rather than using the entire dataset. The final testing should (if possible) be done for the full test set but we will accept solutions that use "only" 10,000 digits for training and 1,000 digits for testing.
The description of this exercise is deliberately vague as you are supposed to, on your own, find a suitable way to solve this problem in detail. This is why it is important that you document your effort and progress in your report. **You must use your implementations of KNN for classification. If you successfully finished Exercise 3, part 2, it is advisable to use your FastKNNClassificationModel**

For this problem I started by getting the dataset using keras.datasets. I also used as reference this link: [MNIST in Keras](https://colab.research.google.com/github/AviatorMoser/keras-mnist-tutorial/blob/master/MNIST%20in%20Keras.ipynb) 

Firstly, I tried to visualize the dataset with the prints on lines (8-11). The for loop was just a plot test, that's wjy is commented. 

Then, with the data normalized I created the KNNClassifier_MNIST class which calls FastKNNClassficationModel - fit, predict, evaluate. 

When I tried with a train subset of 5000 digits and 1000 for test, these were the results:

    Training on subset with  5000  samples
    Testing on  1000  samples
    Evaluation:  0.912

For such a small subset, I think accuracy is quite high. (only for k = 3)

Then, I run the same code with different k's:

**Testing with k=5**
Training on subset with  5000  samples
Testing on  1000  samples
Accuracy with k=5: 0.9110

**Testing with k=7**
Training on subset with  5000  samples
Testing on  1000  samples
Accuracy with k=7: 0.9090

**Testing with k=9**
Training on subset with  5000  samples
Testing on  1000  samples
Accuracy with k=9: 0.9120

**Testing with k=11**
Training on subset with  5000  samples
Testing on  1000  samples
Accuracy with k=11: 0.8990

Based on previous outputs, I believe the best k = 9 would be the best choice.

Then, I tried to run with 10,000 digits for training and 1000 for testing, for different k's (took some time to run):

**Testing with k=3**
Training on subset with  10000  samples
Testing on  1000  samples
Accuracy with k=3: 0.9220

**Testing with k=5**
Training on subset with  10000  samples
Testing on  1000  samples
Accuracy with k=5: 0.9200

**Testing with k=7**
Training on subset with  10000  samples
Testing on  1000  samples
Accuracy with k=7: 0.9200

**Testing with k=9**
Training on subset with  10000  samples
Testing on  1000  samples
Accuracy with k=9: 0.9160

**Testing with k=11**
Training on subset with  10000  samples
Testing on  1000  samples
Accuracy with k=11: 0.9160

For this last subset, my conclusion is that the model performs better for a smaller k = 3. 

In [None]:
class KNNClassifier_MNIST:
    def __init__(self, k=3):
        self.k = k
        self.model = FastKNNClassificationModel(k=self.k)

    def fit(self, X, y):
        self.model.fit(X, y)

    def predict(self, X_test):
        return self.model.predict(X_test)
    
    def evaluate(self, X_test, y_test):
        return self.model.evaluate(X_test, y_test)
    
    def main(self):
        # Smaller subset
        train = 10000
        test = 1000
        
        k = [3, 5, 7, 9, 11]
        # find best k value
        for k in k:
            knn = KNNClassifier_MNIST(k=k)
            print(f"\nTesting with k={k}")
            print("Training on subset with ", train, " samples")
            knn.fit(X_train_normalized[:train], train_y[:train])
            
            print("Testing on ", test, " samples")
            predictions = knn.predict(X_test_normalized[:test])
            
            accuracy = knn.evaluate(test_y[:test], predictions) / test
            print(f"Accuracy with k={k}: {accuracy:.4f}")
            print("-" * 50)