### K-Nearest Neighbors (KNN) Classification
1. Initialize K value. K is the number of nearest neighbors to consider and it controls the balance between overfitting and underfitting. [1 point]

2. Prepare the training data. [1 point]

3.  For each example in the test set

    3.1 Calculate the Euclidean distance between the training set examples (X_train) and the current example from the test data set. [1 point]

    3.2 Sort the distances in ascending order and get the K nearest neighbors based on the calculated distances. [1 point]

    3.3 Get the labels of the K nearest neighbors. [1 point]

    3.4 Get the most common label using [np.unique with return_counts=True and np.argmax] or scipy.stats.mode. [2 points]

    3.5 Append the predicted label to the output list. [1 point]

4. Verify your classifier and fine tune it (change K values to see the change in accuracy) using the Breast Cancer dataset. [1 point for fine-tuning and discussion, and 1 point for successful running of the model]

In [56]:
import numpy as np

class KNNClassifier:
    def __init__(self, k=3):
        # 1. Initialize the number of neighbors K. [1 point]
        "YOUR CODE"
        self.k=k

    def fit(self, X_train, y_train):
        # 2. Prepare the training data [1 point]
        "YOUR CODE"
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        # batch prediction
        predictions = []

        # 3. loop over all samples in the test set
        for sample in X_test:
            
            # 3.1 Compute the distance between the test sample and all training samples in X-train, use np.linalg.norm  [1 point]
            "YOUR CODE"
            distances = np.linalg.norm(sample - self.X_train, axis=1)

            # 3.2 Sort the distances and return the indices of K nearest neighbors using np.argsort [1 point]
            "YOUR CODE"
            nearest_neighbors_indices = np.argsort(distances)[:self.k]

            # 3.3 Get the labels of the K nearest neighbors [1 point]
            "YOUR CODE"
            nearest_labels = self.y_train[nearest_neighbors_indices]

            # 3.4 Get the most common label using [np.unique with return_counts=True and np.argmax] or scipy.stats.mode [2 points]
            "YOUR CODE"
            unique_vals = np.unique(nearest_labels, return_counts=True)
            predicted_label = unique_vals[0][np.argmax(unique_vals[1])]
            
            # 3.5 Append the predicted label to the output [1 point]
            "YOUR CODE"
            predictions.append(predicted_label)

        #return the predictions of all test samples
        return np.array(predictions)


In [37]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the KNN classifier
knn_classifier = KNNClassifier(k=3)
knn_classifier.fit(X_train, y_train)

# Make predictions
predictions = knn_classifier.predict(X_test)

# Calculate accuracy [1 point for good accuracy]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

# Remember you need to fine tune your classifier (change K values to see the change in accuracy). 
# 1 point will be given for fine-tuning and a brief discussion.

Accuracy: 0.9298245614035088


**Fine-tuning K**

I will fine tune K by running a simulation which tests different values of K and compares the resulting accuracy, to pick the best K. I loop through choosing K from 1 to all observations in the training set and graph the results below. It is found a K of 10 leads to the best accuracy (about 98.2%), as shown in the below analysis.

In [55]:
# fine-tuning K

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import altair as alt
import pandas as pd

breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

a = []
index = []
for i in range(1,len(X_train)):
    index.append(i)
    # Create and train the KNN classifier
    knn_classifier = KNNClassifier(k=i)
    knn_classifier.fit(X_train, y_train)
    
    # Make predictions
    predictions = knn_classifier.predict(X_test)
    
    # Calculate accuracy [1 point for good accuracy]
    accuracy = accuracy_score(y_test, predictions)
    a.append(accuracy)

df = pd.DataFrame({'index': index, 'accuracy': a})

chart = alt.Chart(df).mark_line().encode(
    x=alt.X('index', title='Choice of K'),
    y=alt.Y('accuracy', title='Values')
).properties(
    width=800,
    height=500
)
display(chart)

optimal_k = df[df['accuracy'] == df['accuracy'].max()].index.values[0]
optimal_accuracy = df['accuracy'].max()
print("Best K to pick is", optimal_k, "which gives an accuracy of", optimal_accuracy)

Best K to pick is 10 which gives an accuracy of 0.9824561403508771
