## Supplement 4: Classification

In [40]:
%matplotlib inline
import numpy as np
import pandas as pd

### 4.2 Programming Task: K-Nearest Neighbor
The datasets in files __train-knn.csv__ and __test-knn.csv__ contain samples from a synthetic dataset for training a K-Nearest Neighbor classifier.
The dataset consists of 7 columns: the first six columns, denoted as x1, x2, ..., x6 represent
 the input features for each data sample, and the last column represents the class label given by 0 or 1.
There are 200 samples in the __train-knn.csv__ and 100 samples in the __test-knn.csv__}.

i\. Implement the K-Nearest Neighbor classification algorithm using NumPy and SciPy.



In [43]:
training_data=pd.read_csv('./train-knn.csv')
X_train=training_data.drop(columns=['class']).to_numpy()
y_train=training_data.pop('class').to_numpy()
num_neighbors = 3
class kNearestNeighbor(object):
    def __init__(self, k):
        self.k=k 

    def fit(self, X, y):
        self.__X=X
        self.__y=y

    def predict(self, X_test):
        predictions=np.zeros(X_test.shape[0],dtype=np.int32)
        idx=0
        for x in X_test:
            distances=np.sqrt(np.sum((self.__X-x)**2,axis=-1))
            k_indices=np.argsort(distances)[:self.k]
            k_nearest_labels=self.__y[k_indices]
            unique,counts=np.unique(k_nearest_labels,return_counts=True)
            max_count=np.max(counts)
            candidates=unique[counts==max_count]
            if len(candidates)>1:
                predicted_label=np.random.choice(candidates)
            else:
                predicted_label=candidates[0]
            predictions[idx]=predicted_label
            idx+=1
        return predictions
knn = kNearestNeighbor(num_neighbors)
knn.fit(X_train, y_train)
test_data=pd.read_csv('./test-knn.csv')
X_test=test_data.drop(columns=['class']).to_numpy()
y_test=test_data.pop('class').to_numpy()
predictions = knn.predict(X_test)
print(np.sum(predictions==y_test)/y_test.size)

0.8


ii\. Perform cross-validation (with 5 folds) on the train dataset __train-knn.csv__ to determine a suitable value of K.


In [44]:
def kfold_indices(data, k):
    fold_size = len(data) // k
    indices = np.arange(len(data))
    folds = []
    for i in range(k):
        test_indices = indices[i * fold_size: (i + 1) * fold_size]
        train_indices = np.concatenate([indices[:i * fold_size], indices[(i + 1) * fold_size:]])
        folds.append((train_indices, test_indices))
    return folds

# Define the number of folds (K)
k = 5
K_values=np.array(range(1,21))
# Get the fold indices
fold_indices = kfold_indices(X_train, k)
mean_accuracies=[]
for K in K_values:
    model=kNearestNeighbor(K)
    fold_accuracies=[]
    for train_indices,test_indices in fold_indices:
        X_train_cv,y_train_cv=X_train[train_indices],y_train[train_indices]
        X_test_cv,y_test_cv=X_train[test_indices],y_train[test_indices]
        model.fit(X_train_cv,y_train_cv)
        y_pred_cv=model.predict(X_test_cv)
        fold_accuracy=np.sum(y_pred_cv==y_test_cv)/y_test_cv.size
        fold_accuracies.append(fold_accuracy)
    mean_accuracies.append(np.mean(np.array(fold_accuracies)))
optimal_k=K_values[np.argmax(mean_accuracies)]
print(f"The suitable value of K is {optimal_k}")






The suitable value of K is 11


iii\. Using the optimal value of k from the cross-validation, obtain the accuracy of your model on the test dataset __test-knn.csv__.


In [46]:
knn_opt=kNearestNeighbor(11)
knn_opt.fit(X_train,y_train)
predictions_opt=knn_opt.predict(X_test)
Accuracy=np.sum(predictions_opt==y_test)/y_test.size
print(f"The accuracy of the model is {Accuracy*100}%")

The accuracy of the model is 77.0%


iv\. Compare your result with the KNeighborsClassifier model from the scikit-learn library.

In [48]:
from sklearn.neighbors import KNeighborsClassifier
knn_sklearn=KNeighborsClassifier(n_neighbors=11)
knn_sklearn.fit(X_train,y_train)
predictions_sklearn=knn_sklearn.predict(X_test)
Accuracy_sk=np.sum(predictions_sklearn==y_test)/y_test.size
print(f"The accuracy of the model is {Accuracy_sk*100}%")


The accuracy of the model is 77.0%


v\. How do the bias and variance of each model vary as K increases?

Bias increases with increasing K and variance decreases with increasing K