# Assignment 1

This assignment requires you to implement a K-Nearest Neighbors classifier and test it using the `column_2C_weka` data from the **Vertebral Column dataset**, which you can download from [this page on the UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/212/vertebral+column). It walks you through the steps of a typical machine learning workflow.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Fetching and preparing the data

* **(5 points)** Read the `column_2C_weka` data of the above Vertebral Column dataset into a Pandas dataframe. The dataset contains six input features plus a binary target column(the output).

In [2]:
vert = pd.read_csv('vertebral+column/column_2C.DAT', sep=' ', header=None)
vert.columns = ['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'output']
vert

Unnamed: 0,feat1,feat2,feat3,feat4,feat5,feat6,output
0,63.03,22.55,39.61,40.48,98.67,-0.25,AB
1,39.06,10.06,25.02,29.00,114.41,4.56,AB
2,68.83,22.22,50.09,46.61,105.99,-3.53,AB
3,69.30,24.65,44.31,44.64,101.87,11.21,AB
4,49.71,9.65,28.32,40.06,108.17,7.92,AB
...,...,...,...,...,...,...,...
305,47.90,13.62,36.00,34.29,117.45,-4.25,NO
306,53.94,20.72,29.22,33.22,114.37,-0.42,NO
307,61.45,22.69,46.17,38.75,125.67,-2.71,NO
308,45.25,8.69,41.58,36.56,118.55,0.21,NO


* **(5 points)** Re-code the values of the output column to 0 for “Normal” and 1 for “Abnormal”.

In [3]:
vert['output'] = vert['output'].map({'AB': 1, 'NO': 0})
vert

Unnamed: 0,feat1,feat2,feat3,feat4,feat5,feat6,output
0,63.03,22.55,39.61,40.48,98.67,-0.25,1
1,39.06,10.06,25.02,29.00,114.41,4.56,1
2,68.83,22.22,50.09,46.61,105.99,-3.53,1
3,69.30,24.65,44.31,44.64,101.87,11.21,1
4,49.71,9.65,28.32,40.06,108.17,7.92,1
...,...,...,...,...,...,...,...
305,47.90,13.62,36.00,34.29,117.45,-4.25,0
306,53.94,20.72,29.22,33.22,114.37,-0.42,0
307,61.45,22.69,46.17,38.75,125.67,-2.71,0
308,45.25,8.69,41.58,36.56,118.55,0.21,0


* **(5 points)** Perform feature scaling by normalizing the values of each feature column. We use feature scaling to make sure that features with larger values do not render features with small values insignificant. Use the following formula for normalization: $$ x_{new}  = \frac{x - x_{min}}{x_{max} - x_{min}}$$

In [4]:
vert[['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6']] = vert[['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6']].apply(lambda iterr: (iterr - iterr.min()) / (iterr.max() - iterr.min()))

* **(5 points)** Split the dataset into two sets: 25% for testing and the rest for training.

In [5]:
#shuffle
vert = vert.sample(len(vert))

#get split idx
split_idx = int(.75 * len(vert))

#split
training, testing = vert.iloc[:split_idx,:], vert.iloc[split_idx:,:]

print(training.shape)
print(testing.shape)

(232, 7)
(78, 7)


In [10]:
training.iloc[:,:-1]

Unnamed: 0,feat1,feat2,feat3,feat4,feat5,feat6
217,0.115548,0.234191,0.326204,0.168425,0.667061,0.040503
148,0.382041,0.292605,0.329515,0.393763,0.368964,0.117249
81,0.461613,0.494284,0.388223,0.365630,0.539090,0.199302
149,0.512153,0.544659,0.239842,0.388025,0.306915,0.111196
211,0.277488,0.455877,0.336495,0.208958,0.599742,0.030400
...,...,...,...,...,...,...
72,0.567323,0.706860,0.419366,0.357024,0.597699,0.198766
107,0.504244,0.714184,0.557365,0.292708,0.736316,0.205354
275,0.396798,0.415684,0.331126,0.344253,0.725992,0.037291
263,0.073688,0.182744,0.102917,0.154914,0.626411,0.021601


In [14]:
training

Unnamed: 0,feat1,feat2,feat3,feat4,feat5,feat6,output
217,0.115548,0.234191,0.326204,0.168425,0.667061,0.040503,0
148,0.382041,0.292605,0.329515,0.393763,0.368964,0.117249,1
81,0.461613,0.494284,0.388223,0.365630,0.539090,0.199302,1
149,0.512153,0.544659,0.239842,0.388025,0.306915,0.111196,1
211,0.277488,0.455877,0.336495,0.208958,0.599742,0.030400,0
...,...,...,...,...,...,...,...
72,0.567323,0.706860,0.419366,0.357024,0.597699,0.198766,1
107,0.504244,0.714184,0.557365,0.292708,0.736316,0.205354,1
275,0.396798,0.415684,0.331126,0.344253,0.725992,0.037291,0
263,0.073688,0.182744,0.102917,0.154914,0.626411,0.021601,0


## Implementing the classifier

* **(30 points)** Define a class named `KNearestNeighborClassifier` that generalizes the NearestNeighborClassifier of the `05.nearest_neighbors.ipynb` handout to `k` neighbors instead of just 1. The class must define a `fit()` and `predict()` methods just like `NearestNeighborClassifier` does.

In [25]:
class KNearestNeighborClassifier:
    def fit(self, X, y):
        self.X = X
        self.y = y
        
        return self
    
    def predict(self, X_unseen, k):
        if X_unseen.ndim == 1:
            distances = np.sqrt(np.sum((self.X - X_unseen)**2, axis=1))
            #sort 'k' values
            k_min_idx = distances.argpartition(k)
            #retrieve 'k' smallest values' idxs
            k_min_vals = self.y[k_min_idx[:k]]
            print(X_unseen, k_min_vals)
        else:
            return np.array([self.predict(ex, k) for ex in X_unseen])

In [31]:
np.array([1, 2, 3, 4, 7, 7]).index(7)

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [27]:
nn = KNearestNeighborClassifier().fit(training.iloc[:,:-1].values, training.iloc[:,-1].values)
nn.predict(testing.iloc[:,:-1].values, 4)

[0.16946373 0.29224723 0.34007517 0.19007959 0.19733305 0.12090317] [1 1 1 1]
[0.21392747 0.51411218 0.19849651 0.11780492 0.50865684 0.04082868] [1 0 0 1]
[1.         0.26705966 0.30767854 1.         0.40445209 1.        ] [1 1 1 1]
[0.19560185 0.23526259 0.30517272 0.24467888 0.64813421 0.03144786] [0 0 0 1]
[0.44762731 0.42765273 0.34007517 0.38691468 0.5281213  0.10048883] [1 1 0 1]
[0.20746528 0.35423365 0.20297118 0.19442902 0.3029358  0.04033985] [1 1 1 1]
[0.32262731 0.6886388  0.29139073 0.1316861  0.52962684 0.02916667] [1 0 1 1]
[0.61101466 0.99053233 0.51906211 0.25198964 0.69416066 0.30123371] [1 1 1 1]
[0.07417052 0.20757413 0.20261321 0.14251342 0.5793096  0.02527933] [0 0 1 0]
[0.37104552 0.38906752 0.47995346 0.33342587 0.21744274 0.09867318] [1 1 1 1]
[0.44299769 0.45551983 0.33112583 0.36794373 0.47456716 0.0280959 ] [0 1 1 1]
[0.43489583 0.21114684 0.64435296 0.48676661 0.43682116 0.11480447] [1 1 1 1]
[0.35706019 0.23008217 0.43851799 0.40227651 0.43617593 0.124930

array([None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None, None, None, None, None, None, None, None, None, None, None,
       None], dtype=object)

## Hyperparameter tuning using 10-fold cross-validation
* **(20 points)** Define a function that implements **10-fold cross-validation**. See the lecture notes for how such a function works. Use $1 - accuracy$ as an error measure. The learner in this case is your `KNearestNeighborClassifier` class and $k$ (the number of neighbors is the hyperparameter.

In [7]:
def fold10_cv(X, y):
    batch_size = len(X) // 10
    err = 0
    for i in range(10):
        

IndentationError: expected an indented block (2898989707.py, line 5)

In [None]:
tmp = np.array(vert.feat1.to_list())
print(len(tmp))
batch_size = len(tmp) // 10
print(batch_size)

In [None]:
for i in range(10):
    test_start_idx = i * batch_size
    test_end_idx = (i + 1) * batch_size
    print(test_start_idx, test_end_idx)

In [None]:
#for curr valid set
len(tmp[range(test_start_idx,test_end_idx)])

In [None]:
#for curr train set

* **(10 points)** Use the above **10-fold cross-validation** function to plot the values of `k` (1 to 20) against their cross-validation errors. Using this plot, what is the best value for `k`?

In [None]:
# TODO

## Testing the classifier
* **(10 points)** Train a model of the above `KNearestNeighborClassifier` with the “found-out” best `k` value. Test the trained model using the testing set. Show its confusion matrix and accuracy.

In [None]:
# TODO

* **(10 points)** Plot a point corresponding to the above confusion matrix on the ROC (Receiver Operating Characteristic) plot; no curve, just a point. Make sure to also plot the line from (0,0) to (1,1). 

In [None]:
# TODO