##### AML class exercise/mandatory assignment 1

In the course of this lab, you used NearestCentroid class from sklearn library and used it to make predictions from scratch
Now you have the knowledge to write a similar NearestCentroid from scratch.

Your custom NearestCentroid implementation should be able to accept any dataset as input, with any number of labels and do the prediction.
But first, code your custom Nearest Centroid implementation specifically for Iris dataset with 3 labels and then generalize for n labels.

1. Separate the dataset into n labels by using the boolean mask based indexing
2. Calculate the centroid of each class. 
3. For any incoming test data, check the distance of each test data point from the centroid. Each test data point belongs to that class to whose centroid it is closest
4. For the given train test split, verify your code prediction is same as sklearn NearestCentroid prediction 
5. **<font color="red">Write the code as reusable Python classes along the lines of sklearn classes (but dont aim for it at the outset)</font>**

Hint: 
1. To calculate the distance between any two data points a and b, use the np.linalg.norm(a-b). In this case distance between all test points and all centroids should be calculated.
2. You can implement this with the traditional two nested for loops. Or if you can use vectorization  

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.neighbors import NearestCentroid

In [3]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

y = df.pop("target").to_numpy()
X = df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, random_state = 0)

In [4]:
model_sk = NearestCentroid()
model_sk.fit(X_train, y_train)
y_pred = model_sk.predict(X_test)
accuracy_score(y_test, y_pred)

0.9

In [5]:
model_sk.centroids_

array([[5.02051282, 3.4025641 , 1.46153846, 0.24102564],
       [5.88648649, 2.76216216, 4.21621622, 1.32432432],
       [6.63863636, 2.98863636, 5.56590909, 2.03181818]])

In [11]:
class MyNearestCentroid:
    def fit(self, X, y):
        self.labels = np.unique(y)
        self.centroids = []
        for lbl in self.labels:
            y_lbl = y[y==lbl]
            X_lbl = X[y==lbl]
            centroid = np.mean(X_lbl, axis=0)
            self.centroids.append(centroid)


    """
        prediction with traditional nested loop
        This function has a sneaky bug that prevents it from working as expected.
        Identify & fix the bug & ping me your answers. This is first part of AML assignment 1
    """
    def predict(self, X):
        num_records = X.shape[0]
        y_pred_distances = np.empty((num_records, len(self.labels)))
        for i in np.arange(0,num_records):
            for j in range(len(self.centroids)):
                y_pred_distances[i, j] = np.linalg.norm(X[i] - self.centroids[j], axis=0)

        y_pred =  np.argmin(y_pred_distances, axis=1)
        return y_pred
    
    """
        TODO: Add vectorized code to do prediction
        This is second part of AML assignment 1
    """
    def predict_vectorized(self, X):
        centroid_matrix = np.array(self.centroids)
        X_broadcasted = X[:, np.newaxis, :]
        distance = np.linalg.norm(X_broadcasted - centroid_matrix, axis = 2)
        y_pred = np.argmin(distance, axis=1)
        return y_pred

In [12]:
mymodel = MyNearestCentroid()
mymodel.fit(X_train, y_train)

print(mymodel.labels)
mymodel.centroids

[0 1 2]


[array([5.02051282, 3.4025641 , 1.46153846, 0.24102564]),
 array([5.88648649, 2.76216216, 4.21621622, 1.32432432]),
 array([6.63863636, 2.98863636, 5.56590909, 2.03181818])]

In [13]:
y_test

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0])

In [14]:
y_pred = mymodel.predict(X_test)
accuracy_score(y_test, y_pred)

0.9

In [15]:
y_pred = mymodel.predict_vectorized(X_test)
accuracy_score(y_test, y_pred)

0.9