# Lab #4
## KNN design and implementation

As you might remember from Lab. 1, the Iris dataset collects the measurements of different Iris flowers, and each data point is associated with a Iris species (Setosa, Versicolor, or Virginica). In this exercise, you will implement your own version of the the K-Nearest Neighbors algorithm, and you will use it to assign a Iris species (i.e. a label) to flowers whose species is unknown.

The KNN algorithm is straightforward. Suppose that some measurements (i.e. records) and their relative species are known in advance. Then, whenever we want to label a new flower, we look at the K most similar points (a.k.a. neighbors) and assign a label accordingly. The simplest solution is using a
majority voting scheme: if the majority of the neighbors votes for a label, we will go for it. This approach is naive only at first sight: the local similarity assumed by KNN happens to be roughly true. Even though this reasoning does not generalize well1, the KNN provides a valid baseline for your tasks.

In [1]:
import pandas as pd
import numpy as np
import math

**1.** Loading the Iris dataset.

In [2]:
cols = ['sepalLength','sepalWidth','petalLength','petalWidth','class']
df = pd.read_csv("iris.data",names = cols , header = None) 
df

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


**2.** Let’s identify a portion of our data for which we will try to guess the species. Randomly select 20% of the records and store the first four columns (i.e. the features representing each flower) into a two-dimensional numpy array of shape N × C, you can call it X_test.

For the same records, store the last column (i.e. the one with the species values) into another array, namely y_test. This is the data that will be used to test the accuracy of your KNN implementation and its correct functioning (i.e. the testing data).

In [3]:
sampleDf = df.sample(frac = 0.2)
X_test = sampleDf.iloc[:,:-1]
y_test = sampleDf.iloc[:,-1]
print(X_test)
print(y_test)

     sepalLength  sepalWidth  petalLength  petalWidth
56           6.3         3.3          4.7         1.6
38           4.4         3.0          1.3         0.2
83           6.0         2.7          5.1         1.6
87           6.3         2.3          4.4         1.3
130          7.4         2.8          6.1         1.9
141          6.9         3.1          5.1         2.3
43           5.0         3.5          1.6         0.6
70           5.9         3.2          4.8         1.8
32           5.2         4.1          1.5         0.1
33           5.5         4.2          1.4         0.2
20           5.4         3.4          1.7         0.2
79           5.7         2.6          3.5         1.0
76           6.8         2.8          4.8         1.4
144          6.7         3.3          5.7         2.5
67           5.8         2.7          4.1         1.0
42           4.4         3.2          1.3         0.2
73           6.1         2.8          4.7         1.2
106          4.9         2.5

**3.** Store the remaining 80% of the records in the same way. In this case, use the names X_train and y_train for the arrays. This is the data that your model will use as ground-truth knowledge (i.e. the training data).

In [4]:
X_train = df.loc[~df.index.isin(X_test.index)].iloc[:,:-1]
y_train = df.loc[~df.index.isin(y_test.index)].iloc[:,-1]

In [5]:
#pd.concat([X_test, y_test],axis = 1)

**4.** Focus now on the KNN technique. Starting from the next laboratory, you will use the scikit-learn package. Many of its functionalities
are exposed via an object-oriented interface. With this paradigm in mind, implement now the KNN
algorithm and expose it as a Python class.

**5.** Your implementation must support three different distance definitions:
* Euclidean Distance (distance_metric = "euclidean")
* Cosine Distance (distance_metric = "cosine")
* Manhattan Distance (distance_metric = "manhattan")

**6.** Implement the predict method. The function receives as input a numpy array with N rows and C columns, corresponding to N flowers. The method assigns one of the three Iris species to each row using the KNN algorithm, and returns them as a numpy array. For the actual implementation, apply the identify K neighbors using the distance specified by the parameters k and distance passed to the constructor.
Then, assign the label using a majority voting scheme

In [45]:
#input: p, q are two np.arrays of the same lenght
#output: the Euclidean distnace as a float

def EuclideanDist(p,q):
    if len(p) != len(q):
        print('Error: input arrays have different lenghts')
    else:
        d = p - q
        d = np.square(d)
        return np.round(math.sqrt(sum(d)), decimals = 10)

In [46]:
#input: p, q are two np.arrays of the same lenght
#output: the cosine distnace as a float

def CosineDist(p,q):
    if len(p) != len(q):
        print('Error: input arrays have different lenghts')
    else:
        def cs(p,q):
            pSq = [x**2 for x in p]
            qSq = [x**2 for x in q]
            numToBeSummed = [x*y for x,y in zip(p,q)] 
            return sum(numToBeSummed)/(math.sqrt(sum(pSq))*math.sqrt(sum(qSq)))
            
        return np.round(1 - abs(cs(p,q)), decimals = 10)

In [47]:
#input: p, q are two np.arrays of the same lenght
#output: the Manhattan distnace as a float

def ManhattanDist(p,q):
    if len(p) != len(q):
        print('Error: input arrays have different lenghts')
    else:
        return sum(abs(p-q))

In [65]:
class KNearestNeighbors:
    #distance_metric can be "euclidean" or "cosine" or "manhattan"
    def __init__(self, k, distance_metric="euclidean"):
        self.k = k
        self.distance_metric = distance_metric
        
    def fit(self, X, y):    
        """
        Store the 'prior knowledge' of you model that will be used
        to predict new labels.
        :param X : input data points, ndarray, shape = (R,C).
        :param y : input labels, ndarray, shape = (R,).
        """
        #self.model = pd.concat([X_test, y_test],axis = 1)
        self.X_train = X
        self.y_train = y
        
    def predict(self, X):
        """Run the KNN classification on X.
        :param X: input data points, ndarray, shape = (N,C).
        :return: labels : ndarray, shape = (N,).
        """
        self.X = X
        
        #Implement the function to apply to each row
        def singlePrediction(row):
            distType={'euclidean':'Euclidean distance',
                      'cosine':'Cosine distance',
                     'manhattan':'Manhattan distance'}
            sampleRow = row # se non funziona prova a trasformarlo in lista
            rowOutputDf = pd.DataFrame()
            if self.distance_metric == "euclidean":
                rowOutputDf['Euclidean distance'] = self.X_train.apply(lambda row : EuclideanDist(row,sampleRow),axis = 1)
            
            elif self.distance_metric == "cosine":
                rowOutputDf['Cosine distance'] = self.X_train.apply(lambda row : CosineDist(row,sampleRow),axis = 1)
                
            elif self.distance_metric == "manhattan":
                rowOutputDf['Manhattan distance'] = self.X_train.apply(lambda row : ManhattanDist(row,sampleRow),axis = 1)
                
            #print(rowOutputDf['EucliDist'])
            rowOutputDf['label'] = self.y_train
            #print(rowOutputDf['label'])
            neighbors = rowOutputDf.nsmallest(self.k+1, distType[self.distance_metric]).loc[:,'label']
            
            #return the most frequent element in a list
            prediction = max(set(neighbors.tolist()), key=neighbors.tolist().count)
            #print(prediction)
            return prediction
    
        self.y = pd.DataFrame()
        self.y['label'] = self.X.apply(lambda row : singlePrediction(row), axis = 1)
        return self.y['label']

In [80]:
knn = KNearestNeighbors(5,distance_metric="manhattan")
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test).to_numpy()
y_test = y_test.to_numpy()
print(y_pred)
print()
print(y_test)

comparison = []
for el1,el2 in zip (y_pred, y_test):
    print(f'{el1}, {el2}')
    if el1 == el2:
        comparison.append(True)
    else:
        comparison.append(False)
#print(comparison)

AttributeError: 'function' object has no attribute 'to_numpy'

**ToDo**: Consegna 7

In [72]:
# Importing Libraries
import pandas as pd
import numpy as np

# data's stored in dictionary
details = {
	'Column1': [1, 2, 30, 4],
	'Column2': [7, 4, 25, 9],
	'Column3': [3, 8, 10, 30]
}

# creating a Dataframe object
df = pd.DataFrame(details)

# Where method to compare the values
# The values were stored in the new column
df['new'] = np.where((df['Column1'] <= df['Column2']) & (
	df['Column1'] <= df['Column3']), df['Column1'], np.nan)

# printing the dataframe
print(df)


   Column1  Column2  Column3  new
0        1        7        3  1.0
1        2        4        8  2.0
2       30       25       10  NaN
3        4        9       30  4.0
