# Nearest Neighbour Algorithm (NNA)

NNA is a supervised learning algorithm that classifies a new data point into the target class, depending on the features of its neighbour data points. It works by calculating the distance of 1 test observation from all the observation of the training dataset and then finds the nearest neighbours of it. This happens for each, and every test observation and that is how it finds similarities in the data. This algorithm is non-parametric because it does not assume any kind of distribution or any kind of functional relationship between the data points.NNA is considered a lazy algorithm compare to other machine learning algorithm as it does not have training steps.

## Calculating Distance 

Distance measurements are key components to NNA . The distance function tells the nearest neighbour. Depending on the distance function, you can get different classifiers and performance. 

**Minkowski Distance**

Minkowski Distance is the generalized form of Euclidean and Manhattan distance.

d(i,j)=$\sqrt[p]{x_i1 - x_j1 |[p]$ + |{x_i2 - x_j2}|[p] +...+|x_il - x_jl}$

where i = ($x_i1,x_i2,....,x_il $) and j = ($x_j1,x_j2,....,x_jl $) are two l dimensional data objects, and p is the order (the distance)

Minkowski distance is typically used with p being 1 or 2.

if p = 2 is Euclidian Distance

if p = 1 is Manhattan Distance



 ## Table of Contents
**PART A:**
  * Importing Data
  * Calculating Nearest Neighbour
  *  Euclidean Distance (P=1): Accuracy, Sensitivity and Specificity
  *  Manhattan distance (P=2): Accuracy, Sensitivity and Specificity

**PART B:**
  * Minkowski distance (P=9): Accuracy, Sensitivity and Specificity


# PART A

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# import train and test datasets
X_train = pd.read_csv("/Users/goundosidibe/Downloads/sonar_train.csv")
X_test = pd.read_csv("/Users/goundosidibe/Downloads/sonar_test.csv")

In [3]:
#Converting the target values to numbers
X_train['Class'] = X_train['Class'].map(lambda x : 0 if x=='M' else 1)

X_test['Class'] = X_test['Class'].map(lambda x : 0 if x=='M' else 1)

In [4]:
# training target variable
y_train = X_train.Class
# removing the target variable from the dataset
X_train = X_train[X_train.columns[:-1]]

In [5]:
#test target variable
y_test = X_test.Class
# removing the target variable from the dataset
X_test = X_test[X_test.columns[:-1]]

In [6]:
print("X_train: " ,X_train.shape, "\n")
print("y_train: " ,y_train.shape, "\n")
print("X_test: " ,X_test.shape, "\n")
print("y_test: " ,y_test.shape)

X_train:  (139, 60) 

y_train:  (139,) 

X_test:  (69, 60) 

y_test:  (69,)


In [7]:
X_test = np.array(X_test)
X_train = np.array(X_train)
y_train = np.array(y_train)
y_test = np.array(y_test)

In [8]:
# The minkowski distance function takes an input of two data points (in1 & in2) and a Minkowski power parameter p. 
#By adjusting parameter p , I can calculate Manhattan distance (p=1), Euclidean distance (p=2).

def minkowski_distance(in1, in2, p=1):
      #  Set initial distance to 0
    dist = 0
     # Calculate minkowski distance using parameter p
    for j in range(len(in1)):
        dist = dist + abs(in1[j]-in2[j])**p
    dist = np.sqrt(dist)**(1/p)
    return dist;
 

In [9]:

#the predict function takes in all of the training and test data, k, and p, and returns the predictions.
#The function should return a list of label predictions containing only 0’s and 1’s.

def predict(x_train, y , x_input, k,p):
    op_labels = []
     
    #Loop through the Datapoints to be classified
    for item in x_input: 
         
        #Array to store distances
        point_dist = []
         
        #Loop through each training Data
        for j in range(len(x_train)): 
            distances = minkowski_distance(np.array(x_train[j,:]) , item) 
            #Calculating the distance
            point_dist.append(distances) 
        point_dist = np.array(point_dist) 
         
        #Sorting the array while preserving the index
        #Keeping the first K datapoints
        dist = np.argsort(point_dist)[:k] 
        k_labels = [y[n] for n in dist]
        # count neighbor labels and take the label which has max count
        labels_counts = np.bincount(k_labels)
        op_labels.append(np.argmax(labels_counts))
    return np.array(op_labels)
         

l = predict(X_train, y_train, X_test, k = 1 , p=1)

def calculateAccuracy(y_test,Y_pred):
    #retuns the accuracy
    accuracy = len(y_test[np.where(y_test==Y_pred)])/len(Y_pred) 
    return accuracy



In [10]:
print("predicted values: ", np.array(l), "\n")

print("target values:", y_test, "\n")

print("accuracy:", calculateAccuracy(y_test, l))


predicted values:  [1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1] 

target values: [1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1] 

accuracy: 0.8840579710144928


# Euclidean Distance (P=1)

In [11]:

y_ =  predict(X_train, y_train, X_test, k = 1 , p=1)


def calculateAccuracy(y_test,Y_pred):
    #retuns the accuracy
    accuracy = len(y_test[np.where(y_test==Y_pred)])/len(Y_pred) 
    return accuracy


def classification_metrics(y_test, y_):
    '''
    True positive - actual = 1, predicted = 1
    False positive - actual = 1, predicted = 0
    False negative - actual = 0, predicted = 1
    True negative - actual = 0, predicted = 0
    '''
    tp = sum((y_test == 1) & (y_ == 1))
    tn = sum((y_test == 0) & (y_ == 0))
    fn = sum((y_test == 1) & (y_ == 0))
    fp = sum((y_test == 0) & (y_ == 1))
    return tp, tn, fp, fn


#     tp_cl= True positive classification 
#     tn_cl = True negative classification
#      ...
    
tp_cl, tn_cl, fp_cl, fn_cl = classification_metrics(y_test, y_)



def accuracy(tp, tn, fn, fp):
    
    '''Accuracy = (TP+TN)/(TP+FP+FN+TN)'''
    
    return ((tp + tn) * 100)/ float( tp + tn + fn + fp)



def sensitivity(tp, fn):
    
    '''sensitivity = TP/(TP+FN) '''
    
    return (tp  * 100)/ float( tp + fn)


def specification(tn, fp):
    
    '''specificity = TN/(TN+FP)'''
    
    return (tn  * 100)/ float( tn + fp)



print("predicted values: ", np.array(l), "\n")
print("target values: ", (y_test),  "\n")
print("accuracy: %.2f" %(accuracy(tp_cl, tn_cl, fn_cl, fp_cl)),"%", "\n")
print("sensitivity: %.2f" %sensitivity(tp_cl, fn_cl),"%", "\n")
print("specification: %.2f" %specification(tn_cl, fp_cl),"%", "\n")


predicted values:  [1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1] 

target values:  [1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1] 

accuracy: 88.41 % 

sensitivity: 81.25 % 

specification: 94.59 % 



# Manhattan Distance (P=2)

In [12]:
def minkowski_distance(in1, in2, p=2):
      #  Set initial distance to 0
    dist = 0
     # Calculate minkowski distance using parameter p
    for j in range(len(in1)):
        dist = dist + abs(in1[j]-in2[j])**p
    dist = np.sqrt(dist)**(1/p)
    return dist


m =  predict(X_train, y_train, X_test, k = 1 , p=2)

    
tp_cl, tn_cl, fp_cl, fn_cl = classification_metrics(y_test, m)

print("predicted values: ", np.array(m), "\n")
print("target values: ", (y_test), "\n")
print("accuracy: %.2f" %(accuracy(tp_cl, tn_cl, fn_cl, fp_cl)),"%", "\n")
print("sensitivity: %.2f" %sensitivity(tp_cl, fn_cl),"%", "\n")
print("specification: %.2f" %specification(tn_cl, fp_cl),"%","\n")



predicted values:  [1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0] 

target values:  [1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1] 

accuracy: 89.86 % 

sensitivity: 81.25 % 

specification: 97.30 % 



# PART B

In [13]:
def minkowski_distance(in1, in2, p=15):
      #  Set initial distance to 0
    dist = 0
     # Calculate minkowski distance using parameter p
    for j in range(len(in1)):
        dist = dist + abs(in1[j]-in2[j])**p
    dist = np.sqrt(dist)**(1/p)
    return dist;
 

r =  predict(X_train, y_train, X_test, k = 1 , p=15)

print("predicted values: ", np.array(r), "\n")
print("target values: ", (y_test),"\n")
print("accuracy: %.2f" %(accuracy(tp_cl, tn_cl, fn_cl, fp_cl)),"%","\n")
print("sensitivity: %.2f" %sensitivity(tp_cl, fn_cl),"%","\n")
print("specification: %.2f" %specification(tn_cl, fp_cl),"%","\n")


predicted values:  [1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1
 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0] 

target values:  [1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1] 

accuracy: 89.86 % 

sensitivity: 81.25 % 

specification: 97.30 % 

