# K-NEAREST NEIGHBORS

*  supervised machine learning algorithm used for both classification and regression tasks
*  Does not make any assumptions about the underlying data distribution
*  Uses distance metrics as a basis
  

## Distance Metrics


### Manhattan distance

Measures distance from one point to another traveling along the axes of a grid

* Can be used to calculate continuous data
* always non negative

Manhattan distance = `|x1 - x2| + |y1 - y2| + |z1 - z2|`


_example_

In [171]:
# Locations of two points A and B
C = (1, 7, 12)
D = (-1, 0, -5)

manhattan_distance = 0

# Use a for loop to iterate over each element
for i in range(3):
    manhattan_distance += abs(C[i] - D[i])

manhattan_distance

26

### Euclidean distance

commonly used in mathematics, measure straightline distance between two points. uses _`Pythagorean Theorem`_ and finds the square root

calculated by finding the square root of the sum of the squared differences between corresponding coordinates of two points

formula

Euclidean distance = `sqrt((x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2 + ...)`

_Example_

In [172]:
from math import sqrt

# Locations of two points A and B
C = (1, 7, 12)
D = (-1, 0, -5)

euclidean_distance = 0

# Use a for loop to iterate over each element
for i in range(3):
    euclidean_distance += ((C[i] - D[i])**2)
    
# Square root of the final result
euclidean_distance = sqrt(euclidean_distance)

euclidean_distance

18.49324200890693

### Minkowski distance

Both the Manhattan and Euclidean distances are actually special cases of Minkowski distance

The Minkowski distance is a generalization of both the Euclidean distance and the Manhattan distance.

formula

Minkowski distance = `(abs(x1 - x2)^p + abs(y1 - y2)^p + abs(z1 - z2)^p + ... )^(1/p)`

When `p = 1`, the Minkowski distance reduces to the Manhattan distance, and when `p = 2,` it becomes the Euclidean distance.

## KNN- Predict Diabetes

we have a dataset who were or were not diagnosed with diabetes

importing the necessary libraries


In [181]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
 



In [159]:
# loading the dataset

df = pd.read_csv('diabetes.csv')

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [160]:
# checking for the shape

df.shape

(768, 9)

there are columns with zeros as values and this can not be accepted as it will affect the outcome, we need to replace the values in those columns with the mean of the respective columns

In [161]:
zero_values = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for column in zero_values:
    # replacing the 0 with NaN
    df[column] = df[column].replace( 0 , np.NaN)
    
    # filling it with mean of that respective column
    mean_value = int(df[column].mean())
    df[column].fillna( mean_value, inplace= True)

In [162]:
# setting x and y

X = df.drop('Outcome', axis = 1)
y = df.Outcome

In [163]:
# doing a train and test split

X_train, X_test, y_train, y_test = train_test_split( X,y , random_state= 0 , test_size= 0.2)

In [164]:
# Feature scaling
scaler = StandardScaler()

# fitting and transforming 

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# the "fit" method is used to calculate and store the necessary statistics from the training dataset,
# while the "transform" method applies the scaling transformation using the learned parameters.
# The testing data is transformed using the previously fitted scaler to maintain consistency and avoid introducing additional information.
# so only training data is fitted and this fitting transformed to both test and train

### To choose value of K
* Sqrt(n) where n is the total number of data points
* odd value of k is selected to avoid confusion between the two classes
* note that there is no definitive "correct" way to select the value of k, there are some common practices and guidelines and this is just one of them

In [165]:
(len(y_test))** 0.5

# so we take 11

12.409673645990857

In [166]:
# defining our model

classifier = KNeighborsClassifier(n_neighbors= 11, p= 2, metric= 'euclidean' )

# fitting the model

classifier.fit ( X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=11)

In [167]:
y_pred = classifier.predict(X_test)

### Evaluating the Model

* Confusion Matrix

In [194]:
cm = confusion_matrix(y_test, y_pred)

print(cm, '\n')

print ('recall_score: ' , recall_score(y_pred , y_test).round(3))
print ('f1_score :' , f1_score(y_pred , y_test).round(3))
print ('accuracy_score :' , accuracy_score(y_pred , y_test).round(3))
print ('precision_score :' , precision_score(y_pred , y_test).round(3))



[[94 13]
 [15 32]] 

recall_score:  0.711
f1_score : 0.696
accuracy_score : 0.818
precision_score : 0.681
