<h1>K-NN Classification</h1>


K-NN classificantion is a branch of supervised learning within the classificantion class.

K-NN classification uses data point proximity to group items which posses similar characteristics. A value k is then decided which dictate how many nearest neighbors will be considered for the prediction of a new data point.

In [90]:
# Import librarys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

# Assign data values held in "diabetes.csv" to data variable
data = pd.read_csv("diabetes.csv")
print(data.head()) # Print first five rows of "data"

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [91]:
# Replace zero values 
replace_zero = ['Glucose', 'BloodPressure', 'Age','BMI', 'Insulin']

# Loop function runs through the data and replace zeroes with the column mean (average)
for column in replace_zero:
    data[column] = data[column].replace(0, np.NaN)
    mean = int(data[column].mean(skipna=True))
    data[column].fillna(mean, inplace=True)

# This model will predict relation between BloodPressure and the diagnosis of diabetes
print(data['BloodPressure'])
print(mean)

0      72.0
1      66.0
2      64.0
3      66.0
4      40.0
       ... 
763    76.0
764    70.0
765    72.0
766    60.0
767    70.0
Name: BloodPressure, Length: 768, dtype: float64
155


In [92]:
# Splits the dataset into train (creating the model) and test (validation) sets
x = data.iloc[:, 0:8] # Keeps all rows, exept for column 8
y = data.iloc[:, 8] #  Removes all data exept from that held in column 8, showing whether or not a person has diabetes
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.2)

In [93]:
# Attribute scaling
sc_x = StandardScaler() # Sets all data between -1 and 1
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

In [94]:
# Determine the hyperparaneter value (the number of nearest neighbors to consider)
import math
print(len(y_test))
math.sqrt(len(y_test))

154


12.409673645990857

In [95]:
# Define the model: Init K-NN
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')
classifier.fit(x_train, y_train)

In [96]:
# Prediction of weather or not someone has diabetes, one representing if someone has it and zero is someone dosent have it
y_pred = classifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [97]:
# Evaluate the model
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f1_score(y_test, y_pred))

[[89 18]
 [16 31]]
0.6458333333333333


## Interpretation of results -
[[89, 18][16, 31]]
<table> 
<tr> 
<td> </td>
<td> Predicted positive </td>
<td> Predicted negative </td>
</tr>
<tr>
<td> Actual positive </td>
<td> 89</td>
<td> 18</td>
</tr>
<td> Actual negative </td>
<td> 16</td>
<td> 31</td>
</tr>
</table>

TP - 89  
FP - 18  
TN - 16  
FN - 31

**Accuracy:** overall correctness of the model's predictions 

formula = (TP + TN) / (TP + FP + TN + FN).  
personal accuracy = (89 + 16) / (89 + 18 + 16 + 31) ≈ 0.6818 or 68.18%.


**Precision:** measures the accuracy of the positive predictions

formula = TP / (TP + FP).  
personal precision = 89 / (89 + 18) ≈ 0.8317 or 83.17%.


**Recall** (Sensitivity): measures the ability of the model to correctly identify positive instances 

formula = TP / (TP + FN).  
personal recall = 89 / (89 + 31) ≈ 0.7416 or 74.16%.


**F1-Score**: the harmonic mean of precision and recall 

formula = 2 * (Precision * Recall) / (Precision + Recall).  
personal F1-Score = 2 * (0.8317 *  0.7416) / (0.8317 + 0.7416) ≈ 0.3920 or 39.2%.

From this interpretation, this project's precision and recall are relativly good but the F1-scope and accuracy are not.