KNN - Predicting whether or not a person has diabetus mellitus or not

Preparation of data (reading the dataset)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # to make data more uniform
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix # for testing our data
from sklearn.metrics import f1_score # for testing our data
from sklearn.metrics import accuracy_score # for testing our data

In [2]:
dataset = pd.read_csv('diabetes.csv')
print( len(dataset) )
print( dataset.head() )

768
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


Preparation of data (removing 0's, and replacing it with mean values)

In [3]:
# Replace zeroes 
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin'] # column names

for column in zero_not_accepted:
	dataset[column] = dataset[column].replace(0, np.NaN)    # replace whatever zero you find in a column 
                                                            # with 0 in each of those columns from zero_not_accepted
	mean = int(dataset[column].mean(skipna=True))           # calculate the mean to replace the 0's with
	dataset[column] = dataset[column].replace(np.NaN, mean) # insert the NaN values with mean 

Preparation of data (splitting data set into two, one is training data, another one is testing data

In [4]:
# split dataset
X = dataset.iloc[:, 0:8] # looking at all rows, from column 0 till 7 only
y = dataset.iloc[:, 8] # looking at all rows, from column 0 till 7 only
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2) # putting aside 20% of 
                                                                                             # dataset into the test sample

In [5]:
# Feature scaling
sc_X = StandardScaler()               # to scale the data, so that we don't  
                                      # calculate extremes, 
                                      # to scale down calculations

X_train = sc_X.fit_transform(X_train) # replace our old X_train into scaled
                                      # training data

X_test = sc_X.transform(X_test)       # replace our old X_test into scaled
                                      # training data

In [6]:
import math
math.sqrt(len(y_test))

12.409673645990857

In [7]:
# Define the model: Init K-NN
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')

why 11? because if we square root the y_train, and get the length, we'll get 12.40967. We don't want this cause it's an even number, so we take 1 away to make it odd. 
Why odd? The odd value of K should be preferred over even values in order to ensure that there are no ties in the voting.

In [9]:
# Fit Model
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=11)

In [10]:
# Predict the test set results
y_pred = classifier.predict(X_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [11]:
# Evaluate the Model
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[94 13]
 [15 32]]


The way you read it: The model successfully classified 94, with 13 false positives
                     The model successfully classified 32, with 15 false positives

In [12]:
print(f1_score(y_test, y_pred)) # more accurate representation of the performance of the ML model

0.6956521739130436


In [13]:
print(accuracy_score(y_test, y_pred)) # doesn't take into account the false positives

0.8181818181818182
