# Diabetes Prediction Using K-Nearest Neighbors (KNN)
**Author:** Giovanna Cardenas  
**Description:** This notebook implements a K-Nearest Neighbors (KNN) classification model to predict the likelihood of diabetes based on patient health indicators. The model is evaluated across multiple values of k to identify the optimal number of neighbors for classification accuracy.

In [5]:
# Import packages
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [6]:
# Load data
data = pd.read_csv('diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
# Create KNN Classification Model
KNN_cls = KNeighborsClassifier(n_neighbors=3)
x_data, y_data = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                       'BMI', 'DiabetesPedigreeFunction', 'Age']], data[['Outcome']]

# Partition the data and fit it into the model
x_train, x_test, y_train, y_test= train_test_split(x_data,y_data,random_state=17)
KNN_cls.fit(x_train, y_train)

# Measure how accurate the model is by inputing the testing data set
KNN_cls.score(x_test, y_test)

0.7604166666666666

In [8]:
# Create a for loop to determine the accuracy score for several values of 'k'
result= []
for k in range(1,15):
  KNN_cls_k= KNeighborsClassifier(n_neighbors=k)
  KNN_cls_k.fit(x_train, y_train)
  result.append({'k': k, 'Accuracy': KNN_cls_k.score(x_test, y_test)})

result= pd.DataFrame(result)
print(result)

     k  Accuracy
0    1  0.697917
1    2  0.729167
2    3  0.760417
3    4  0.739583
4    5  0.776042
5    6  0.744792
6    7  0.760417
7    8  0.744792
8    9  0.739583
9   10  0.729167
10  11  0.755208
11  12  0.755208
12  13  0.765625
13  14  0.755208


In [13]:
# Find the index of the value of 'k' with the highest accuracy score
max_k= result.idxmax()
print(max_k)

k           13
Accuracy     4
dtype: int64


In [15]:
# Update the KNN Classification model so it uses the optimal value of nearest neighbors(k)
KNN_cls = KNeighborsClassifier(n_neighbors=5)
x_data, y_data = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                       'BMI', 'DiabetesPedigreeFunction', 'Age']], data[['Outcome']]
x_train, x_test, y_train, y_test= train_test_split(x_data,y_data,random_state=17)
KNN_cls.fit(x_train, y_train)
KNN_cls.score(x_test, y_test)

0.7760416666666666

In [17]:
# Predict the outcome of a new observation using the model
prediction = KNN_cls.predict([[3, 150, 80, 22, 10, 40, 2.3, 66]])
print(f"Predicted Outcome: {prediction}")

Predicted Outcome: [1]


In [19]:
# This observation is predicted to be class 1 which suggests they likely have diabetes.