## Disease Risk Prediction using Machine Learning

This notebook aims to develop a classification model that predicts disease risk levels based on health indicators such as glucose levels, blood pressure, etc.

We will compare three algorithms:
- Logistic Regression
- Decision Trees
- K-Nearest Neighbors (KNN)

We will also tune hyperparameters to optimize performance and address overfitting and bias-variance tradeoffs.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [4]:
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


- Features include glucose, blood pressure, BMI, age, etc.
- The target variable is `Outcome` (1 = disease, 0 = no disease).

In [7]:
# Check for nulls
df.isnull().sum()

# Feature scaling
X = df.drop('Outcome', axis=1)
y = df['Outcome']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [9]:
# Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

# Decision Tree
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)

# KNN
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)


In [11]:
models = ['Logistic Regression', 'Decision Tree', 'KNN']
scores = [
    accuracy_score(y_test, y_pred_log),
    accuracy_score(y_test, y_pred_tree),
    accuracy_score(y_test, y_pred_knn)
]

for m, s in zip(models, scores):
    print(f'{m}: {s:.4f}')


Logistic Regression: 0.7532
Decision Tree: 0.7403
KNN: 0.6883


In [13]:
param_grid = {'n_neighbors': range(1, 20)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

Best Parameters: {'n_neighbors': 11}
Best Score: 0.7573637211781954


- All three models performed decently, with KNN showing slightly better results after tuning.
- Hyperparameter tuning helped reduce overfitting.
- Future work can include feature engineering and trying ensemble methods.