# Diabetes Prediction using Machine Learning

**Author:** Shivam Kannojia  
**Institute:** IIT Patna (Batch of 2027)  
**Domain:** Data Science | Machine Learning | Healthcare Analytics  

---

## üìå Problem Statement

Diabetes is a chronic medical condition that affects millions of people worldwide. Early detection plays a crucial role in preventing severe complications and improving patient outcomes.

The objective of this project is to build a machine learning model that can predict whether an individual is likely to have diabetes based on clinical and lifestyle features. This project follows a complete end-to-end data science workflow.


In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, classification_report

from sklearn.linear_model import LogisticRegression
import xgboost as xgb

import warnings
warnings.filterwarnings("ignore")


## üìä Dataset Description

The dataset consists of medical and lifestyle-related features such as:
- Age
- BMI
- Blood glucose levels
- Blood pressure
- Other health indicators

‚ö†Ô∏è **Note:**  
Due to competition rules, the dataset is not included in this repository. Please download the dataset directly from the competition source.


In [None]:

df = pd.read_csv("data/train.csv")
df.head()


## üîç Exploratory Data Analysis (EDA)

In this section, we explore:
- Feature distributions
- Missing values
- Class imbalance
- Statistical properties

EDA helps in understanding the dataset and guiding preprocessing and modeling decisions.


In [None]:
df.info()
df.describe()

# Target distribution
sns.countplot(x='target', data=df)
plt.title("Target Class Distribution")
plt.show()

## üß™ Feature Engineering

Feature engineering includes:
- Handling missing values
- Scaling numerical features
- Encoding categorical variables (if any)
- Feature selection (if required)


In [None]:
X = df.drop(columns=['target'])
y = df['target']


## ü§ñ Model Building

We train and compare multiple models:
- Logistic Regression (baseline)
- XGBoost Classifier
- Ensemble / Blended models

Cross-validation is used to ensure stable and reliable performance.


In [None]:
log_reg_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
])

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
scores_lr = cross_val_score(log_reg_pipeline, X, y, cv=cv, scoring="roc_auc")
print("Logistic Regression CV ROC-AUC:", scores_lr.mean())

In [None]:
xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)

scores_xgb = cross_val_score(xgb_model, X, y, cv=cv, scoring="roc_auc")
print("XGBoost CV ROC-AUC:", scores_xgb.mean())


## üîó Model Ensembling

To improve performance, predictions from multiple models are combined using a weighted blending approach. Ensembling helps capture different patterns learned by individual models.


In [None]:
pred_xgb = xgb_model.predict_proba(X_test)[:, 1]
pred_lr = log_reg_pipeline.predict_proba(X_test)[:, 1]

blended_pred = 0.7 * pred_xgb + 0.3 * pred_lr


## üìà Model Evaluation

The primary evaluation metric used in this competition is **ROC-AUC**, which is well-suited for imbalanced classification problems.


In [None]:
roc_auc = roc_auc_score(y_test, blended_pred)
print("Blended Model ROC-AUC:", roc_auc)


## ‚úÖ Conclusion

In this project, we built an end-to-end machine learning pipeline for diabetes prediction:
- Performed structured EDA and preprocessing
- Trained baseline and advanced models
- Used ensembling techniques to improve performance
- Evaluated models using ROC-AUC

This project demonstrates practical application of machine learning in healthcare analytics.
