
# Diabetes Prediction â€” Preprocessing & Modeling (Beginner friendly)

This notebook uses the **Pima Indians Diabetes** dataset (OpenML name: `diabetes`).

**What you'll get in this assignment notebook (ready-to-run):**
- Load the dataset from OpenML
- Inspect & handle missing values (zeros -> NaN for some medical features)
- Impute missing values (median)
- Scale features with StandardScaler
- Train 5 classification models: Logistic Regression, Decision Tree, Random Forest, KNN, SVC
- Evaluate models (accuracy, classification report, confusion matrix, ROC AUC)
- Visualize model accuracies and include short discussion points for the assignment write-up

> Run the notebook cells in order. If your environment allows internet access, `fetch_openml('diabetes')` will download the dataset.


## 1. Setup and imports

In [None]:
# Standard imports
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt

print('Libraries imported')

## 2. Load the Pima Indians Diabetes dataset from OpenML
This dataset on OpenML has the name `'diabetes'` (UCI Pima Indians Diabetes). We'll fetch it and prepare X, y.

In [None]:
pima = fetch_openml('diabetes', version=1, as_frame=True)
X = pima.data.copy()
y = pima.target.copy().astype(int)
print('Loaded dataset shape:', X.shape)
print('Target distribution:\n', y.value_counts())
X.head()

## 3. Inspect and preprocess
Replace zeros with NaN for features where zero is not medically valid (glucose, bloodpressure, skinthickness, insulin, bmi). Then impute with median and scale.

In [None]:
cols_with_zero_invalid = ['glucose', 'bloodpressure', 'skinthickness', 'insulin', 'bmi']
X_clean = X.copy()
X_clean[cols_with_zero_invalid] = X_clean[cols_with_zero_invalid].replace(0, np.nan)
print('Missing values after replacing zeros:')
print(X_clean[cols_with_zero_invalid].isna().sum())

# Impute and scale
imputer = SimpleImputer(strategy='median')
X_imputed = pd.DataFrame(imputer.fit_transform(X_clean), columns=X_clean.columns)
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_imputed), columns=X_imputed.columns)

print('\nAfter imputation and scaling:')
print(X_scaled.describe().T)

## 4. Train/test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.30, random_state=42, stratify=y)
print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)

## 5. Train five models
We'll train Logistic Regression, Decision Tree, Random Forest, KNN, and SVC. Then evaluate each model on the test set.

In [None]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVC': SVC(probability=True, random_state=42)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    try:
        roc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    except Exception:
        roc = None
    results.append({'model': name, 'accuracy': acc, 'roc_auc': roc})
    print(f'--- {name} ---')
    print('Accuracy:', round(acc, 4))
    print('Classification report:\n', classification_report(y_test, y_pred))
    print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))
    print('\n')

results_df = pd.DataFrame(results).sort_values('accuracy', ascending=False).reset_index(drop=True)
print('Summary results:')
print(results_df)

## 6. Visualize accuracy comparison

In [None]:
plt.figure(figsize=(8,4))
plt.bar(results_df['model'], results_df['accuracy'])
plt.title('Model accuracy comparison')
plt.ylabel('Accuracy')
plt.ylim(0,1)
for i, v in enumerate(results_df['accuracy']):
    plt.text(i, v+0.01, f'{v:.3f}', ha='center')
plt.show()

## 7. Assignment write-up suggestions (short)
- Explain preprocessing choices (why zeros -> NaN, why median imputation).
- Explain scaling and why it's needed for KNN and SVM.
- Discuss model comparison and how to improve (cross-validation, hyperparameter tuning, feature selection).