
# Diabetes Prediction — Using a Local CSV (Beginner friendly)

This notebook uses a local CSV file `pima_diabetes_dataset_offline.csv` (placed in the same folder).
It performs a full ML workflow with explanations:
- Load local CSV
- Inspect data and handle missing values
- Preprocess (replace invalid zeros, impute using median, scale features)
- Train 5 classifiers (Logistic Regression, Decision Tree, Random Forest, KNN, SVC)
- Evaluate models (accuracy, ROC-AUC, classification report, confusion matrix)
- Plot model comparison and give short assignment notes

Run the notebook cells top-to-bottom. The notebook is written for beginners with comments and brief explanations.


## 1) Imports and helper functions

In [None]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt

print('Libraries imported')

## 2) Load the local CSV dataset

In [None]:
# Load local CSV (ensure the file is in the same folder as this notebook)
csv_path = 'pima_diabetes_dataset_offline.csv'  # change path if needed
df = pd.read_csv(csv_path)
print('Loaded shape:', df.shape)
df.head()

## 3) Basic inspection

In [None]:
print(df.info())
print('\nSummary statistics:\n', df.describe().T)
print('\nTarget value counts:\n', df['Outcome'].value_counts())

## 4) Robust target handling (ensure y is numeric 0/1)

In [None]:
# Robust target mapping (handles string labels if present)
y = df['Outcome']
import numpy as np
if not np.issubdtype(pd.Series(y).dtype, np.number):
    print('Non-numeric target detected. Examples:', pd.Series(y).unique()[:20])
    mapping = {
        'tested_negative': 0, 'tested_positive': 1,
        'negative': 0, 'positive': 1,
        'No': 0, 'Yes': 1,
        'no': 0, 'yes': 1,
        'NEGATIVE': 0, 'POSITIVE': 1,
        '0': 0, '1': 1
    }
    y_mapped = pd.Series(y).map(mapping)
    if y_mapped.isna().any():
        coerced = pd.to_numeric(pd.Series(y), errors='coerce')
        if coerced.isna().any():
            missing = pd.Series(y)[y_mapped.isna()].unique()
            raise ValueError(f'Cannot automatically map these target labels: {missing}. Add them to mapping.')
        else:
            y = coerced.astype(int)
    else:
        y = y_mapped.astype(int).values
else:
    y = pd.Series(y).astype(int).values

print('Final target unique values:', np.unique(y))

## 5) Preprocessing: replace invalid zeros, impute, and scale

In [None]:
# Features
X = df.drop(columns=['Outcome']).copy()

# Columns where 0 is medically invalid and likely represents missing values
cols_with_zero_invalid = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Replace zeros with NaN in those columns
X[cols_with_zero_invalid] = X[cols_with_zero_invalid].replace(0, np.nan)
print('Missing counts after replacing zeros:')
print(X[cols_with_zero_invalid].isna().sum())

# Impute missing values with median
imputer = SimpleImputer(strategy='median')
X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Scale features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_imputed), columns=X_imputed.columns)

print('\nAfter imputation and scaling — feature summary:')
print(X_scaled.describe().T)

## 6) Train/Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.30, random_state=42, stratify=y)
print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)
print('Train class distribution:\n', pd.Series(y_train).value_counts())

## 7) Train 5 models and evaluate

In [None]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVC': SVC(probability=True, random_state=42)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    try:
        roc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    except Exception:
        roc = None
    results.append({'model': name, 'accuracy': acc, 'roc_auc': roc})
    print(f'--- {name} ---')
    print('Accuracy:', round(acc, 4))
    print('Classification report:\n', classification_report(y_test, y_pred))
    print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))
    print('\n')

results_df = pd.DataFrame(results).sort_values('accuracy', ascending=False).reset_index(drop=True)
print('Summary results:')
results_df

## 8) Plot model accuracies

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
plt.bar(results_df['model'], results_df['accuracy'])
plt.title('Model accuracy comparison')
plt.ylabel('Accuracy')
plt.ylim(0,1)
for i, v in enumerate(results_df['accuracy']):
    plt.text(i, v+0.01, f'{v:.3f}', ha='center')
plt.show()

## 9) Assignment notes and suggestions
- Explain why zeros were replaced with NaN for certain columns.
- Explain median imputation.
- Explain why scaling is needed for KNN and SVM.
- Suggest next steps: cross-validation, hyperparameter tuning, feature selection, ensemble methods.