# Diabetes Health Indicators Classification Project

**Objective:** Build and compare multiple classification models to predict diabetes using clinical indicators. Handle class imbalance, perform EDA, preprocessing, model training with GridSearchCV, and interpret results.

This notebook is structured to guide you through each step with explanations, reasoning, and results.

## Project Requirements
- Load the Kaggle dataset programmatically using `kaggle.json` credentials.
- Perform exploratory data analysis (EDA) to understand distributions and missing data.
- Preprocess data: imputation, encoding, scaling, and class balancing with SMOTE.
- Train binary classifiers: KNN, Logistic Regression, SVM, Decision Tree (covered in class).
- Train two additional classifiers: XGBoost and LightGBM.
- Use `GridSearchCV` for hyperparameter tuning of each model.
- Evaluate using F1 score, confusion matrix, and ROC curve.
- Compare models and select the best by F1.
- Provide clear explanations and visualizations.

## ML Utility Library Analysis
We leverage `generic_ml_utils.py` which provides:
- **Data Processing**: loading, imputation (`fit_imputer`), encoding (`encode_categorical`), balancing (`balance_classes`).
- **Feature Engineering**: datetime encoding, wind direction encoding (not used here), Combine/Polynomial features.
- **Model Evaluation**: `rmse`, `accuracy`, `model_performance_df`, `compare_models`, `select_best_model`, plotting functions.
- **Modeling**: `get_model`, `train_model`, `tune_model` for GridSearchCV, and `predict`.

Each of these maps directly to our workflow steps.

In [None]:
# Install necessary libraries
!pip install kaggle xgboost lightgbm imbalanced-learn matplotlib seaborn


In [None]:
# Set Kaggle config and download dataset
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.path.expanduser('~/.kaggle')
!kaggle datasets download -d plamen2/diabetes-health-indicators-dataset -p data --unzip


## Exploratory Data Analysis (EDA)
Load the data and inspect basic statistics, distributions, and missing values.

In [None]:
import pandas as pd
df = pd.read_csv('data/diabetes_health_indicators.csv')
df.shape, df.columns


In [None]:
# Display first rows
df.head()


In [None]:
# Summary statistics
df.describe()


In [None]:
# Check missing values
df.isnull().sum()


## Data Preprocessing
1. **Imputation:** Fill missing numerical features using mean strategy.
2. **Encoding:** One-hot encode categorical variables.
3. **Balancing:** Apply SMOTE to handle class imbalance.

In [None]:
import generic_ml_utils as ml_utils
# 1. Impute missing data
df_imputed, imputer = ml_utils.fit_imputer(df)
# 2. Encode categoricals
df_encoded = ml_utils.encode_categorical(df_imputed)
# 3. Separate X, y and balance
X = df_encoded.drop('Diabetes_binary', axis=1)
y = df_encoded['Diabetes_binary']
X_res, y_res = ml_utils.balance_classes(X, y, method='smote')
X_res.shape, y_res.value_counts()


## Train-Test Split
Split the balanced data into training and testing sets (80/20).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_res, y_res, test_size=0.2, random_state=42, stratify=y_res
)
X_train.shape, X_test.shape


## Model Training with GridSearchCV
Define hyperparameter grids and tune each classifier to maximize F1 score.

In [None]:
param_grids = {
    'logistic_regression': {'C': [0.1, 1, 10]},
    'knn': {'n_neighbors': [3, 5, 7]},
    'svm': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'decision_tree': {'max_depth': [None, 5, 10]},
    'xgboost_classifier': {'n_estimators': [100, 200], 'max_depth': [3, 5]},
    'lightgbm_classifier': {'n_estimators': [100, 200], 'num_leaves': [31, 50]}
}
from generic_ml_utils import tune_model, model_performance_df
search_results = {}
estimators = {}
results = {}
for name, grid in param_grids.items():
    print(f'Tuning {name}...')
    search = tune_model(
        X_train.values, y_train.values, name, grid, cv=5, scoring='f1'
    )
    search_results[name] = search
    best_model = search.best_estimator_
    estimators[name] = best_model
    y_pred = best_model.predict(X_test.values)
    results[name] = model_performance_df(
        y_test.values, y_pred, model_type='classification', model_name=name
    )
print('Tuning complete.')

## Results Comparison
Compare model performances and select the best by F1 score.

In [None]:
from generic_ml_utils import compare_models, select_best_model
comparison_df = compare_models(results)
comparison_df


In [None]:
best_model_name, best_f1 = select_best_model(results, metric='F1')
print(f'Best model: {best_model_name} with F1 = {best_f1:.4f}')

## Visualization of the Best Model
- **ROC Curve**
- **Confusion Matrix**

In [None]:
from generic_ml_utils import plot_roc_curve, plot_confusion_matrix
best_est = estimators[best_model_name]
y_proba = best_est.predict_proba(X_test.values)[:, 1]
fig1 = plot_roc_curve(y_test.values, y_proba)
fig2 = plot_confusion_matrix(y_test.values, best_est.predict(X_test.values))
fig1, fig2

## Conclusion
- The best performing model is **{0}** with an F1 score of **{1:.4f}**.
- Insights on feature importance and potential improvements:
  - Perform advanced feature selection or dimensionality reduction.
  - Explore ensemble stacking or voting classifiers.
  - Consider calibration of probabilities and threshold optimization.

### Next Steps for Presentation
- Export key result tables and plots.
- Prepare slides summarizing methodology, results, and interpretation.
- Keep code notebook available for Q&A during defense.