# Diabetes Prediction on CDC Dataset â€” Project Overview

## Summary
Build a machine learning pipeline to predict diabetes status using a CDC health survey dataset (e.g., BRFSS / CDC diabetes data). The goal is a reproducible notebook that performs data cleaning, exploratory analysis, model training, evaluation, and basic explainability.

## Objectives
- Load and understand the CDC diabetes dataset.
- Clean and preprocess features (missing values, categorical encoding, scaling).
- Train baseline and advanced classifiers to predict diabetes.
- Evaluate models with appropriate metrics and cross-validation.
- Provide model interpretation and deployment-ready artifacts.

## Data (high level)
- Source: CDC public survey data (BRFSS or similar) containing demographics, lifestyle, clinical indicators and a diabetes label.
- Typical columns: age, sex, BMI, physical activity, smoking, blood pressure, cholesterol, survey responses, and a binary diabetes indicator.
- Expected issues: missing data, class imbalance, mixed data types.

## Pipeline / Tasks
1. Data ingestion and schema inspection
2. Data cleaning and imputation
3. Exploratory Data Analysis (distributions, correlations, class balance)
4. Feature engineering and encoding
5. Train/test split and cross-validation
6. Baseline model (Logistic Regression) and stronger models (Random Forest, XGBoost)
7. Hyperparameter tuning (GridSearch / Randomized CV)
8. Evaluation: ROC AUC, precision, recall, F1, confusion matrix, calibration
9. Model interpretation (feature importance, SHAP)
10. Save best model and provide sample inference code

## Evaluation criteria
- Robustness (cross-validated performance)
- Balanced metrics for imbalanced classes (AUC, recall for positive class)
- Interpretability and reproducibility

## Deliverables
- Jupyter notebook with EDA, modeling, and interpretation
- Trained model artifact and inference snippet
- Short README summarizing findings and next steps

## Next steps
- Obtain/confirm exact CDC dataset and target definition.
- Implement preprocessing and baseline model.
- Iterate with feature selection and tuning based on validation results.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import warnings