# Project 01: Diabetes Prediction — Goals & Data\n\n## 🎯 Concept Primer\n\nIn this capstone project, we're building a **portfolio-ready** diabetes prediction model using tabular health data. This project demonstrates the **full ML pipeline** from data loading through model evaluation.\n\n### Tabular ML Flow\n1. **Define the problem** — What are we predicting? What metrics matter?\n2. **Load & inspect** — Understand data structure, dtypes, missingness\n3. **Clean** — Handle missing values, outliers, rename columns\n4. **Explore** — Visualize distributions, correlations, class balance\n5. **Preprocess** — Encode categoricals, scale continuous, split train/val/test\n6. **Baseline** — Simple models to sanity-check data quality\n7. **Build PyTorch model** — Neural network for tabular data\n8. **Evaluate** — Honest out-of-sample metrics\n\n### Why Start with Goals?\nDefining success upfront prevents scope creep and helps you know when you're done. Healthcare ML needs clear **operating context** — are we screening (high recall) or diagnosing (high precision)?

## 📋 Objectives\n\nBy the end of this notebook, you will:\n1. State the **concrete problem** you're solving\n2. Define **success metrics** (ROC-AUC, F1, etc.)\n3. Document the **dataset source** and key fields\n4. List the **variables** available for prediction\n5. Consider **ethical implications** of the task

## ✅ Acceptance Criteria\n\nYou'll know you're done when:\n- [ ] A clear **problem statement** is written in a markdown cell\n- [ ] **Metrics** are defined (primary: ROC-AUC, secondary: F1 for positive class)\n- [ ] **Dataset variables** are listed in a markdown table\n- [ ] **Success thresholds** are set (e.g., ROC-AUC ≥ 0.75)

## 🔧 Setup

In [None]:
# TODO 1: Import required libraries\n# Hint: You'll need pandas, numpy, and matplotlib for basic exploration\n# import pandas as pd\n# import numpy as np\n# import matplotlib.pyplot as plt\n# %matplotlib inline

## 📝 Problem Statement\n\n### TODO 2: Write your problem statement\n\nIn 2-3 sentences, describe:\n- What are you predicting? (e.g., \"diabetes presence\")\n- Why does it matter? (e.g., \"early screening\")  \n- What is the expected outcome? (e.g., \"binary classification model\")

**Your problem statement here:**\n\n*Replace this with your own statement*

## 🎯 Success Metrics\n\n### TODO 3: Define your metrics\n\nFor binary classification, choose **one primary metric** and **1-2 secondary metrics**.\n\n**Common choices:**\n- **ROC-AUC:** Overall model performance (area under ROC curve)\n- **F1-score:** Balance of precision and recall\n- **Precision:** Of predicted positives, how many are correct?\n- **Recall:** Of actual positives, how many did we catch?\n\nFor **screening** (e.g., identifying at-risk patients), prioritize **Recall** (avoid false negatives).  \nFor **diagnostics** (e.g., confirming diagnosis), prioritize **Precision** (avoid false positives).

**Your chosen metrics:**\n\n| Metric | Target | Why? |\n|--------|--------|------|\n| Primary | | |\n| Secondary | | |

## 📊 Dataset Information\n\n### TODO 4: Document dataset source and fields\n\n**Dataset:** BRFSS 2015 (Behavioral Risk Factor Surveillance System)  \n**Location:** `../../data/diabetes_BRFSS2015.csv`  \n**Size:** ~250,000+ records  \n**Target:** Diabetes `Diabetes_binary` (Yes/No)

### TODO 5: List key variables\n\nCreate a table of the main features you expect to find. You'll verify this in the next notebook.\n\n| Variable | Type | Description |\n|----------|------|-------------|\n| Diabetes_binary | binary | Target: diabetes diagnosis |\n| HighBP | binary | High blood pressure |\n| HighChol | binary | High cholesterol |\n| CholCheck | binary | Cholesterol check in past 5 years |\n| BMI | numeric | Body Mass Index |\n| Smoker | binary | Smoking status |\n| Stroke | binary | History of stroke |\n| HeartDiseaseorAttack | binary | Heart disease history |\n| PhysActivity | binary | Physical activity in past month |\n| Fruits | binary | Consumes fruit daily |\n| Veggies | binary | Consumes vegetables daily |\n| HvyAlcoholConsump | binary | Heavy alcohol consumption |\n| AnyHealthcare | binary | Has health insurance |\n| NoDocbcCost | binary | Couldn't see doctor due to cost |\n| GenHlth | ordinal | General health (1-5 scale) |\n| MentHlth | numeric | Days of poor mental health (past 30 days) |\n| PhysHlth | numeric | Days of poor physical health (past 30 days) |\n| DiffWalk | binary | Difficulty walking |\n| Sex | binary | Sex (Male/Female) |\n| Age | ordinal | Age bands |\n| Education | ordinal | Education level |\n| Income | ordinal | Income brackets |

## 🤔 Reflection\n\nAnswer these questions:\n\n1. **Why these metrics?** — Did you choose ROC-AUC or F1 as primary? Why?\n2. **Screening vs. Diagnostics** — Is this for early screening (high recall) or final diagnosis (high precision)?\n3. **Data limitations** — What concerns do you have about self-reported survey data?\n4. **Success criteria** — What ROC-AUC would make you feel this model is \"good enough\"?

---\n\n**Your reflection:**\n\n*Write your answers here*

## 📌 Summary\n\n✅ **Problem defined:** Binary classification of diabetes from health factors  \n✅ **Metrics chosen:** ROC-AUC (primary), F1 (secondary)  \n✅ **Dataset documented:** BRFSS 2015, ~250K records  \n✅ **Ready for next step:** Load and inspect the data\n\n**Next notebook:** `02_load_and_inspect.ipynb`