# Clean Data## 🎯 Concept PrimerCleaning transforms raw data into ML-ready data. **Decisions here affect everything downstream.**### Cleaning Steps1. **Rename columns** — snake_case for readability2. **Fix dtypes** — strings → numbers, floats → ints where appropriate3. **Handle missing** — drop rows/columns OR impute (mean/median/mode)4. **Remove outliers** — boxplots or z-scores to find extreme values5. **Deduplicate** — remove exact duplicate rows**Document every decision** — future you will thank you!

## 📋 ObjectivesBy the end of this notebook, you will:1. Rename columns to concise snake_case2. Cast dtypes correctly (int, float, bool)3. Handle missing values (drop or impute)4. Scan for outliers using visualizations5. Document cleaning decisions

## ✅ Acceptance CriteriaYou'll know you're done when:- [ ] All columns renamed to snake_case- [ ] Dtypes are correct (no object columns that should be numeric)- [ ] Missing values handled (dropped or imputed)- [ ] Cleaning decisions logged in a markdown cell- [ ] Clean DataFrame saved in memory

## 🔧 Setup

In [ ]:
# TODO 1: Import libraries# import pandas as pd# import numpy as np# import matplotlib.pyplot as plt# import seaborn as sns# %matplotlib inline

## 📝 Rename Columns### TODO 2: Convert column names to snake_case**Expected:** Original columns like `Diabetes_binary` → `diabetes_binary`**Hints:**- Use `df.columns = df.columns.str.lower().str.replace(' ', '_')`- Or manually rename specific columns with `df.rename()`

In [ ]:
# TODO 2: Rename columns# df.columns = df.columns.str.lower().str.replace(' ', '_')# df.head()

## 🔢 Fix Data Types### TODO 3: Cast columns to correct dtypes**Check previous notebook:** Which columns have wrong dtypes?**Common fixes:**- Binary columns (0/1) → `int` or `bool`- Numeric columns stored as strings → `float` or `int`- Ordinal columns (1-5 scale) → `int`**Use:** `df['column'] = df['column'].astype(dtype)`

In [ ]:
# TODO 3: Fix dtypes# Example: df['column'] = df['column'].astype(int)# Use .info() to verify changes

## 🚫 Handle Missing Values### TODO 4: Decide on missing value strategy**Options:**1. **Drop rows:** If missing < 1% of total2. **Drop columns:** If > 50% missing3. **Impute:** Fill with mean (numeric) or mode (categorical)**Decision:** Document your choice for each column in the reflection.**Use:** `df.dropna()` or `df.fillna()`

In [ ]:
# TODO 4: Handle missing values# Example: df = df.dropna(subset=['column1'])# Or: df['column2'] = df['column2'].fillna(df['column2'].mean())# Verify: df.isnull().sum()

## 📊 Outlier Detection### TODO 5: Scan for outliers in numeric columns**Methods:**- **Boxplots:** Visual inspection- **Z-scores:** Count values beyond ±3 standard deviations**Action:** Decide whether to clip/cap outliers or leave them**Create:** Boxplots for key numeric features (BMI, MentHlth, PhysHlth)

In [ ]:
# TODO 5: Outlier detection# plt.figure(figsize=(10, 6))# sns.boxplot(data=df[['bmi', 'menthlth', 'physhlth']])# plt.title('Outlier Detection')# plt.show()

## 📋 Cleaning Decisions Log### TODO 6: Document your cleaning choicesCreate a markdown table summarizing decisions:| Column | Issue | Decision | Rationale ||--------|-------|----------|-----------|| | | | |

**Example entry:**| Column | Issue | Decision | Rationale ||--------|-------|----------|-----------|| bmi | No missing values | > 0 | Valid range || age | Stored as float | Convert to int | Age should be integer |

## 🤔 ReflectionAnswer these questions:1. **Dtype changes:** Did you convert any strings to numbers? Why?2. **Missing strategy:** Drop or impute? Could your choice bias results?3. **Outliers:** Did you find any extreme values? How did you handle them?4. **Bias check:** Could any cleaning steps introduce bias?

---**Your reflection:***Write your answers here*

## 📌 Summary✅ **Columns renamed:** snake_case convention  ✅ **Dtypes fixed:** Correct numeric types  ✅ **Missing handled:** Dropped or imputed  ✅ **Outliers scanned:** Decisions documented  ✅ **Ready for next step:** Explore relationships**Next notebook:** `04_eda_visualization.ipynb`