# Load and Inspect Data## 🎯 Concept PrimerBefore building any model, you must **understand your data**. Wrong dtypes cause silent bugs. Missing values leak information. This notebook ensures your data is ready for cleaning.### Data Inspection Checklist- **Schema:** Column names, dtypes, non-null counts- **Target distribution:** Is it balanced? How many classes?- **Missing values:** Where are they? How many?- **Basic stats:** min, max, mean for numeric columns**Expected shapes:** You'll load a CSV and inspect rows × columns.

## 📋 ObjectivesBy the end of this notebook, you will:1. Load the diabetes dataset from CSV2. Display schema (.info()) and basic statistics (.describe())3. Visualize target distribution (diabetes yes/no)4. Check for missing values and unusual patterns5. Document initial observations

## ✅ Acceptance CriteriaYou'll know you're done when:- [ ] DataFrame loaded successfully from CSV- [ ] Schema printed (dtypes, non-null counts)- [ ] Target distribution chart created- [ ] Missing value counts documented- [ ] Initial issues identified in reflection

## 🔧 Setup

In [ ]:
# TODO 1: Import libraries# Hint: You'll need pandas, numpy, matplotlib# import pandas as pd# import numpy as np# import matplotlib.pyplot as plt# import seaborn as sns# %matplotlib inline

## 📂 Load Data### TODO 2: Read the CSV file**Expected:** DataFrame with ~250K rows, 22 columns**Hints:**- Use `pd.read_csv()` with the path `../../data/diabetes_BRFSS2015.csv`- Check the shape: `df.shape`- Display first 5 rows: `df.head()`

In [ ]:
# TODO 2: Load CSV# df = pd.read_csv('../../data/diabetes_BRFSS2015.csv')# print(f"Shape: {df.shape}")# df.head()

## 📊 Inspect Schema### TODO 3: Display data types and missing counts**Use:** `df.info()` — shows column names, dtypes, non-null counts**Expected:** Mixed dtypes (int64, float64, object for some columns)**Watch for:** Strings that look numeric (may need parsing)

In [ ]:
# TODO 3: Check schema# df.info()

## 📈 Basic Statistics### TODO 4: Get summary statistics**Use:** `df.describe()` — shows count, mean, std, min, max for numeric columns**Questions to answer:**- Are there any obviously wrong values (e.g., negative BMI)?- What's the range of each numeric feature?

In [ ]:
# TODO 4: Summary statistics# df.describe()

## 🎯 Target Distribution### TODO 5: Visualize target variable**Expected:** `Diabetes_binary` column with values 0 (No) and 1 (Yes)**Questions:**- Is the dataset balanced? (50/50 split)- Or is it imbalanced? (e.g., 10% positive)**Create:** Bar chart showing counts for each class

In [ ]:
# TODO 5: Visualize target# Hint: Use df['Diabetes_binary'].value_counts().plot(kind='bar')# Label axes and add title# plt.xlabel('Diabetes Status')# plt.ylabel('Count')# plt.title('Distribution of Diabetes Cases')# plt.show()

## 🔍 Missing Values### TODO 6: Check for missing data**Use:** `df.isnull().sum()` — counts missing values per column**Questions:**- Which columns have missing values?- How much is missing? (< 1% vs. > 50%)

In [ ]:
# TODO 6: Missing values# missing = df.isnull().sum()# print(missing[missing > 0])

## 🤔 ReflectionAnswer these questions:1. **Schema issues:** Any columns with wrong dtypes? (e.g., integers stored as strings)2. **Missing data:** Which columns have missing values? Should we drop or impute?3. **Class balance:** Is diabetes class balanced or imbalanced? How will this affect training?4. **Initial concerns:** What stands out as problematic?

---**Your reflection:***Write your answers here*

## 📌 Summary✅ **Data loaded:** CSV → DataFrame  ✅ **Schema inspected:** Dtypes documented  ✅ **Target visualized:** Distribution chart created  ✅ **Missing identified:** Columns with NaN documented  ✅ **Ready for next step:** Clean the data**Next notebook:** `03_cleaning.ipynb`