# import pandas as pd

obesity = pd.read_csv(r"C:\Users\sydne\Downloads\obesity\obesity_level.csv")
intake = pd.read_excel(r"C:\Users\sydne\OneDrive\Documents\aac_shelter_cat_outcome_eng.csv.xlsx")
Data Cleaning & Exploratory Data Analysis Workshop

**Audience:** Intro–Intermediate data science / statistics students
**Tools:** Python, pandas, numpy, matplotlib, seaborn
**Datasets:**

* **Example (instructor-led):** Austin Animal Center Cat Outcomes
* **Student practice:** Obesity Levels Dataset

---

## Learning Goals

By the end of this lab, you should be able to:

* Diagnose common data quality problems using inspection and plots
* Apply basic **imputation**, **outlier detection**, and **consistency checks**
* Perform **univariate, bivariate, and simple multivariate EDA**
* Recognize issues that affect **linear regression assumptions**
* Avoid **data leakage** during exploratory analysis

---

## Dataset Overview & Cleaning Opportunities

### Example Dataset: Animal Shelter (Cats)

This dataset contains **mixed data types** and many *realistic messiness issues*:

* Ages stored as **strings** (e.g., "2 weeks", "1 year") alongside numeric age fields
* Multiple date/time columns with different formats
* Many **categorical variables** with missing values
* Potential **duplicates** (same animal ID appearing multiple times)
* Derived variables (age in days/years) that allow **consistency checks**

**Best topics to cover here:**

* Data type correction
* Missing value handling (deletion vs simple imputation)
* Duplicate detection
* Univariate and bivariate EDA
* Regression diagnostics using numeric age

---

### Student Dataset: Obesity Levels

This dataset includes **health and behavior variables**:

* Continuous variables (Age, Height, Weight)
* Ordinal / numeric scales (food consumption, physical activity)
* Binary indicators stored as integers
* Categorical lifestyle variables
* A **target variable** (obesity level)

**Best topics to cover here:**

* Outlier detection (Weight, Height, Age)
* Distribution shape (skewness, heavy tails)
* Correlations and multivariate structure
* Detecting potential **data leakage**
* Checking assumptions for linear regression

---

## Part 1: Example Code (Animal Shelter Dataset)

We will walk through the following steps together. Your independent lab will mirror this structure.

### 1. Load and Inspect the Data

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

cats = pd.read_excel("aac_shelter_cat_outcome_eng.csv.xlsx")

cats.head()
cats.info()
cats.describe(include='all')
```

**Discussion:**

* Which variables should be numeric but are not?
* Which columns have many missing values?

---

### 2. Data Type Corrections

```python
cats['datetime'] = pd.to_datetime(cats['datetime'])
cats['date_of_birth'] = pd.to_datetime(cats['date_of_birth'])
```

Check whether age variables are consistent:

```python
cats[['outcome_age_(days)', 'outcome_age_(years)']].describe()
```

---

### 3. Missing Values & Simple Imputation

```python
cats['name'].isna().mean()
```

Decision example:

* Drop rows missing **name** (non-essential)
* Keep rows missing **outcome_subtype**

```python
cats_clean = cats.dropna(subset=['name'])
```

---

### 4. Duplicate Detection

```python
cats['animal_id'].duplicated().sum()
```

```python
cats_unique = cats.drop_duplicates(subset='animal_id')
```

**Discussion:** When would duplicates be meaningful instead of errors?

---

### 5. Univariate Analysis

```python
sns.histplot(cats_unique['outcome_age_(years)'], bins=30)
plt.show()

sns.boxplot(x=cats_unique['outcome_age_(years)'])
plt.show()
```

**Discussion:**

* Is the distribution skewed?
* Are there extreme values?

---

### 6. Bivariate Analysis

```python
sns.scatterplot(x='outcome_age_(years)', y='outcome_hour', data=cats_unique)
plt.show()
```

```python
cats_unique[['outcome_age_(years)', 'outcome_hour']].corr()
```

---

### 7. Regression Diagnostics (Preview)

```python
sns.residplot(
    x=cats_unique['outcome_age_(years)'],
    y=cats_unique['outcome_hour'],
    lowess=True
)
plt.show()
```

**Discussion:**

* Linearity
* Constant variance
* Influence of outliers

---

## Part 2: Student Practice (Obesity Dataset)

Load the data:

```python
obesity = pd.read_excel("obesity_level.xlsx")
```

### Your Tasks

Answer the following questions. Show code and brief interpretation.

1. **Initial inspection**
   Identify at least **three variables** that may need cleaning or recoding. Explain why.

2. **Missing values**
   Are there any missing values? If so, choose **one variable** and justify whether you would delete or impute.

3. **Outlier detection**
   Use either the **IQR rule or z-scores** to identify potential outliers in **Weight**. How many do you find?

4. **Distribution shape**
   Create histograms and boxplots for **Weight** and **Height**. Describe skewness or heavy tails.

5. **Bivariate relationships**
   Create a scatterplot of **Height vs Weight**. What does this suggest about their relationship?

6. **Correlation analysis**
   Compute correlations among numeric variables. Which variables are most strongly associated?

7. **Multivariate patterns**
   Create either a **pair plot** or **correlation heatmap** for numeric variables. Describe one notable pattern.

8. **Regression assumptions**
   Suppose we model **Weight** as a function of **Height**. Which regression assumptions appear questionable based on plots?

9. **Data leakage check**
   Identify one variable that should **not** be used as a predictor if the goal is to predict obesity level. Explain why.

---

## Wrap-Up Reflection

* Which cleaning decisions required subjective judgment?
 How did EDA change your understanding of the data?
* What additional data would improve your analysis?

---

*This lab intentionally mirrors real-world data: imperfect, ambiguous, and requiring careful reasoning—not just code.*


In [2]:
import pandas as pd

obesity = pd.read_csv(r"C:\Users\sydne\Downloads\obesity\obesity_level.csv")
intake = pd.read_excel(r"C:\Users\sydne\OneDrive\Documents\aac_shelter_cat_outcome_eng.xlsx")


Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,0be1dad
0,0,Male,24.443011,1.699998,81.669950,1,1,2.000000,2.983297,Sometimes,0,2.763573,0,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.000000,1.560000,57.000000,1,1,2.000000,3.000000,Frequently,0,2.000000,0,1.000000,1.000000,0,Automobile,0rmal_Weight
2,2,Female,18.000000,1.711460,50.165754,1,1,1.880534,1.411685,Sometimes,0,1.910378,0,0.866045,1.673584,0,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.710730,131.274851,1,1,3.000000,3.000000,Sometimes,0,1.674061,0,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,1,1,2.679664,1.971472,Sometimes,0,1.979848,0,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,20753,Male,25.137087,1.766626,114.187096,1,1,2.919584,3.000000,Sometimes,0,2.151809,0,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II
20754,20754,Male,18.000000,1.710000,50.000000,0,1,3.000000,4.000000,Frequently,0,1.000000,0,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight
20755,20755,Male,20.101026,1.819557,105.580491,1,1,2.407817,3.000000,Sometimes,0,2.000000,0,1.158040,1.198439,0,Public_Transportation,Obesity_Type_II
20756,20756,Male,33.852953,1.700000,83.520113,1,1,2.671238,1.971472,Sometimes,0,2.144838,0,0.000000,0.973834,0,Automobile,Overweight_Level_II
