# KAN-CDSCO2004U  Machine Learning and Deep Learning

## Lab 2: Data Cleaning and Pipelines with the Penguins Dataset
**Estimated time: 2 hours**

### Learning Objectives
By the end of this exercise, you will be able to:
- Perform Exploratory Data Analysis (EDA) including skewness and correlation checks
- Identify and fix data inconsistencies (e.g., typos in categorical variables)
- Handle missing values using robust imputation strategies
- Manually encode categorical features (One-Hot vs Label Encoding)
- **Build production-ready Preprocessing Pipelines** using `ColumnTransformer`
- **Split data** into training and test sets

In this exercise, you will practice the full data preprocessing workflow using the **Palmer Penguins** dataset.

**How to work through this notebook:**
- üèÉ **RUN** cells = Just execute the code to see the output
- ‚úèÔ∏è **TODO** cells = Write your own code or answer questions
- üìñ **READ** cells = Explanations to help you understand the concepts

---
## Setup

üèÉ **RUN** the cells below to import libraries and load the data.

In [None]:
# Import needed libraries
# Author: Luca Gudi (lgg.digi@cbs.dk)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# SKLearn preprocessing tools
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Display options
pd.set_option('display.max_columns', None)

In [None]:
# Load the Penguins dataset
url = "https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv"
df_raw = pd.read_csv(url)

# Working copy
df = df_raw.copy()
df.head()

---
## 1. Exploring the Data (EDA)

üèÉ **RUN** these cells to get an overview of the dataset.

In [None]:
df.info()

In [None]:
df.describe(include='all')

### ‚úèÔ∏è TODO: Answer the following questions (based on the outputs above)

**Q1: How many features (columns) are in the dataset?**

Your answer: ___

**Q2: Which feature appears to be the target (the species we want to predict)?**

Your answer: ___

**Q3: Are there any missing values? If so, in which columns?**

Your answer: ___

### Visualizing the Relationships

üèÉ **RUN** the pairplot and heatmap below.
Lecture slides recommend checking **histograms** (distributions), **skewness**, and **correlations**.

In [None]:
sns.pairplot(df, hue="species_short")
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

---
## 2. Data Cleaning

### Investigating Inconsistencies

üèÉ **RUN** the following cell to inspect the unique values in the `sex` column.

In [None]:
print("Unique Sex values:", df['sex'].unique())

### ‚úèÔ∏è TODO: Handling the Error

You likely noticed a `.` in the gender column. This means "unknown" or "data entry error".

**Task**: Replace the `.` with `np.nan` so Python treats it as a proper missing value.

In [None]:
# Write your code here to replace '.' with np.nan
# Hint: df['sex'] = df['sex'].replace(..., ...)

df['sex'] = df['sex'].replace('.', np.nan)
print("Fixed Sex values:", df['sex'].unique())

### Handling Missing Values

üìñ **READ**: 
We have missing values in `sex` and some numerical columns.
- **Dropping**: Quick, but loses data.
- **Imputing**: Smart (filling with mean/median/mode).

For this *manual* section, we will **drop** the rows with NAs to keep things simple. Later, in the **Pipelines** section, we will impute them intelligently.

In [None]:
print(f"Rows before dropping: {len(df)}")
df = df.dropna().reset_index(drop=True)
print(f"Rows after dropping: {len(df)}")

---
## 3. Manual Preprocessing (Understanding the Mechanics)

Before we automate with pipelines, let's understand how to transform data manually.

### 1. Feature Engineering
üèÉ **RUN** below to create a new feature `culmen_ratio`.

In [None]:
df['culmen_ratio'] = df['culmen_length_mm'] / df['culmen_depth_mm']
df[['culmen_length_mm', 'culmen_depth_mm', 'culmen_ratio']].head()

### 2. Encoding Categorical Variables

üìñ **READ**:
Models need numbers, not strings. We have two main ways to convert them:

1.  **Label Encoding**: `0, 1, 2`. Good for ordinal data (Low, Med, High) or binary (Male, Female).
2.  **One-Hot Encoding**: Creates new binary columns (`Is_Island_A`, `Is_Island_B`). Good for nominal data (Island, Color) where `0 < 1 < 2` doesn't make sense.

‚úèÔ∏è **TODO**: Analyze the code below and run it.

In [None]:
# Example 1: Label Encoding for 'sex' (maps to 0 and 1)
le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex'])

# Example 2: One-Hot Encoding for 'island'
df = pd.get_dummies(df, columns=['island'], prefix='island', drop_first=False)

df.head()

### 3. Scaling Numerical Data

üìñ **READ**:
Features with large values (like `body_mass_g`: 4000) can dominate features with small values (`culmen_ratio`: 0.3). We use `StandardScaler` to put them on the same scale (mean=0, std=1).

üèÉ **RUN** the Scaling step.

In [None]:
scaler = StandardScaler()
numeric_cols = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', 'culmen_ratio']

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
df.head()

---
## 4. The Professional Way: Pipelines & ColumnTransformer

üìñ **READ**:
Doing all the steps above manually (filling NAs, encoding, scaling) is messy and prone to error (Data Leakage). 

In the real world, we use **Pipelines**.
A Pipeline bundles the steps together. A `ColumnTransformer` allows us to apply different pipelines to different columns (e.g., SimpleImputer for numbers, OneHotEncoder for text).

### Reloading the "Dirty" Data
Let's start fresh with the raw data to see the pipeline magic.

In [None]:
# Reload Raw Data
df_full = pd.read_csv(url)
X = df_full.drop('species_short', axis=1)
y = df_full['species_short']

# Fix the '.' error (This is usually done before the pipeline as a data cleaning step)
X['sex'] = X['sex'].replace('.', np.nan)

### ‚úèÔ∏è TODO: Build the Transformers

We need two pipelines:
1.  **Numeric**: Impute Missing (Median) -> Scale (StandardScaler)
2.  **Categorical**: Impute Missing (Most Frequent) -> One-Hot Encode

üèÉ **RUN** the cell below to define them.

In [None]:
# 1. Select Columns
numeric_features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_features = ['island', 'sex']

# 2. Create Numeric Pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 3. Create Categorical Pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 4. Combine in ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


### Running the Full Pipeline

Now we can process the entire dataset in effectively one line of code. This `fit_transform` handles the NA imputation, scaling, and encoding all at once.

In [None]:
# Execute the pipeline
X_processed = preprocessor.fit_transform(X)

# Convert back to DataFrame for viewing (Optional step for visualization)
cat_names = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
all_names = numeric_features + list(cat_names)
df_final = pd.DataFrame(X_processed, columns=all_names)

print("Shape of processed data:", df_final.shape)
df_final.head()

---
## 5. Splitting the Data

üìñ **READ**:
To accurately evaluate a model, we must split our data into **Training** and **Test** sets. We train on the training set and validate on the test set.

üèÉ **RUN** below to split the processed data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

---

## Summary

In this lab, you learned how to:

| Section | Technique | sklearn Class |
| :--- | :--- | :--- |
| **1. Exploration** | `.info()`, `.describe()`, Heatmap | - |
| **2. Visualization** | Pairplot, Boxplot, Correlation | - |
| **3. Cleaning** | Handling inconsistencies (replacing '.') | - |
| **Feature Engineering** | Creating derived features (`culmen_ratio`) | - |
| **4. Encoding** | Label Encoding, One-Hot Encoding | `LabelEncoder`, `OneHotEncoder` |
| **5. Scaling** | Standard scaling (z-score) | `StandardScaler` |
| **6. Pipelines** | Chaining imputation, scaling, and encoding | `Pipeline`, `ColumnTransformer` |
| **7. Splitting** | Creating Train/Test sets | `train_test_split` |

**Next steps:** In upcoming labs, you'll use these preprocessing techniques as part of full ML workflows!