# Module 05 Lab - Data Preparation**Objective:** To learn and apply the most common data preparation techniques. Raw data is rarely ready for a machine learning model. This process, also called preprocessing, is one of the most critical steps in the entire ML workflow.**In this lab, you will write more of the code.** Read the explanations and then complete the tasks in the code cells.

## Part 1: Setup and Initial LookWe will continue using the Titanic dataset because it has the exact problems we need to solve: missing values and non-numeric data.

In [None]:
import pandas as pdimport numpy as np# Load the datasetdf = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')# Let's look at the missing valuesprint("--- Missing Values Before Cleaning ---")print(df.isnull().sum())

## Part 2: Handling Missing Values (Imputation)**Concept:** Most machine learning models cannot handle missing values (`NaN`). We must deal with them. Dropping the rows is an option, but you lose data. A better way is **imputation**, which means filling in the missing values with a calculated guess.Common imputation strategies:*   **Mean:** Fill with the average value. Good for normally distributed data.*   **Median:** Fill with the middle value. Better for skewed data or data with outliers (like `Fare`).*   **Mode:** Fill with the most frequent value. Used for categorical data.

### Task 1: Impute the 'Age' ColumnThe 'Age' column is missing many values. Since age can be skewed (e.g., by a few very old passengers), using the **median** is a robust choice.**Your Task:** Calculate the median of the 'Age' column and use the `.fillna()` method to replace the missing values.

In [None]:
# --- ENTER YOUR CODE HERE ---# 1. Calculate the median of the 'Age' column# median_age = ...# 2. Fill the missing values in 'Age' with the median# df['Age'].fillna(..., inplace=True)# 3. Verify that there are no more missing values in 'Age'# print("Missing values in 'Age' after imputation:")# print(df['Age'].isnull().sum())

## Part 3: Encoding Categorical Features**Concept:** Machine learning models are mathematical, so they need numbers, not text. We need to convert categorical columns (like 'Sex' and 'Embarked') into a numerical format. The most common method is **One-Hot Encoding**.One-Hot Encoding takes a column with `N` categories and turns it into `N` new columns, each with a `1` or `0`. For example, the 'Sex' column (`male`, `female`) becomes two new columns: `Sex_male` and `Sex_female`.Pandas has a convenient function called `pd.get_dummies()` that does this for us.

### Task 2: One-Hot Encode Categorical Columns**Your Task:** Use `pd.get_dummies()` to encode the 'Sex' and 'Embarked' columns. Make sure to drop the original columns after encoding.

In [None]:
# --- ENTER YOUR CODE HERE ---# 1. Use get_dummies to create new columns for 'Sex' and 'Embarked'#    Set `drop_first=True` to avoid multicollinearity (a statistical issue), which drops one of the new columns (e.g., just having `Sex_male` is enough to know if someone is female).# encoded_df = pd.get_dummies(df, columns=[... , ...], drop_first=True)# 2. Display the first few rows of the new DataFrame to see the new columns# print(encoded_df.head())

## Part 4: Feature Scaling**Concept:** Many models are sensitive to the scale of the features. For example, `Age` (from 0-80) and `Fare` (from 0-512) are on very different scales. This can cause the model to incorrectly believe that `Fare` is a more important feature simply because its values are larger.**Feature Scaling** solves this by putting all features on a similar scale. A common method is **Standardization** (`StandardScaler` in scikit-learn), which rescales the data to have a mean of 0 and a standard deviation of 1.**Important:** You only scale your numerical features, not your target variable or your newly encoded categorical columns.

### Task 3: Scale the 'Age' and 'Fare' Columns**Your Task:** Use `StandardScaler` from `sklearn.preprocessing` to scale the 'Age' and 'Fare' columns.

In [None]:
from sklearn.preprocessing import StandardScaler# --- ENTER YOUR CODE HERE ---# 1. Create an instance of the StandardScaler# scaler = ...# 2. Select the columns to scale# columns_to_scale = ['Age', 'Fare']# 3. Fit the scaler to the data and transform it#    Note: We are using the `encoded_df` from the previous step if you created it.# encoded_df[columns_to_scale] = scaler.fit_transform(encoded_df[columns_to_scale])# 4. Display the first few rows to see the scaled data# print(encoded_df.head())

## 📝 Knowledge Check**Instructions:** Answer the following questions in this markdown cell.1.  **Why is it often better to impute missing values with the median instead of the mean?**2.  **Explain in your own words what One-Hot Encoding does and why it is necessary.**3.  **Would you need to apply Feature Scaling to a Decision Tree model?** Why or why not? (Hint: Think about how a Decision Tree makes its splits).**[ENTER YOUR ANSWERS HERE]**