# Day 3: Data Preprocessing - Missing Values and Encoding

**Course**: Machine Learning with Python: From Basics to Applications  
**Objective**: Learn to handle missing data and encode categorical variables to prepare datasets for machine learning models.  
**Prerequisites**: Basic Python knowledge, familiarity with NumPy and Pandas (from Day 2), and the raw Titanic dataset (`titanic.csv`).  
**Tools**: Pandas, scikit-learn (install with `pip install pandas scikit-learn`).  
**Dataset**: Titanic dataset (raw, available from Kaggle or provided).  

In this notebook, we will:  
1. Load the raw Titanic dataset.  
2. Check for and handle missing values (impute `Age` with median, drop rows with missing `Embarked`).  
3. Encode categorical variables (`Sex` with `LabelEncoder`, `Embarked` with one-hot encoding).  
4. Save the preprocessed dataset as a CSV.  
5. Verify the preprocessing steps.  

Let’s get started!

## Step 1: Import Libraries

We need Pandas for data handling and scikit-learn’s `LabelEncoder` for encoding categorical variables.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Step 2: Load the Titanic Dataset

Load the raw Titanic dataset (`titanic.csv`) and inspect it to understand its structure and identify missing values.

In [None]:
# Load the dataset
df = pd.read_csv('titanic.csv')

# Display first 5 rows
print("First 5 rows of raw data:")
df.head()

# Check data info and missing values
print("\nDataset info:")
df.info()

print("\nMissing values:")
print(df.isnull().sum())

**Expected Output**:  
- `df.head()` shows columns like `PassengerId`, `Survived`, `Pclass`, `Sex`, `Age`, `Fare`, `Embarked`, etc.  
- `df.info()` shows data types (e.g., `Age` as float, `Sex` as object).  
- `df.isnull().sum()` likely shows ~177 missing values for `Age`, ~2 for `Embarked`, and many for `Cabin`.  

**Note**: We’ll focus on `Age` and `Embarked` for this exercise; `Cabin` has too many missing values and can be ignored.

## Step 3: Handle Missing Values

Impute missing `Age` values with the median to preserve data. Drop rows with missing `Embarked` values since they are few.

In [None]:
# Impute missing Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Drop rows with missing Embarked
df.dropna(subset=['Embarked'], inplace=True)

# Verify missing values
print("Missing values after handling:")
print(df.isnull().sum())

**Expected Output**:  
- No missing values for `Age` or `Embarked`.  
- `Cabin` may still have missing values, but we’re ignoring it for now.

## Step 4: Encode Categorical Variables

Encode `Sex` using `LabelEncoder` (male=0, female=1). Encode `Embarked` using one-hot encoding with `pd.get_dummies` and drop the first dummy column to avoid multicollinearity.

In [None]:
# Encode Sex with LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

# Encode Embarked with one-hot encoding
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Display first 5 rows after encoding
print("\nFirst 5 rows after encoding:")
df.head()

**Expected Output**:  
- `Sex` is now 0 or 1 (0=male, 1=female).  
- `Embarked` is replaced by `Embarked_C` and `Embarked_Q` (binary columns), with `Embarked_S` dropped.

## Step 5: Save Preprocessed Dataset

Save the preprocessed dataset as `titanic_preprocessed.csv` for use in future sessions.

In [None]:
# Save preprocessed data
df.to_csv('titanic_preprocessed.csv', index=False)

print("Preprocessed dataset saved as titanic_preprocessed.csv")

## Step 6: Verification

Verify that the dataset has no missing values, `Sex` is encoded as 0/1, and `Embarked` is one-hot encoded.

In [None]:
# Verify saved file
print("\nVerify titanic_preprocessed.csv:")
print(pd.read_csv('titanic_preprocessed.csv').head())

# Confirm no missing values
print("\nMissing values in preprocessed data:")
print(df.isnull().sum())

# Check unique values for Sex
print("\nUnique values in Sex:", df['Sex'].unique())

# Check columns for Embarked encoding
print("\nColumns after encoding:", df.columns.tolist())

**Expected Output**:  
- No missing values (except possibly `Cabin`, which we’re ignoring).  
- `Sex` has values [0, 1].  
- Columns include `Embarked_C`, `Embarked_Q` (but not `Embarked_S`).

## Assignment

1. Run this notebook to preprocess the Titanic dataset.  
2. Verify no missing values for `Age` and `Embarked`.  
3. Confirm `Sex` is encoded as 0/1 and `Embarked` is one-hot encoded (`Embarked_C`, `Embarked_Q`).  
4. Ensure the saved `titanic_preprocessed.csv` contains the expected columns and data.  
5. Submit a screenshot of the notebook output showing the verification steps (missing values, unique values for `Sex`, and column names).  

**Next Steps**: On Day 4, we’ll split this preprocessed dataset into training and test sets and apply feature scaling.