# Day 3: Data Preprocessing – Missing Values, Label Encoding, One-Hot Encoding

Welcome to Day 3 of the Machine Learning with Python course. Today, we'll explore how to clean and prepare data by handling missing values and converting categorical data into numerical form.

## 1. Why Preprocessing Matters
Machine Learning models require numerical, complete, and standardized data to function effectively. Raw data often contains:
- Missing values
- Non-numeric (categorical) columns
- Inconsistent formatting

## 2. Load Dataset
We'll use the Titanic dataset for today's preprocessing tasks.

In [None]:
import pandas as pd

# Load Titanic dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df.head()

## 3. Identifying Missing Data

In [None]:
df.isnull().sum()

## 4. Handling Missing Values
We will:
- Drop the 'Cabin' column (too many missing values)
- Fill missing 'Age' with median
- Fill missing 'Embarked' with mode

In [None]:
# Drop Cabin if it exists
if 'Cabin' in df.columns:
    df.drop(columns='Cabin', inplace=True)

# Fill Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill Embarked with mode
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

## 5. Handling Categorical Data
We'll use two methods:
- Label Encoding (for binary or ordinal)
- One-Hot Encoding (for nominal)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label Encode 'Sex'
le = LabelEncoder()
df['Sex_encoded'] = le.fit_transform(df['Sex'])
df[['Sex', 'Sex_encoded']].head()

Label Encoding assigns integer values to classes. It works for binary categories but may mislead the model for non-ordinal categories.

In [None]:
# One-Hot Encode 'Embarked'
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
df.head()

One-Hot Encoding creates a new binary column for each category and avoids implying any order.

## 6. Final Cleanup
Remove unused columns and inspect final structure.

In [None]:
# Drop irrelevant columns
df.drop(columns=['PassengerId', 'Name', 'Ticket'], inplace=True)
df.head()

## 7. Summary
- Handled missing data using drop, median, and mode
- Encoded categorical features using Label and One-Hot Encoding
- Prepared the dataset for modeling

## 📝 Assignment
Use the **Iris dataset**:
- Check for missing values (introduce some manually if needed)
- Encode the `species` column using:
  - Label Encoding
  - One-Hot Encoding
- Visualize and compare the transformed columns
- Save both versions for future modeling