# Titanic Dataset Preprocessing 🚢

This notebook prepares the Titanic dataset for modeling:

- Loads raw CSVs from `data/raw/`
- Cleans and imputes missing values
- Encodes categorical variables
- Applies feature engineering
- Saves the processed dataset to `data/processed/processed_titanic.csv`

In [1]:
import pandas as pd
import numpy as np
import os

# Ensure processed folder exists
os.makedirs("../data/processed", exist_ok=True)

In [2]:
# Load raw train and test datasets
train = pd.read_csv("../data/raw/train.csv")
test = pd.read_csv("../data/raw/test.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)
train.head()

Train shape: (891, 12)
Test shape: (418, 11)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 1. Handle Missing Values
We will:
- Fill missing `Age` with median
- Fill missing `Embarked` with mode
- Drop `Cabin` (too many missing values)

In [3]:
df = train.copy()

# Drop Cabin
df = df.drop("Cabin", axis=1)

# Fill missing Age with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Fill missing Embarked with mode
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

## 2. Feature Engineering
- Extract `Title` from `Name`
- Encode `Sex` (0=female, 1=male)
- Map `Embarked` to numeric
- Create `FamilySize` and `IsAlone`

In [4]:
# Extract Title
df["Title"] = df["Name"].str.extract(" ([A-Za-z]+)\.", expand=False)
df["Title"] = df["Title"].replace(["Mlle","Ms"], "Miss")
df["Title"] = df["Title"].replace("Mme", "Mrs")
rare_titles = df["Title"].value_counts()[df["Title"].value_counts() < 10].index
df["Title"] = df["Title"].replace(rare_titles, "Rare")

# Encode categorical variables
df["Sex"] = df["Sex"].map({"female": 0, "male": 1})
df["Embarked"] = df["Embarked"].map({"C": 0, "Q": 1, "S": 2})

# Family size & IsAlone
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = np.where(df["FamilySize"] > 1, 0, 1)

## 3. Drop Unused Columns
We remove:
- `PassengerId` (not useful for training)
- `Name`, `Ticket` (too specific)

In [5]:
df.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)

## 4. Save Processed Dataset
We now save the cleaned dataset to `data/processed/processed_titanic.csv`.

In [6]:
df.to_csv("../data/processed/processed_titanic.csv", index=False)
print("✅ Processed dataset saved to data/processed/processed_titanic.csv")

✅ Processed dataset saved to data/processed/processed_titanic.csv
