# Titanic Dataset — Preprocessing

## Objective
Prepare the raw Titanic dataset for machine learning by handling missing values, encoding categorical variables, and splitting into train/test sets.

## Output
At the end of this notebook, we will have:
- `X_train`, `X_test` — Feature matrices ready for modeling
- `y_train`, `y_test` — Target vectors for training and evaluation

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('data/train.csv')
print(df.shape)
df.head(2)

## 2. Define Label and Features

In [None]:
# Label (target variable)
y = df['Survived']

# Features (all columns except the label)
X = df.drop(columns=['Survived'])

print(f"Label (y): {y.shape}")
print(f"Features (X): {X.shape}")
print(f"\nFeature columns:\n{list(X.columns)}")
X.head()

## 3. Drop Non-Useful Columns

In [None]:
columns_to_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin']

X = X.drop(columns=columns_to_drop)

print(f"Dropped: {columns_to_drop}")
print(f"Remaining features: {list(X.columns)}")
print(f"Shape: {X.shape}")
X.head()

## 4. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set:  X_train {X_train.shape}, y_train {y_train.shape}")
print(f"Test set:      X_test  {X_test.shape},  y_test  {y_test.shape}")
print(f"\nSurvival ratio in full data:     {y.mean():.4f}")
print(f"Survival ratio in training set: {y_train.mean():.4f}")
print(f"Survival ratio in test set:     {y_test.mean():.4f}")

## 5. Preprocessing Strategy

### Imputation Plan (Handling Missing Values)
| Column | Strategy | Why |
|---|---|---|
| **Age** | Fill with **median** (from training set) | Median is robust to outliers unlike mean. We use the training set median to avoid data leakage from the test set. |
| **Embarked** | Fill with **mode** (from training set) | Only 2 values missing. Mode (most frequent value) is the safest choice for a categorical column. |

### Encoding Plan (Converting Text to Numbers)
| Column | Strategy | Why |
|---|---|---|
| **Sex** | **Binary encoding** (male=0, female=1) | Only 2 categories, so a single 0/1 column is enough. |
| **Embarked** | **One-hot encoding** (S, C, Q → 3 columns) | 3 categories with no natural order, so one-hot avoids implying a ranking between them. |

## 6. Implement Preprocessing (Pipeline Approach)

### 6A) Identify Column Types

In [None]:
numeric_cols = ['Age', 'SibSp', 'Parch', 'Fare', 'Pclass']
categorical_cols = ['Sex', 'Embarked']

print(f"Numeric columns:     {numeric_cols}")
print(f"Categorical columns: {categorical_cols}")
print(f"Total: {len(numeric_cols) + len(categorical_cols)} features")

### 6B) Build Preprocessing Transformer

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Numeric pipeline: fill missing values with median
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# Categorical pipeline: fill missing values with most frequent, then one-hot encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False))
])

# Combine both pipelines into one preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_pipeline, numeric_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

print("Preprocessor created successfully!")
print(preprocessor)

## 7. Fit and Transform

In [8]:
# Step 1: Learn from training data + transform it
X_train_processed = preprocessor.fit_transform(X_train)

# Step 2: Transform test data using what was learned from training
X_test_processed = preprocessor.transform(X_test)

print(f"X_train_processed shape: {X_train_processed.shape}")
print(f"X_test_processed shape:  {X_test_processed.shape}")
print(f"\nColumns match: {X_train_processed.shape[1] == X_test_processed.shape[1]}")
print(f"NaNs in train: {np.isnan(X_train_processed).sum()}")
print(f"NaNs in test:  {np.isnan(X_test_processed).sum()}")

X_train_processed shape: (712, 8)
X_test_processed shape:  (179, 8)

Columns match: True
NaNs in train: 0
NaNs in test:  0
