# Preprocessing & Feature Engineering

## Phase 2 — Building the ML Input Pipeline

**Objective:**
Transform raw data into a clean, consistent and leakage-free feature set
ready for machine learning models.

All transformations are designed to:
- Be learned only from the training data
- Be reproducible
- Preserve semantic meaning of features


In [2]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [3]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer

In [4]:
train_path = "../data/raw/train.csv"
test_path = "../data/raw/test.csv"

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [5]:
from src.features import (
    NUMERIC_FEATURES,
    ORDINAL_FEATURES,
    NOMINAL_FEATURES,
    TEMPORAL_FEATURES
)

## Step 3 — Preprocessing Pipelines

Preprocessing pipelines ensure that all transformations are:
- Applied consistently
- Learned only from training data
- Reproducible across experiments

This design prevents data leakage by construction.


In [6]:
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

In [7]:
ordinal_categories = [
    ["Po", "Fa", "TA", "Gd", "Ex"],  # Qualities
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"],
    ["Po", "Fa", "TA", "Gd", "Ex"]
]

ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ordinal_categories))
])

In [8]:
nominal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

In [9]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, NUMERIC_FEATURES),
        ("ord", ordinal_pipeline, ORDINAL_FEATURES),
        ("nom", nominal_pipeline, NOMINAL_FEATURES)
    ],
    remainder="drop"
)