# **Preprocessing Pipeline**

This notebook implements the final preprocessing pipeline for the real estate price prediction system.

The pipeline is built directly on top of the conclusions from:
- the Exploratory Data Analysis (EDA), which revealed strong skewness of the target variable, heavy noise, missing values, redundant area-related features and high-cardinality categorical variables,
- and the Data Cleaning step, which removed duplicated records, invalid target values and physically impossible numerical values.

After the cleaning stage, the dataset is logically consistent but still:
- contains a large number of missing values,
- contains heterogeneous numerical scales,
- contains high-cardinality categorical variables,
- and contains redundant and noisy features.

The goal of this notebook is to build a fully reproducible, production-style preprocessing pipeline that:
- performs feature engineering based on domain and EDA insights,
- unifies and simplifies redundant features (especially area-related features),
- handles missing values in a robust and statistically sound way,
- reduces the impact of outliers,
- encodes categorical variables in a scalable way,
- scales numerical features where appropriate,
- and produces a final model-ready feature matrix.

All transformations are implemented using scikit-learn Pipelines and ColumnTransformer, ensuring that the exact same preprocessing logic can be reused during:
- model training,
- model evaluation,
- batch inference,
- and real-time inference in the API.


## 1. Imports and Configuration

In this section we import all required libraries, scikit-learn components and custom transformers.
This notebook uses the custom preprocessing logic implemented in `src/custom_transformers.py`.


In [2]:
import sys
import os

if os.getcwd().endswith("notebooks"):
    sys.path.append("..")


In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Custom transformers
from src.custom_transformers import (
    AreaFeatureSelector,
    RareCategoryGrouper,
    BooleanNormalizer,
    DateFeatureExtractor,
    OutlierClipper,
    ColumnDropper,
)

import joblib

# Display settings
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)


## 2. Load Cleaned Data

In [4]:
import os
import pandas as pd

# Ensure working directory is project root
print("CWD:", os.getcwd())
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("Changed CWD to:", os.getcwd())

DATA_PATH = "data/portugal_cleaned.csv"

df = pd.read_csv(DATA_PATH)
print(df.shape)
df.sample(5)


CWD: c:\projetcs\real-estate-ml-system\notebooks
Changed CWD to: c:\projetcs\real-estate-ml-system
(121184, 26)


  df = pd.read_csv(DATA_PATH)


Unnamed: 0,Price,District,City,Town,Type,EnergyCertificate,GrossArea,TotalArea,Parking,HasParking,Floor,ConstructionYear,EnergyEfficiencyLevel,PublishDate,Garage,Elevator,ElectricCarsCharging,TotalRooms,NumberOfBedrooms,NumberOfWC,ConservationStatus,LivingArea,LotSize,BuiltArea,NumberOfBathrooms,LogPrice
63654,75000.0,Aveiro,Espinho,Silvalde,Land,NC,,288.0,0.0,,,2003.0,NC,,False,False,False,0.0,0.0,0.0,,288.0,288.0,,0.0,11.225257
10075,119500.0,Leiria,Castanheira de Pêra,Castanheira de Pêra e Coentral,House,NC,,1050.0,0.0,False,,1990.0,,,,False,,9.0,,,,298.0,,,5.0,11.69108
31054,125000.0,Setúbal,Barreiro,Santo António da Charneca,Apartment,D,,62.0,0.0,False,Ground Floor,1981.0,,,,True,,1.0,,,,62.0,,,1.0,11.736077
116432,15000.0,Coimbra,Oliveira do Hospital,Seixo da Beira,House,NC,,70.0,0.0,,,2019.0,NC,,False,False,False,5.0,5.0,2.0,Needs renovation,65.0,1354.0,824.0,2.0,9.615872
55242,195000.0,Braga,Braga,São Victor,Apartment,NC,,125.0,1.0,True,7th Floor,1994.0,,,,True,,3.0,,,,125.0,,,2.0,12.18076


## 3. Basic sanity check

Before building the preprocessing pipeline, we perform a quick sanity check to confirm:
- data types of all columns,
- presence of missing values,
- overall consistency of the dataset after cleaning.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121184 entries, 0 to 121183
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Price                  121184 non-null  float64
 1   District               121184 non-null  object 
 2   City                   121184 non-null  object 
 3   Town                   121184 non-null  object 
 4   Type                   121182 non-null  object 
 5   EnergyCertificate      121184 non-null  object 
 6   GrossArea              25208 non-null   float64
 7   TotalArea              112412 non-null  float64
 8   Parking                121071 non-null  float64
 9   HasParking             61025 non-null   object 
 10  Floor                  25992 non-null   object 
 11  ConstructionYear       82141 non-null   float64
 12  EnergyEfficiencyLevel  60159 non-null   object 
 13  PublishDate            27196 non-null   object 
 14  Garage                 60159 non-nul

In [6]:
df.isna().mean().sort_values(ascending=False).head(20)

ConservationStatus       0.853009
BuiltArea                0.801079
GrossArea                0.791986
Floor                    0.785516
PublishDate              0.775581
LotSize                  0.732407
NumberOfBedrooms         0.643559
NumberOfWC               0.578162
EnergyEfficiencyLevel    0.503573
Garage                   0.503573
ElectricCarsCharging     0.503573
HasParking               0.496427
TotalRooms               0.441923
ConstructionYear         0.322179
LivingArea               0.209013
TotalArea                0.072386
NumberOfBathrooms        0.047647
Parking                  0.000932
Type                     0.000017
Town                     0.000000
dtype: float64

## 4. Define target and feature matrix

We use the log-transformed price (`LogPrice`) as the modeling target, as motivated in the EDA.

All remaining columns are treated as input features.


In [7]:
TARGET = "LogPrice"

X = df.drop(columns=[TARGET])
y = df[TARGET]

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (121184, 25)
y shape: (121184,)


## 5. Train / test split

We split the dataset into training and test sets before fitting any preprocessing steps.

This ensures that:
- all imputations,
- all encodings,
- and all scaling parameters

are learned strictly from the training data, preventing data leakage.


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (96947, 25)
Test shape: (24237, 25)


## 6. Feature groups and transformation plan

Based on the EDA and the data cleaning step, we explicitly define groups of features and
their intended transformations.

The dataset contains:
- multiple redundant area-related features,
- several binary categorical features,
- high-cardinality categorical features (especially location-related),
- numerical features with missing values and strong skewness,
- and a date feature that needs to be decomposed into numerical components.

We therefore split the features into the following logical groups:
- area-related features (to be unified into a single feature),
- binary categorical features (to be normalized to 0/1),
- high-cardinality categorical features (to be grouped and one-hot encoded),
- numerical features (to be imputed, clipped and scaled).


In [9]:
# Area-related features (will be merged into MainArea)
area_cols = ["LivingArea", "TotalArea", "BuiltArea", "LotSize", "GrossArea"]

# Binary-like categorical features
binary_cols = ["HasParking", "Garage", "Elevator", "ElectricCarsCharging"]

# High-cardinality categorical features
categorical_cols = [
    "District",
    "City",
    "Town",
    "Type",
    "EnergyCertificate",
    "ConservationStatus",
]

# Numerical features (after feature engineering)
numeric_cols = [
    "MainArea",
    "TotalRooms",
    "NumberOfBedrooms",
    "NumberOfBathrooms",
    "NumberOfWC",
    "ConstructionYear",
    "PublishYear",
    "PublishMonth",
]

print("Area columns:", area_cols)
print("Binary columns:", binary_cols)
print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)


Area columns: ['LivingArea', 'TotalArea', 'BuiltArea', 'LotSize', 'GrossArea']
Binary columns: ['HasParking', 'Garage', 'Elevator', 'ElectricCarsCharging']
Categorical columns: ['District', 'City', 'Town', 'Type', 'EnergyCertificate', 'ConservationStatus']
Numeric columns: ['MainArea', 'TotalRooms', 'NumberOfBedrooms', 'NumberOfBathrooms', 'NumberOfWC', 'ConstructionYear', 'PublishYear', 'PublishMonth']


## 7. Feature engineering pipeline (structural transformations)

In this section we build the first stage of the preprocessing pipeline, responsible for
structural feature transformations and feature engineering.

This stage:
- unifies multiple area-related features into a single `MainArea` feature,
- extracts numerical features from the `PublishDate` column,
- normalizes binary categorical features to 0/1 values,
- groups rare categories in high-cardinality categorical features.

These transformations change the structure of the dataset and must therefore be applied
before the classical numerical and categorical preprocessing steps.


In [10]:
# Pipeline responsible for structural feature engineering
feature_engineering_pipeline = Pipeline(
    steps=[
        ("area_selector", AreaFeatureSelector(area_cols)),
        ("date_features", DateFeatureExtractor(column="PublishDate")),
        ("boolean_normalizer", BooleanNormalizer(binary_cols)),
        ("rare_grouper", RareCategoryGrouper(columns=["City", "Town"], min_frequency=0.001)),
    ]
)

feature_engineering_pipeline


In [11]:
X_train_fe = feature_engineering_pipeline.fit_transform(X_train)
X_test_fe = feature_engineering_pipeline.transform(X_test)

print("Shape before:", X_train.shape)
print("Shape after feature engineering:", X_train_fe.shape)

X_train_fe.sample(5)

Shape before: (96947, 25)
Shape after feature engineering: (96947, 22)


Unnamed: 0,Price,District,City,Town,Type,EnergyCertificate,Parking,HasParking,Floor,ConstructionYear,EnergyEfficiencyLevel,Garage,Elevator,ElectricCarsCharging,TotalRooms,NumberOfBedrooms,NumberOfWC,ConservationStatus,NumberOfBathrooms,MainArea,PublishYear,PublishMonth
79500,140000.0,Lisboa,Sintra,Agualva e Mira-Sintra,Apartment,C,0.0,,,1965.0,C,0.0,0,0.0,3.0,2.0,,,2.0,49.0,2024.0,10.0
83876,550000.0,Lisboa,Mafra,Other,House,NC,1.0,,,2024.0,NC,0.0,0,0.0,5.0,5.0,0.0,Needs renovation,6.0,414.0,2024.0,4.0
72263,37500.0,Coimbra,Coimbra,Other,Land,NC,0.0,,,1950.0,NC,0.0,0,0.0,0.0,0.0,0.0,New,0.0,680.0,,
93568,240000.0,Porto,Porto,Paranhos,Apartment,NC,0.0,,,2023.0,NC,0.0,0,0.0,4.0,2.0,0.0,,1.0,60.0,,
76137,25000.0,Guarda,Seia,Other,Land,NC,0.0,,,,NC,0.0,0,0.0,,,,,,1492.0,2024.0,4.0


## 8. Numerical and categorical preprocessing pipelines

In this section we build the second stage of the preprocessing pipeline, responsible for:

- imputing missing values,
- scaling numerical features,
- encoding categorical features using One-Hot Encoding.

This stage prepares the data in a fully numerical form suitable for machine learning models.


In [None]:
num_features = X_train_fe.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_features = X_train_fe.select_dtypes(include=["object"]).columns.tolist()

print("Numerical features:", num_features)
print("Categorical features:", cat_features)

Numerical features: ['Price', 'Parking', 'HasParking', 'ConstructionYear', 'Garage', 'Elevator', 'ElectricCarsCharging', 'TotalRooms', 'NumberOfBedrooms', 'NumberOfWC', 'NumberOfBathrooms', 'MainArea', 'PublishYear', 'PublishMonth']
Categorical features: ['District', 'City', 'Town', 'Type', 'EnergyCertificate', 'Floor', 'EnergyEfficiencyLevel', 'ConservationStatus']


In [15]:
numeric_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ]
)

In [16]:
preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, num_features),
        ("cat", categorical_pipeline, cat_features),
    ]
)

preprocessing_pipeline

In [17]:
X_train_processed = preprocessing_pipeline.fit_transform(X_train_fe)
X_test_processed = preprocessing_pipeline.transform(X_test_fe)

print("Final train shape:", X_train_processed.shape)
print("Final test shape:", X_test_processed.shape)

Final train shape: (96947, 509)
Final test shape: (24237, 509)


In [18]:
# Check for NaN values

print("Any NaNs in train:", np.isnan(X_train_processed).any())
print("Any NaNs in test:", np.isnan(X_test_processed).any())


Any NaNs in train: False
Any NaNs in test: False


In [None]:
# Save the preprocessing pipeline
joblib.dump(preprocessing_pipeline, "artifacts/preprocessing_pipeline.joblib")
joblib.dump(feature_engineering_pipeline, "artifacts/feature_engineering_pipeline.joblib")


['artifacts/feature_engineering_pipeline.joblib']

## Summary

In this notebook we designed and implemented a complete, production-ready preprocessing pipeline for the real estate price prediction system. The pipeline was built directly based on the findings and recommendations from the Exploratory Data Analysis (EDA) and the Data Cleaning stage.

The main objective of this stage was to transform the cleaned dataset into a fully numerical, consistent, and model-ready representation while preserving as much useful information as possible and avoiding any form of data leakage.

---

### Overall pipeline structure

The preprocessing process was deliberately split into two logically independent parts:

1. **Feature Engineering Pipeline** – responsible for structural and semantic transformations of the dataset.
2. **Preprocessing Pipeline** – responsible for numerical preparation of features for machine learning models.

This separation makes the system:
- easier to maintain,
- easier to debug,
- and easier to reuse in both training and inference environments.

---

### Feature Engineering Pipeline

This first stage performs transformations that change the structure or meaning of the data based on domain knowledge and EDA insights:

- **Area feature unification**  
  Multiple redundant and partially inconsistent area-related columns (`LivingArea`, `TotalArea`, `BuiltArea`, `LotSize`, `GrossArea`) are consolidated into a consistent representation.

- **Date feature extraction**  
  The `PublishDate` column is decomposed into numerical time-based components (`PublishYear`, `PublishMonth`) and the original column is removed.

- **Binary feature normalization**  
  Binary features (`Garage`, `Elevator`, `HasParking`, `ElectricCarsCharging`) are normalized into consistent 0/1 representations.

- **Rare category grouping**  
  High-cardinality categorical features such as `City` and `Town` are processed using frequency-based grouping to replace very rare categories with a common `"Other"` category.  
  This significantly reduces noise, limits dimensionality explosion after one-hot encoding, and improves generalization.

These transformations are implemented using custom scikit-learn compatible transformers and combined into a single reusable feature engineering pipeline.

---

### Preprocessing Pipeline

The second stage transforms the engineered dataset into a fully numerical matrix suitable for machine learning models:

- **Numerical features**
  - Missing values are imputed using the median.
  - Features are standardized using `StandardScaler`.

- **Categorical features**
  - Missing values are imputed using the most frequent value.
  - Features are encoded using `OneHotEncoder`.

This stage is implemented using a `ColumnTransformer` to ensure that each group of features is processed with the correct strategy.

---

### Train / Test safety and data leakage prevention

The dataset is split into training and test sets **before** fitting any preprocessing steps.

All transformers:
- are fitted only on the training data,
- and then applied to the test data.

This guarantees:
- no information leakage,
- realistic evaluation results,
- and production-safe behavior.

---

### Final result

After applying the full preprocessing pipeline:

- The dataset contains **no missing values**.
- All features are **fully numerical**.
- The original feature space is transformed into **509 final model-ready features**.
- The entire preprocessing logic is encapsulated in **reusable, serializable pipelines**.

Both:
- the **feature engineering pipeline**  
- and the **full preprocessing pipeline**  

are saved as artifacts and can be reused:
- during model training,
- during validation,
- and in production inference inside the API.

---

### Role in the full system

This notebook defines the **single source of truth** for all preprocessing logic in the project.

All future modeling notebooks and the production system will:
- load the same saved pipelines,
- apply identical transformations,
- and therefore remain fully consistent and reproducible across the entire ML lifecycle.

This completes the data preparation stage and provides a solid, reliable foundation for model training and evaluation.

---
