## Step 1: Preprocessing Strategies ##

* Define features (X) and target variable (y).

* Split the dataset into train, test, validation sets (e.g., 70/20/10 split);

**Linear Model Preprocessor 5 steps (python object)**

*No imputation, no scaling, no capping, no encoding for price!!!*

TRAIN SET:
- Handle missing values: missingness is meaningful here; impute Nan with median and include a missing_flag indicator;
- Handle skeweness:log-transform skewed features and target, cap outliers at 1-99%;
- Encoding categorical variables (TargetEncoding and OneHotEncoder), and scaling (StandardScaling);

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.

**Random Forest & Boosting Models 2 steps Preprocessor**

TRAIN SET:
- Handle missing values: impute Nan with -1 with missing_flag indicator; 
- Handle skeweness: not necessary;
- Encoding categorical variables (TargetEncoding), and scaling - StandardScaler OR MinMaxScaler?;

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.


In [35]:
import pandas as pd

filename = "cleaned_properties.csv"
df = pd.read_csv(filename)
df.columns
df

Unnamed: 0,id,price,property_type,subproperty_type,region,province,locality,zip_code,latitude,longitude,...,fl_garden,garden_sqm,fl_swimming_pool,fl_floodzone,state_building,primary_energy_consumption_sqm,epc,heating_type,fl_double_glazing,cadastral_income
0,34221000,225000.0,APARTMENT,APARTMENT,Flanders,Antwerp,Antwerp,2050,51.217172,4.379982,...,0,0.0,0,0,MISSING,231.0,poor,GAS,1,922.0
1,2104000,449000.0,HOUSE,HOUSE,Flanders,East Flanders,Gent,9185,51.174944,3.845248,...,0,0.0,0,0,MISSING,221.0,poor,MISSING,1,406.0
2,34036000,335000.0,APARTMENT,APARTMENT,Brussels-Capital,Brussels,Brussels,1070,50.842043,4.334543,...,0,0.0,0,1,AS_NEW,,MISSING,GAS,0,
3,58496000,501000.0,HOUSE,HOUSE,Flanders,Antwerp,Turnhout,2275,51.238312,4.817192,...,0,0.0,0,1,MISSING,99.0,excellent,MISSING,0,
4,48727000,982700.0,APARTMENT,DUPLEX,Wallonia,Walloon Brabant,Nivelles,1410,,,...,1,142.0,0,0,AS_NEW,19.0,excellent,GAS,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75503,30785000,210000.0,APARTMENT,APARTMENT,Wallonia,Hainaut,Tournai,7640,,,...,0,0.0,0,1,AS_NEW,,MISSING,MISSING,1,
75504,13524000,780000.0,APARTMENT,PENTHOUSE,Brussels-Capital,Brussels,Brussels,1200,50.840183,4.435570,...,0,0.0,0,0,AS_NEW,95.0,good,GAS,1,
75505,43812000,798000.0,HOUSE,MIXED_USE_BUILDING,Brussels-Capital,Brussels,Brussels,1080,,,...,0,0.0,0,1,TO_RENOVATE,351.0,bad,GAS,0,
75506,49707000,575000.0,HOUSE,VILLA,Flanders,West Flanders,Veurne,8670,,,...,1,,0,1,AS_NEW,269.0,poor,GAS,1,795.0


In [36]:
#Define features (X) and target variable (y)

from sklearn.model_selection import train_test_split

X = df.drop(columns = ["price","id","zip_code","latitude","longitude"])
y = df["price"]
type(y)

pandas.core.series.Series

In [37]:
#Split the dataset into train, test, validation sets (e.g., 60/20/20 split);
from sklearn.model_selection import train_test_split

X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state = 86)

X_train, X_val, y_train, y_val = train_test_split(X_temp,y_temp, test_size = 0.25, random_state = 86)


In [38]:
print(type(X_train))
print(type(X_test))
print(type(X_val))
print(type(y_train))
print(type(y_test))
print(type(y_val))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Linear Model Preprocessor 5 steps ##
impute → cap → log → scale → encode

In [39]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,FunctionTransformer, StandardScaler,OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce
from category_encoders import TargetEncoder

In [40]:
#NAN: replace NAN with median and add a missing_fl column here:
numeric_features = ["cadastral_income", "surface_land_sqm", "construction_year",
                    "primary_energy_consumption_sqm","nbr_bedrooms","nbr_frontages",
                    "terrace_sqm", "total_area_sqm","garden_sqm"]

skewed_features = ["surface_land_sqm","total_area_sqm", "garden_sqm","terrace_sqm"]

# No NAN: binary also treated as categorical as we count "missing" as a category that captures missingness
categorical_onehot = ["property_type","region","province","heating_type","equipped_kitchen",
                        "fl_floodzone", "fl_double_glazing", "fl_open_fire","fl_terrace", 
                        "fl_garden", "fl_swimming_pool", "fl_furnished","epc"
                        ]
categorical_target = ["subproperty_type","locality","state_building"]

In [41]:

# Function for log-transformation
log_transformer = FunctionTransformer(np.log1p, validate=True)

# Class Outlier Capper
class OutlierCapper(BaseEstimator, TransformerMixin):
    def __init__(self, lower_quantile=0.01, upper_quantile=0.99):
        self.lower_quantile = lower_quantile
        self.upper_quantile = upper_quantile
    
    def fit(self, X, y=None):
        # Compute thresholds for each column based on training data
        self.lower_ = np.quantile(X, self.lower_quantile, axis=0)
        self.upper_ = np.quantile(X, self.upper_quantile, axis=0)
        return self
    
    def transform(self, X):
        # Clip values to the learned thresholds
        return np.clip(X, self.lower_, self.upper_)
    
capper = OutlierCapper(lower_quantile=0.05, upper_quantile=0.95)

# Pipeline for numeric columns (imputation, scale, capping (capping needs to come as a parameter from train data - leakage issue))
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap',capper),
    ('scaler', StandardScaler())
])

# Pipeline for numeric features that need log-transform
log_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap', capper),
    ('log', log_transformer),
    ('scaler', StandardScaler())
])

# Pipeline for one-hot categorical features - binary features are included because "MISSING" is treated as a category
onehot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Pipeline for label/ordinal categorical features - "MISSING" is treated as a category
target_pipeline = Pipeline([
    ('target_enc', TargetEncoder(smoothing=1.0))
])

# Putting all pipelines together; TargetEncoder is SUPERVISED (it needs y_train)

preprocessor_linear = ColumnTransformer([
    ('num', numeric_pipeline, [f for f in numeric_features if f not in skewed_features]),
    ('log', log_pipeline, skewed_features),
    ('onehot', onehot_pipeline, categorical_onehot),
    ('label', target_pipeline, categorical_target)
])

In [42]:
# Fit on training data
X_train_processed = preprocessor_linear.fit_transform(X_train, y_train) #HERE I NEED TO ADD y_train because of TargetEncoder

# Transform test data (re-use fitted transformers)
X_test_processed = preprocessor_linear.transform(X_test)

print("X_train_processed shape:", X_train_processed.shape)
print("X_test_processed shape:", X_test_processed.shape)

# For X_train
print("First 5 rows of processed X_train:")
print(X_train_processed[:5, :5])  # first 5 rows & first 5 columns

# For X_test
print("First 5 rows of processed X_test:")
print(X_test_processed[:5, :5])

X_train_processed shape: (45304, 74)
X_test_processed shape: (15102, 74)
First 5 rows of processed X_train:
[[-1.7221717   1.18843323 -0.14593329  0.27205378 -1.27501337]
 [-0.12257671  1.26611278 -0.14593329 -1.53171094 -1.27501337]
 [-0.69457326  0.13975939 -0.86666715 -0.62982858  0.18560831]
 [-0.12257671  1.26611278 -0.14593329 -0.62982858  0.18560831]
 [-0.12257671  0.13975939 -0.14593329  0.27205378  0.18560831]]
First 5 rows of processed X_test:
[[-1.51931246  0.13975939  0.67976182  0.27205378 -1.27501337]
 [-0.12257671 -1.29731216 -0.41883252  1.17393613 -1.27501337]
 [-0.27222697  0.99423437 -1.11157672 -0.62982858 -1.27501337]
 [-0.12257671  1.26611278 -0.14593329  0.27205378  0.18560831]
 [ 0.23658391 -1.06427353  1.41449051  0.27205378 -1.27501337]]
