## Linear and Tree-Based Models: Preprocessing Pipelines, Training and Testing ##


* Define features (X) and target variable (y).

* Split the dataset into train, test, validation sets (e.g., 70/20/10 split);

**1. Linear Model Preprocessor 5 steps (python object)**

*No imputation, no scaling, no capping, no encoding for price - only log-transformation after fitting the model and before predicting.*

TRAIN SET:
- Handle missing values: missingness is meaningful here; impute Nan with median and include a missing_flag indicator;
- Handle skeweness:log-transform skewed features and target, cap outliers at 1-99%;
- Encoding categorical variables (TargetEncoding and OneHotEncoder), and scaling (StandardScaling);

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.


**2. Random Forest & XGBoost Models: Preporcessing Strategy**

TRAIN SET:
- Handle missing values: impute Nan with -1 with missing_flag indicator; 
- Handle skeweness: not necessary;
- Encoding categorical variables (OHE, TargetEncoding), and scaling - not necessary;

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.

In [2]:
import pandas as pd

filename = "cleaned_properties.csv"
df = pd.read_csv(filename)
df.columns
df

Unnamed: 0,id,price,property_type,subproperty_type,region,province,locality,zip_code,latitude,longitude,...,fl_garden,garden_sqm,fl_swimming_pool,fl_floodzone,state_building,primary_energy_consumption_sqm,epc,heating_type,fl_double_glazing,cadastral_income
0,34221000,225000.0,APARTMENT,APARTMENT,Flanders,Antwerp,Antwerp,2050,51.217172,4.379982,...,0,0.0,0,0,MISSING,231.0,poor,GAS,1,922.0
1,2104000,449000.0,HOUSE,HOUSE,Flanders,East Flanders,Gent,9185,51.174944,3.845248,...,0,0.0,0,0,MISSING,221.0,poor,MISSING,1,406.0
2,34036000,335000.0,APARTMENT,APARTMENT,Brussels-Capital,Brussels,Brussels,1070,50.842043,4.334543,...,0,0.0,0,1,AS_NEW,,MISSING,GAS,0,
3,58496000,501000.0,HOUSE,HOUSE,Flanders,Antwerp,Turnhout,2275,51.238312,4.817192,...,0,0.0,0,1,MISSING,99.0,excellent,MISSING,0,
4,48727000,982700.0,APARTMENT,DUPLEX,Wallonia,Walloon Brabant,Nivelles,1410,,,...,1,142.0,0,0,AS_NEW,19.0,excellent,GAS,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75503,30785000,210000.0,APARTMENT,APARTMENT,Wallonia,Hainaut,Tournai,7640,,,...,0,0.0,0,1,AS_NEW,,MISSING,MISSING,1,
75504,13524000,780000.0,APARTMENT,PENTHOUSE,Brussels-Capital,Brussels,Brussels,1200,50.840183,4.435570,...,0,0.0,0,0,AS_NEW,95.0,good,GAS,1,
75505,43812000,798000.0,HOUSE,MIXED_USE_BUILDING,Brussels-Capital,Brussels,Brussels,1080,,,...,0,0.0,0,1,TO_RENOVATE,351.0,bad,GAS,0,
75506,49707000,575000.0,HOUSE,VILLA,Flanders,West Flanders,Veurne,8670,,,...,1,,0,1,AS_NEW,269.0,poor,GAS,1,795.0


In [3]:
#Define features (X) and target variable (y)

from sklearn.model_selection import train_test_split

X = df.drop(columns = ["price","id","zip_code","latitude","longitude"])
y = df["price"]
type(y)

pandas.core.series.Series

**Splitting the data into train, test, validation sets**

In [4]:
#Split the dataset into train, test, validation sets (e.g., 60/20/20 split);
from sklearn.model_selection import train_test_split

X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state = 86)

X_train, X_val, y_train, y_val = train_test_split(X_temp,y_temp, test_size = 0.25, random_state = 86)


In [5]:
print(type(X_train))
print(type(X_test))
print(type(X_val))
print(type(y_train))
print(type(y_test))
print(type(y_val))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Preprocessors ##
impute → cap → log → scale → encode

In [6]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,FunctionTransformer, StandardScaler,OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce
from category_encoders import TargetEncoder

*Following the multicolinearity analysis, there are certain features candidates for dropping. Need to choose one among the identified groups*



In [7]:
#NAN: replace NAN with median and add a missing_fl column here:
numeric_features = ["cadastral_income", "surface_land_sqm", "construction_year",
                    "primary_energy_consumption_sqm","nbr_bedrooms","nbr_frontages",
                    "terrace_sqm", "total_area_sqm","garden_sqm"]

skewed_features = ["surface_land_sqm","total_area_sqm", "garden_sqm","terrace_sqm"]

# No NAN to be handled, only encoding
categorical_onehot = ["property_type","region","province","heating_type","equipped_kitchen", "epc"]
categorical_target = ["subproperty_type","locality","state_building"]

# # No NAN to be handled and no encoding
binary_features = ["fl_floodzone", "fl_double_glazing", "fl_open_fire","fl_terrace", 
                        "fl_garden", "fl_swimming_pool", "fl_furnished"
                        ]


**1. Linear Model Preprocessor**

In [8]:

# Function for log-transformation of skewed_features
log_transformer = FunctionTransformer(np.log1p, validate=True)

# Class Outlier Capper
class OutlierCapper(BaseEstimator, TransformerMixin):
    def __init__(self, lower_quantile=0.01, upper_quantile=0.99):
        self.lower_quantile = lower_quantile
        self.upper_quantile = upper_quantile
    
    def fit(self, X, y=None):
        # Compute thresholds for each column based on training data
        self.lower_ = np.quantile(X, self.lower_quantile, axis=0)
        self.upper_ = np.quantile(X, self.upper_quantile, axis=0)
        return self
    
    def transform(self, X):
        # Clip values to the learned thresholds
        return np.clip(X, self.lower_, self.upper_)
    
capper = OutlierCapper(lower_quantile=0.05, upper_quantile=0.95)

# Pipeline for numeric columns (imputation, scale, capping (capping needs to come as a parameter from train data - leakage issue))
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap',capper),
    ('scaler', StandardScaler())
])

# Pipeline for numeric features that need log-transform (specific order for cap,log,scaler)
log_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap', capper),
    ('log', log_transformer),
    ('scaler', StandardScaler())
])

# Pipeline for one-hot categorical features
onehot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Pipeline for label/ordinal categorical features - "MISSING" is treated as a category
target_pipeline = Pipeline([
    ('target_enc', TargetEncoder(smoothing=1.0))
])

# Putting all pipelines together; TargetEncoder is SUPERVISED (it needs y_train)

preprocessor_linear = ColumnTransformer([
    ('num', numeric_pipeline, [f for f in numeric_features if f not in skewed_features]),
    ('log', log_pipeline, skewed_features),
    ('onehot', onehot_pipeline, categorical_onehot),
    ('target', target_pipeline, categorical_target),
    ('binary', 'passthrough', binary_features) # Just passing them as-is
])


*Wrapping up the model pipeline*

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

# Preprocess of X, log-transform y, untransforms predictions
linear_model = Pipeline([
    ("preprocessor", preprocessor_linear),
    ("reg", TransformedTargetRegressor(
        regressor=LinearRegression(),
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

*Fit the Linear Model Pipeline once*



In [10]:
linear_model.fit(X_train,y_train)

0,1,2
,steps,"[('preprocessor', ...), ('reg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('log', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,lower_quantile,0.05
,upper_quantile,0.95

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,lower_quantile,0.05
,upper_quantile,0.95

0,1,2
,func,<ufunc 'log1p'>
,inverse_func,
,validate,True
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,regressor,LinearRegression()
,transformer,
,func,<ufunc 'log1p'>
,inverse_func,<ufunc 'expm1'>
,check_inverse,True

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


*Get the column names back (lost after ColumnTransformer)*

In [11]:

def get_column_names(ct):
    """
    Return list of output column names produced by a fitted ColumnTransformer `ct`.
    Handles Pipelines, SimpleImputer(add_indicator=True) inside pipelines,
    and transformers that implement get_feature_names_out.
    """
    feature_names = []

    for name, transformer, cols in ct.transformers_:
        # Skip dropped transformers
        if transformer == 'drop':
            continue

        # passthrough: keep original names
        if transformer == 'passthrough':
            feature_names.extend(list(cols))
            continue

        # Some ColumnTransformer entries may be (name, transformer, slice) where
        # transformer is a Pipeline or transformer instance.
        # We'll treat Pipeline specially.
        if isinstance(transformer, Pipeline):
            
            last_step = transformer.steps[-1][1]
            if hasattr(last_step, 'get_feature_names_out'):
                try:
                    names = last_step.get_feature_names_out(cols)
                    feature_names.extend(list(names))
                    continue
                except Exception:
                    # if it fails for any reason, fall through to other checks
                    pass

            imputer_with_indicator = None
            for step_name, step_obj in transformer.steps:
                if isinstance(step_obj, SimpleImputer) and getattr(step_obj, "add_indicator", False):
                    imputer_with_indicator = step_obj
                    break

            if imputer_with_indicator is not None:
                # Imputer keeps original number of columns + indicator cols (one per input col with NaNs seen during fit)
                feature_names.extend(list(cols))
                if hasattr(imputer_with_indicator, 'indicator_'):
                    indicator_names = [f"{cols[i]}_missing_flag" for i in imputer_with_indicator.indicator_.features_]
                    feature_names.extend(indicator_names)
                continue

            feature_names.extend(list(cols))
            continue

        # If transformer is not a Pipeline
        # Try to use get_feature_names_out if present
        if hasattr(transformer, 'get_feature_names_out'):
            try:
                names = transformer.get_feature_names_out(cols)
                feature_names.extend(list(names))
                continue
            except Exception:
                pass

        # Check if this transformer itself is a SimpleImputer with add_indicator=True
        if isinstance(transformer, SimpleImputer) and getattr(transformer, "add_indicator", False):
            feature_names.extend(list(cols))
            if hasattr(transformer, 'indicator_'): # Thie priece resolves the issue when missing_fl colummn is created but not needed, causing issue when converting to df
                indicator_names = [f"{cols[i]}_missing_flag" for i in transformer.indicator_.features_]
                feature_names.extend(indicator_names)
            continue

        # final fallback: original column names
        feature_names.extend(list(cols))

    return feature_names

column_names = get_column_names(preprocessor_linear)
print(column_names)


['cadastral_income', 'construction_year', 'primary_energy_consumption_sqm', 'nbr_bedrooms', 'nbr_frontages', 'cadastral_income_missing_flag', 'construction_year_missing_flag', 'primary_energy_consumption_sqm_missing_flag', 'nbr_frontages_missing_flag', 'surface_land_sqm', 'total_area_sqm', 'garden_sqm', 'terrace_sqm', 'surface_land_sqm_missing_flag', 'total_area_sqm_missing_flag', 'garden_sqm_missing_flag', 'terrace_sqm_missing_flag', 'property_type_APARTMENT', 'property_type_HOUSE', 'region_Brussels-Capital', 'region_Flanders', 'region_Wallonia', 'region_missing_value', 'province_Antwerp', 'province_Brussels', 'province_East Flanders', 'province_Flemish Brabant', 'province_Hainaut', 'province_Limburg', 'province_Liège', 'province_Luxembourg', 'province_Namur', 'province_Walloon Brabant', 'province_West Flanders', 'province_missing_value', 'heating_type_CARBON', 'heating_type_ELECTRIC', 'heating_type_FUELOIL', 'heating_type_GAS', 'heating_type_MISSING', 'heating_type_PELLET', 'heating_

*Extract features with coefficients*

In [None]:

linreg = linear_model.named_steps["reg"].regressor_ # This is where the LR is stored

preprocessor_linear = linear_model.named_steps["preprocessor"]

feature_names = get_column_names(preprocessor_linear)

coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": linreg.coef_
}).sort_values(by="coefficient", key=abs, ascending=False)

print(coef_df.head(25))  # top 10 strongest features

# Example interpretation below: increasing total area by 1 std oncreased log(price) by 0.23, corresponding
# to around 26% increase in price. 

                                feature  coefficient
10                       total_area_sqm     0.231797
65                     fl_swimming_pool     0.220506
35                  heating_type_CARBON    -0.141772
63                           fl_terrace    -0.137180
48  equipped_kitchen_USA_HYPER_EQUIPPED     0.126990
27                     province_Hainaut    -0.106291
46       equipped_kitchen_NOT_INSTALLED    -0.105935
12                          terrace_sqm     0.093480
21                      region_Wallonia    -0.092023
43      equipped_kitchen_HYPER_EQUIPPED     0.090743
47       equipped_kitchen_SEMI_EQUIPPED    -0.088347
53                              epc_bad    -0.084062
3                          nbr_bedrooms     0.083469
0                      cadastral_income     0.078632
54                        epc_excellent     0.077334
56                             epc_poor    -0.076411
41                   heating_type_SOLAR     0.069558
42                    heating_type_WOOD    -0.

**2. Random Forest & XGBoost Preprocessing Pipeline**

In [13]:
# Pipeline for numeric columns (imputation, scale, capping (capping needs to come as a parameter from train data - leakage issue))
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1, add_indicator=True)),
])

# No log-transformation is done; as in ct we cannot pass both lists of vars, separate pipeline is indicated
skew_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1, add_indicator=True)),
])

onehot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Pipeline for label/ordinal categorical features - "MISSING" is treated as a category
target_pipeline = Pipeline([
    ('target_enc', TargetEncoder(smoothing=1.0))
])

preprocessor_forest_boost = ColumnTransformer([
    ('num', numeric_pipeline, [f for f in numeric_features if f not in skewed_features]),
    ('skewed',skew_pipeline, skewed_features),
    ('onehot', onehot_pipeline, categorical_onehot),
    ('target', target_pipeline, categorical_target),
    ('binary', 'passthrough', binary_features) # Just passing them as-is
])

Wraping up Random Forest And XGBoost Models

In [14]:
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from xgboost import XGBRegressor

forest_model = Pipeline(steps=[
    ("preprocess", preprocessor_forest_boost),
    ("model", RandomForestRegressor(
        n_estimators=100,    # number of trees
        max_depth=None,      
        random_state=42
    ))
])

xgboost_model = Pipeline(steps=[
    ("preprocess", preprocessor_forest_boost),
    ("model", XGBRegressor(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=6,
        subsample=0.9,
        colsample_bytree=0.9
    ))
])

Fit the Random Forest Model

In [15]:
forest_model.fit(X_train,y_train)


0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('skewed', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [16]:
# Quick prediction checks with Random Forest

y_pred = forest_model.predict(X_test)
print(y_pred[:30]) 

[ 136741.98        456619.04666667  223955.22        531606.25
  291719.          296866.5         204050.77        338768.485
  592944.30666667  325337.32        370410.          125055.98
  485230.52666667  384763.9         141098.97        207979.
  402531.31666667  342804.          313360.          303260.25916667
 1225125.18        345872.56        271378.          433813.61666667
  322389.          220646.59        210486.98        277425.
  342540.          367048.5       ]


In [None]:
# Checking feature importance (RF uses Mean Decrease in Impurity (MDI)):
# every tree splits, every split reduces impurity (e.g. variance for regression)
# -> importance(feature) = total impurity reduction contributed by that feature across the entire forest

rf = forest_model.named_steps['model'] # This is where RF model is stored
#rf contains the built-in attribute rf.feature_importances_ (used below)

feature_names = []

for name, transformer, cols in preprocessor_forest_boost.transformers_:
    if name != 'remainder':  
        if hasattr(transformer, 'get_feature_names_out'):
            feature_names.extend(transformer.get_feature_names_out(cols))
        else:
            feature_names.extend(cols)

importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf.feature_importances_ # the most important line here
}).sort_values(by='Importance', ascending=False)

print(importances.head(10))  # top 10 features

#Interpretation: how much (%) each feature contributes to reducing the prediction error across all trees;
#Still doesn't tell us direction but effect;


                           Feature  Importance
10                  total_area_sqm    0.301120
58                        locality    0.145369
9                 surface_land_sqm    0.092224
3                     nbr_bedrooms    0.082658
1                construction_year    0.053645
0                 cadastral_income    0.049156
2   primary_energy_consumption_sqm    0.044337
57                subproperty_type    0.031305
59                  state_building    0.026939
12                     terrace_sqm    0.022323


Fit XGBoost Model

In [18]:
xgboost_model.fit(X_train,y_train)

0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('skewed', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.9
,device,
,early_stopping_rounds,
,enable_categorical,False


In [21]:
# Quick prediction checks with Random Forest

y_pred = xgboost_model.predict(X_test)
print(y_pred[:30]) 

[1.1870538e+05 5.2575962e+05 2.3082364e+05 5.2681256e+05 2.9261059e+05
 2.6447238e+05 2.1715977e+05 3.1440556e+05 5.8278475e+05 2.7607925e+05
 3.9555638e+05 1.2718918e+05 5.1533094e+05 3.1904222e+05 1.2091988e+05
 2.2025398e+05 3.8953012e+05 3.4426816e+05 3.2271006e+05 3.2967812e+05
 1.0888211e+06 3.5839338e+05 2.5790200e+05 3.8026594e+05 3.0524409e+05
 2.4296780e+05 2.3334591e+05 2.7530947e+05 3.2605575e+05 3.5671578e+05]


In [None]:
# Checking features importance 
# split count (weight) - based on how many times it appears in a tree across all trees) - can be biased towards features with many categories
# gain - how much a feature actually improves the model at each split (reduction in learning loss)
# cover - the total number of training samples that go through splits using this feature

# Getting the feature names back
preprocessor = xgboost_model.named_steps["preprocess"]
feature_names = get_column_names(preprocessor)

booster = xgboost_model.named_steps["model"].get_booster() # This is where the booster is stored

importance_gain = booster.get_score(importance_type='gain')

df_gain = (
    pd.DataFrame(list(importance_gain.items()), columns=['Feature', 'Gain'])
    .sort_values('Gain', ascending=False)
)

# Extracting importance metrics
importance_gain = booster.get_score(importance_type='gain')
importance_weight = booster.get_score(importance_type='weight')
importance_cover = booster.get_score(importance_type='cover')

# Map "f0", "f1" etc to actual feature names
importance_gain_named = {feature_names[int(k[1:])]: v for k, v in importance_gain.items()}
importance_weight_named = {feature_names[int(k[1:])]: v for k, v in importance_weight.items()}
importance_cover_named = {feature_names[int(k[1:])]: v for k, v in importance_cover.items()}

all_features = feature_names

df_importance = pd.DataFrame({
    'Feature': all_features,
    'Gain': [importance_gain_named.get(f, 0) for f in all_features],
    'Weight': [importance_weight_named.get(f, 0) for f in all_features],
    'Cover': [importance_cover_named.get(f, 0) for f in all_features]
})

# Optional: sort by Gain descending
df_importance = df_importance.sort_values(by='Gain', ascending=False)

df_importance

Unnamed: 0,Feature,Gain,Weight,Cover
24,province_Brussels,8.426063e+12,17.0,3306.588135
21,region_Wallonia,7.153247e+12,48.0,3893.541748
10,total_area_sqm,6.145581e+12,1868.0,6584.284180
13,missingindicator_surface_land_sqm,4.608659e+12,26.0,3692.346191
58,locality,4.093225e+12,1360.0,4877.377441
...,...,...,...,...
15,missingindicator_garden_sqm,4.022559e+10,5.0,6675.399902
35,heating_type_CARBON,3.783036e+10,2.0,40746.500000
34,province_missing_value,0.000000e+00,0.0,0.000000
22,region_missing_value,0.000000e+00,0.0,0.000000


## Metrics for the Models ##

In [19]:
# Adjusted R-sqr, RMSE

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

r2_log = r2_score(y_test_pred_log, np.log1p(y_test))
mse_log = mean_squared_error(np.log1p(y_test), y_test_pred_log)

r2 = r2_score(y_test, y_test_pred)
rmse = mean_squared_error(y_test, y_test_pred)

print("R2 (original scale):", r2)
print("RMSE:", rmse)
print("R2 (log scale):", r2_log)
print("RMSE (log scale):", np.sqrt(mse_log))


R2 (original scale): 0.33954826665613613
RMSE: 143477781261.8997
R2 (log scale): 0.4645265851207946
RMSE (log scale): 0.3419016547868019


In [20]:
# Over/under-fitting test
# train < test - overfitting (model memorizes training data)
# train > test - underfitting (model is too simple or wrong features)
#

def evaluate(y_true, y_pred): # temporarily defining here, not globally
    return {
        "MAE": mean_absolute_error(y_true, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_true, y_pred)),
        "R2": r2_score(y_true, y_pred)
    }

train_metrics = evaluate(y_train, y_train_pred)
test_metrics  = evaluate(y_test,  y_test_pred)

print("TRAIN:", train_metrics)
print("TEST: ", test_metrics)

# Result interpretation: NO over/under-fitting  

TRAIN: {'MAE': 122895.95500355208, 'RMSE': np.float64(332277.1726435241), 'R2': 0.40316051400669195}
TEST:  {'MAE': 124093.27323195233, 'RMSE': np.float64(378784.6106455484), 'R2': 0.33954826665613613}
