## Linear and Tree-Based Models: Preprocessing Pipelines, Training and Testing ##


* Define features (X) and target variable (y).

* Split the dataset into train, test, validation sets (e.g., 70/20/10 split);

**1. Linear Model Preprocessor 5 steps (python object)**

*No imputation, no scaling, no capping, no encoding for price - only log-transformation after fitting the model and before predicting.*

TRAIN SET:
- Handle missing values: missingness is meaningful here; impute Nan with median and include a missing_flag indicator;
- Handle skeweness:log-transform skewed features and target, cap outliers at 1-99%;
- Encoding categorical variables (TargetEncoding and OneHotEncoder), and scaling (StandardScaling);

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.


**2. Random Forest & XGBoost Models: Preporcessing Strategy**

TRAIN SET:
- Handle missing values: impute Nan with -1 with missing_flag indicator; 
- Handle skeweness: not necessary;
- Encoding categorical variables (OHE, TargetEncoding), and scaling - not necessary;

TEST, VALIDATION SETS:
Apply the same sub-steps from above but with the parameters learned from the **training set**.

In [120]:
import pandas as pd

filename = "cleaned_properties.csv"
df = pd.read_csv(filename)
df.columns
df

Unnamed: 0,id,price,property_type,subproperty_type,region,province,locality,zip_code,latitude,longitude,...,fl_garden,garden_sqm,fl_swimming_pool,fl_floodzone,state_building,primary_energy_consumption_sqm,epc,heating_type,fl_double_glazing,cadastral_income
0,34221000,225000.0,APARTMENT,APARTMENT,Flanders,Antwerp,Antwerp,2050,51.217172,4.379982,...,0,0.0,0,0,MISSING,231.0,poor,GAS,1,922.0
1,2104000,449000.0,HOUSE,HOUSE,Flanders,East Flanders,Gent,9185,51.174944,3.845248,...,0,0.0,0,0,MISSING,221.0,poor,MISSING,1,406.0
2,34036000,335000.0,APARTMENT,APARTMENT,Brussels-Capital,Brussels,Brussels,1070,50.842043,4.334543,...,0,0.0,0,1,AS_NEW,,MISSING,GAS,0,
3,58496000,501000.0,HOUSE,HOUSE,Flanders,Antwerp,Turnhout,2275,51.238312,4.817192,...,0,0.0,0,1,MISSING,99.0,excellent,MISSING,0,
4,48727000,982700.0,APARTMENT,DUPLEX,Wallonia,Walloon Brabant,Nivelles,1410,,,...,1,142.0,0,0,AS_NEW,19.0,excellent,GAS,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75503,30785000,210000.0,APARTMENT,APARTMENT,Wallonia,Hainaut,Tournai,7640,,,...,0,0.0,0,1,AS_NEW,,MISSING,MISSING,1,
75504,13524000,780000.0,APARTMENT,PENTHOUSE,Brussels-Capital,Brussels,Brussels,1200,50.840183,4.435570,...,0,0.0,0,0,AS_NEW,95.0,good,GAS,1,
75505,43812000,798000.0,HOUSE,MIXED_USE_BUILDING,Brussels-Capital,Brussels,Brussels,1080,,,...,0,0.0,0,1,TO_RENOVATE,351.0,bad,GAS,0,
75506,49707000,575000.0,HOUSE,VILLA,Flanders,West Flanders,Veurne,8670,,,...,1,,0,1,AS_NEW,269.0,poor,GAS,1,795.0


In [121]:
#Define features (X) and target variable (y)

from sklearn.model_selection import train_test_split

X = df.drop(columns = ["price","id","zip_code","latitude","longitude"])
y = df["price"]
type(y)

pandas.core.series.Series

**Splitting the data into train, test, validation sets**

In [None]:
#Split the dataset into train, test, validation sets (e.g., 60/20/20 split);
from sklearn.model_selection import train_test_split

#X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state = 86)
#X_train, X_val, y_train, y_val = train_test_split(X_temp,y_temp, test_size = 0.25, random_state = 86)

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.25, random_state = 36)


In [123]:
print(type(X_train))
print(type(X_test))
print(type(X_val))
print(type(y_train))
print(type(y_test))
print(type(y_val))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Preprocessors ##
impute → cap → log → scale → encode

In [124]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,FunctionTransformer, StandardScaler,OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce
from category_encoders import TargetEncoder

*Following the multicolinearity analysis, there are certain features candidates for dropping. Need to choose one among the identified groups*



In [None]:
#NAN: replace NAN with median and add a missing_fl column here:
numeric_features = ["cadastral_income","primary_energy_consumption_sqm","nbr_bedrooms","nbr_frontages","total_area_sqm"]

skewed_features = ["total_area_sqm"]

# No NAN to be handled, only encoding
categorical_onehot = ["heating_type","equipped_kitchen", "epc"]
categorical_target = ["subproperty_type","province"]

# # No NAN to be handled and no encoding
binary_features = ["fl_terrace", "fl_garden", "fl_swimming_pool", "fl_furnished"]
#Candidates to drop: 
# - Due to multicollinearity: "construction_year",surface_land_sqm, property_type, "state_building", "fl_double_glazing", "fl_open_fire", "region","locality","fl_floodzone", "terrace_sqm","garden_sqm",  

**1. Linear Model Preprocessor**

In [126]:

# Function for log-transformation of skewed_features
log_transformer = FunctionTransformer(np.log1p, validate=True)

# Class Outlier Capper
class OutlierCapper(BaseEstimator, TransformerMixin):
    def __init__(self, lower_quantile=0.01, upper_quantile=0.99):
        self.lower_quantile = lower_quantile
        self.upper_quantile = upper_quantile
    
    def fit(self, X, y=None):
        # Compute thresholds for each column based on training data
        self.lower_ = np.quantile(X, self.lower_quantile, axis=0)
        self.upper_ = np.quantile(X, self.upper_quantile, axis=0)
        return self
    
    def transform(self, X):
        # Clip values to the learned thresholds
        return np.clip(X, self.lower_, self.upper_)
    
capper = OutlierCapper(lower_quantile=0.05, upper_quantile=0.95)

# Pipeline for numeric columns (imputation, scale, capping (capping needs to come as a parameter from train data - leakage issue))
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap',capper),
    ('scaler', StandardScaler())
])

# Pipeline for numeric features that need log-transform (specific order for cap,log,scaler)
log_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median', add_indicator=True)),
    ('cap', capper),
    ('log', log_transformer),
    ('scaler', StandardScaler())
])

# Pipeline for one-hot categorical features
onehot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Pipeline for label/ordinal categorical features - "MISSING" is treated as a category
target_pipeline = Pipeline([
    ('target_enc', TargetEncoder(smoothing=1.0))
])

# Putting all pipelines together; TargetEncoder is SUPERVISED (it needs y_train)

preprocessor_linear = ColumnTransformer([
    ('num', numeric_pipeline, [f for f in numeric_features if f not in skewed_features]),
    ('log', log_pipeline, skewed_features),
    ('onehot', onehot_pipeline, categorical_onehot),
    ('target', target_pipeline, categorical_target),
    ('binary', 'passthrough', binary_features) # Just passing them as-is
])


*Wrapping up the model pipeline*

In [127]:
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

# Preprocess of X, log-transform y, untransforms predictions
linear_model = Pipeline([
    ("preprocessor", preprocessor_linear),
    ("reg", TransformedTargetRegressor(
        regressor=LinearRegression(),
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

*Fit the Linear Model Pipeline once*



In [128]:
linear_model.fit(X_train,y_train)

0,1,2
,steps,"[('preprocessor', ...), ('reg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('log', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,lower_quantile,0.05
,upper_quantile,0.95

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,lower_quantile,0.05
,upper_quantile,0.95

0,1,2
,func,<ufunc 'log1p'>
,inverse_func,
,validate,True
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,regressor,LinearRegression()
,transformer,
,func,<ufunc 'log1p'>
,inverse_func,<ufunc 'expm1'>
,check_inverse,True

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


*Get the column names back (lost after ColumnTransformer)*

In [129]:

def get_column_names(ct):
    """
    Return list of output column names produced by a fitted ColumnTransformer `ct`.
    Handles Pipelines, SimpleImputer(add_indicator=True) inside pipelines,
    and transformers that implement get_feature_names_out.
    """
    feature_names = []

    for name, transformer, cols in ct.transformers_:
        # Skip dropped transformers
        if transformer == 'drop':
            continue

        # passthrough: keep original names
        if transformer == 'passthrough':
            feature_names.extend(list(cols))
            continue

        # Some ColumnTransformer entries may be (name, transformer, slice) where
        # transformer is a Pipeline or transformer instance.
        # We'll treat Pipeline specially.
        if isinstance(transformer, Pipeline):
            
            last_step = transformer.steps[-1][1]
            if hasattr(last_step, 'get_feature_names_out'):
                try:
                    names = last_step.get_feature_names_out(cols)
                    feature_names.extend(list(names))
                    continue
                except Exception:
                    # if it fails for any reason, fall through to other checks
                    pass

            imputer_with_indicator = None
            for step_name, step_obj in transformer.steps:
                if isinstance(step_obj, SimpleImputer) and getattr(step_obj, "add_indicator", False):
                    imputer_with_indicator = step_obj
                    break

            if imputer_with_indicator is not None:
                # Imputer keeps original number of columns + indicator cols (one per input col with NaNs seen during fit)
                feature_names.extend(list(cols))
                if hasattr(imputer_with_indicator, 'indicator_'):
                    indicator_names = [f"{cols[i]}_missing_flag" for i in imputer_with_indicator.indicator_.features_]
                    feature_names.extend(indicator_names)
                continue

            feature_names.extend(list(cols))
            continue

        # If transformer is not a Pipeline
        # Try to use get_feature_names_out if present
        if hasattr(transformer, 'get_feature_names_out'):
            try:
                names = transformer.get_feature_names_out(cols)
                feature_names.extend(list(names))
                continue
            except Exception:
                pass

        # Check if this transformer itself is a SimpleImputer with add_indicator=True
        if isinstance(transformer, SimpleImputer) and getattr(transformer, "add_indicator", False):
            feature_names.extend(list(cols))
            if hasattr(transformer, 'indicator_'): # Thie priece resolves the issue when missing_fl colummn is created but not needed, causing issue when converting to df
                indicator_names = [f"{cols[i]}_missing_flag" for i in transformer.indicator_.features_]
                feature_names.extend(indicator_names)
            continue

        # final fallback: original column names
        feature_names.extend(list(cols))

    return feature_names

column_names = get_column_names(preprocessor_linear)
print(column_names)


['cadastral_income', 'primary_energy_consumption_sqm', 'nbr_bedrooms', 'nbr_frontages', 'cadastral_income_missing_flag', 'primary_energy_consumption_sqm_missing_flag', 'nbr_frontages_missing_flag', 'total_area_sqm', 'total_area_sqm_missing_flag', 'heating_type_CARBON', 'heating_type_ELECTRIC', 'heating_type_FUELOIL', 'heating_type_GAS', 'heating_type_MISSING', 'heating_type_PELLET', 'heating_type_SOLAR', 'heating_type_WOOD', 'equipped_kitchen_HYPER_EQUIPPED', 'equipped_kitchen_INSTALLED', 'equipped_kitchen_MISSING', 'equipped_kitchen_NOT_INSTALLED', 'equipped_kitchen_SEMI_EQUIPPED', 'equipped_kitchen_USA_HYPER_EQUIPPED', 'equipped_kitchen_USA_INSTALLED', 'equipped_kitchen_USA_SEMI_EQUIPPED', 'equipped_kitchen_USA_UNINSTALLED', 'epc_MISSING', 'epc_bad', 'epc_excellent', 'epc_good', 'epc_poor', 'subproperty_type', 'province', 'fl_terrace', 'fl_garden', 'fl_swimming_pool', 'fl_furnished']


*Extract features with coefficients*

In [130]:

linreg = linear_model.named_steps["reg"].regressor_ # This is where the LR is stored

preprocessor_linear = linear_model.named_steps["preprocessor"]

feature_names = get_column_names(preprocessor_linear)

coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": linreg.coef_
}).sort_values(by="coefficient", key=abs, ascending=False)

print(coef_df.head(25))  # top 10 strongest features

# Example interpretation below: increasing total area by 1 std oncreased log(price) by 0.23, corresponding
# to around 26% increase in price. 

                                feature  coefficient
35                     fl_swimming_pool     0.241705
7                        total_area_sqm     0.228613
9                   heating_type_CARBON    -0.193116
22  equipped_kitchen_USA_HYPER_EQUIPPED     0.164977
20       equipped_kitchen_NOT_INSTALLED    -0.124611
17      equipped_kitchen_HYPER_EQUIPPED     0.122518
27                              epc_bad    -0.119684
36                         fl_furnished     0.101360
28                        epc_excellent     0.100117
21       equipped_kitchen_SEMI_EQUIPPED    -0.094656
0                      cadastral_income     0.088241
30                             epc_poor    -0.087760
15                   heating_type_SOLAR     0.083792
2                          nbr_bedrooms     0.072994
10                heating_type_ELECTRIC     0.062872
16                    heating_type_WOOD    -0.058522
29                             epc_good     0.056854
13                 heating_type_MISSING     0.

**2. Random Forest & XGBoost Preprocessing Pipeline**

In [131]:
# Pipeline for numeric columns (imputation, scale, capping (capping needs to come as a parameter from train data - leakage issue))
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1, add_indicator=True)),
])

# No log-transformation is done; as in ct we cannot pass both lists of vars, separate pipeline is indicated
skew_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1, add_indicator=True)),
])

onehot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant')), 
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Pipeline for label/ordinal categorical features - "MISSING" is treated as a category
target_pipeline = Pipeline([
    ('target_enc', TargetEncoder(smoothing=1.0))
])

preprocessor_forest_boost = ColumnTransformer([
    ('num', numeric_pipeline, [f for f in numeric_features if f not in skewed_features]),
    ('skewed',skew_pipeline, skewed_features),
    ('onehot', onehot_pipeline, categorical_onehot),
    ('target', target_pipeline, categorical_target),
    ('binary', 'passthrough', binary_features) # Just passing them as-is
])

Wraping up Random Forest And XGBoost Models

In [132]:
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from xgboost import XGBRegressor

forest_model = Pipeline(steps=[
    ("preprocess", preprocessor_forest_boost),
    ("model", RandomForestRegressor(
        n_estimators=100,    # number of trees
        max_depth=None,      
        random_state=42
    ))
])

xgboost_model = Pipeline(steps=[
    ("preprocess", preprocessor_forest_boost),
    ("model", XGBRegressor(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=6,
        subsample=0.9,
        colsample_bytree=0.9
    ))
])

Fit the Random Forest Model

In [133]:
forest_model.fit(X_train,y_train)


0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('skewed', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [134]:
# Quick prediction checks with Random Forest

y_pred = forest_model.predict(X_test)
print(y_pred[:30]) 

[ 515943.47192857  328427.11666667  411102.01333333  177585.
  525281.37        200720.          200748.         1312570.
  421789.6         242081.79        318950.          261066.5
  434217.3884127   343257.70559956  334422.49        168240.
  625810.          300272.88        138650.          301210.58059154
  279357.5         317645.8         380523.69333333  256240.
  420854.4255744   319642.89        364074.         1765785.67
  347005.          473797.        ]


In [135]:
# Checking feature importance (RF uses Mean Decrease in Impurity (MDI)):
# every tree splits, every split reduces impurity (e.g. variance for regression)
# -> importance(feature) = total impurity reduction contributed by that feature across the entire forest

rf = forest_model.named_steps['model'] # This is where RF model is stored
#rf contains the built-in attribute rf.feature_importances_ (used below)

feature_names = []

for name, transformer, cols in preprocessor_forest_boost.transformers_:
    if name != 'remainder':  
        if hasattr(transformer, 'get_feature_names_out'):
            feature_names.extend(transformer.get_feature_names_out(cols))
        else:
            feature_names.extend(cols)

importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf.feature_importances_ # the most important line here
}).sort_values(by='Importance', ascending=False)

print(importances.head(10))  # top 10 features

#Interpretation: how much (%) each feature contributes to reducing the prediction error across all trees;
#Still doesn't tell us direction but effect;


                            Feature  Importance
7                    total_area_sqm    0.345325
32                         province    0.129655
31                 subproperty_type    0.083042
1    primary_energy_consumption_sqm    0.083042
0                  cadastral_income    0.078925
2                      nbr_bedrooms    0.069593
3                     nbr_frontages    0.028224
35                 fl_swimming_pool    0.017600
17  equipped_kitchen_HYPER_EQUIPPED    0.014698
33                       fl_terrace    0.013508


Fit XGBoost Model

In [136]:
xgboost_model.fit(X_train,y_train)

0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('skewed', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,-1
,copy,True
,add_indicator,True
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,verbose,0
,cols,
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,min_samples_leaf,20
,smoothing,1.0
,hierarchy,

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.9
,device,
,early_stopping_rounds,
,enable_categorical,False


In [137]:
# Quick prediction checks with Random Forest

y_pred = xgboost_model.predict(X_test)
print(y_pred[:30]) 

[6.1048819e+05 3.8414553e+05 4.1342119e+05 1.5365978e+05 6.8060450e+05
 1.8799012e+05 3.4701681e+05 1.4487374e+06 3.9391634e+05 2.8008447e+05
 2.2488791e+05 2.4734533e+05 4.0906447e+05 3.3622738e+05 3.1712125e+05
 1.5983608e+05 6.6382944e+05 3.3295938e+05 1.0140403e+05 3.1160544e+05
 2.2977578e+05 2.9652591e+05 3.9247553e+05 2.3496625e+05 3.9715534e+05
 3.4780275e+05 3.3674969e+05 1.1070250e+06 3.3970728e+05 4.5697991e+05]


In [138]:
# Checking features importance 
# split count (weight) - based on how many times it appears in a tree across all trees) - can be biased towards features with many categories
# gain - how much a feature actually improves the model at each split (reduction in learning loss)
# cover - the total number of training samples that go through splits using this feature

# Getting the feature names back
preprocessor = xgboost_model.named_steps["preprocess"]
feature_names = get_column_names(preprocessor)

booster = xgboost_model.named_steps["model"].get_booster() # This is where the booster is stored

importance_gain = booster.get_score(importance_type='gain')

df_gain = (
    pd.DataFrame(list(importance_gain.items()), columns=['Feature', 'Gain'])
    .sort_values('Gain', ascending=False)
)

# Extracting importance metrics
importance_gain = booster.get_score(importance_type='gain')
importance_weight = booster.get_score(importance_type='weight')
importance_cover = booster.get_score(importance_type='cover')

# Map "f0", "f1" etc to actual feature names
importance_gain_named = {feature_names[int(k[1:])]: v for k, v in importance_gain.items()}
importance_weight_named = {feature_names[int(k[1:])]: v for k, v in importance_weight.items()}
importance_cover_named = {feature_names[int(k[1:])]: v for k, v in importance_cover.items()}

all_features = feature_names

df_importance = pd.DataFrame({
    'Feature': all_features,
    'Gain': [importance_gain_named.get(f, 0) for f in all_features],
    'Weight': [importance_weight_named.get(f, 0) for f in all_features],
    'Cover': [importance_cover_named.get(f, 0) for f in all_features]
})

# Optional: sort by Gain descending
df_importance = df_importance.sort_values(by='Gain', ascending=False)

df_importance

Unnamed: 0,Feature,Gain,Weight,Cover
7,total_area_sqm,5821614000000.0,2451.0,8482.21582
20,equipped_kitchen_NOT_INSTALLED,4493871000000.0,97.0,6201.27832
32,province,4045266000000.0,1893.0,3637.886475
35,fl_swimming_pool,3684798000000.0,227.0,5628.52002
26,epc_MISSING,3117794000000.0,174.0,2148.695312
22,equipped_kitchen_USA_HYPER_EQUIPPED,2952699000000.0,200.0,5599.060059
31,subproperty_type,2925858000000.0,1660.0,7444.703125
17,equipped_kitchen_HYPER_EQUIPPED,2870666000000.0,296.0,4587.905273
5,missingindicator_primary_energy_consumption_sqm,2300981000000.0,43.0,4530.767578
0,cadastral_income,1883369000000.0,2003.0,6127.98584


## Improvements of the Models ##

Cross-validation

CV ≈ test -> the model generalizes fine

CV > test -> the current train/test split is unlucky

CV < test -> the model overfits even more

In [None]:
from sklearn.model_selection import cross_val_score

pipeline = xgboost_model #xgboost_model
cv_r2 = cross_val_score(pipeline, X, y, scoring='r2', cv=10)

print("CV R-sqr per fold:", cv_r2)
print("Mean CV R-sqr per fold:", cv_r2.mean())

# Interpretation - the model generalizes well following dropping of certain autocorr features, 
# adding validaiton split, and choosing new random state; 

CV R-sqr per fold: [0.67637552 0.58698878 0.58840938 0.6282246  0.56862229 0.54640019
 0.64970632 0.66507007 0.53190033 0.42871616]
Mean CV R-sqr per fold: 0.587041364400964


## Metrics for the Models: ##

train - test comparisosns of R-squared, RMSE, MSA, across models 

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate_model(model, X_train, y_train, X_test, y_test): # Creates a dictionary with results

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    results = {
        'R2_train': r2_score(y_train, y_pred_train),
        'R2_test': r2_score(y_test, y_pred_test),
        'MAE_train': mean_absolute_error(y_train, y_pred_train),
        'MAE_test': mean_absolute_error(y_test, y_pred_test),
        'RMSE_train': np.sqrt(mean_squared_error(y_train, y_pred_train)),
        'RMSE_test': np.sqrt(mean_squared_error(y_test, y_pred_test))
    }
    
    return results

# Using the pipelines

results_lr = evaluate_model(linear_model, X_train, y_train, X_test, y_test)
results_rf = evaluate_model(forest_model, X_train, y_train, X_test, y_test)
results_xgb = evaluate_model(xgboost_model, X_train, y_train, X_test, y_test)

df_results = pd.DataFrame([results_lr, results_rf, results_xgb],index=['LinearRegression', 'RandomForest', 'XGBoost'])
df_results


# HUGE GAP BETWEEN R sqr for RF and XGBoost - overfitting
# Possible strategies:
# - Cross-validation: Use GridSearchCV or RandomizedSearchCV to tune hyperparameters
# - Feature selection: Remove irrelevant features using gain importance, reduces overfitting

Unnamed: 0,R2_train,R2_test,MAE_train,MAE_test,RMSE_train,RMSE_test
LinearRegression,0.333305,0.328588,129401.702211,131042.642411,354869.54447,368252.874352
RandomForest,0.913615,0.622803,41770.481965,104093.948465,127739.429377,276016.839684
XGBoost,0.826203,0.617946,88724.966511,110107.261864,181186.608079,277788.307111
