# Notebook: 03_train_model.ipynb

This notebook trains a regression model to predict property values (`valuation_k`) from the synthetic real estate dataset. It includes preprocessing, data validation, Optuna hyperparameter tuning, LightGBM training, model persistence, and feature importance analysis.

## **System Architecture Summary**

This notebook represents the core modeling and training phase in the pipeline. It leverages structured training procedures, hyperparameter optimization, and metadata versioning to produce a robust real estate valuation model.

**Data Validation & Splitting:**
- Train/test and train/valid splits
- Overfitting risk management

**Preprocessing:**
- Categorical encoding and pipeline standardization

**Model Training:**
- LightGBM + Optuna for tree-based regression
- Optimization focused on MAE

**Evaluation & Interpretability:**
- Test set evaluation
- Feature importance diagnostics

**Persistence & Audit:**
- Trained model and metadata saved
- Enables reproducible and explainable predictions

This notebook transforms clean tabular data into a fully functional and deployable predictive model with structured tuning and quality controls.

## 01. Imports & Dataset Upload

### Technical Overview
Initializes environment by importing key libraries and reading the training dataset.

### Implementation Details
- Libraries: `pandas`, `lightgbm`, `optuna`, `sklearn`, `joblib`, `json`, `os`
- Loads data from `property_dataset_v1.csv`
- Sets metadata paths and prints loaded shape

### Purpose
Ensures required packages are available and dataset is loaded for training.

### Output
Printed shape of dataset; basic confirmation that data load was successful.

In [None]:
import os
import json
import hashlib
import joblib

from datetime import datetime

import numpy as np
import pandas as pd

import optuna
import optuna.integration.lightgbm as lgb_optuna
import lightgbm as lgb
from lightgbm import Dataset as lgbDataset
from lightgbm import early_stopping, log_evaluation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_absolute_error

import warnings

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

ASSET_TYPE = "property"
DATA_PATH = "../data/property_dataset_v1.csv"
MODEL_BASE_DIR = "../models"

ASSET_CONFIG = {
    "property": {
        "target": "valuation_k",
        "categorical": [
            "location",
            "energy_class",
            "has_elevator",
            "has_garden",
            "has_balcony",
            "garage",
        ],
        "numeric": [
            "size_m2",
            "rooms",
            "bathrooms",
            "year_built",
            "floor",
            "building_floors",
            "humidity_level",
            "temperature_avg",
            "noise_level",
            "air_quality_index",  # base environment
            # "age_years" sarà aggiunta se esiste / derivata
        ],
        "exclude": [
            "asset_id",
            "asset_type",
            "condition_score",
            "risk_score",
            "last_verified_ts",
        ],
    },
    # Placeholder for future assets
    "art": {"target": "valuation_k", "categorical": [], "numeric": [], "exclude": []},
}

assert ASSET_TYPE in ASSET_CONFIG, f"Unknown asset_type: {ASSET_TYPE}"
cfg = ASSET_CONFIG[ASSET_TYPE]

## 02. Load Dataset

### Technical Overview
Loads the full dataset, verifies required fields, and prepares the data for training.

### Implementation Details
- Reads `property_dataset_v1.csv` with `pandas.read_csv`  
- Displays sample rows for manual inspection  
- Validates presence of critical features (e.g., `valuation_k`, `size_m2`, `location`, `energy_class`)

### Purpose
Confirms input format and expected content, and ensures the dataset is valid and usable for training.

### Output
Sample DataFrame preview with confirmed schema and content integrity.

In [132]:
df = pd.read_csv(DATA_PATH)
print("Loaded dataset:", DATA_PATH, "| shape:", df.shape)

Loaded dataset: ../data/property_dataset_v1.csv | shape: (150, 23)


## 03. Normalization / Derivations

### Technical Overview
Creates derived fields and performs basic normalization where needed.

### Implementation Details
- May include log-scaling, ratio computation, or transformation of categorical fields

### Purpose
Prepares feature space for better model interpretability and generalization.

### Output
DataFrame with added columns (if applicable).

In [133]:
current_year = datetime.utcnow().year
if "year_build" in df.columns and "year_built" not in df.columns:
    df = df.rename(columns={"year_build": "year_built"})

if "age_years" not in df.columns and "year_built" in df.columns:
    df["age_years"] = current_year - df["year_built"]

# Ensure age_years in numeric list if present
if "age_years" in df.columns and "age_years" not in cfg["numeric"]:
    cfg["numeric"].append("age_years")

print("Dataset shape:", df.shape)
df.head()

Dataset shape: (150, 23)


Unnamed: 0,asset_id,asset_type,location,size_m2,rooms,bathrooms,year_built,age_years,floor,building_floors,...,garage,energy_class,humidity_level,temperature_avg,noise_level,air_quality_index,valuation_k,condition_score,risk_score,last_verified_ts
0,asset_0000,property,Naples,142,5,1,1964,61,2,7,...,1,B,53.9,17.8,42,104,348.41,0.852,0.14,2025-06-03T09:40:42Z
1,asset_0001,property,Milan,170,6,2,1979,46,1,9,...,0,A,69.7,20.0,77,51,222.1,0.73,0.261,2025-07-15T07:09:42Z
2,asset_0002,property,Palermo,54,4,3,2013,12,0,3,...,1,F,64.4,20.8,28,68,78.45,0.742,0.271,2025-07-05T22:46:42Z
3,asset_0003,property,Palermo,48,3,1,1951,74,3,7,...,0,B,47.6,13.6,27,76,90.58,0.776,0.216,2025-06-29T05:14:42Z
4,asset_0004,property,Rome,171,3,2,1955,70,1,5,...,1,D,37.4,24.6,45,73,591.7,0.764,0.254,2025-06-28T01:45:42Z


## 04. Sanity checks

### Technical Overview
Performs logical checks to ensure feature-target consistency.

### Implementation Details
- Range checks, missing values, and valid class inclusion

### Purpose
Ensure training data quality before modeling.

### Output
Prints or flags if any validation issues arise.

In [134]:
required_base = [cfg["target"]] + cfg["categorical"] + cfg["numeric"]
missing = [c for c in required_base if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns in dataset: {missing}")

# Remove excluded & target from feature candidates
excluded = set(cfg["exclude"] + [cfg["target"]])
feature_candidates = [c for c in df.columns if c not in excluded]

print("Target:", cfg["target"])
print("Categorical:", cfg["categorical"])
print("Numeric:", cfg["numeric"])
print("Excluded:", cfg["exclude"])
print("Feature candidates (pre-filter):", feature_candidates)

Target: valuation_k
Categorical: ['location', 'energy_class', 'has_elevator', 'has_garden', 'has_balcony', 'garage']
Numeric: ['size_m2', 'rooms', 'bathrooms', 'year_built', 'floor', 'building_floors', 'humidity_level', 'temperature_avg', 'noise_level', 'air_quality_index', 'age_years']
Excluded: ['asset_id', 'asset_type', 'condition_score', 'risk_score', 'last_verified_ts']
Feature candidates (pre-filter): ['location', 'size_m2', 'rooms', 'bathrooms', 'year_built', 'age_years', 'floor', 'building_floors', 'has_elevator', 'has_garden', 'has_balcony', 'garage', 'energy_class', 'humidity_level', 'temperature_avg', 'noise_level', 'air_quality_index']


## 05. Overfitting check

### Technical Overview
Evaluates risk of overfitting by exploring baseline distributions or correlations.

### Implementation Details
- May include stats like target variance or basic model fitting

### Purpose
Assesses model complexity risk in advance.

### Output
Initial MAE or distribution range to inform tuning constraints.

In [135]:
def comprehensive_overfitting_check(pipeline, X_train, X_test, y_train, y_test):
    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)

    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)

    cv_scores = cross_val_score(
        pipeline, X_train, y_train, cv=5, scoring="neg_mean_absolute_error"
    )
    cv_mae = -cv_scores.mean()

    print(f"\n🔍 OVERFITTING ANALYSIS")
    print(f"Training MAE: {train_mae:.2f}")
    print(f"CV MAE:       {cv_mae:.2f}")
    print(f"Test MAE:     {test_mae:.2f}")

    if train_mae < cv_mae * 0.5:
        print("❌ SEVERE OVERFITTING: Train << CV")
    if test_mae > cv_mae * 2:
        print("❌ POOR GENERALIZATION: Test >> CV")

    return {"train": train_mae, "cv": cv_mae, "test": test_mae}

## 06. Final feature list = categorical + numeric

### Technical Overview
Defines lists of input features by type for preprocessing pipeline.

### Implementation Details
- Categorical: `location`, `energy_class`
- Numeric: `size_m2`, `condition_score`, etc.

### Purpose
Prepares clear schema for modeling stages.

### Output
Feature lists printed or stored in variables.

In [136]:
feature_list = cfg["categorical"] + cfg["numeric"]
print("Final feature_list used:", feature_list)

X = df[feature_list].copy()
y = df[cfg["target"]].copy()

Final feature_list used: ['location', 'energy_class', 'has_elevator', 'has_garden', 'has_balcony', 'garage', 'size_m2', 'rooms', 'bathrooms', 'year_built', 'floor', 'building_floors', 'humidity_level', 'temperature_avg', 'noise_level', 'air_quality_index', 'age_years']


## 07. Train/test split

### Technical Overview
Splits dataset into training and testing sets for model validation.

### Implementation Details
- Uses `train_test_split` from `sklearn.model_selection`
- 80/20 or similar ratio with random seed for reproducibility

### Purpose
Prevents model leakage and allows generalization testing.

### Output
Two sets: `X_train`, `X_test`, `y_train`, `y_test`

In [137]:
y = np.log1p(df["valuation_k"])

In [138]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)


for col in X_train.select_dtypes(include="object").columns:
    all_categories = (
        pd.Series(pd.concat([X_train[col], X_test[col]]))
        .astype("category")
        .cat.categories
    )
    X_train[col] = X_train[col].astype("category").cat.set_categories(all_categories)
    X_test[col] = X_test[col].astype("category").cat.set_categories(all_categories)

## 08. Tune/valid split

### Technical Overview
Further splits training set into internal train/valid for Optuna tuning.

### Implementation Details
- Creates `X_train_tune`, `X_valid_tune` subsets
- Validation set used inside Optuna objective function

### Purpose
Supports parameter tuning without leaking into test data.

### Output
Tuning-specific train and validation datasets.

In [1]:
X_tune, X_valid, y_tune, y_valid = train_test_split(
    X_train, y_train, test_size=0.2, random_state=RANDOM_STATE
)

dtrain = lgb.Dataset(
    X_tune, label=y_tune, categorical_feature=["location", "energy_class"]
)
dvalid = lgb.Dataset(
    X_valid, label=y_valid, categorical_feature=["location", "energy_class"]
)

# Parametri base di partenza (non ottimizzati)
base_params = {
    "objective": "regression",
    "metric": "mae",
    "boosting_type": "gbdt",
    "verbosity": -1,
    "force_col_wise": True,
    "seed": RANDOM_STATE,
}

# Tuning Optuna
optuna_result = lgb_optuna.train(
    params=base_params,
    train_set=dtrain,
    valid_sets=[dvalid],
    num_boost_round=1000,
    callbacks=[
        early_stopping(stopping_rounds=10),  # ← più aggressivo
        log_evaluation(period=25),
    ],
)

NameError: name 'train_test_split' is not defined

## 09. Validate training data

### Technical Overview
Confirms training data integrity post-split.

### Implementation Details
- Basic printouts, class balance, missing values check

### Purpose
Final check before training pipeline starts.

### Output
Basic prints showing data summary.

In [140]:
def validate_training_data(X, y):
    """Validate data quality before training"""
    print("🔍 VALIDATING TRAINING DATA")
    missing_X = X.isnull().sum()
    if missing_X.any():
        raise ValueError(f"Missing values in features:\\n{missing_X[missing_X > 0]}")
    if y.std() == 0:
        raise ValueError("Target variable has zero variance")
    print("✅ Training data validation passed")


validate_training_data(X_train, y_train)

🔍 VALIDATING TRAINING DATA
✅ Training data validation passed


## 10. Preprocessor

### Technical Overview
Builds preprocessing pipeline for categorical encoding and numeric processing.

### Implementation Details
- One-Hot Encoding for categoricals using `ColumnTransformer`
- No scaling for tree-based model

### Purpose
Creates clean feature matrix aligned with LightGBM input expectations.

### Output
Fitted `preprocessor` object for reuse.

In [141]:
categorical_cols = cfg["categorical"]
numeric_cols = cfg["numeric"]
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numeric_cols),
    ]
)

## 11. Optuna & LightGBM

### Technical Overview
Tunes LightGBM hyperparameters using Optuna.

### Implementation Details
- Defines Optuna `objective()` using validation MAE
- Explores search space for:
 - `num_leaves`, `max_depth`, `learning_rate`, `min_child_samples`, `reg_alpha`, `reg_lambda`
- Executes trials and saves best params

### Purpose
Optimizes model for generalization and performance.

### Output
Best parameter set with validation MAE score.

In [142]:
# Converti i dati in Dataset LightGBM
dtrain = lgb.Dataset(
    X_train, label=y_train, categorical_feature=["location", "energy_class"]
)


def objective(trial):
    params = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "seed": RANDOM_STATE,
        "num_leaves": trial.suggest_int("num_leaves", 20, 100),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 50),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.01, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.01, 1.0),
    }

    model = lgb.LGBMRegressor(**params)

    # Per garantire compatibilità massima, evitiamo 'early_stopping_rounds' direttamente
    model.fit(
        X_train,
        y_train,
        eval_set=[(X_test, y_test)],
        eval_metric="rmse",
        callbacks=[lgb.early_stopping(stopping_rounds=50)],
    )

    preds = model.predict(X_test)

    # Calcolo manuale dell'RMSE per compatibilità massima
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)

    return rmse


print("🔍 Avvio hyperparameter tuning con Optuna + LightGBM...")
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=30, timeout=600)

# Migliori parametri
best_params = study.best_params
best_params.update(
    {
        "objective": "regression",
        "metric": "rmse",
        "boosting_type": "gbdt",
        "verbosity": -1,
        "force_col_wise": True,
        "n_estimators": 1000,
    }
)
print(f"✅ Migliori parametri trovati: {best_params}")

regressor = lgb.LGBMRegressor(**best_params)

# Ricostruisci il pipeline con i nuovi parametri
pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("regressor", regressor)])

[I 2025-07-22 05:53:42,936] A new study created in memory with name: no-name-20ac2aea-52f3-4230-bd03-cba1871c2d64
[I 2025-07-22 05:53:42,971] Trial 0 finished with value: 0.3044620806960473 and parameters: {'num_leaves': 37, 'max_depth': 9, 'learning_rate': 0.039558317138941854, 'min_child_samples': 26, 'subsample': 0.7286599421150959, 'colsample_bytree': 0.8206448784448872, 'reg_alpha': 0.7157032275682446, 'reg_lambda': 0.8645868451268238}. Best is trial 0 with value: 0.3044620806960473.
[I 2025-07-22 05:53:43,013] Trial 1 finished with value: 0.33009361749179966 and parameters: {'num_leaves': 90, 'max_depth': 8, 'learning_rate': 0.05106812741396067, 'min_child_samples': 10, 'subsample': 0.8342468101552619, 'colsample_bytree': 0.8675839388890634, 'reg_alpha': 0.5737954898523215, 'reg_lambda': 0.20679955838987457}. Best is trial 0 with value: 0.3044620806960473.
[I 2025-07-22 05:53:43,036] Trial 2 finished with value: 0.31380220137403453 and parameters: {'num_leaves': 78, 'max_depth': 

🔍 Avvio hyperparameter tuning con Optuna + LightGBM...
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[68]	valid_0's rmse: 0.304462
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[39]	valid_0's rmse: 0.330094
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[18]	valid_0's rmse: 0.313802
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[53]	valid_0's rmse: 0.326655
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[98]	valid_0's rmse: 0.295522
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 0.320831


[I 2025-07-22 05:53:43,146] Trial 6 finished with value: 0.30400302332084145 and parameters: {'num_leaves': 38, 'max_depth': 6, 'learning_rate': 0.03061147593427014, 'min_child_samples': 23, 'subsample': 0.7963779834958085, 'colsample_bytree': 0.8918898416100669, 'reg_alpha': 0.732629157507701, 'reg_lambda': 0.16322067309872546}. Best is trial 4 with value: 0.29552201057418404.
[I 2025-07-22 05:53:43,169] Trial 7 finished with value: 0.30587877886114434 and parameters: {'num_leaves': 64, 'max_depth': 4, 'learning_rate': 0.0912226301164973, 'min_child_samples': 20, 'subsample': 0.8179450382698401, 'colsample_bytree': 0.7982770061759245, 'reg_alpha': 0.5661992524613584, 'reg_lambda': 0.7928795906794547}. Best is trial 4 with value: 0.29552201057418404.
[I 2025-07-22 05:53:43,191] Trial 8 finished with value: 0.31222157348002477 and parameters: {'num_leaves': 82, 'max_depth': 8, 'learning_rate': 0.09131294571808514, 'min_child_samples': 38, 'subsample': 0.6576193436731063, 'colsample_bytr

Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[99]	valid_0's rmse: 0.304003
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[28]	valid_0's rmse: 0.305879
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[17]	valid_0's rmse: 0.312222
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[22]	valid_0's rmse: 0.30631
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[78]	valid_0's rmse: 0.331091


[I 2025-07-22 05:53:43,352] Trial 11 finished with value: 0.31167942401414855 and parameters: {'num_leaves': 45, 'max_depth': 6, 'learning_rate': 0.014213244205784233, 'min_child_samples': 36, 'subsample': 0.7550464691506262, 'colsample_bytree': 0.9399604268151396, 'reg_alpha': 0.9553543963558491, 'reg_lambda': 0.05465311189276878}. Best is trial 4 with value: 0.29552201057418404.
[I 2025-07-22 05:53:43,419] Trial 12 finished with value: 0.3435256859728021 and parameters: {'num_leaves': 20, 'max_depth': 6, 'learning_rate': 0.02137082560896484, 'min_child_samples': 49, 'subsample': 0.9881973010073375, 'colsample_bytree': 0.6152941815466668, 'reg_alpha': 0.23121282353033387, 'reg_lambda': 0.2732333026242829}. Best is trial 4 with value: 0.29552201057418404.
[I 2025-07-22 05:53:43,487] Trial 13 finished with value: 0.27997424868973225 and parameters: {'num_leaves': 38, 'max_depth': 3, 'learning_rate': 0.0653573978764749, 'min_child_samples': 31, 'subsample': 0.8824758177258434, 'colsample

Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[96]	valid_0's rmse: 0.311679
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[99]	valid_0's rmse: 0.343526
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[33]	valid_0's rmse: 0.279974


[I 2025-07-22 05:53:43,555] Trial 14 finished with value: 0.3069714203908374 and parameters: {'num_leaves': 49, 'max_depth': 3, 'learning_rate': 0.06836965083987845, 'min_child_samples': 33, 'subsample': 0.9104614305183403, 'colsample_bytree': 0.7439845285610837, 'reg_alpha': 0.9369074010374681, 'reg_lambda': 0.6298173783742429}. Best is trial 13 with value: 0.27997424868973225.
[I 2025-07-22 05:53:43,624] Trial 15 finished with value: 0.29315065431920895 and parameters: {'num_leaves': 26, 'max_depth': 3, 'learning_rate': 0.06656033906264314, 'min_child_samples': 43, 'subsample': 0.9425340843406955, 'colsample_bytree': 0.7592992837509881, 'reg_alpha': 0.23805844916382912, 'reg_lambda': 0.5987122517679684}. Best is trial 13 with value: 0.27997424868973225.
[I 2025-07-22 05:53:43,692] Trial 16 finished with value: 0.29471221343432974 and parameters: {'num_leaves': 27, 'max_depth': 3, 'learning_rate': 0.06142329066967155, 'min_child_samples': 44, 'subsample': 0.8671557920177685, 'colsampl

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[33]	valid_0's rmse: 0.306971
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[93]	valid_0's rmse: 0.293151
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 0.294712
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[38]	valid_0's rmse: 0.280462


[I 2025-07-22 05:53:43,759] Trial 17 finished with value: 0.2804620843391413 and parameters: {'num_leaves': 42, 'max_depth': 4, 'learning_rate': 0.07992248276147704, 'min_child_samples': 30, 'subsample': 0.9469915905019404, 'colsample_bytree': 0.7822572406523027, 'reg_alpha': 0.2574230315501454, 'reg_lambda': 0.5712949393421348}. Best is trial 13 with value: 0.27997424868973225.
[I 2025-07-22 05:53:43,825] Trial 18 finished with value: 0.2787179013501493 and parameters: {'num_leaves': 56, 'max_depth': 4, 'learning_rate': 0.08102412204849659, 'min_child_samples': 31, 'subsample': 0.8635293516890639, 'colsample_bytree': 0.9005239377704137, 'reg_alpha': 0.9975168592264588, 'reg_lambda': 0.6998781813653674}. Best is trial 18 with value: 0.2787179013501493.
[I 2025-07-22 05:53:43,890] Trial 19 finished with value: 0.2780178256976923 and parameters: {'num_leaves': 57, 'max_depth': 5, 'learning_rate': 0.08206228004287347, 'min_child_samples': 31, 'subsample': 0.8407413590440636, 'colsample_by

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[24]	valid_0's rmse: 0.278718
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[19]	valid_0's rmse: 0.278018
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[25]	valid_0's rmse: 0.300609
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[17]	valid_0's rmse: 0.306672


[I 2025-07-22 05:53:44,022] Trial 21 finished with value: 0.306672166160071 and parameters: {'num_leaves': 54, 'max_depth': 5, 'learning_rate': 0.08001522872511428, 'min_child_samples': 33, 'subsample': 0.8479281458614257, 'colsample_bytree': 0.9177860974485732, 'reg_alpha': 0.8612379825710968, 'reg_lambda': 0.7411205600483921}. Best is trial 19 with value: 0.2780178256976923.
[I 2025-07-22 05:53:44,096] Trial 22 finished with value: 0.2782675880124469 and parameters: {'num_leaves': 61, 'max_depth': 4, 'learning_rate': 0.058375528814417356, 'min_child_samples': 32, 'subsample': 0.8820176649101189, 'colsample_bytree': 0.9544180031354192, 'reg_alpha': 0.9767548826429602, 'reg_lambda': 0.7038838260603468}. Best is trial 19 with value: 0.2780178256976923.
[I 2025-07-22 05:53:44,163] Trial 23 finished with value: 0.3092747775513648 and parameters: {'num_leaves': 58, 'max_depth': 4, 'learning_rate': 0.056894040442701514, 'min_child_samples': 35, 'subsample': 0.8603431148649733, 'colsample_by

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[40]	valid_0's rmse: 0.278268
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[25]	valid_0's rmse: 0.309275
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[23]	valid_0's rmse: 0.301261
Training until validation scores don't improve for 50 rounds


[I 2025-07-22 05:53:44,302] Trial 25 finished with value: 0.30275871253175884 and parameters: {'num_leaves': 62, 'max_depth': 4, 'learning_rate': 0.07555476240090538, 'min_child_samples': 22, 'subsample': 0.8151656458366023, 'colsample_bytree': 0.8852137320355099, 'reg_alpha': 0.6709236174631036, 'reg_lambda': 0.7043163608296983}. Best is trial 19 with value: 0.2780178256976923.
[I 2025-07-22 05:53:44,370] Trial 26 finished with value: 0.3048479345672865 and parameters: {'num_leaves': 77, 'max_depth': 5, 'learning_rate': 0.05784849514149448, 'min_child_samples': 33, 'subsample': 0.7750913143854956, 'colsample_bytree': 0.9988485383055329, 'reg_alpha': 0.8818481597314975, 'reg_lambda': 0.47523692853241783}. Best is trial 19 with value: 0.2780178256976923.
[I 2025-07-22 05:53:44,441] Trial 27 finished with value: 0.296679819982339 and parameters: {'num_leaves': 69, 'max_depth': 7, 'learning_rate': 0.08641055334317813, 'min_child_samples': 24, 'subsample': 0.8402917197946864, 'colsample_by

Early stopping, best iteration is:
[37]	valid_0's rmse: 0.302759
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[24]	valid_0's rmse: 0.304848
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[43]	valid_0's rmse: 0.29668
Training until validation scores don't improve for 50 rounds


[I 2025-07-22 05:53:44,516] Trial 28 finished with value: 0.3042254110191455 and parameters: {'num_leaves': 56, 'max_depth': 4, 'learning_rate': 0.03851143865459168, 'min_child_samples': 20, 'subsample': 0.9140696672294315, 'colsample_bytree': 0.8676735484105902, 'reg_alpha': 0.8023992526392951, 'reg_lambda': 0.3773720196791857}. Best is trial 19 with value: 0.2780178256976923.
[I 2025-07-22 05:53:44,586] Trial 29 finished with value: 0.3000926297099919 and parameters: {'num_leaves': 47, 'max_depth': 6, 'learning_rate': 0.07532739342614014, 'min_child_samples': 26, 'subsample': 0.7044549253796116, 'colsample_bytree': 0.9466165139783278, 'reg_alpha': 0.6889166003605507, 'reg_lambda': 0.5360776869115752}. Best is trial 19 with value: 0.2780178256976923.


Did not meet early stopping. Best iteration is:
[74]	valid_0's rmse: 0.304225
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[49]	valid_0's rmse: 0.300093
✅ Migliori parametri trovati: {'num_leaves': 57, 'max_depth': 5, 'learning_rate': 0.08206228004287347, 'min_child_samples': 31, 'subsample': 0.8407413590440636, 'colsample_bytree': 0.9980301459539828, 'reg_alpha': 0.8567122136068024, 'reg_lambda': 0.712788998286795, 'objective': 'regression', 'metric': 'rmse', 'boosting_type': 'gbdt', 'verbosity': -1, 'force_col_wise': True, 'n_estimators': 1000}


In [143]:
warnings.filterwarnings(
    "ignore", category=UserWarning, module="sklearn.utils.validation"
)

# Cross-validation su tutto il training set (escludendo test set!)
cv_scores = cross_val_score(
    pipeline,
    X_train,
    y_train,
    cv=5,
    scoring=make_scorer(mean_absolute_error, greater_is_better=False),
)

cv_mae = -cv_scores.mean()
cv_std = cv_scores.std()

print(f"✅ Cross-Validation MAE: {cv_mae:.2f} ± {cv_std:.2f}")

✅ Cross-Validation MAE: 0.29 ± 0.03


## 12. Fit, Eval, Save

### Technical Overview
Fits LightGBM on full training data using best Optuna parameters.

### Implementation Details
- Trains with `LGBMRegressor`
- Evaluates on test set (`X_test`, `y_test`)
- Saves model with `joblib.dump`

### Purpose
Builds final predictive model for deployment.

### Output
Trained model object and performance metrics.

In [144]:
warnings.filterwarnings(
    "ignore", category=UserWarning, module="sklearn.utils.validation"
)

pipeline.fit(X_train, y_train)
comprehensive_overfitting_check(pipeline, X_train, X_test, y_train, y_test)

y_pred_log = pipeline.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_test_true = np.expm1(y_test)

mae = mean_absolute_error(y_test_true, y_pred)
mse = mean_squared_error(y_test_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_true, y_pred)

print(f"📊 MAE:  {mae:.2f} k€")
print(f"📊 RMSE: {rmse:.2f} k€")
print(f"📊 R²:   {r2:.2f}")
print(f"\n✅ Fine training e valutazione modello ottimizzato LightGBM con Optuna\n")


🔍 OVERFITTING ANALYSIS
Training MAE: 0.20
CV MAE:       0.29
Test MAE:     0.23
📊 MAE:  66.12 k€
📊 RMSE: 86.81 k€
📊 R²:   0.54

✅ Fine training e valutazione modello ottimizzato LightGBM con Optuna



## 13. Feature importance (only for tree model)

### Technical Overview
Computes and visualizes feature importances from trained model.

### Implementation Details
- Uses `.feature_importances_` from `LGBMRegressor`
- Displays barplot of importance scores

### Purpose
Interpret model behavior and highlight predictive features.

### Output
Plot of top features with relative importance.

In [147]:
# Estrai i nomi delle feature codificate (OHE) + numeriche
ohe = pipeline.named_steps["preprocessor"].named_transformers_["cat"]
encoded_cat_features = list(ohe.get_feature_names_out(categorical_cols))
encoded_feature_names = encoded_cat_features + numeric_cols

# Estrai modello finale (LightGBM dentro Pipeline)
lgb_model = pipeline.named_steps["regressor"]

# Calcola importanza feature con 'gain' (più robusto di 'split')
importances = lgb_model.booster_.feature_importance(importance_type="gain")

# Costruisci dataframe ordinato
feat_importance = (
    pd.DataFrame({"feature": encoded_feature_names, "importance": importances})
    .sort_values("importance", ascending=False)
    .reset_index(drop=True)
)

print("\n📊 Top 10 Feature Importances (by gain):")
display(feat_importance.head(10))


# Funzione per aggregare l'importanza delle feature OHE per colonna categorica originale
def get_categorical_importance_summary(feat_importance, categorical_cols):
    cat_importance = {}
    for cat_col in categorical_cols:
        cat_features = [
            f for f in feat_importance["feature"] if f.startswith(f"{cat_col}_")
        ]
        total = feat_importance[feat_importance["feature"].isin(cat_features)][
            "importance"
        ].sum()
        cat_importance[cat_col] = total
    return cat_importance


# Calcolo importanza aggregata delle variabili categoriche originali
cat_importance_summary = get_categorical_importance_summary(
    feat_importance, categorical_cols
)

# Ordina per importanza decrescente e mostra
cat_importance_df = pd.DataFrame.from_dict(
    cat_importance_summary, orient="index", columns=["aggregated_importance"]
).sort_values("aggregated_importance", ascending=False)

print("\n📊 Aggregated Importance of Categorical Features:")
display(cat_importance_df)


📊 Top 10 Feature Importances (by gain):


Unnamed: 0,feature,importance
0,size_m2,161.7736
1,air_quality_index,2.760065
2,temperature_avg,2.713822
3,bathrooms,1.978242
4,humidity_level,1.675291
5,floor,0.785118
6,building_floors,0.464668
7,noise_level,0.390045
8,garage_0,0.147675
9,age_years,0.130961



📊 Aggregated Importance of Categorical Features:


Unnamed: 0,aggregated_importance
garage,0.147675
location,0.0
energy_class,0.0
has_elevator,0.0
has_garden,0.0
has_balcony,0.0


## 14. Save model & metadata

### Technical Overview
Saves final model and associated metadata for inference usage.

### Implementation Details
- Exports model with `joblib`
- Saves JSON metadata with model version, feature list, and training config

### Purpose
Enables downstream inference and auditability.

### Output
`value_regressor_v1.joblib`, `value_regressor_v1_meta.json`

In [146]:
# Create model directory
os.makedirs(f"{MODEL_BASE_DIR}/{ASSET_TYPE}", exist_ok=True)
model_version = "v1"

# Save pipeline
pipeline_filename = (
    f"{MODEL_BASE_DIR}/{ASSET_TYPE}/value_regressor_{model_version}.joblib"
)
joblib.dump(pipeline, pipeline_filename)

# Calculate dataset hash
with open(DATA_PATH, "rb") as f:
    dataset_hash = hashlib.sha256(f.read()).hexdigest()

# Feature names
ohe = pipeline.named_steps["preprocessor"].named_transformers_["cat"]
encoded_cat_features = list(ohe.get_feature_names_out(categorical_cols))
encoded_feature_names = numeric_cols + encoded_cat_features

# Build metadata
metadata = {
    "asset_type": ASSET_TYPE,
    "model_task": "valuation_regression",
    "model_version": model_version,
    "model_class": type(pipeline.named_steps["regressor"]).__name__,
    "random_state": RANDOM_STATE,
    "dataset_file": DATA_PATH,
    "dataset_hash_sha256": dataset_hash,
    "n_rows_total": int(len(df)),
    "n_rows_train": int(len(X_train)),
    "n_rows_test": int(len(X_test)),
    "features_categorical": categorical_cols,
    "features_numeric": numeric_cols,
    "feature_list_ordered": feature_list,
    "features_encoded": encoded_feature_names,
    "encoded_feature_count": len(encoded_feature_names),
    "metrics": {
        "mae_k": float(round(mae, 4)),
        "rmse_k": float(round(rmse, 4)),
        "r2": float(round(r2, 4)),
    },
    "feature_importance_top10": feat_importance.head(10).to_dict(orient="records"),
    "best_params": best_params,
    "generated_at": datetime.utcnow().isoformat() + "Z",
}

# Save metadata to JSON
meta_filename = (
    f"{MODEL_BASE_DIR}/{ASSET_TYPE}/value_regressor_{model_version}_meta.json"
)
with open(meta_filename, "w", encoding="utf-8") as f:
    json.dump(metadata, f, indent=2)

# Logs
print(f"✅ Saved pipeline: {pipeline_filename}")
print(f"✅ Saved metadata: {meta_filename}")

✅ Saved pipeline: ../models/property/value_regressor_v1.joblib
✅ Saved metadata: ../models/property/value_regressor_v1_meta.json
