# Tune XGB Regressor Model

This notebook is used to optimize an XGB Regressor Model

A two-step hyperparameter tuning approach was used for the XGBoost regressor. Initially, the influence of individual hyperparameters was evaluated by varying one at a time while keeping others fixed, allowing identification of promising value ranges and their relative impact on model performance. Based on these insights, a focused random search was then conducted, sampling hyperparameter combinations within the refined ranges centered around the previously identified good base values. This method ensured a more efficient and informed search of the hyperparameter space.

In [1]:
import os
import joblib
import polars as pl
import numpy as np
from sklearn.preprocessing import (
    StandardScaler,
    PolynomialFeatures,
)
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import r2_score, root_mean_squared_error as rmse
import mlflow
import mlflow.xgboost
from xgboost import XGBRegressor

In [2]:
def get_data(poly_features: int = 1):
    # Get Data
    data = pl.read_parquet("data.parquet")
    data = data.drop(["Step", "Light_ID", "Lane", "Intersection_u", "Sim_ID"])
    data = data.with_columns(pl.col("Is_Entrypoint").cast(pl.Int8))
    print(f"Data: {data.shape}")
    print(f"{data.collect_schema()}")

    # Split Data
    X = data.drop("Num_Cars").to_numpy()
    y = data.select(pl.col("Num_Cars")).to_numpy()
    y = y.ravel()
    print("")
    print(f"X: {X.shape}")
    print(f"y: {y.shape}")

    # Scale
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Polynomial Features
    if poly_features > 1:
        poly = PolynomialFeatures(degree=poly_features)
        X = poly.fit_transform(X)

    # Train Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=0.6, test_size=0.4, random_state=42
    )
    print("")
    print(f"X_train: {X_train.shape}")
    print(f"X_test: {X_test.shape}")
    print(f"y_train: {y_train.shape}")
    print(f"y_test {y_test.shape}")

    # Train Test Validation Split
    X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, train_size=0.5, test_size=0.5, random_state=42
    )
    print("")
    print(f"X_test: {X_test.shape}")
    print(f"X_val: {X_val.shape}")
    print(f"y_test: {y_test.shape}")
    print(f"y_val: {y_val.shape}")

    return X_train, X_test, X_val, y_train, y_test, y_val

In [3]:
X_train, X_test, X_val, y_train, y_test, y_val = get_data()

Data: (6610000, 5)
Schema({'Time': Int16, 'Num_Cars': Int16, 'Centrality': Float32, 'Is_Entrypoint': Int8, 'Distance': Int16})

X: (6610000, 4)
y: (6610000,)

X_train: (3966000, 4)
X_test: (2644000, 4)
y_train: (3966000,)
y_test (2644000,)

X_test: (1322000, 4)
X_val: (1322000, 4)
y_test: (1322000,)
y_val: (1322000,)


## Base

Create a baseline for performance with all default Hyperparameters

In [28]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)

    model = XGBRegressor(n_jobs=N_JOBS, random_state=RANDOM_STATE)
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=None,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.21278494596481323
RMSE (Training Data): 1.3919494152069092
R2 Score (Test Data): 0.20956403017044067
RMSE (Test Data): 1.3935534954071045


  self.get_booster().save_model(fname)


## Estimators

The n_estimators parameter determines the number of trees (estimators) in the XGBoost model.

Increasing n_estimators can improve the model’s performance by allowing it to learn more complex relationships in the data. However, a higher number of estimators also increases the model’s training time and computational resources required.

Default is 100

https://xgboosting.com/configure-xgboost-n_estimators-parameter/

In [64]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 200

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)

    model = XGBRegressor(
        n_jobs=N_JOBS, random_state=RANDOM_STATE, n_estimators=N_ESTIMATORS
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=200,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2222939133644104
RMSE (Training Data): 1.3835170269012451
R2 Score (Test Data): 0.21727150678634644
RMSE (Test Data): 1.3867425918579102


  self.get_booster().save_model(fname)


In [30]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)

    model = XGBRegressor(
        n_jobs=N_JOBS, random_state=RANDOM_STATE, n_estimators=N_ESTIMATORS
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.22840863466262817
RMSE (Training Data): 1.3780673742294312
R2 Score (Test Data): 0.22170007228851318
RMSE (Test Data): 1.3828141689300537


  self.get_booster().save_model(fname)


In [31]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 500

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)

    model = XGBRegressor(
        n_jobs=N_JOBS, random_state=RANDOM_STATE, n_estimators=N_ESTIMATORS
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=500,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2362128496170044
RMSE (Training Data): 1.3710803985595703
R2 Score (Test Data): 0.2262347936630249
RMSE (Test Data): 1.3787797689437866


  self.get_booster().save_model(fname)


In [32]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 1000

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)

    model = XGBRegressor(
        n_jobs=N_JOBS, random_state=RANDOM_STATE, n_estimators=N_ESTIMATORS
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=1000,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2459743618965149
RMSE (Training Data): 1.362290859222412
R2 Score (Test Data): 0.22950363159179688
RMSE (Test Data): 1.3758642673492432


  self.get_booster().save_model(fname)


## Learning Rate

The learning_rate parameter in XGBoost is an alias for the eta parameter, which controls the step size at each boosting iteration.

The learning rate determines the contribution of each tree to the final outcome by scaling the weights of the features.

A lower learning rate can lead to better generalization and reduced overfitting, while a higher learning rate may result in faster learning but suboptimal solutions.

Default is 0.3

https://xgboosting.com/configure-xgboost-learning_rate-parameter/

In [33]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.01

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.01, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.17912393808364868
RMSE (Training Data): 1.4213975667953491
R2 Score (Test Data): 0.17810559272766113
RMSE (Test Data): 1.4210138320922852


  self.get_booster().save_model(fname)


In [34]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.001

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.001, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.07236886024475098
RMSE (Training Data): 1.5109999179840088
R2 Score (Test Data): 0.07216638326644897
RMSE (Test Data): 1.5098206996917725


  self.get_booster().save_model(fname)


In [35]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.1

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.1, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.21185463666915894
RMSE (Training Data): 1.392771601676941
R2 Score (Test Data): 0.20865070819854736
RMSE (Test Data): 1.3943583965301514


  self.get_booster().save_model(fname)


In [36]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.15

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.15, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.21798813343048096
RMSE (Training Data): 1.3873417377471924
R2 Score (Test Data): 0.21378618478775024
RMSE (Test Data): 1.3898266553878784


  self.get_booster().save_model(fname)


In [37]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.2

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.2, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.22191280126571655
RMSE (Training Data): 1.383855938911438
R2 Score (Test Data): 0.2166287899017334
RMSE (Test Data): 1.3873118162155151


  self.get_booster().save_model(fname)


In [38]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.25

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.25, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.22642111778259277
RMSE (Training Data): 1.3798410892486572
R2 Score (Test Data): 0.22013872861862183
RMSE (Test Data): 1.3842004537582397


  self.get_booster().save_model(fname)


In [39]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.3

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.3, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.22840863466262817
RMSE (Training Data): 1.3780673742294312
R2 Score (Test Data): 0.22170007228851318
RMSE (Test Data): 1.3828141689300537


  self.get_booster().save_model(fname)


In [40]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2309439778327942
RMSE (Training Data): 1.37580144405365
R2 Score (Test Data): 0.22321468591690063
RMSE (Test Data): 1.3814679384231567


  self.get_booster().save_model(fname)


## Max Depth

The max_depth parameter determines the maximum depth of each tree in the XGBoost model. It is a regularization parameter that can help control overfitting by limiting the model’s complexity. max_depth accepts positive integer values, and the default value in XGBoost is 6.

https://xgboosting.com/configure-xgboost-max_depth-parameter/

In [41]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 25

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=25,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.320204496383667
RMSE (Training Data): 1.2934983968734741
R2 Score (Test Data): 0.10901790857315063
RMSE (Test Data): 1.4795334339141846


  self.get_booster().save_model(fname)


In [42]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 10

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=10,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2809116840362549
RMSE (Training Data): 1.3303560018539429
R2 Score (Test Data): 0.22061705589294434
RMSE (Test Data): 1.3837758302688599


  self.get_booster().save_model(fname)


In [43]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24233931303024292
RMSE (Training Data): 1.3655705451965332
R2 Score (Test Data): 0.2281225323677063
RMSE (Test Data): 1.3770968914031982


  self.get_booster().save_model(fname)


In [44]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 5

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=5,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2194962501525879
RMSE (Training Data): 1.3860032558441162
R2 Score (Test Data): 0.21521472930908203
RMSE (Test Data): 1.3885635137557983


  self.get_booster().save_model(fname)


## Min Child Weight

The min_child_weight parameter determines the minimum sum of instance weight (hessian) needed in a child node for a split to be made.

It is a regularization parameter that can help control overfitting by preventing the creation of overly complex trees. min_child_weight accepts non-negative values, and the default value in XGBoost is 1.

https://xgboosting.com/configure-xgboost-min_child_weight-parameter/

In [45]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 1

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24233931303024292
RMSE (Training Data): 1.3655705451965332
R2 Score (Test Data): 0.2281225323677063
RMSE (Test Data): 1.3770968914031982


  self.get_booster().save_model(fname)


In [46]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 2

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24233931303024292
RMSE (Training Data): 1.3655705451965332
R2 Score (Test Data): 0.2281225323677063
RMSE (Test Data): 1.3770968914031982


  self.get_booster().save_model(fname)


In [47]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")


XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2421284317970276
RMSE (Training Data): 1.3657605648040771
R2 Score (Test Data): 0.2276458740234375
RMSE (Test Data): 1.3775219917297363


  self.get_booster().save_model(fname)


## Subsample

The subsample parameter determines the fraction of observations to be randomly sampled for each tree during the model’s training process. It is a regularization technique that can help prevent overfitting by introducing randomness into the training data. subsample accepts values between 0 and 1, with 1 meaning that all observations are used for each tree. The default value of subsample in XGBoost is 1.

https://xgboosting.com/configure-xgboost-subsample-parameter/

In [50]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2421284317970276
RMSE (Training Data): 1.3657605648040771
R2 Score (Test Data): 0.2276458740234375
RMSE (Test Data): 1.3775219917297363


  self.get_booster().save_model(fname)


In [51]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 0.9

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24120813608169556
RMSE (Training Data): 1.3665895462036133
R2 Score (Test Data): 0.2252511978149414
RMSE (Test Data): 1.3796558380126953


  self.get_booster().save_model(fname)


In [52]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 0.8

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2399202585220337
RMSE (Training Data): 1.367748737335205
R2 Score (Test Data): 0.2237309217453003
RMSE (Test Data): 1.3810087442398071


  self.get_booster().save_model(fname)


## Col Sample by Tree

The colsample_bytree parameter determines the fraction of features (columns) to be randomly sampled for each tree during the model’s training process. It is a regularization technique that can help prevent overfitting by reducing the number of features each tree can access, thus encouraging the model to rely on different subsets of features. colsample_bytree accepts values between 0 and 1, with 1 meaning that all features are available for each tree. The default value of colsample_bytree in XGBoost is 1.

https://xgboosting.com/configure-xgboost-colsample_bytree-parameter/

In [53]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2421284317970276
RMSE (Training Data): 1.3657605648040771
R2 Score (Test Data): 0.2276458740234375
RMSE (Test Data): 1.3775219917297363


  self.get_booster().save_model(fname)


In [54]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 0.9

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.9, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2324995994567871
RMSE (Training Data): 1.3744091987609863
R2 Score (Test Data): 0.22353607416152954
RMSE (Test Data): 1.3811821937561035


  self.get_booster().save_model(fname)


In [55]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 0.8

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.8, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2324995994567871
RMSE (Training Data): 1.3744091987609863
R2 Score (Test Data): 0.22353607416152954
RMSE (Test Data): 1.3811821937561035


  self.get_booster().save_model(fname)


## Gamma

The gamma parameter is a regularization term that governs the minimum loss reduction needed for a split to occur.

In other words, it specifies the minimum improvement in the model’s objective function that a new partition must bring to justify its creation. gamma is a non-negative value, and higher values make the model more conservative.

The default value of gamma in XGBoost is 0.

https://xgboosting.com/configure-xgboost-gamma-parameter/

In [57]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2421284317970276
RMSE (Training Data): 1.3657605648040771
R2 Score (Test Data): 0.2276458740234375
RMSE (Test Data): 1.3775219917297363


  self.get_booster().save_model(fname)


In [56]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 1.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=1.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.22192102670669556
RMSE (Training Data): 1.3838486671447754
R2 Score (Test Data): 0.21626973152160645
RMSE (Test Data): 1.3876298666000366


  self.get_booster().save_model(fname)


In [58]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.3

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.3, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2324044108390808
RMSE (Training Data): 1.3744945526123047
R2 Score (Test Data): 0.22302556037902832
RMSE (Test Data): 1.3816360235214233


  self.get_booster().save_model(fname)


In [59]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.1

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.1, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.23941928148269653
RMSE (Training Data): 1.3681994676589966
R2 Score (Test Data): 0.22676622867584229
RMSE (Test Data): 1.378306269645691


  self.get_booster().save_model(fname)


## Reg Alpha

The reg_alpha parameter in XGBoost is an alias for the alpha parameter, which controls the L1 regularization term on weights. By adjusting reg_alpha, you can influence the model’s complexity and sparsity. Default is 0.

https://xgboosting.com/configure-xgboost-reg_alpha-parameter/

In [65]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.0
REG_ALPHA = 0.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)
    mlflow.log_param("reg_alpha", REG_ALPHA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
        reg_alpha=REG_ALPHA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.2421284317970276
RMSE (Training Data): 1.3657605648040771
R2 Score (Test Data): 0.2276458740234375
RMSE (Test Data): 1.3775219917297363


  self.get_booster().save_model(fname)


In [61]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.0
REG_ALPHA = 0.2

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)
    mlflow.log_param("reg_alpha", REG_ALPHA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
        reg_alpha=REG_ALPHA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24177467823028564
RMSE (Training Data): 1.3660792112350464
R2 Score (Test Data): 0.22749465703964233
RMSE (Test Data): 1.3776568174362183


  self.get_booster().save_model(fname)


## Reg Lambda

The reg_lambda parameter in XGBoost is an alias for the lambda parameter, which controls the L2 regularization term on weights. By adjusting reg_lambda, you can influence the model’s complexity and its ability to generalize. Default value is 1.

https://xgboosting.com/configure-xgboost-reg_lambda-parameter/

In [62]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.0
REG_ALPHA = 0.2
REG_LAMBDA = 0.0

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)
    mlflow.log_param("reg_alpha", REG_ALPHA)
    mlflow.log_param("reg_lambda", REG_LAMBDA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
        reg_alpha=REG_ALPHA,
        reg_lambda=REG_LAMBDA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.242112398147583
RMSE (Training Data): 1.3657749891281128
R2 Score (Test Data): 0.22756677865982056
RMSE (Test Data): 1.3775924444198608


  self.get_booster().save_model(fname)


In [63]:
N_JOBS = os.cpu_count()
RANDOM_STATE = 42
N_ESTIMATORS = 300
LR = 0.35
MAX_DEPTH = 7
MIN_CHILD_WEIGHT = 3
SUBSAMPLE = 1.0
COLSAMPLE_BYTREE = 1.0
GAMMA = 0.0
REG_ALPHA = 0.2
REG_LAMBDA = 0.2

with mlflow.start_run():
    mlflow.log_param("model_type", "XGBRegressor")
    mlflow.log_param("n_jobs", N_JOBS)
    mlflow.log_param("random_state", RANDOM_STATE)
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("learning_rate", LR)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_param("min_child_weight", MIN_CHILD_WEIGHT)
    mlflow.log_param("subsample", SUBSAMPLE)
    mlflow.log_param("colsample_bytree", COLSAMPLE_BYTREE)
    mlflow.log_param("gamma", GAMMA)
    mlflow.log_param("reg_alpha", REG_ALPHA)
    mlflow.log_param("reg_lambda", REG_LAMBDA)

    model = XGBRegressor(
        n_jobs=N_JOBS,
        random_state=RANDOM_STATE,
        n_estimators=N_ESTIMATORS,
        learning_rate=LR,
        max_depth=MAX_DEPTH,
        min_child_weight=MIN_CHILD_WEIGHT,
        subsample=SUBSAMPLE,
        colsample_bytree=COLSAMPLE_BYTREE,
        gamma=GAMMA,
        reg_alpha=REG_ALPHA,
        reg_lambda=REG_LAMBDA,
    )
    model.fit(X_train, y_train)

    print(model)
    print("")

    y_pred_train = model.predict(X_train)
    train_r2 = r2_score(y_true=y_train, y_pred=y_pred_train)
    train_rmse = rmse(y_true=y_train, y_pred=y_pred_train)

    print(f"R2 Score (Training Data): {train_r2}")
    print(f"RMSE (Training Data): {train_rmse}")

    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)

    y_pred_test = model.predict(X_test)
    test_r2 = r2_score(y_true=y_test, y_pred=y_pred_test)
    test_rmse = rmse(y_true=y_test, y_pred=y_pred_test)

    print(f"R2 Score (Test Data): {test_r2}")
    print(f"RMSE (Test Data): {test_rmse}")

    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)

    mlflow.xgboost.log_model(model, "model")

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=0.0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.35, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
             max_leaves=None, min_child_weight=3, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=300,
             n_jobs=8, num_parallel_tree=None, ...)

R2 Score (Training Data): 0.24194109439849854
RMSE (Training Data): 1.365929365158081
R2 Score (Test Data): 0.22797441482543945
RMSE (Test Data): 1.3772289752960205


  self.get_booster().save_model(fname)


## Random Search Hyperparameter Tuning

A two-step hyperparameter tuning approach was used for the XGBoost regressor. Initially, the influence of individual hyperparameters was evaluated by varying one at a time while keeping others fixed, allowing identification of promising value ranges and their relative impact on model performance. Based on these insights, a focused random search was then conducted, sampling hyperparameter combinations within the refined ranges centered around the previously identified good base values. This method ensured a more efficient and informed search of the hyperparameter space.

In [None]:
param_dist = {
    "n_estimators": np.arange(250, 450, 50),
    "learning_rate": np.arange(0.3, 0.5, 0.05),
    "max_depth": np.arange(6, 10, 1),
    "min_child_weight": np.arange(1, 10, 1),
    "subsample": [1.0],
    "colsample_bytree": [1.0],
    "gamma": [0, 0.1],
    "reg_alpha": [0, 0.1],
    "reg_lambda": [0, 0.1, 0.2],
}

model = XGBRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=150,
    cv=3,
    verbose=3,
    random_state=42,
    n_jobs=-1,
    scoring="neg_mean_squared_error",
)

random_search.fit(X_train, y_train)

best_model: XGBRegressor = random_search.best_estimator_

# Evaluate model
y_pred = best_model.predict(X_train)
print(f"R2 Score (Training Data): {r2_score(y_true=y_train, y_pred=y_pred)}")
print(f"RMSE (Training Data): {rmse(y_true=y_train, y_pred=y_pred)}")
print("")
y_pred = best_model.predict(X_test)
print(f"R2 Score (Test Data): {r2_score(y_true=y_test, y_pred=y_pred)}")
print(f"RMSE (Test Data): {rmse(y_true=y_test, y_pred=y_pred)}")

Fitting 3 folds for each of 150 candidates, totalling 450 fits
[CV 2/3] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.44999999999999996, max_depth=7, min_child_weight=6, n_estimators=400, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0;, score=-1.905 total time=  48.8s
[CV 1/3] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.44999999999999996, max_depth=7, min_child_weight=6, n_estimators=400, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0;, score=-1.909 total time=  52.5s
[CV 3/3] END colsample_bytree=1.0, gamma=0, learning_rate=0.35, max_depth=6, min_child_weight=6, n_estimators=300, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0;, score=-1.915 total time=  55.4s
[CV 1/3] END colsample_bytree=1.0, gamma=0, learning_rate=0.35, max_depth=6, min_child_weight=6, n_estimators=300, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0;, score=-1.913 total time=  55.8s
[CV 2/3] END colsample_bytree=1.0, gamma=0, learning_rate=0.35, max_depth=6, min_child_weight=6, n_estimators=300, reg_alpha=0.

The hyperparameter search slightly improved model performance, with the best identified model being the following:

In [60]:
best_model

In [61]:
joblib.dump(best_model, "model.pkl")

['model.pkl']