# Gradient Boosting Models Exercise: Advanced Ensemble Methods

**ML2 Course - Extra Points Assignment (5 points)**

**Objective:**
The goal of this exercise is to explore and master various gradient boosting algorithms for panel data modeling. You will implement and compare seven state-of-the-art boosting models that represent the cutting edge of machine learning regression techniques.

**Models to Implement:**

1. **AdaBoost** ([AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)) - Adaptive Boosting, the pioneering boosting algorithm
2. **GBM** ([GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)) - Classic Gradient Boosting Machine from scikit-learn
3. **GBM Histogram** ([HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)) - Histogram-based Gradient Boosting (faster, inspired by LightGBM)
4. **XGBoost** ([XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) - Extreme Gradient Boosting, industry standard
5. **LightGBM** ([LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html)) - Light Gradient Boosting Machine, optimized for speed and memory
6. **CatBoost** ([CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor)) - Categorical Boosting, handles categorical features natively
7. **XGBoostLSS** ([XGBoostLSS](https://github.com/StatMixedML/XGBoostLSS)) - XGBoost for Location, Scale and Shape, probabilistic predictions

**Tasks Workflow:**

Following a similar process to the SVM / KNN model (`notebooks/07.knn-model.ipynb`):

1. **Load the prepared training data** from the preprocessing step
2. **Feature Engineering** (if necessary):
   - Note: Tree-based models do NOT require standardization/normalization
   - They are invariant to monotonic transformations of features
3. **Feature Selection**:
   - Use existing feature rankings from `feature_ranking.xlsx` for initial feature selection
   - Consider feature importance from tree-based models
   - Test multiple feature sets (top 20, 30, 50 features, etc.) - please utilize Feature Importance directly from models
4. **Hyperparameter Tuning**: (2 points)
   - Use GridSearchCV or RandomizedSearchCV, or Optuna
   - Focus on key parameters: learning rate, boosting iterations, tree max depth, regularization (if applicable) etc. 
   - Use rolling window cross-validation to avoid data leakage
5. **Identify Local Champions**: (1 point)
   - Select the best model for each algorithm class
   - Compare based on RMSE on validation sets
6. **Save Models**:
   - Pickle the best models for each algorithm
   - Save to `../models/` directory

**Important Notes:**

- Gradient boosting models are powerful but prone to overfitting - pay attention to regularization
- Learning rate and number of estimators have an inverse relationship
- Early stopping can be used to prevent overfitting
- XGBoostLSS provides distributional forecasts (not just point estimates)
- Use time-series aware cross-validation (rolling window) for final model selection

**Model Evaluation:** (2 points)

After completing this notebook:
- Load your models in `notebooks/09.final-comparison-and-summary.ipynb`
- Compare them against existing models (OLS, ARMA, ARDL, KNN, SVR)
- Check if any gradient boosting model becomes the new champion!

---

## Submission Requirements

- Complete this notebook with code and outputs
- Save best model(s) as pickle files in `models/` directory
- Commit and push to your GitHub repository
- Send repository link to: **mj.wozniak9@uw.edu.pl**

**Deadline:** [To be announced by instructor]

Note :<br>
Imported self programmed function from cv_utils : grid_CV()

# Models

In [2]:
# --- imports ---
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor

from xgboost import XGBRegressor, DMatrix
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboostlss.model import *
from xgboostlss.distributions.Gaussian import Gaussian

import joblib
import pickle

DATA_PATH   = Path("../data/processed/train_panel.parquet")   
RANKING_XLS = Path("../data/feature_ranking.xlsx")           
MODELS_DIR  = Path("../models")
MODELS_DIR.mkdir(exist_ok=True, parents=True)

preprocessed_output_data_path = "../data/output"

df = pd.read_csv(f"{preprocessed_output_data_path}/train_fe.csv", index_col=0)

fr = pd.read_excel(f"{preprocessed_output_data_path}/feature_ranking.xlsx", index_col=0)


In [3]:
fr.columns

Index(['mi_score', 'sign_fscore', 'sign_fscore_0_1', 'corr', 'EN_coef',
       'boruta_rank'],
      dtype='object')

In [4]:
print(df.columns.tolist())

['Ticker', 'Nazwa2', 'rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc', 'dta', 'capex2_scaled', 'y_v2x_polyarchy', 'y_e_p_polity', 'y_BR_Democracy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'sektor_consumer discretionary', 'sektor_consumer staples', 'sektor_energy', 'sektor_health care', 'sektor_industrials', 'sektor_materials', 'sektor_real estate', 'sektor_technology', 'sektor_utilities', 'gielda_2', 'gielda_3', 'gielda_4', 'gielda_5', 'ta_log', 'txt_cat_(-63.011, -34.811]', 'txt_cat_(-34.811, 0.488]', 'txt_cat_(0.488, 24.415]', 'txt_cat_(24.415, 25.05]', 'txt_cat_(25.05, 308.55]', 'txt_cat_(308.55, 327.531]', 'txt_cat_(327.531, inf]', 'pi_cat_(-8975.0, -1.523]', 'pi_cat_(-1.523, 157.119]', 'pi_cat_(157.119, 465.9]', 'pi_cat_(465.9, 7875.5]', 'pi_cat_(7875.5, 8108.5]', 'pi_c

In [4]:
standardization = list()
not_standardization = list()
for i in df.columns:
    if df[i].nunique() > 2:
        standardization.append(i)
    else:
        not_standardization.append(i)
print(standardization)


['Ticker', 'Nazwa2', 'rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'capex2_scaled', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'ta_log', 'ppent_sqrt', 'intant_sqrt', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'diff_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past']


In [5]:
standardization.remove("etr")
standardization.remove("Ticker")
standardization.remove("Nazwa2")
standardization.append("y_e_p_polity")
print(standardization)

['rok', 'ta', 'txt', 'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex', 'revenue', 'cce', 'adv', 'diff', 'roa', 'lev', 'intan', 'rd', 'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'capex2_scaled', 'y_v2x_polyarchy', 'WB_GDPgrowth', 'WB_GDPpc', 'WB_Inflation', 'rr_per_country', 'rr_per_sector', 'ta_log', 'ppent_sqrt', 'intant_sqrt', 'roa_clip', 'lev_sqrt', 'intan_pow2', 'rd_sqrt', 'ppe_clip', 'cash_holdings_sqrt', 'diff_dta', 'etr_y_past', 'etr_y_ma', 'diff_ma', 'roa_ma', 'lev_ma', 'intan_ma', 'ppe_ma', 'sale_ma', 'cash_holdings_ma', 'roa_past', 'lev_past', 'intan_past', 'ppe_past', 'sale_past', 'cash_holdings_past', 'y_e_p_polity']


In [6]:
scaler = pickle.load(open("../models_shah/minmaxscaler.sav", 'rb'))
df[standardization] = scaler.transform(df[standardization])

## Feature Selection


In [8]:
fr = pd.read_excel(f"{preprocessed_output_data_path}/feature_ranking.xlsx", index_col=0)

In [9]:
fr.columns

Index(['mi_score', 'sign_fscore', 'sign_fscore_0_1', 'corr', 'EN_coef',
       'boruta_rank'],
      dtype='object')

In [4]:
# mi features 
fr.sort_values("mi_score", ascending=False, inplace=True)
# top 20 features
mi_features = fr.iloc[0:20].index.tolist()
# top 25 features
mi_features_25 = fr.iloc[0:25].index.tolist()
# top 35 features
mi_features_35 = fr.iloc[0:35].index.tolist()

# top 50 features
mi_features_50 = fr.iloc[0:50].index.tolist()

In [5]:
# only for testing this feature set 
fr["corr_abs"] = np.abs(fr["corr"])
fr.sort_values("corr_abs", ascending=False, inplace=True)
corr_features = fr.iloc[0:20].index.tolist()

In [6]:
feature_sets = {
    "mi20": mi_features,
    "mi25": mi_features_25,
    "mi35": mi_features_35,
    "mi50": mi_features_50,
    "corr": corr_features,
}

## Hyper Parameteer Tuning

In [20]:
# --- AdaBoost ---
param_grid_ada = {
    "n_estimators": [50, 75, 100, 150, 200, 250],
    "learning_rate": [0.1, 0.3, 0.5, 0.8, 1.0],
    "loss": ["linear", "square", "exponential"],
}

# --- GradientBoostingRegressor ---
param_grid_gbm = {
    "n_estimators": [200, 300, 400, 600, 800, 1000, 1200],
    "learning_rate": [0.10, 0.07, 0.05, 0.03, 0.02],
    "max_depth": [2, 3, 4],
    "subsample": [1.0, 0.9, 0.8, 0.7, 0.6],
}

# --- HistGradientBoostingRegressor ---
param_grid_histgbm = {
    "max_iter": [200, 300, 400, 600, 800, 1000, 1200],
    "learning_rate": [0.10, 0.07, 0.05, 0.03, 0.02],
    "max_depth": [None, 3, 4, 6],
    "l2_regularization": [0.0, 0.1, 1.0, 5.0],
    "min_samples_leaf": [10, 20, 30, 50],
}


# --- XGBRegressor ---
param_grid_xgb = {
    "n_estimators": [ 500, 1000], # Let early stoppping decide
    "learning_rate": [ 0.05, 0.02, 0.01],
    "max_depth": [1, 2, 3, 4, 5],
    "subsample": [0.9, 0.8, 0.7],
    "colsample_bytree": [0.9, 0.8, 0.7], # We might have noisy features, lets keep more than one option
    "reg_lambda": [1.0, 5.0]
}

# --- LightGBM ---
param_grid_lgbm = {
    "learning_rate": [0.05, 0.02],
    "n_estimators": [500, 1500],        # ideally use early_stopping and stop tuning this
    "num_leaves": [31, 127],
    "max_depth": [-1, 8],               # -1 = no limit
    "subsample": [0.8],
    "colsample_bytree": [0.8],
    "min_child_samples": [20, 50],
    "reg_alpha": [0.0],                 # drop unless you know you need L1
    "reg_lambda": [0.0, 1.0],           # keep small range
}

# --- CatBoost ---
param_grid_cat = {
    "depth": [4, 6, 8],
    "learning_rate": [0.05, 0.02],
    "iterations": [800, 2000],     # or just [5000] with early stopping
    "l2_leaf_reg": [3, 8],
}


# --- XGBoostLSS ---
param_grid_xgblss = {
    "n_estimators": [200, 300, 400, 600, 800, 1000],
    "learning_rate": [0.10, 0.07, 0.05, 0.03, 0.02, 0.01],
    "max_depth": [2, 3, 4],
    "subsample": [0.9, 0.8, 0.7],
    "colsample_bytree": [0.9, 0.8, 0.7],
    "reg_alpha": [0.0, 0.1],
    "reg_lambda": [1.0, 5.0, 10.0],
}


## Ada Boost

In [11]:
from cv_utils import grid_CV


In [39]:
df = df.sort_values(by="rok").reset_index(drop=True)
feature_sets = {
    "mi20": mi_features,
    "mi25": mi_features_25,
    "mi35": mi_features_35,
    "mi50": mi_features_50,
    "corr": corr_features,
}


best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items():
    var = feats
    train_score, valid_score, view, best_params = grid_CV(
        df[var], df["etr"],
        AdaBoostRegressor(random_state=11),
        param_grid_ada,
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.131372  0.140228
1  0.133966  0.139969
2  0.135680  0.153276
3  0.137746  0.148347
4  0.139233  0.137565
5  0.138767  0.137669
0.14284241484010193 {'learning_rate': 0.1, 'loss': 'exponential', 'n_estimators': 50} mi50


### Final fit and Save

In [40]:
model = AdaBoostRegressor(learning_rate =0.1, loss = "exponential", n_estimators = 50)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0,1,2
,estimator,
,n_estimators,50
,learning_rate,0.1
,loss,'exponential'
,random_state,


In [41]:
filename = "../models_shah/ada.sav"
pickle.dump(model, open(filename, "wb"))

## GBM


In [16]:
df = df.sort_values(by="rok").reset_index(drop=True)

feature_sets = {
    "mi50": mi_features_50,
}

best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items(): 
    var = feats
    train_score, valid_score, view, best_params = grid_CV(
        df[var], df["etr"],
        GradientBoostingRegressor(random_state=11),  
        param_grid_gbm,                              
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name  
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.120903  0.146328
1  0.124747  0.135304
2  0.125807  0.148229
3  0.129163  0.143214
4  0.129723  0.127318
5  0.129960  0.132965
0.13889288682082654 {'learning_rate': 0.02, 'max_depth': 2, 'n_estimators': 200, 'subsample': 0.8} mi50


### Final fit and Save

In [18]:
model = GradientBoostingRegressor(learning_rate =0.02, loss = "squared_error", max_depth = 2, n_estimators = 200, subsample= 0.8)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0,1,2
,loss,'squared_error'
,learning_rate,0.02
,n_estimators,200
,subsample,0.8
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,2
,min_impurity_decrease,0.0


In [19]:
filename = "../models_shah/GBM.sav"
pickle.dump(model, open(filename, "wb"))

## Hist GBM

In [22]:
df = df.sort_values(by="rok").reset_index(drop=True)
# We are keeping only one feature set to reduce computation time
feature_sets = {
    "mi50": mi_features_50,
}

best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items():
    train_score, valid_score, view, best_params = grid_CV(
        df[feats], df["etr"],
        HistGradientBoostingRegressor(random_state=11),
        param_grid_histgbm,          
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.123301  0.146607
1  0.124874  0.134228
2  0.125171  0.145158
3  0.126999  0.139068
4  0.128170  0.126791
5  0.128039  0.130372
0.13703737802393193 {'l2_regularization': 0.0, 'learning_rate': 0.02, 'max_depth': 3, 'max_iter': 200, 'min_samples_leaf': 50} mi50


### Final fit and Save

In [None]:
model = HistGradientBoostingRegressor(l2_regularization = 0.0, learning_rate= 0.02, max_depth= 3, max_iter = 200, min_samples_leaf =  50)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0,1,2
,loss,'squared_error'
,quantile,
,learning_rate,0.02
,max_iter,200
,max_leaf_nodes,31
,max_depth,3
,min_samples_leaf,50
,l2_regularization,0.0
,max_features,1.0
,max_bins,255


In [24]:
filename = "../models_shah/hist_GBM.sav"
pickle.dump(model, open(filename, "wb"))

## XGB

In [12]:
df = df.sort_values(by="rok").reset_index(drop=True)

feature_sets = {
    "mi50": mi_features_50
}

best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items():
    train_score, valid_score, view, best_params = grid_CV(
        df[feats], df["etr"],
        XGBRegressor(
            objective="reg:squarederror",
            tree_method="hist",
            random_state=11,
            n_jobs=-1
        ),
        param_grid_xgb,
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.134078  0.138952
1  0.135103  0.133730
2  0.134907  0.147500
3  0.136924  0.142800
4  0.137708  0.128259
5  0.136651  0.132209
0.13724168980208656 {'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500, 'reg_lambda': 1.0, 'subsample': 0.9} mi50


### Final fit and Save

In [13]:
model = XGBRegressor(colsample_bytree = 0.7, learning_rate = 0.01, max_depth = 1, n_estimators = 500, reg_lambda = 1.0, subsample = 0.9)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.7
,device,
,early_stopping_rounds,
,enable_categorical,False


In [14]:
filename = "../models_shah/XGB.sav"
pickle.dump(model, open(filename, "wb"))

## CatBoost


In [16]:
df = df.sort_values(by="rok").reset_index(drop=True)

feature_sets = {
    "mi20": mi_features,
    "mi50": mi_features_50,
}

best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items():
    train_score, valid_score, view, best_params = grid_CV(
        df[feats], df["etr"],
        CatBoostRegressor(
            loss_function="RMSE",
            random_state=11,
            verbose=False
        ),
        param_grid_cat,
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.073256  0.143505
1  0.079926  0.133550
2  0.086449  0.149094
3  0.090038  0.140331
4  0.092430  0.126979
5  0.093258  0.133145
0.1377673430142593 {'depth': 8, 'iterations': 800, 'l2_leaf_reg': 8, 'learning_rate': 0.02} mi50


### Final fit and Save

In [17]:
model = CatBoostRegressor(depth=8, iterations =800, l2_leaf_reg = 8, learning_rate= 0.02)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0:	learn: 0.1535189	total: 22.7ms	remaining: 18.2s
1:	learn: 0.1530405	total: 29.8ms	remaining: 11.9s
2:	learn: 0.1525756	total: 35.5ms	remaining: 9.42s
3:	learn: 0.1521582	total: 41.2ms	remaining: 8.2s
4:	learn: 0.1516976	total: 46.9ms	remaining: 7.46s
5:	learn: 0.1512879	total: 52.3ms	remaining: 6.91s
6:	learn: 0.1508621	total: 57.8ms	remaining: 6.55s
7:	learn: 0.1504273	total: 63ms	remaining: 6.24s
8:	learn: 0.1500016	total: 68.2ms	remaining: 6s
9:	learn: 0.1495683	total: 74.3ms	remaining: 5.87s
10:	learn: 0.1492298	total: 80.4ms	remaining: 5.76s
11:	learn: 0.1488790	total: 85.9ms	remaining: 5.64s
12:	learn: 0.1484770	total: 91.5ms	remaining: 5.54s
13:	learn: 0.1480824	total: 97ms	remaining: 5.45s
14:	learn: 0.1477422	total: 102ms	remaining: 5.35s
15:	learn: 0.1473519	total: 108ms	remaining: 5.29s
16:	learn: 0.1470081	total: 113ms	remaining: 5.22s
17:	learn: 0.1466410	total: 119ms	remaining: 5.19s
18:	learn: 0.1463596	total: 125ms	remaining: 5.13s
19:	learn: 0.1460929	total: 130ms	r

<catboost.core.CatBoostRegressor at 0x11d6185f0>

In [18]:
filename = "../models_shah/catB.sav"
pickle.dump(model, open(filename, "wb"))

## Light GBM

In [23]:
import warnings; warnings.filterwarnings("ignore", message="X does not have valid feature names")

df = df.sort_values(by="rok").reset_index(drop=True)

feature_sets = {
    "mi50": mi_features_50,
}

best = {"rmse": np.inf, "params": None, "features": None, "view": None}

for name, feats in feature_sets.items():
    train_score, valid_score, view, best_params = grid_CV(
        df[feats], df["etr"],
        LGBMRegressor(
            objective="regression",
            random_state=11,
            n_jobs=-1, 
            verbosity=-1
        ),
        param_grid_lgbm,
        display_res=True
    )

    avg_rmse = np.mean(valid_score)

    if avg_rmse < best["rmse"]:
        best["rmse"] = avg_rmse
        best["params"] = best_params
        best["features"] = name
        best["view"] = view

print(best["view"])
print(best["rmse"], best["params"], best["features"])


   cv_train    cv_val
0  0.089468  0.147414
1  0.089041  0.140751
2  0.089159  0.150173
3  0.093429  0.142980
4  0.094295  0.126330
5  0.094304  0.131945
0.1399321146264909 {'colsample_bytree': 0.8, 'learning_rate': 0.02, 'max_depth': 8, 'min_child_samples': 50, 'n_estimators': 500, 'num_leaves': 127, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 0.8} mi50


### Final fit and Save

In [24]:
model = LGBMRegressor(colsample_bytree =0.8, learning_rate =0.02, max_depth =8, min_child_samples =50, n_estimators = 500, num_leaves =127, reg_alpha =0.0, reg_lambda =0.0, subsample= 0.8)
model.fit(df.loc[:, mi_features_50].values, df.loc[:, "etr"].values.ravel())

0,1,2
,boosting_type,'gbdt'
,num_leaves,127
,max_depth,8
,learning_rate,0.02
,n_estimators,500
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [25]:
filename = "../models_shah/lgbm.sav"
pickle.dump(model, open(filename, "wb"))