# Boosted trees verses random forest


Compare LightGBM and catbost with sklearn's Random Forest<br>
LightGBM is a microsoft gradient boosted tree product<br>
catboost is a Yandex gradient boosted tree product<br>

See <a href="https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db">CatBoost vs. Light GBM vs. XGBoost</a> for relative comparisons<br>
See <a href="https://towardsdatascience.com/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956">Scikit-Learn vs XGBoost vs LightGBM vs CatBoost in Sentiment Classification</a> for another relative comparison.


In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

## Install LightGBM
conda and pip both have the same version and the last upload was 3 months ago. This package is currently being maintained. Prefer conda so anaconda can coordinate LightGBMs dependencies with all other conda packages

In [2]:
# !conda install -c conda-forge lightgbm -y
import lightgbm
lightgbm.__version__

'4.3.0'

## Install catboost
conda and pip both have the same version and the last upload was 2 months ago. This package is currently being maintained. Prefer conda so anaconda can coordinate catboosts dependencies with all other conda packages

In [3]:
# !conda install -c conda-forge catboost -y
import catboost
catboost.__version__

'1.2.3'

## Regression

### Data

In [4]:
from sklearn.datasets import fetch_california_housing
calif_housing = fetch_california_housing()

# for line in calif_housing.DESCR.split("\n")[5:22]:
#     print(line)

calif_housing_df = pd.DataFrame(data=calif_housing.data, columns=calif_housing.feature_names)
calif_housing_df["Price($)"] = calif_housing.target

# calif_housing_df.head()

In [9]:
calif_housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price($)
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [6]:
#get train/test split
from sklearn.model_selection import train_test_split
X, y = calif_housing.data, calif_housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,test_size=0.2,random_state=123)

### R squared - a way to qualify a models predictions

The following regressors use R squared as the default objective to optimize.  See <a href="https://www.youtube.com/watch?v=2AQKmw14mHM">Statquest: R-squared, Clearly Explained!!!</a> for a great explanation plus examples.

Usually 0<R squared<1  .  It ranges between these 2 values and is interpreted as how well the model fits the data. (In statistics this is called explained variance)

If R squared =0,the line fitted to data is no more accurate than taking the mean of the data.<br>
If R squared =1,the line fitted to the data is a perfect match<br>
If R squared is negative then the line fitted to the data is a worse fit than just taking the average value of the data.

In [7]:
#It was not clear what objective lightGBM optimizes
#so I implemented R squared below 
def rsquared(preds, y):
    RSS=np.sum(np.square(preds-y))
    ymean=np.sum(y)/len(y)
    TSS=np.sum(np.square(y-ymean))
    return 1-RSS/TSS

def scoremodel(clf,X_test, y_test):
    print("Score on test set: {:.2f}".format(clf.score(X_test, y_test)))
    #run score using rsquared function above
    preds=clf.predict(X_test)
    rsq=rsquared(preds,y_test)
    print("Score on test set using rsquared: {:.2f}".format(rsq))
    

### random forest- default hyperparameters

In [10]:
%%time
from sklearn.ensemble import RandomForestRegressor

#random forest can be done in parallel, set n_jobs=-1 to use all processors
clf = RandomForestRegressor(random_state=42, n_jobs=-1)
_=clf.fit(X_train, y_train,)

CPU times: user 15.4 s, sys: 93.4 ms, total: 15.5 s
Wall time: 1.22 s


In [11]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.81
Score on test set using rsquared: 0.81


### lightgbm- default hyperparameters

In [12]:
%%time
from lightgbm import LGBMRegressor
clf = LGBMRegressor(random_state=42, n_jobs=-1)
_=clf.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000553 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.069687
CPU times: user 3.97 s, sys: 55.1 ms, total: 4.03 s
Wall time: 282 ms


In [13]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.84
Score on test set using rsquared: 0.84


### catboost -default parameters

In [14]:
%%time
from catboost import CatBoostRegressor
clf = CatBoostRegressor(silent=True, random_state=42)

_=clf.fit(X_train, y_train)

CPU times: user 15.2 s, sys: 1.38 s, total: 16.6 s
Wall time: 1.58 s


In [15]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.86
Score on test set using rsquared: 0.86


In [16]:
clf.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'RMSE',
 'iterations': 1000,
 'sampling_frequency': 'PerTree',
 'leaf_estimation_method': 'Newton',
 'random_score_type': 'NormalWithModelSizeDecrease',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'eval_fraction': 0,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'rsm': 1,
 'boost_from_average': True,
 'model_size_reg': 0.5,
 'pool_metainfo_options': {'tags': {}},
 'subsample': 0.800000011920929,
 'use_best_model': False,
 'random_seed': 42,
 'depth': 6,
 'posterior_sampling': False,
 'border_count': 254,
 'classes_count': 0,
 'auto_class_weights': 'None',
 'sparse_features_conflict_fraction': 0,
 'leaf_estimation_backtracking': 'AnyImprovement',
 'best_model_min_trees': 1,
 'model_shrink_rate': 0,
 'min_data_in_leaf': 1,
 'loss_function': 'RMSE',
 'lea

### Optimize hyperparameters for catboost

In [17]:
%%time
from sklearn.metrics import mean_squared_error
import optuna
import sklearn
def objective(trial):
    #these are the parameters I want to optimize
    params = {
        'learning_rate':trial.suggest_discrete_uniform("learning_rate", 0.001, 0.09, 0.001),
        'depth': trial.suggest_int("depth", 1, 12),
        'l2_leaf_reg':trial.suggest_discrete_uniform('l2_leaf_reg', 1.0, 5.5, 0.5),
        'iterations':1000,
        'silent':True,
        'random_state':42
    }
 
    # Define the model. Pass in params to be tuned
    clf = CatBoostRegressor(**params)
    
    clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0, early_stopping_rounds=100)

    return mean_squared_error(y_test, clf.predict(X_test))
 
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=5)

trial = study.best_trial

print('mean_squared_error: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[I 2024-04-05 09:34:03,094] A new study created in memory with name: no-name-b0b11b09-f849-4970-8e4f-25693bde084e
[I 2024-04-05 09:34:07,617] Trial 0 finished with value: 0.1796842384273674 and parameters: {'learning_rate': 0.069, 'depth': 9, 'l2_leaf_reg': 2.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:34:16,821] Trial 1 finished with value: 0.1817989673582218 and parameters: {'learning_rate': 0.065, 'depth': 10, 'l2_leaf_reg': 3.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:34:17,678] Trial 2 finished with value: 0.21164926847760404 and parameters: {'learning_rate': 0.079, 'depth': 3, 'l2_leaf_reg': 4.0}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:34:18,799] Trial 3 finished with value: 0.21784232823365285 and parameters: {'learning_rate': 0.024, 'depth': 5, 'l2_leaf_reg': 2.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:34:28,166] Trial 4 finished with value: 0.1843707089728872 and parameters:

mean_squared_error: 0.1796842384273674
Best hyperparameters: {'learning_rate': 0.069, 'depth': 9, 'l2_leaf_reg': 2.5}
CPU times: user 3min 6s, sys: 9.19 s, total: 3min 15s
Wall time: 25.4 s


In [21]:
#not satisfied?  Keep on optimizing from where you left off above
study.optimize(objective, n_trials=10)

[I 2024-04-05 09:37:21,543] Trial 5 finished with value: 0.18432432973332566 and parameters: {'learning_rate': 0.09, 'depth': 10, 'l2_leaf_reg': 1.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:37:24,066] Trial 6 finished with value: 0.2675895080176103 and parameters: {'learning_rate': 0.005, 'depth': 8, 'l2_leaf_reg': 5.0}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:37:24,810] Trial 7 finished with value: 0.2648986302875752 and parameters: {'learning_rate': 0.034, 'depth': 2, 'l2_leaf_reg': 5.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:37:41,713] Trial 8 finished with value: 0.18221500724058384 and parameters: {'learning_rate': 0.075, 'depth': 11, 'l2_leaf_reg': 5.0}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-04-05 09:37:42,692] Trial 9 finished with value: 0.204328976613987 and parameters: {'learning_rate': 0.057, 'depth': 4, 'l2_leaf_reg': 2.5}. Best is trial 0 with value: 0.1796842384273674.
[I 2024-

### catboost -using best parameters

In [18]:
trial.params

{'learning_rate': 0.069, 'depth': 9, 'l2_leaf_reg': 2.5}

In [22]:
trial.params

{'learning_rate': 0.069, 'depth': 9, 'l2_leaf_reg': 2.5}

In [23]:
%%time
clf = CatBoostRegressor(**trial.params,silent=True, random_state=42)
_=clf.fit(X_train, y_train)

scoremodel(clf,X_test, y_test)

Score on test set: 0.86
Score on test set using rsquared: 0.86
CPU times: user 33.5 s, sys: 2.36 s, total: 35.9 s
Wall time: 4.43 s


## Optimize parameters for Random Forest

In [24]:
%%time
import optuna
import sklearn
def objective(trial,X=X,y=y):
    
    params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
    
    #these are the parameters I want to optimize
    n_estimators = trial.suggest_int('n_estimators', 10, 130)
    max_depth = int(trial.suggest_int('max_depth', 1, 55))
    min_samples_split = int(trial.suggest_int('min_samples_split', 2, 10))
    min_samples_leaf= int(trial.suggest_int('min_samples_leaf', 1, 10))

    # Define the model. Pass in params to be tuned
    clf = RandomForestRegressor(random_state=42, n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf )  

    #get the cross validation score
    return sklearn.model_selection.cross_val_score( clf, X, y, n_jobs=-1, cv=3, scoring='neg_mean_absolute_error').mean() 
 
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

trial = study.best_trial

print('neg_mean_absolute_error: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[I 2024-04-05 09:40:04,127] A new study created in memory with name: no-name-0cf18b05-6257-4f74-baca-ae1a076b2c0d
[I 2024-04-05 09:40:13,661] Trial 0 finished with value: -0.4794500280689147 and parameters: {'n_estimators': 121, 'max_depth': 39, 'min_samples_split': 6, 'min_samples_leaf': 2}. Best is trial 0 with value: -0.4794500280689147.
[I 2024-04-05 09:40:17,004] Trial 1 finished with value: -0.4850436666806058 and parameters: {'n_estimators': 51, 'max_depth': 26, 'min_samples_split': 4, 'min_samples_leaf': 8}. Best is trial 0 with value: -0.4794500280689147.
[I 2024-04-05 09:40:21,592] Trial 2 finished with value: -0.4815268531917261 and parameters: {'n_estimators': 64, 'max_depth': 31, 'min_samples_split': 4, 'min_samples_leaf': 4}. Best is trial 0 with value: -0.4794500280689147.
[I 2024-04-05 09:40:28,285] Trial 3 finished with value: -0.4803078428411087 and parameters: {'n_estimators': 98, 'max_depth': 55, 'min_samples_split': 10, 'min_samples_leaf': 5}. Best is trial 0 with 

neg_mean_absolute_error: -0.4794500280689147
Best hyperparameters: {'n_estimators': 121, 'max_depth': 39, 'min_samples_split': 6, 'min_samples_leaf': 2}
CPU times: user 118 ms, sys: 318 ms, total: 436 ms
Wall time: 46 s


In [26]:
%%time
clf = RandomForestRegressor(**trial.params,random_state=42)
_=clf.fit(X_train, y_train)

scoremodel(clf,X_test, y_test)

Score on test set: 0.81
Score on test set using rsquared: 0.81
CPU times: user 10.4 s, sys: 26.5 ms, total: 10.4 s
Wall time: 10.4 s


## Classification

### Data

In [27]:
#took 10% of original dataset, dropped bunch of columns
data = pd.read_csv("../datasets/kaggle/flights/flights_small.csv")
data.head()

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,DESTINATION_AIRPORT,ORIGIN_AIRPORT,AIR_TIME,DEPARTURE_TIME,DISTANCE,ARRIVAL_DELAY
0,1,28,3,WN,103,MKE,DCA,102.0,713.0,634,1.0
1,8,11,2,B6,153,PBI,JFK,134.0,111.0,1028,337.0
2,2,4,3,DL,1187,DCA,MSP,111.0,1734.0,931,-19.0
3,3,27,5,WN,171,RDU,DEN,173.0,1807.0,1436,-7.0
4,8,1,6,WN,4330,RIC,ATL,63.0,2151.0,481,13.0


In [32]:
data.dtypes

MONTH                    int64
DAY                      int64
DAY_OF_WEEK              int64
AIRLINE                 object
FLIGHT_NUMBER            int64
DESTINATION_AIRPORT     object
ORIGIN_AIRPORT          object
AIR_TIME               float64
DEPARTURE_TIME         float64
DISTANCE                 int64
ARRIVAL_DELAY          float64
dtype: object

In [28]:
len(data)

571339

In [29]:
data.nunique()

MONTH                    12
DAY                      31
DAY_OF_WEEK               7
AIRLINE                  14
FLIGHT_NUMBER          6688
DESTINATION_AIRPORT     624
ORIGIN_AIRPORT          623
AIR_TIME                613
DEPARTURE_TIME         1414
DISTANCE               1324
ARRIVAL_DELAY           737
dtype: int64

In [31]:
(data["ARRIVAL_DELAY"]>10)*1

0         0
1         1
2         0
3         0
4         1
         ..
571334    0
571335    0
571336    0
571337    0
571338    0
Name: ARRIVAL_DELAY, Length: 571339, dtype: int64

In [34]:
data["ARRIVAL_DELAY"].value_counts()

ARRIVAL_DELAY
0    446235
1    125104
Name: count, dtype: int64

In [33]:
#only later than 10 minutes is considered late
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

#convert to ordinal (even though they are categorical)
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

#unlbalanced dataset, make sure you get a stratified sample
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], stratify=data["ARRIVAL_DELAY"],
                                                random_state=10, test_size=0.25)

In [35]:
y_train.head()

34965     1
409860    0
255071    0
260261    0
201888    0
Name: ARRIVAL_DELAY, dtype: int64

In [36]:
# data.head()
# X_test.shape
y_test.value_counts()

ARRIVAL_DELAY
0    111559
1     31276
Name: count, dtype: int64

In [37]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

def score_classifier_model(clf,X_test, y_test):
    res = clf.predict(X_test)
    print (classification_report(y_test, res))
    print("And the confusion matrix")
    print(confusion_matrix(y_test,res))
        
    

### random forest- default hyperparameters

In [38]:
%%time
from sklearn.ensemble import RandomForestClassifier
clfc = RandomForestClassifier(random_state=42, n_jobs=-1)
_=clfc.fit(X_train, y_train)

CPU times: user 3min 39s, sys: 1.09 s, total: 3min 41s
Wall time: 15 s


In [39]:
score_classifier_model(clfc,X_test, y_test)

              precision    recall  f1-score   support

           0       0.81      0.98      0.89    111559
           1       0.71      0.17      0.28     31276

    accuracy                           0.80    142835
   macro avg       0.76      0.58      0.58    142835
weighted avg       0.79      0.80      0.75    142835

And the confusion matrix
[[109344   2215]
 [ 25809   5467]]


In [None]:
# macro avg (precision)   =(.81+.71)/2
# weighted avg (precision)= (111559/142835)*.81 + ((31276/142835)*.71)

### lightgbm- default hyperparameters

In [40]:
%%time
from lightgbm import LGBMClassifier
clf_lgbm = LGBMClassifier(random_state=42, n_jobs=-1)
_=clf_lgbm.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 93828, number of negative: 334676
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002858 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1526
[LightGBM] [Info] Number of data points in the train set: 428504, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.218966 -> initscore=-1.271700
[LightGBM] [Info] Start training from score -1.271700
CPU times: user 10.6 s, sys: 108 ms, total: 10.7 s
Wall time: 734 ms


In [41]:
score_classifier_model(clf_lgbm,X_test, y_test)

              precision    recall  f1-score   support

           0       0.80      0.99      0.88    111559
           1       0.74      0.12      0.20     31276

    accuracy                           0.80    142835
   macro avg       0.77      0.55      0.54    142835
weighted avg       0.79      0.80      0.73    142835

And the confusion matrix
[[110268   1291]
 [ 27610   3666]]


### catboost -default parameters

In [42]:
%%time
from catboost import CatBoostClassifier
clf_catboost = CatBoostClassifier(silent=True, random_state=42)
_=clf_catboost.fit(X_train, y_train)

CPU times: user 3min 56s, sys: 6.16 s, total: 4min 2s
Wall time: 17.8 s


In [43]:
score_classifier_model(clf_catboost,X_test, y_test)

              precision    recall  f1-score   support

           0       0.81      0.98      0.89    111559
           1       0.72      0.19      0.31     31276

    accuracy                           0.81    142835
   macro avg       0.77      0.59      0.60    142835
weighted avg       0.79      0.81      0.76    142835

And the confusion matrix
[[109215   2344]
 [ 25226   6050]]


## Notice that catboost outperforms Random Forest and LightGBM for these 2 tasks?  But lightGBM is very fast.