# Boosted trees verses random forest


Compare LightGBM and catbost with sklearn's Random Forest<br>
LightGBM is a microsoft gradient boosted tree product<br>
catboost is a Yandex gradient boosted tree product<br>

See <a href="https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db">CatBoost vs. Light GBM vs. XGBoost</a> for relative comparisons<br>
See <a href="https://towardsdatascience.com/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956">Scikit-Learn vs XGBoost vs LightGBM vs CatBoost in Sentiment Classification</a> for another relative comparison.


In [39]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

## Install LightGBM
conda and pip both have the same version (3.3.2) and the last upload was 3 months ago. This package is currently being maintained. Prefer conda so anaconda can coordinate LightGBMs dependencies with all other conda packages

In [40]:
# !conda install -c conda-forge lightgbm -y
import lightgbm
lightgbm.__version__

'3.2.1'

## Install catboost
conda and pip both have the same version (1.0.4) and the last upload was 2 months ago. This package is currently being maintained. Prefer conda so anaconda can coordinate catboosts dependencies with all other conda packages

In [41]:
# !conda install -c conda-forge catboost -y
import catboost
catboost.__version__

'1.0.6'

## Regression

### Data

In [42]:
from sklearn.datasets import fetch_california_housing
calif_housing = fetch_california_housing()

# for line in calif_housing.DESCR.split("\n")[5:22]:
#     print(line)

calif_housing_df = pd.DataFrame(data=calif_housing.data, columns=calif_housing.feature_names)
calif_housing_df["Price($)"] = calif_housing.target

# calif_housing_df.head()

In [43]:
#get train/test split
from sklearn.model_selection import train_test_split
X, y = calif_housing.data, calif_housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,test_size=0.2,random_state=123)

### R squared - a way to qualify a models predictions

The following regressors use R squared as the default objective to optimize.  See <a href="https://www.youtube.com/watch?v=2AQKmw14mHM">Statquest: R-squared, Clearly Explained!!!</a> for a great explanation plus examples.

Usually 0<R squared<1  .  It ranges between these 2 values and is interpreted as how well the model fits the data. (In statistics this is called explained variance)

If R squared =0,the line fitted to data is no more accurate than taking the mean of the data.<br>
If R squared =1,the line fitted to the data is a perfect match<br>
If R squared is negative then the line fitted to the data is a worse fit than just taking the average value of the data.

In [44]:
#It was not clear what objective lightGBM optimizes
#so I implemented R squared below 
def rsquared(preds, y):
    RSS=np.sum(np.square(preds-y))
    ymean=np.sum(y)/len(y)
    TSS=np.sum(np.square(y-ymean))
    return 1-RSS/TSS

def scoremodel(clf,X_test, y_test):
    print("Score on test set: {:.2f}".format(clf.score(X_test, y_test)))
    #run score using rsquared function above
    preds=clf.predict(X_test)
    rsq=rsquared(preds,y_test)
    print("Score on test set using rsquared: {:.2f}".format(rsq))
    

### random forest- default hyperparameters

In [45]:
%%time
from sklearn.ensemble import RandomForestRegressor

#random forest can be done in parallel, set n_jobs=-1 to use all processors
clf = RandomForestRegressor(random_state=42, n_jobs=-1)
_=clf.fit(X_train, y_train,)

CPU times: user 11.8 s, sys: 36.8 ms, total: 11.8 s
Wall time: 908 ms


In [46]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.81
Score on test set using rsquared: 0.81


### lightgbm- default hyperparameters

In [47]:
%%time
from lightgbm import LGBMRegressor
clf = LGBMRegressor(random_state=42, n_jobs=-1)
_=clf.fit(X_train, y_train)

CPU times: user 3.06 s, sys: 44.3 ms, total: 3.1 s
Wall time: 214 ms


In [48]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.84
Score on test set using rsquared: 0.84


### catboost -default parameters

In [49]:
%%time
from catboost import CatBoostRegressor
clf = CatBoostRegressor(silent=True, random_state=42)

_=clf.fit(X_train, y_train)

CPU times: user 14.5 s, sys: 1.07 s, total: 15.5 s
Wall time: 1.34 s


In [50]:
scoremodel(clf,X_test, y_test)

Score on test set: 0.86
Score on test set using rsquared: 0.86


In [53]:
clf.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'RMSE',
 'iterations': 1000,
 'sampling_frequency': 'PerTree',
 'leaf_estimation_method': 'Newton',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'rsm': 1,
 'boost_from_average': True,
 'model_size_reg': 0.5,
 'pool_metainfo_options': {'tags': {}},
 'subsample': 0.800000011920929,
 'use_best_model': False,
 'random_seed': 42,
 'depth': 6,
 'posterior_sampling': False,
 'border_count': 254,
 'classes_count': 0,
 'auto_class_weights': 'None',
 'sparse_features_conflict_fraction': 0,
 'leaf_estimation_backtracking': 'AnyImprovement',
 'best_model_min_trees': 1,
 'model_shrink_rate': 0,
 'min_data_in_leaf': 1,
 'loss_function': 'RMSE',
 'learning_rate': 0.0637660026550293,
 'score_function': 'Cosine',
 'task_type'

### Optimize hyperparameters for catboost

In [54]:
%%time
from sklearn.metrics import mean_squared_error
import optuna
import sklearn
def objective(trial):
    #these are the parameters I want to optimize
    params = {
        'learning_rate':trial.suggest_discrete_uniform("learning_rate", 0.001, 0.09, 0.001),
        'depth': trial.suggest_int("depth", 1, 12),
        'l2_leaf_reg':trial.suggest_discrete_uniform('l2_leaf_reg', 1.0, 5.5, 0.5),
        'iterations':1000,
        'silent':True,
        'random_state':42
    }
 
    # Define the model. Pass in params to be tuned
    clf = CatBoostRegressor(**params)
    
    clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0, early_stopping_rounds=100)

    return mean_squared_error(y_test, clf.predict(X_test))
 
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=5)

trial = study.best_trial

print('mean_squared_error: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[32m[I 2023-04-07 01:45:16,775][0m A new study created in memory with name: no-name-5f5321d5-3b5c-4cc9-a169-86cea1f11ad9[0m
[32m[I 2023-04-07 01:45:18,569][0m Trial 0 finished with value: 0.18354207895951047 and parameters: {'learning_rate': 0.081, 'depth': 7, 'l2_leaf_reg': 2.5}. Best is trial 0 with value: 0.18354207895951047.[0m
[32m[I 2023-04-07 01:45:19,933][0m Trial 1 finished with value: 0.23510875876300527 and parameters: {'learning_rate': 0.012, 'depth': 6, 'l2_leaf_reg': 3.5}. Best is trial 0 with value: 0.18354207895951047.[0m
[32m[I 2023-04-07 01:45:20,850][0m Trial 2 finished with value: 0.2117854546006257 and parameters: {'learning_rate': 0.081, 'depth': 3, 'l2_leaf_reg': 1.5}. Best is trial 0 with value: 0.18354207895951047.[0m
[32m[I 2023-04-07 01:45:38,672][0m Trial 3 finished with value: 0.20630553313630487 and parameters: {'learning_rate': 0.013000000000000001, 'depth': 11, 'l2_leaf_reg': 3.0}. Best is trial 0 with value: 0.18354207895951047.[0m
[32m[

mean_squared_error: 0.18354207895951047
Best hyperparameters: {'learning_rate': 0.081, 'depth': 7, 'l2_leaf_reg': 2.5}
CPU times: user 2min 53s, sys: 7.74 s, total: 3min 1s
Wall time: 23.1 s


In [56]:
#not satisfied?  Keep on optimizing from where you left off above
study.optimize(objective, n_trials=10)

[32m[I 2023-04-07 01:47:28,959][0m Trial 5 finished with value: 0.1808923330294024 and parameters: {'learning_rate': 0.081, 'depth': 9, 'l2_leaf_reg': 1.5}. Best is trial 5 with value: 0.1808923330294024.[0m
[32m[I 2023-04-07 01:47:30,083][0m Trial 6 finished with value: 0.3294701737957197 and parameters: {'learning_rate': 0.004, 'depth': 5, 'l2_leaf_reg': 1.0}. Best is trial 5 with value: 0.1808923330294024.[0m
[32m[I 2023-04-07 01:47:47,962][0m Trial 7 finished with value: 0.1927326417397994 and parameters: {'learning_rate': 0.025, 'depth': 11, 'l2_leaf_reg': 4.0}. Best is trial 5 with value: 0.1808923330294024.[0m
[32m[I 2023-04-07 01:47:48,650][0m Trial 8 finished with value: 0.33344312711302915 and parameters: {'learning_rate': 0.061, 'depth': 1, 'l2_leaf_reg': 1.5}. Best is trial 5 with value: 0.1808923330294024.[0m
[32m[I 2023-04-07 01:47:50,415][0m Trial 9 finished with value: 0.18579053458364106 and parameters: {'learning_rate': 0.055, 'depth': 7, 'l2_leaf_reg': 

### catboost -using best parameters

In [57]:
%%time
clf = CatBoostRegressor(**trial.params,silent=True, random_state=42)
_=clf.fit(X_train, y_train)

scoremodel(clf,X_test, y_test)

Score on test set: 0.86
Score on test set using rsquared: 0.86
CPU times: user 18.2 s, sys: 1.36 s, total: 19.5 s
Wall time: 1.77 s


## Optimize parameters for Random Forest

In [15]:
%%time
import optuna
import sklearn
def objective(trial,X=X,y=y):
    
    params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
    
    #these are the parameters I want to optimize
    n_estimators = trial.suggest_int('n_estimators', 10, 130)
    max_depth = int(trial.suggest_int('max_depth', 1, 55))
    min_samples_split = int(trial.suggest_int('min_samples_split', 2, 10))
    min_samples_leaf= int(trial.suggest_int('min_samples_leaf', 1, 10))

    # Define the model. Pass in params to be tuned
    clf = RandomForestRegressor(random_state=42, n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf )  

    #get the cross validation score
    return sklearn.model_selection.cross_val_score( clf, X, y, n_jobs=-1, cv=3, scoring='neg_mean_absolute_error').mean() 
 
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

trial = study.best_trial

print('neg_mean_absolute_error: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


[32m[I 2023-04-07 01:03:33,650][0m A new study created in memory with name: no-name-0fd993d5-ede0-4f03-a89a-67e6451e3d7d[0m
[32m[I 2023-04-07 01:03:38,455][0m Trial 0 finished with value: -0.4862874292542784 and parameters: {'n_estimators': 92, 'max_depth': 54, 'min_samples_split': 6, 'min_samples_leaf': 10}. Best is trial 0 with value: -0.4862874292542784.[0m
[32m[I 2023-04-07 01:03:39,897][0m Trial 1 finished with value: -0.4847104603070296 and parameters: {'n_estimators': 24, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 6}. Best is trial 1 with value: -0.4847104603070296.[0m
[32m[I 2023-04-07 01:03:45,479][0m Trial 2 finished with value: -0.4854579401085435 and parameters: {'n_estimators': 122, 'max_depth': 33, 'min_samples_split': 6, 'min_samples_leaf': 10}. Best is trial 1 with value: -0.4847104603070296.[0m
[32m[I 2023-04-07 01:03:46,505][0m Trial 3 finished with value: -0.48425659389633574 and parameters: {'n_estimators': 16, 'max_depth': 13, 'min_s

neg_mean_absolute_error: -0.4799994981807118
Best hyperparameters: {'n_estimators': 67, 'max_depth': 22, 'min_samples_split': 9, 'min_samples_leaf': 2}


## Classification

### Data

In [60]:
#took 10% of original dataset, dropped bunch of columns
data = pd.read_csv("../datasets/kaggle/flights/flights_small.csv")
data.head()

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,DESTINATION_AIRPORT,ORIGIN_AIRPORT,AIR_TIME,DEPARTURE_TIME,DISTANCE,ARRIVAL_DELAY
0,1,28,3,WN,103,MKE,DCA,102.0,713.0,634,1.0
1,8,11,2,B6,153,PBI,JFK,134.0,111.0,1028,337.0
2,2,4,3,DL,1187,DCA,MSP,111.0,1734.0,931,-19.0
3,3,27,5,WN,171,RDU,DEN,173.0,1807.0,1436,-7.0
4,8,1,6,WN,4330,RIC,ATL,63.0,2151.0,481,13.0


In [61]:
data.nunique()

MONTH                    12
DAY                      31
DAY_OF_WEEK               7
AIRLINE                  14
FLIGHT_NUMBER          6688
DESTINATION_AIRPORT     624
ORIGIN_AIRPORT          623
AIR_TIME                613
DEPARTURE_TIME         1414
DISTANCE               1324
ARRIVAL_DELAY           737
dtype: int64

In [62]:
#only later than 10 minutes is considered late
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

#convert to ordinal (even though they are categorical)
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

#unlbalanced dataset, make sure you get a stratified sample
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], stratify=data["ARRIVAL_DELAY"],
                                                random_state=10, test_size=0.25)

In [63]:
# data.head()
# X_test.shape
y_test.value_counts()

0    111559
1     31276
Name: ARRIVAL_DELAY, dtype: int64

In [67]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

def score_classifier_model(clf,X_test, y_test):
    res = clf.predict(X_test)
    print (classification_report(y_test, res))
    print("And the confusion matrix")
    print(confusion_matrix(y_test,res))
        
    

### random forest- default hyperparameters

In [65]:
%%time
from sklearn.ensemble import RandomForestClassifier
clfc = RandomForestClassifier(random_state=42, n_jobs=-1)
_=clfc.fit(X_train, y_train)

CPU times: user 3min 31s, sys: 637 ms, total: 3min 31s
Wall time: 14.4 s


In [68]:
score_classifier_model(clfc,X_test, y_test)

              precision    recall  f1-score   support

           0       0.81      0.98      0.89    111559
           1       0.71      0.17      0.28     31276

    accuracy                           0.80    142835
   macro avg       0.76      0.58      0.58    142835
weighted avg       0.79      0.80      0.75    142835

And the confusion matrix
[[109344   2215]
 [ 25809   5467]]


In [19]:
# macro avg (precision)   =(.81+.71)/2
# weighted avg (precision)= (111559/142835)*.81 + ((31276/142835)*.71)

### lightgbm- default hyperparameters

In [69]:
%%time
from lightgbm import LGBMClassifier
clf_lgbm = LGBMClassifier(random_state=42, n_jobs=-1)
_=clf_lgbm.fit(X_train, y_train)

CPU times: user 10 s, sys: 58.1 ms, total: 10.1 s
Wall time: 699 ms


In [70]:
score_classifier_model(clf_lgbm,X_test, y_test)

              precision    recall  f1-score   support

           0       0.80      0.99      0.88    111559
           1       0.74      0.12      0.20     31276

    accuracy                           0.80    142835
   macro avg       0.77      0.55      0.54    142835
weighted avg       0.79      0.80      0.73    142835

And the confusion matrix
[[110268   1291]
 [ 27610   3666]]


### catboost -default parameters

In [71]:
%%time
from catboost import CatBoostClassifier
clf_catboost = CatBoostClassifier(silent=True, random_state=42)
_=clf_catboost.fit(X_train, y_train)

CPU times: user 3min 37s, sys: 4.48 s, total: 3min 41s
Wall time: 16.4 s


In [72]:
score_classifier_model(clf_catboost,X_test, y_test)

              precision    recall  f1-score   support

           0       0.81      0.98      0.89    111559
           1       0.72      0.19      0.31     31276

    accuracy                           0.81    142835
   macro avg       0.77      0.59      0.60    142835
weighted avg       0.79      0.81      0.76    142835

And the confusion matrix
[[109215   2344]
 [ 25226   6050]]


## Notice that catboost outperforms Random Forest and LightGBM for these 2 tasks?  See the articles mentioned in first cell for further evidence.