# CatBoost Regressor (Category Boosting)
The goal of this script is to document the generic steps in hyper parameter tuning and training a CatBoost Regression model.
This can be used to quickly produce a baseline model to compare to, but in practice, more modifications will be necessary for fine tuning and creating the best possible model.
#### Useful Resources:
 - Source Documentation: https://catboost.ai/en/docs/
 - Informative Article on CatBoost: https://towardsdatascience.com/why-you-should-learn-catboost-now-390fb3895f76
 - Deep dive into important metrics and methods: https://coderzcolumn.com/tutorials/machine-learning/catboost-an-in-depth-guide-python#3

In [1]:
##### Imports #####
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import catboost as cb
import time
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
# reduce_memory_usage optimizes the amount of memory required for each column. Pandas defaults to 64 bit data types, but this is wasteful.
# This function looks at the range of values and assigns the most optimal type that keeps the data in tact

# Source for this code: https://www.mikulskibartosz.name/how-to-reduce-memory-usage-in-pandas/
def reduce_memory_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype
    if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                    df[col] = df[col].astype(np.uint16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                    df[col] = df[col].astype(np.uint32)                    
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
                elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                    df[col] = df[col].astype(np.uint64)
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

In [3]:
# Read in data and run reduce memory function
# SIMPLE TEST EXAMPLE USING CALIFORNIA HOUSING DATASET
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
data = pd.DataFrame(housing.data)
data.columns = housing.feature_names
data['MedHouseVal'] = housing.target
reduced_df = reduce_memory_usage(data)
X, y = reduced_df.loc[:,reduced_df.columns != 'MedHouseVal'], reduced_df['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)

Memory usage of dataframe is 1.42 MB
Memory usage after optimization is: 1.30 MB
Decreased by 8.3%


In [8]:
def CatBoost_Regressor_Training (cbr_params, scoring_param, X_train, X_test, y_train, y_test):
    # Parameter documentation: https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters
    
    # CatBoost specific data
    train_dataset = cb.Pool(X_train, y_train) 
    test_dataset = cb.Pool(X_test, y_test)
    
    # Perform GridSearch
    start_time = time.time()
    cbr_tuned = cb.CatBoostRegressor(loss_function=scoring_param, verbose = False)
    cbr_grid_result = cbr_tuned.grid_search(cbr_params, train_dataset, cv = 2)
    
    # Report Results
    print('\nGrid Search Completed in:', round(time.time() - start_time,0),'seconds')
    print('Grid Search Best Parameters:',cbr_grid_result['params'],'\n')
    
    # Create predictions on test dataset
    cbr_preds = cbr_tuned.predict(X_test)
    
    # Store error metrics
    cbr_error_metrics = {'mae':0,'rmse':0,'mse':0,'r2':0,'adjusted_r2':0}
    cbr_error_metrics['mse'] = mean_squared_error(y_test, cbr_preds)
    cbr_error_metrics['rmse'] = np.sqrt(cbr_error_metrics['mse'])
    cbr_error_metrics['mae'] = mean_absolute_error(y_test, cbr_preds)
    cbr_error_metrics['r2'] = r2_score(y_test,cbr_preds)
    n = y_test.shape[0] # Number of rows
    k = len(X_test.columns) # Number of independent variables
    cbr_error_metrics['adjusted_r2'] = 1 - ((1-cbr_error_metrics['r2'])*(n-1)/(n-k-1)) # Adjusted R^2 calculation

    # Print error metrics
    print("\n----------------- FINAL MODEL ERROR METRICS -----------------")
    print("MSE: %f" % (cbr_error_metrics['mse']))
    print("RMSE: %f" % (cbr_error_metrics['rmse']))
    print("MAE: %f" % (cbr_error_metrics['mae']))
    print("R Squared: %f" % (cbr_error_metrics['r2']))
    print("Adjusted R Squared: %f" % (cbr_error_metrics['adjusted_r2']))
    
    # Returns the model, the best parameter list, predicted values, and common error metrics
    return cbr_tuned, cbr_grid_result['params'], cbr_preds, cbr_error_metrics, cbr_tuned.feature_importances_

In [9]:
cbr_params = {'iterations': [100, 150, 200],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}
scoring_param = 'RMSE' # Options: https://catboost.ai/en/docs/concepts/loss-functions-regression#objectives-and-metrics

# Call the function to run hyperparameter optimization and training of final model
cbr_model, cbr_best_params, cbr_preds, cbr_error_metrics, cbr_feature_importances = CatBoost_Regressor_Training(cbr_params, scoring_param, X_train, X_test, y_train, y_test)



bestTest = 0.7579154321
bestIteration = 99

0:	loss: 0.7579154	best: 0.7579154 (0)	total: 154ms	remaining: 14.6s

bestTest = 0.6129901106
bestIteration = 99

1:	loss: 0.6129901	best: 0.6129901 (1)	total: 289ms	remaining: 13.6s

bestTest = 0.7570627111
bestIteration = 99

2:	loss: 0.7570627	best: 0.6129901 (1)	total: 439ms	remaining: 13.6s

bestTest = 0.616525197
bestIteration = 99

3:	loss: 0.6165252	best: 0.6129901 (1)	total: 577ms	remaining: 13.3s

bestTest = 0.7574878695
bestIteration = 99

4:	loss: 0.7574879	best: 0.6129901 (1)	total: 713ms	remaining: 13s

bestTest = 0.6165594365
bestIteration = 99

5:	loss: 0.6165594	best: 0.6129901 (1)	total: 853ms	remaining: 12.8s

bestTest = 0.759740925
bestIteration = 99

6:	loss: 0.7597409	best: 0.6129901 (1)	total: 999ms	remaining: 12.7s

bestTest = 0.6169458889
bestIteration = 99

7:	loss: 0.6169459	best: 0.6129901 (1)	total: 1.16s	remaining: 12.7s

bestTest = 0.695351979
bestIteration = 149

8:	loss: 0.6953520	best: 0.6129901 (1)	total: 1


bestTest = 0.4880766751
bestIteration = 199

71:	loss: 0.4880767	best: 0.4880767 (71)	total: 23.9s	remaining: 7.97s

bestTest = 0.6036860859
bestIteration = 99

72:	loss: 0.6036861	best: 0.4880767 (71)	total: 24.6s	remaining: 7.76s

bestTest = 0.5122511032
bestIteration = 99

73:	loss: 0.5122511	best: 0.4880767 (71)	total: 25.3s	remaining: 7.51s

bestTest = 0.6016786048
bestIteration = 99

74:	loss: 0.6016786	best: 0.4880767 (71)	total: 25.8s	remaining: 7.23s

bestTest = 0.5131354756
bestIteration = 99

75:	loss: 0.5131355	best: 0.4880767 (71)	total: 26.3s	remaining: 6.91s

bestTest = 0.6023994411
bestIteration = 99

76:	loss: 0.6023994	best: 0.4880767 (71)	total: 26.8s	remaining: 6.62s

bestTest = 0.5116647047
bestIteration = 99

77:	loss: 0.5116647	best: 0.4880767 (71)	total: 27.3s	remaining: 6.29s

bestTest = 0.6072736828
bestIteration = 99

78:	loss: 0.6072737	best: 0.4880767 (71)	total: 27.8s	remaining: 5.97s

bestTest = 0.5117886894
bestIteration = 99

79:	loss: 0.5117887	best: 

In [11]:
print("Best Score                : ",cbr_model.best_score_)
print("\nList of Target Classes  : ",cbr_model.classes_)
print("\nData Feature Names      : ",cbr_model.feature_names_)
print("\nFeature Importance      : ",cbr_model.feature_importances_)
print("\nLearning Rate           : ",cbr_model.learning_rate_)
print("\nRandom Seed             : ",cbr_model.random_seed_)
print("\nNumber of Trees         : ",cbr_model.tree_count_)
print("\nNumber of Features      : ",cbr_model.n_features_in_)

Best Score                :  {'learn': {'RMSE': 0.39319646862672164}}

List of Target Classes  :  []

Data Feature Names      :  ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Feature Importance      :  [29.71196356  5.97091003  3.60099854  1.95843617  1.68856913 14.13978212
 22.52625609 20.40308435]

Learning Rate           :  0.10000000149011612

Random Seed             :  0

Number of Trees         :  200

Number of Features      :  8


#### Useful Methods
**get_best_score()** - It returns best score of the estimator.

**get_params()** - It returns parameters which were given as dictionary when creating CatBoost estimator and their values as dictionary.

**get_all_params()** - It returns list of all parameters of CatBoost estimator and their values as dictionary.

**get_cat_feature_indices()** - It returns list of indices which has categorical features.

**get_feature_importance()** - It returns feature importance of individual feature according to trained model.

**shrink(ntree_end, ntree_start=0)** - It accepts two arguments which are end tree and starts tree to shrink ensemble to include only trees that come in that index range discarding all other trees.

**set_params()** - It can be used to set parameters of the estimator. Please make a note that this method will only work before the training model.

**calc_leaf_indexes(data, ntree_start=0,ntree_end=0)** - It takes as input data and returns index of leaf in each tree which was used to make prediction for sample. The output of this function will be n_samples x n_trees. It'll return all trees' leaf index for a sample.

**get_leaf_values()** - It returns actual leaf values of the trees in ensemble.

**get_leaf_weights()** - It returns leaf weights for each leaf of the trees in the ensemble.