# CatBoost

CatBoost is based on gradient boosted decision trees.During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous ones ( It learn from the previous one "boosting" ).
The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model.
Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors.
CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, so leaf can be calculated with bitwise operations.
This improve performance in comparison to other boosting algorithm !

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

The algorithm have the following steps:

- 1) Preliminary calculations of splits
    - Before learning, the possible values of entry are divided into buckets delimited by the threshold values (splits). (Quantization)
      Quantization is also used to split label values for categorical features

- 2) Transforming categorical features to numerical features
    - Different method from one hot encoding ( specific formula )

        ![](images/cat1.png)
        Example of data

        ![](images/cat2.png)
        Generate a new combinations of entry beetween multiple aggregate features

        ![](images/cat2.png)
        Apply formula to convert them into numerical entry
- 3) Convert text features into numerical
- 4) Choosing tree structure !
    - This is a greedy method. Features are selected in order along with their splits for substitution in each leaf. Candidates are selected based on data from the preliminary calculation of splits and the transformation of categorical features to numerical features.
    - The tree depth and other rules for choosing the structure are set in the starting parameters.

    How a "feature-split" pair is chosen for a leaf:
        A list is formed of possible candidates ("feature-split pairs") to be assigned to a leaf as the split.
        A number of penalty functions are calculated for each object.
        The split with the smallest penalty is selected.
        The resulting value is assigned to the leaf.
        This procedure is repeated for all following leaves.

    Before building each new tree, a random permutation of classification objects is performed.
    A metric, which determines the direction for further improving the function,is used to select the structure of the next tree.
- 5) Optimization
    - Cat boost use Regularization to prevent overfitting ,   the weight of each training example is varied over steps of choosing different splits
    - CatBoost implements an algorithm that allows to fight usual gradient boosting biases. link to paper: https://arxiv.org/abs/1706.09516


In [29]:
import catboost as cb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


X_train = pd.read_csv('x_train_cat.csv')
X_test = pd.read_csv('x_test_cat.csv')
y_train = pd.read_csv('y_train_cat.csv')
y_test = pd.read_csv('y_test_cat.csv')

categorical_features_indices = np.where(X_train.dtypes != np.float)[0]

display(X_train.columns[categorical_features_indices])

train_dataset = cb.Pool(X_train, y_train,cat_features=categorical_features_indices) #pass to cat the indices of categorical features for conversion
test_dataset = cb.Pool(X_test, y_test,cat_features=categorical_features_indices)

Index(['MS_SubClass', 'MS_Zoning', 'Lot_Shape', 'Land_Contour', 'Lot_Config', 'Neighborhood', 'Condition_1', 'Bldg_Type', 'House_Style', 'Overall_Qual', 'Overall_Cond', 'Roof_Style', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Exter_Qual', 'Exter_Cond', 'Foundation', 'Bsmt_Qual', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_Type_2', 'Heating_QC', 'Kitchen_Qual', 'Fireplace_Qu', 'Garage_Type', 'Garage_Finish', 'Garage_Qual', 'Fence', 'Sale_Type', 'Sale_Condition', 'All_Quality'], dtype='object')

# Number of trees | Overfitting detector

It is recommended to check that there is no obvious underfitting or overfitting before tuning any other parameters. In order to do this it is necessary to analyze the metric value on the validation dataset and select the appropriate number of iterations.

This can be done by setting the number of iterations to a large value, using the overfitting detector parameters.
In this case the resulting model contains only the first k-best iterations, where k is the iteration with the best loss value on the validation dataset.

Overfitting detector will interrupt the training when needed.

In [68]:
model = cb.CatBoostRegressor(loss_function='RMSE', task_type='GPU', devices='0:1', #Enable GPU training
                             use_best_model=True, #Necessary to try overfitting detector
                             iterations = 5000, #High number of iterations to identify an appropriate number of iteration (for grid search )
                             eval_metric='RMSE', # RMSE lower value are better ( we use RMSE instead of R2 because it is supported for GPU training )
                             #od_type='IncToDec', #Type of overfitting detector
                             #od_pval=.01, #Threshold to stop overfitting detector range bettwen 10^-10 and 10^-2 (bigger value interrupt overfitting faster)
                             border_count=128,#The value of this parameter significantly impacts the speed of training on GPU.
                             # The smaller the value, the faster the training is performed (refer to the Number of splits for numerical),higher value equal to more quality.
                             # The number of splits for numerical features.
                             random_state=6,
                             early_stopping_rounds = 12
                             #The value of this parameter significantly impacts the speed of training on GPU. The smaller the value, the faster the training is performed (refer to the Number of splits for numerical features section for details).
                             )

model.fit(train_dataset,eval_set= test_dataset) #Train model !

Learning rate set to 0.04273
0:	learn: 77630.8864256	test: 76976.0494921	best: 76976.0494921 (0)	total: 47.6ms	remaining: 3m 57s
1:	learn: 75177.9281581	test: 74480.3372010	best: 74480.3372010 (1)	total: 96.4ms	remaining: 4m
2:	learn: 72913.0137518	test: 72230.1151141	best: 72230.1151141 (2)	total: 142ms	remaining: 3m 56s
3:	learn: 70684.7617267	test: 69930.1177640	best: 69930.1177640 (3)	total: 188ms	remaining: 3m 54s
4:	learn: 68489.5458084	test: 67721.4140066	best: 67721.4140066 (4)	total: 233ms	remaining: 3m 52s
5:	learn: 66410.9258092	test: 65594.4499951	best: 65594.4499951 (5)	total: 281ms	remaining: 3m 53s
6:	learn: 64604.0734680	test: 63743.5759549	best: 63743.5759549 (6)	total: 330ms	remaining: 3m 55s
7:	learn: 62714.1255687	test: 61756.6566747	best: 61756.6566747 (7)	total: 376ms	remaining: 3m 54s
8:	learn: 61042.9658238	test: 60074.4451349	best: 60074.4451349 (8)	total: 423ms	remaining: 3m 54s
9:	learn: 59425.2722501	test: 58465.9997606	best: 58465.9997606 (9)	total: 472ms	r

<catboost.core.CatBoostRegressor at 0x16d18d3a6a0>

In [72]:
predictions = model.predict(X_test)

tr_pr = model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(y_test, predictions)))
r2 = r2_score(y_test, predictions)

print("train performance : {:.2f}".format(r2_score(y_train, tr_pr)))
print("Model performance with RMSE and R2")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))

train performance : 0.95
Model performance with RMSE and R2
RMSE: 21374.26
R2: 0.93


- Note that the overfitting detector have stopped at 256 iterations instead of do all 5000.

# Train with CV Grid and hyperparameter explanation

In [30]:
model = cb.CatBoostRegressor(loss_function='RMSE', task_type='GPU', devices='0:1',
                             eval_metric='RMSE',
                             od_type='IncToDec',
                             od_pval=.01,
                             border_count=254, #The number of splits for numerical features.
                             one_hot_max_size=256,
                             )

grid = {'iterations': [100,200,300], #The maximum number of trees that can be built
        'learning_rate': [0.04273], #Used for reducing the gradient step smaller value equal to more iterations
        'depth': [6, 7, 8, 9, 10], #Tree depth optimal range beetween 6 and 10
        'l2_leaf_reg': [0.2, 0.5, 1, 3], #Coefficent for regularization (penalty similar to linear regression l2 )
        'random_strength': [0.2,0.5,0.8] #Weights assigned to random objects (used for bagging , higher value correspond to more aggressive bagging )
        }


model.randomized_search(grid,
                      train_dataset,
                      y=None,
                      cv=5,
                      n_iter=20,
                      partition_random_seed=0,
                      calc_cv_statistics= False, #Estimate the quality by using cross-validation with the best of the found parameters
                      search_by_train_test_split = False, # Similar to cross validation for choose parameter
                      refit=True, # Refit an estimator using the best-found parameters on the whole dataset.
                      shuffle=True, #Shuffle before split into fold
                      stratified= False, # Not indicated for regression , usable for Classification task instead
                      train_size=0.8, #Indicate the dimension of each split
                      verbose= False,
)


predictions = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, predictions)))
r2 = r2_score(y_test, predictions)

print("Model performance with RMSE and R2")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))

Training on fold [0/5]
0:	learn: 190003.3139221	test: 189267.3108119	best: 189267.3108119 (0)	total: 10.5ms	remaining: 1.04s
1:	learn: 182375.1474765	test: 181755.2970100	best: 181755.2970100 (1)	total: 24.3ms	remaining: 1.19s
2:	learn: 175024.8743721	test: 174403.1457634	best: 174403.1457634 (2)	total: 37.1ms	remaining: 1.2s
3:	learn: 168042.9168236	test: 167527.4519363	best: 167527.4519363 (3)	total: 48.3ms	remaining: 1.16s
4:	learn: 161383.1796935	test: 160914.4471768	best: 160914.4471768 (4)	total: 60.8ms	remaining: 1.15s
5:	learn: 155034.1902855	test: 154708.6832326	best: 154708.6832326 (5)	total: 72.1ms	remaining: 1.13s
6:	learn: 148890.4459176	test: 148655.4988906	best: 148655.4988906 (6)	total: 84ms	remaining: 1.12s
7:	learn: 143051.2247951	test: 142916.6836455	best: 142916.6836455 (7)	total: 96.8ms	remaining: 1.11s
8:	learn: 137469.6264231	test: 137372.4891490	best: 137372.4891490 (8)	total: 109ms	remaining: 1.1s
9:	learn: 132126.8567816	test: 132173.1874572	best: 132173.18745

# Display parameters of best iteration

In [31]:
display(model.get_all_params())
predictions = model.predict(X_train)
print(r2_score(predictions,y_train))

{'nan_mode': 'Min',
 'gpu_ram_part': 0.95,
 'eval_metric': 'RMSE',
 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1',
  'FeatureFreq:CtrBorderCount=15:CtrBorderType=Median:Prior=0/1'],
 'iterations': 300,
 'fold_permutation_block': 64,
 'leaf_estimation_method': 'Newton',
 'observations_to_bootstrap': 'TestOnly',
 'od_pval': 0.009999999776482582,
 'counter_calc_method': 'SkipTest',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'ctr_history_unit': 'Sample',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'one_hot_max_size': 256,
 'devices': '0:1',
 'eval_fraction': 0,
 'pinned_memory_bytes': '104857600',
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 0.5,
 'random_strength': 0.800000011920929,
 'od_type': 'IncToDec',
 'rsm': 1,
 'boost_from_average': True,
 'gpu_cat_features_storage': 'GpuRam',
 '

0.9817184154611672


# Drop unused features and display data of used features

In [64]:
#model.drop_unused_features()
display(model.feature_names_)
for i in range(0,6):
    model.calc_feature_statistics(feature = i,data = X_test,target = y_test, max_cat_features_on_plot=100)

['MS_SubClass',
 'MS_Zoning',
 'Lot_Frontage',
 'Lot_Area',
 'Lot_Shape',
 'Land_Contour',
 'Lot_Config',
 'Neighborhood',
 'Condition_1',
 'Bldg_Type',
 'House_Style',
 'Overall_Qual',
 'Overall_Cond',
 'Year_Built',
 'Year_Remod_Add',
 'Roof_Style',
 'Exterior_1st',
 'Exterior_2nd',
 'Mas_Vnr_Type',
 'Mas_Vnr_Area',
 'Exter_Qual',
 'Exter_Cond',
 'Foundation',
 'Bsmt_Qual',
 'Bsmt_Cond',
 'Bsmt_Exposure',
 'BsmtFin_Type_1',
 'BsmtFin_SF_1',
 'BsmtFin_Type_2',
 'BsmtFin_SF_2',
 'Bsmt_Unf_SF',
 'Total_Bsmt_SF',
 'Heating_QC',
 'First_Flr_SF',
 'Second_Flr_SF',
 'Low_Qual_Fin_SF',
 'Gr_Liv_Area',
 'Bsmt_Full_Bath',
 'Bsmt_Half_Bath',
 'Full_Bath',
 'Half_Bath',
 'Bedroom_AbvGr',
 'Kitchen_AbvGr',
 'Kitchen_Qual',
 'Fireplaces',
 'Fireplace_Qu',
 'Garage_Type',
 'Garage_Finish',
 'Garage_Area',
 'Garage_Qual',
 'Wood_Deck_SF',
 'Open_Porch_SF',
 'Enclosed_Porch',
 'Three_season_porch',
 'Screen_Porch',
 'Pool_Area',
 'Fence',
 'Misc_Val',
 'Year_Sold',
 'Sale_Type',
 'Sale_Condition',
 '