# CatBoost

CatBoost is based on gradient boosted decision trees.During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous ones ( It learn from the previous one "boosting" ).
The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model.
Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors.
CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, so leaf can be calculated with bitwise operations.
This improve performance in comparison to other boosting algorithm !

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

The algorithm have the following steps:

- 1) Preliminary calculations of splits
    - Before learning, the possible values of entry are divided into buckets delimited by the threshold values (splits). (Quantization)
      Quantization is also used to split label values for categorical features

- 2) Transforming categorical features to numerical features
    - Different method from one hot encoding ( specific formula )

        ![](images/cat1.png)
        Example of data

        ![](images/cat2.png)
        Generate a new combinations of entry beetween multiple aggregate features

        ![](images/cat2.png)
        Apply formula to convert them into numerical entry
- 3) Convert text features into numerical
- 4) Choosing tree structure !
    - This is a greedy method. Features are selected in order along with their splits for substitution in each leaf. Candidates are selected based on data from the preliminary calculation of splits and the transformation of categorical features to numerical features.
    - The tree depth and other rules for choosing the structure are set in the starting parameters.

    How a "feature-split" pair is chosen for a leaf:
        A list is formed of possible candidates ("feature-split pairs") to be assigned to a leaf as the split.
        A number of penalty functions are calculated for each object.
        The split with the smallest penalty is selected.
        The resulting value is assigned to the leaf.
        This procedure is repeated for all following leaves.

    Before building each new tree, a random permutation of classification objects is performed.
    A metric, which determines the direction for further improving the function,is used to select the structure of the next tree.
- 5) Optimization
    - Cat boost use Regularization to prevent overfitting ,   the weight of each training example is varied over steps of choosing different splits
    - CatBoost implements an algorithm that allows to fight usual gradient boosting biases. link to paper: https://arxiv.org/abs/1706.09516


In [51]:
import catboost as cb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


X_train = pd.read_csv('x_train_cat.csv')
X_test = pd.read_csv('x_test_cat.csv')
y_train = pd.read_csv('y_train_cat.csv')
y_test = pd.read_csv('y_test_cat.csv')

categorical_features_indices = np.where(X_train.dtypes != np.float)[0]
display(categorical_features_indices)

train_dataset = cb.Pool(X_train, y_train,cat_features=categorical_features_indices) #pass to cat the indices of categorical features for conversion
test_dataset = cb.Pool(X_test, y_test,cat_features=categorical_features_indices)

array([ 0,  1,  4,  5,  6,  7,  8,  9, 10, 11, 12, 15, 16, 17, 18, 20, 21,
       22, 23, 24, 25, 26, 28, 32, 43, 45, 46, 47, 49, 56, 59, 60, 61],
      dtype=int64)

Unnamed: 0,MS_SubClass,MS_Zoning,Lot_Frontage,Lot_Area,Lot_Shape,Land_Contour,Lot_Config,Neighborhood,Condition_1,Bldg_Type,House_Style,Overall_Qual,Overall_Cond,Year_Built,Year_Remod_Add,Roof_Style,Exterior_1st,Exterior_2nd,Mas_Vnr_Type,Mas_Vnr_Area,Exter_Qual,Exter_Cond,Foundation,Bsmt_Qual,Bsmt_Cond,Bsmt_Exposure,BsmtFin_Type_1,BsmtFin_SF_1,BsmtFin_Type_2,BsmtFin_SF_2,Bsmt_Unf_SF,Total_Bsmt_SF,Heating_QC,First_Flr_SF,Second_Flr_SF,Low_Qual_Fin_SF,Gr_Liv_Area,Bsmt_Full_Bath,Bsmt_Half_Bath,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Kitchen_Qual,Fireplaces,Fireplace_Qu,Garage_Type,Garage_Finish,Garage_Area,Garage_Qual,Wood_Deck_SF,Open_Porch_SF,Enclosed_Porch,Three_season_porch,Screen_Porch,Pool_Area,Fence,Misc_Val,Year_Sold,Sale_Type,Sale_Condition,All_Quality,Neighborhood_Score,Total_External_SF,Total_Finished_Bsmt_SF,Total_SF,Total_Baths,Year_To_Sell
0,b'One_Story_1946_and_Newer_All_Styles',b'Residential_Low_Density',141.0,31770.0,b'Slightly_Irregular',b'Lvl',b'Corner',b'North_Ames',b'Norm',b'OneFam',b'One_Story',5,4,1960.0,1960.0,b'Hip',b'BrkFace',b'Plywood',b'Stone',112.0,1,2,b'CBlock',3,4,4,b'BLQ',2.0,b'Unf',0.0,441.0,1080.0,1,1656.0,0.0,0.0,1656.0,1.0,0.0,1.0,0.0,3.0,1.0,2,2.0,4,b'Attchd',b'Fin',528.0,3,210.0,62.0,0.0,0.0,0.0,0.0,b'No_Fence',0.0,2010.0,b'WD ',b'Normal',33,28.787879,272.0,639.0,2823.0,2.0,50.0
1,b'One_Story_1946_and_Newer_All_Styles',b'Residential_High_Density',80.0,11622.0,b'Regular',b'Lvl',b'Inside',b'North_Ames',b'Feedr',b'OneFam',b'One_Story',4,5,1961.0,1961.0,b'Gable',b'VinylSd',b'VinylSd',b'None',0.0,1,2,b'CBlock',3,3,1,b'Rec',6.0,b'LwQ',144.0,270.0,882.0,2,896.0,0.0,0.0,896.0,0.0,0.0,1.0,0.0,2.0,1.0,2,0.0,0,b'Attchd',b'Unf',730.0,3,140.0,0.0,0.0,0.0,120.0,0.0,b'Minimum_Privacy',0.0,2010.0,b'WD ',b'Normal',26,28.787879,260.0,612.0,2238.0,1.0,49.0
2,b'One_Story_1946_and_Newer_All_Styles',b'Residential_Low_Density',81.0,14267.0,b'Slightly_Irregular',b'Lvl',b'Corner',b'North_Ames',b'Norm',b'OneFam',b'One_Story',5,5,1958.0,1958.0,b'Hip',b'Wd Sdng',b'Wd Sdng',b'BrkFace',108.0,1,2,b'CBlock',3,3,1,b'ALQ',1.0,b'Unf',0.0,406.0,1329.0,2,1329.0,0.0,0.0,1329.0,0.0,0.0,1.0,1.0,3.0,1.0,3,0.0,0,b'Attchd',b'Unf',312.0,3,393.0,36.0,0.0,0.0,0.0,0.0,b'No_Fence',12500.0,2010.0,b'WD ',b'Normal',28,28.787879,429.0,923.0,2564.0,2.0,52.0
3,b'Two_Story_1946_and_Newer',b'Residential_Low_Density',78.0,9978.0,b'Slightly_Irregular',b'Lvl',b'Inside',b'Gilbert',b'Norm',b'OneFam',b'Two_Story',5,5,1998.0,1998.0,b'Gable',b'VinylSd',b'VinylSd',b'BrkFace',20.0,1,2,b'PConc',3,3,1,b'GLQ',3.0,b'Unf',0.0,324.0,926.0,4,926.0,678.0,0.0,1604.0,0.0,0.0,2.0,1.0,3.0,1.0,3,1.0,4,b'Attchd',b'Fin',470.0,3,360.0,36.0,0.0,0.0,0.0,0.0,b'No_Fence',0.0,2010.0,b'WD ',b'Normal',34,33.620968,396.0,602.0,2676.0,3.0,12.0
4,b'One_Story_PUD_1946_and_Newer',b'Residential_Low_Density',41.0,4920.0,b'Regular',b'Lvl',b'Inside',b'Stone_Brook',b'Norm',b'TwnhsE',b'One_Story',7,4,2001.0,2001.0,b'Gable',b'CemntBd',b'CmentBd',b'None',0.0,2,2,b'PConc',4,3,2,b'GLQ',3.0,b'Unf',0.0,722.0,1338.0,4,1338.0,0.0,0.0,1338.0,1.0,0.0,2.0,0.0,2.0,1.0,3,0.0,0,b'Attchd',b'Fin',582.0,3,0.0,0.0,170.0,0.0,0.0,0.0,b'No_Fence',0.0,2010.0,b'WD ',b'Normal',34,38.850000,170.0,616.0,2536.0,3.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2188,b'Duplex_All_Styles_and_Ages',b'Residential_Low_Density',63.0,9297.0,b'Regular',b'Lvl',b'Inside',b'Mitchell',b'Norm',b'Duplex',b'One_Story',4,4,1976.0,1976.0,b'Gable',b'Plywood',b'Plywood',b'None',0.0,1,2,b'CBlock',3,3,1,b'ALQ',1.0,b'Unf',0.0,122.0,1728.0,2,1728.0,0.0,0.0,1728.0,2.0,0.0,2.0,0.0,4.0,2.0,2,0.0,0,b'Detchd',b'Unf',560.0,3,0.0,0.0,0.0,0.0,0.0,0.0,b'No_Fence',0.0,2006.0,b'WD ',b'Family',25,29.734940,0.0,1606.0,3894.0,4.0,30.0
2189,b'One_Story_1946_and_Newer_All_Styles',b'Residential_Low_Density',80.0,17400.0,b'Regular',b'Low',b'Inside',b'Mitchell',b'Norm',b'OneFam',b'One_Story',4,4,1977.0,1977.0,b'Gable',b'BrkFace',b'BrkFace',b'None',0.0,1,2,b'CBlock',3,3,1,b'ALQ',1.0,b'Unf',0.0,190.0,1126.0,1,1126.0,0.0,0.0,1126.0,1.0,0.0,2.0,0.0,3.0,1.0,2,1.0,4,b'Attchd',b'RFn',484.0,3,295.0,41.0,0.0,0.0,0.0,0.0,b'No_Fence',0.0,2006.0,b'WD ',b'Normal',28,29.734940,336.0,936.0,2546.0,3.0,29.0
2190,b'Split_or_Multilevel',b'Residential_Low_Density',37.0,7937.0,b'Slightly_Irregular',b'Lvl',b'CulDSac',b'Mitchell',b'Norm',b'OneFam',b'SLvl',5,5,1984.0,1984.0,b'Gable',b'HdBoard',b'HdBoard',b'None',0.0,1,2,b'CBlock',3,3,3,b'GLQ',3.0,b'Unf',0.0,184.0,1003.0,2,1003.0,0.0,0.0,1003.0,1.0,0.0,1.0,0.0,3.0,1.0,2,0.0,0,b'Detchd',b'Unf',588.0,3,120.0,0.0,0.0,0.0,0.0,0.0,b'Good_Privacy',0.0,2006.0,b'WD ',b'Normal',29,29.734940,120.0,819.0,2410.0,2.0,22.0
2191,b'One_Story_1946_and_Newer_All_Styles',b'Residential_Low_Density',0.0,8885.0,b'Slightly_Irregular',b'Low',b'Inside',b'Mitchell',b'Norm',b'OneFam',b'One_Story',4,4,1983.0,1983.0,b'Gable',b'HdBoard',b'HdBoard',b'None',0.0,1,2,b'CBlock',4,3,3,b'BLQ',2.0,b'ALQ',324.0,239.0,864.0,2,902.0,0.0,0.0,902.0,1.0,0.0,1.0,0.0,2.0,1.0,2,0.0,0,b'Attchd',b'Unf',484.0,3,164.0,0.0,0.0,0.0,0.0,0.0,b'Minimum_Privacy',0.0,2006.0,b'WD ',b'Normal',28,29.734940,164.0,625.0,2011.0,2.0,23.0


MS_SubClass                object
MS_Zoning                  object
Lot_Frontage              float64
Lot_Area                  float64
Lot_Shape                  object
Land_Contour               object
Lot_Config                 object
Neighborhood               object
Condition_1                object
Bldg_Type                  object
House_Style                object
Overall_Qual                int64
Overall_Cond                int64
Year_Built                float64
Year_Remod_Add            float64
Roof_Style                 object
Exterior_1st               object
Exterior_2nd               object
Mas_Vnr_Type               object
Mas_Vnr_Area              float64
Exter_Qual                  int64
Exter_Cond                  int64
Foundation                 object
Bsmt_Qual                   int64
Bsmt_Cond                   int64
Bsmt_Exposure               int64
BsmtFin_Type_1             object
BsmtFin_SF_1              float64
BsmtFin_Type_2             object
BsmtFin_SF_2  

# Number of trees | Overffiting detector

It is recommended to check that there is no obvious underfitting or overfitting before tuning any other parameters. In order to do this it is necessary to analyze the metric value on the validation dataset and select the appropriate number of iterations.

This can be done by setting the number of iterations to a large value, using the overfitting detector parameters.
In this case the resulting model contains only the first k best iterations, where k is the iteration with the best loss value on the validation dataset.

Overffiting detector will interrupt the training when needed.

In [68]:
model = cb.CatBoostRegressor(loss_function='RMSE', task_type='GPU', devices='0:1', #Enable GPU training
                             use_best_model=True, #Necessary to try overfitting detector
                             iterations = 5000, #High number of iterations to identify an appropriate number of iteration (for grid search )
                             eval_metric='RMSE', # RMSE lower value are better ( we use RMSE instead of R2 because it is supported for GPU training )
                             #od_type='IncToDec', #Type of overfitting detector
                             #od_pval=.01, #Threshold to stop overfitting detector range bettwen 10^-10 and 10^-2 (bigger value interrupt overfitting faster)
                             border_count=128,#The value of this parameter significantly impacts the speed of training on GPU.
                             # The smaller the value, the faster the training is performed (refer to the Number of splits for numerical),higher value equal to more quality.
                             # The number of splits for numerical features.
                             random_state=6,
                             early_stopping_rounds = 12
                             #The value of this parameter significantly impacts the speed of training on GPU. The smaller the value, the faster the training is performed (refer to the Number of splits for numerical features section for details).
                             )

model.fit(train_dataset,eval_set= test_dataset) #Train model !

Learning rate set to 0.04273
0:	learn: 77630.8864256	test: 76976.0494921	best: 76976.0494921 (0)	total: 47.6ms	remaining: 3m 57s
1:	learn: 75177.9281581	test: 74480.3372010	best: 74480.3372010 (1)	total: 96.4ms	remaining: 4m
2:	learn: 72913.0137518	test: 72230.1151141	best: 72230.1151141 (2)	total: 142ms	remaining: 3m 56s
3:	learn: 70684.7617267	test: 69930.1177640	best: 69930.1177640 (3)	total: 188ms	remaining: 3m 54s
4:	learn: 68489.5458084	test: 67721.4140066	best: 67721.4140066 (4)	total: 233ms	remaining: 3m 52s
5:	learn: 66410.9258092	test: 65594.4499951	best: 65594.4499951 (5)	total: 281ms	remaining: 3m 53s
6:	learn: 64604.0734680	test: 63743.5759549	best: 63743.5759549 (6)	total: 330ms	remaining: 3m 55s
7:	learn: 62714.1255687	test: 61756.6566747	best: 61756.6566747 (7)	total: 376ms	remaining: 3m 54s
8:	learn: 61042.9658238	test: 60074.4451349	best: 60074.4451349 (8)	total: 423ms	remaining: 3m 54s
9:	learn: 59425.2722501	test: 58465.9997606	best: 58465.9997606 (9)	total: 472ms	r

<catboost.core.CatBoostRegressor at 0x16d18d3a6a0>

In [72]:
predictions = model.predict(X_test)

tr_pr = model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(y_test, predictions)))
r2 = r2_score(y_test, predictions)

print("train performance : {:.2f}".format(r2_score(y_train, tr_pr)))
print("Model performance with RMSE and R2")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))

train performance : 0.95
Model performance with RMSE and R2
RMSE: 21374.26
R2: 0.93


- Note that the overfitting detector have stopped at 256 iterations instead of do all 5000.

# Train with CV Grid and hyperparameter explanation

In [77]:
model = cb.CatBoostRegressor(loss_function='RMSE', task_type='GPU', devices='0:1',
                             eval_metric='RMSE',
                             od_type='IncToDec',
                             od_pval=.01,
                             border_count=254, #The number of splits for numerical features.
                             )

grid = {'iterations': [100,200,300], #The maximum number of trees that can be built
        'learning_rate': [0.04273], #Used for reducing the gradient step smaller value equal to more iterations
        'depth': [6, 7, 8, 9, 10], #Tree depth optimal range beetween 6 and 10
        'l2_leaf_reg': [0.2, 0.5, 1, 3], #Coefficent for regularization (penalty similar to linear regression l2 )
        'random_strength': [0.2,0.5,0.8] #Weights assigned to random objects (used for bagging , higher value correspond to more aggressive bagging )
        }


model.randomized_search(grid,
                      train_dataset,
                      y=None,
                      cv=5,
                      n_iter=20,
                      partition_random_seed=0,
                      calc_cv_statistics= False, #Estimate the quality by using cross-validation with the best of the found parameters
                      search_by_train_test_split = False, # Similar to cross validation for choose parameter
                      refit=True, # Refit an estimator using the best-found parameters on the whole dataset.
                      shuffle=True, #Shuffle before split into fold
                      stratified= False, # Not indicated for regression , usable for Classification task instead
                      train_size=0.8, #Indicate the dimension of each split
                      verbose= False,
)


predictions = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, predictions)))
r2 = r2_score(y_test, predictions)

print("Model performance with RMSE and R2")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))


Training on fold [0/5]
0:	learn: 189961.8654770	test: 189317.7336648	best: 189317.7336648 (0)	total: 12.5ms	remaining: 1.24s
1:	learn: 182278.1659624	test: 181682.4910935	best: 181682.4910935 (1)	total: 26.2ms	remaining: 1.29s
2:	learn: 174959.9859078	test: 174361.6911017	best: 174361.6911017 (2)	total: 40.4ms	remaining: 1.31s
3:	learn: 167945.7749470	test: 167356.5361419	best: 167356.5361419 (3)	total: 54.6ms	remaining: 1.31s
4:	learn: 161309.1572058	test: 160835.7269527	best: 160835.7269527 (4)	total: 69.6ms	remaining: 1.32s
5:	learn: 154899.2854659	test: 154485.4123455	best: 154485.4123455 (5)	total: 83.7ms	remaining: 1.31s
6:	learn: 148755.1697375	test: 148431.3216569	best: 148431.3216569 (6)	total: 98ms	remaining: 1.3s
7:	learn: 142930.2232271	test: 142690.0856562	best: 142690.0856562 (7)	total: 111ms	remaining: 1.27s
8:	learn: 137326.4739509	test: 137146.1355476	best: 137146.1355476 (8)	total: 126ms	remaining: 1.27s
9:	learn: 131992.7794499	test: 131876.3203084	best: 131876.32030

KeyboardInterrupt: 

# Display parameters of best iteration

In [None]:
display(model.get_all_params())

# Drop unused features and display data of used features

In [29]:
model.drop_unused_features()
display(model.feature_names_)
model.calc_feature_statistics(feature = model.feature_names_,data = X_train,target = y_train)

122