<a name="0"></a>
# **Table of Contents**
## [How Catboost works](#1)
### [1. Handling Categorical Features](#2)
### [2. Ordered Boosting](#3)
### [3. Symmetric trees](#4)
## [Practical Implementation](#5)
## [Catboost Hyperparamters](#6)
### [1. Learning Rate](#7)
### [2. L2 leaf regularization](#8)
### [3. Grow policy](#9)
### [4. Bootstrap_type](#10)
### [5. Boosting type](#11)
### [6. max_leaves](#12)
### [7. min_data_in_leaf](#13)
### [8. nan_mode](#14)
## [Limitations](#15)
## [Optimal Use Cases](#16)
## [Parallelization Capabilities](#17)
## [References](#18)

<center><font color='navyblue' size='6'><b>CatBoost</b></font></center>


CatBoost, short for **`Categorical Boosting`** is an open-source gradient boosting library developed by Yandex. It’s specifically designed to work with categorical variables straight out of the box. Unlike other machine learning algorithms that require categorical variables to be converted into numerical format through one-hot encoding or similar techniques, CatBoost can process these variables natively, which significantly simplifies the data preparation process and enhances model performance.

<a name="1"></a>

## **How Catboost works**
Current state of the art algorithms suffer from 2 main problems compared to CatBoost, Prediction Shift and Target Leakage.

Target Leakage:

Most algorithms use Target Statistics, it estimates the expected target y conditioned by the category. LightGBM for example groups tail categories to avoid memory and computational expense, but that results in some loss in the information.

The problem with Target Statistics is that it introduces leakage into your dataset, as you are using the target to encode the input features. CatBoost solves these by introducing the concept of using the history to encode the existing point.

Prediction Shift:

The predictions of the train dataset are shifted from the real distribution of the targets because of the target leakage introduced. The solution to this issue lies in using different unlimited datasets in training, one at each iteration. This is infeasable in real life life. However, CatBoost proposed to use random permutations of training examples for the target statistics calculation and for ordered boosting, respectively. Combining them in one algorithm, to avoid prediction shift.


**Catboost introduces the following algorithmic advances:**
- 1) An innovative algorithm for `Processing categorical features`.no need to preprocess features on your own, for data with categorical features, the accuracy would be better compared to another algorithm


- 2) The implementation of ordered boosting, a permutation-driven alternative to the classic boosting algorithm


- 3) Fast and easy to use GPU training


<a name="2"></a>
## 1. Handling Categorical Features

Catboost uses **`one_hot_max_size`** for all features with cardinality threshold less than or equal to given parameter value

In case of greater than threshold of parameter given catboost uses a more effective strategy. it relies on the ordering principle and called **`Ordered Target Encoding`** :a random permutation of the dataset is performed and then target encoding of some type (for example just computing mean of the target for objects of this category) is performed on each example using only the objects that are placed before the current object.
Generally transforming categorical features to numerical features in CatBoost includes the following steps:

1.  We draw a random permutation order of the dataset


2. Quantization for regression problems : converting continuous numerical features into discrete bins or categories


3. Then we iterate sequentially throughout the observations respecting that new order. For every observation, we compute a statistic of interest using only the observations that we have already seen in the past.


4. The initial observations lack sufficient training data to generate reliable estimates, resulting in significant variability. To address this challenge, the creators of CatBoost suggested generating multiple random permutations and creating an encoding for each permutation. The ultimate outcome is obtained by averaging these distinct encodings.

$$ \text{ctr} = \frac{\text{countInClass} + \text{prior}}{\text{totalCount} + 1} $$

where:
* **`countInClass`** is how many times the label value exceeded  i for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).
* **`totalCount`** is the total number of objects that have a feature value matching the current one.
* **`prior`** is a number (constant) defined by the starting parameters.

[Table of Contents](#0)

In [1]:
import pandas as pd
import numpy as np

# Creating Dataset
df = pd.DataFrame({
'color': ["blue", "red", "green", "blue", "green", "green" , "blue" ],
    'grade': [1,0, 1, 0, 1, 0, 0], })
CTR = [] # list to save the encoded values
for i in range(len(df)):
    data = df.iloc[:i+1 , :]
    totalCount = len(data[data['color'] == data.loc[i , 'color']]) -1
    countInClass = ((data['color'] == data.loc[i , 'color']) & (data['grade'] == 1)& (data.index != i)).sum()
    ctr = (countInClass + .05) /( totalCount + 1)
    CTR.append(ctr)
df['encoded_color'] = CTR
df.style.apply(
    lambda row: ['background-color: lightblue' if row['color'] == 'blue'
                 else 'background-color: lightcoral' if row['color'] == 'red'
                 else 'background-color: lightgreen' if row['color'] == 'green'
                 else 'background-color: darkgrey'] * len(row),
    axis=1)


Unnamed: 0,color,grade,encoded_color
0,blue,1,0.05
1,red,0,0.05
2,green,1,0.05
3,blue,0,0.525
4,green,1,0.525
5,green,0,0.683333
6,blue,0,0.35


<a name="3"></a>
## 2.Ordered Boosting
- The classical boosting is prone to overfitting because when you calculate the leaf value 'Output' is the estimate of the gradient of all the objects that would be in this leaf and this estimate in classical boosting is biased because you make this estimate on the same object that you have built the model on .
$$
\
\text{leafValue} = \frac{\sum_{i=1}^{n} g(\text{apprix}(i), \text{target}(i))}{n}
\
$$

- CatBoost implements an algorithm that allows to fight usual gradient boosting biases. The existed implementations face the statistical issue, **` prediction shift`**.the distribution for a training example shifts from the distribution of test example ,This problem is similar to the one that occurs in preprocessing of categorical variables

- The Catboost team derived ordered boosting, a modification of standard gradient boosting algorithm, that avoids target leakage and prediction shift.by using random permutations when you are building in the tree structure then for each object you are making the estimates based on the model that has never seen this object and that gives the boost in quality in case if you have small dataset or noisy"if you know that there might be overfitting it really helps"

$$
\
\text{leafValue} = \frac{\sum_{i=1}^{\text{docs}} g(\text{apprix}(i), \text{target}(i))}{docs}
\
$$


<a name="4"></a>
## 3. Symmetric trees
CatBoost builds balanced trees, also known as symmetric trees, as its base predictors. Unlike traditional gradient boosting methods that build trees leaf-wise or depth-wise, CatBoost’s symmetric trees ensure that all leaf nodes at the same level share the same splitting criterion

Such trees are balanced, less prone to overfitting, and allow speeding up prediction significantly at testing time.

<figure style="text-align: center;">
    <img src="https://i.ibb.co/zsFHjTD/Screenshot-2024-03-06-205057.png" style="width:75%; display: block; margin: auto;" />
    <figcaption>Symmetric tree of catboost</figcaption>
</figure>

[Table of Contents](#0)

<a name="5"></a>
## **Practical Implementation**

 **Classification Example**
Suppose we’re working on a binary classification problem , Here’s how you can implement a **`CatBoostClassifier`**

In [2]:
import catboost
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import time
import warnings
warnings.filterwarnings('ignore')

We are going to use Adult dataset which is one of the datasets used in **Catboost** original paper to be compared to XGboost and LightGBM
* Data consists of 32561 observations and 14 features as well as the target varaible
* It's a classification problem where we aim to predict whether the income is less than or equal 50K or greater than 50K

 **Predictors are:**
* Age
* workclass: private,state gov, without pay.....etc
* fnlwgt
* Education: bachelors,masters....etc
* Eductation num
* Marital status
* Occupation
* Family status: wife ,husband,own child...etc
* Race: white ,black....etc
* Gender
* Capital gain
* Capital loss
* Hours per week
* Native country: United states, cuba ,china...etc


In [3]:
#Loading data
from catboost.datasets import adult
adult_train , adult_test = adult()

In [4]:
adult_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [5]:
adult_train = adult_train.dropna()

In [6]:
# Mapping income to binary classification
adult_train['income']=adult_train['income'].map({'<=50K' : 0 , '>50K' : 1})

In [7]:
# Divide features and target
X = adult_train.drop('income' , axis=1)
y = adult_train.income

In [8]:
#Getting categorical features
cat_features = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' ,
                'race' ,'sex' , 'native-country']

In [9]:
from catboost import Pool
pool = Pool(data=X , label=y , cat_features=cat_features)

In [10]:
from sklearn.model_selection import train_test_split
x_train , x_val , y_train , y_val = train_test_split(X, y , test_size=.2 , random_state=0)
train_pool = Pool(
        data=x_train,
        label=y_train,
        cat_features = cat_features
)


validation_pool = Pool(
        data = x_val,
        label=y_val,
        cat_features = cat_features)

In [11]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=1000,
                           loss_function='Logloss'
                          )
model.fit(train_pool , eval_set=validation_pool , verbose=100)
print('Model is Fitted {}'.format(model.is_fitted()))
print("Model parameters {}".format(model.get_params()))

Learning rate set to 0.069566
0:	learn: 0.6324237	test: 0.6332103	best: 0.6332103 (0)	total: 78ms	remaining: 1m 17s
100:	learn: 0.2870042	test: 0.2976030	best: 0.2976030 (100)	total: 3.04s	remaining: 27.1s
200:	learn: 0.2715236	test: 0.2886415	best: 0.2886415 (200)	total: 6.03s	remaining: 24s
300:	learn: 0.2629174	test: 0.2862884	best: 0.2862435 (285)	total: 9.06s	remaining: 21s
400:	learn: 0.2565195	test: 0.2854570	best: 0.2853909 (391)	total: 12.1s	remaining: 18.1s
500:	learn: 0.2497836	test: 0.2856751	best: 0.2850291 (423)	total: 15.3s	remaining: 15.2s
600:	learn: 0.2440189	test: 0.2852846	best: 0.2850291 (423)	total: 18.3s	remaining: 12.2s
700:	learn: 0.2391983	test: 0.2854882	best: 0.2850291 (423)	total: 21.4s	remaining: 9.15s
800:	learn: 0.2346871	test: 0.2859116	best: 0.2850291 (423)	total: 24.7s	remaining: 6.13s
900:	learn: 0.2303671	test: 0.2861294	best: 0.2850291 (423)	total: 27.7s	remaining: 3.05s
999:	learn: 0.2259424	test: 0.2864257	best: 0.2850291 (423)	total: 31.2s	remai

**This score without any hyperparameter tuning lets see how can we improve this with changing  hyperparameters**

[Table of Contents](#0)

<a name="6"></a>
## **Catboost Hyperparamters**

In [12]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/adult-data/adult.xlsx


In [13]:
df=pd.read_excel('/kaggle/input/adult-data/adult.xlsx')

In [14]:
df.shape

(32561, 15)

In [15]:
#Checking missing values
df.isna().sum()

age               0
workclass         0
fnlwgt            0
Education         0
education_num     0
Marital_status    0
occupation        0
Family_status     0
race              0
Gender            0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
Income            0
dtype: int64

In [16]:
# income to binary variable
df['Income']=df['Income'].replace('<=50K',0,regex=True)
df['Income']=df['Income'].replace('>50K',1,regex=True)

In [17]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,Education,education_num,Marital_status,occupation,Family_status,race,Gender,capital_gain,capital_loss,hours_per_week,native_country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [18]:
# to split categorical from numeric featuers
df_numeric=pd.DataFrame()
for variable in df.columns:
      if df[variable].dtype!='object'and variable!='Income':
        df_numeric=pd.concat([df_numeric,df[variable]],axis=1)


In [19]:
df_cat=pd.DataFrame()
for variable in df.columns:
      if df[variable].dtype=='object' :
        df_cat=pd.concat([df_cat,df[variable]],axis=1)

In [20]:
cat=df_cat.to_numpy()
Full_data=np.concatenate((df_numeric,cat),axis=1)
y=df['Income'].to_numpy()
X=Full_data

In [21]:
X.shape

(32561, 14)

In [22]:
# splitting data into train,val , test
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=0.2, random_state=42)

## Catboost library is very rich and it has many paramters to tune in the classifier, For simplicity we have chosen a set of hyperparameters to test their effect on model performance and training time

In [23]:
#function to evaluate model performance
def evaluate_model (y_test,y_pred):
      print(f'Accuracy = {round(accuracy_score(y_test,y_pred)*100,1)} %')
      print(f'Precision = {round(precision_score(y_test,y_pred)*100,1)} %')
      print(f'Recall = {round(recall_score(y_test,y_pred)*100,1)}%')
      print(f'F1 score = {round(f1_score(y_test,y_pred)*100,1)}%')

<a name="7"></a>
###  **1) Learning Rate** : is the weight given to leaf output to be used to calculate predictions

In [24]:
#automaticly set learning rate  based on Logloss, MultiClass & RMSE loss functions depending on the number of iterations
#if none of parameters leaf_estimation_iterations, --leaf-estimation-method,l2_leaf_reg is set.
clf=catboost.CatBoostClassifier(iterations=1000,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

Learning rate set to 0.067091
0:	learn: 0.6249048	test: 0.6251335	best: 0.6251335 (0)	total: 31ms	remaining: 31s
100:	learn: 0.2819463	test: 0.2934680	best: 0.2934680 (100)	total: 2.75s	remaining: 24.5s
200:	learn: 0.2660824	test: 0.2855544	best: 0.2855544 (200)	total: 5.32s	remaining: 21.2s
300:	learn: 0.2554257	test: 0.2828093	best: 0.2827943 (292)	total: 8.03s	remaining: 18.6s
400:	learn: 0.2477405	test: 0.2817471	best: 0.2817471 (400)	total: 10.7s	remaining: 16s
500:	learn: 0.2413467	test: 0.2811878	best: 0.2811878 (500)	total: 13.5s	remaining: 13.5s
600:	learn: 0.2354350	test: 0.2815630	best: 0.2811878 (500)	total: 16.3s	remaining: 10.8s
700:	learn: 0.2300930	test: 0.2819460	best: 0.2811878 (500)	total: 19.4s	remaining: 8.29s
800:	learn: 0.2246873	test: 0.2817773	best: 0.2811878 (500)	total: 22.2s	remaining: 5.51s
900:	learn: 0.2190815	test: 0.2818115	best: 0.2811878 (500)	total: 25s	remaining: 2.75s
999:	learn: 0.2143648	test: 0.2819574	best: 0.2811878 (500)	total: 27.8s	remainin

In [25]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.8 %
Recall = 67.9%
F1 score = 72.9%


In [26]:
#learning rate=0.01
#default number of trees(iterations)=1000, verbose =100 , to show results every 100 iterations
clf=catboost.CatBoostClassifier(iterations=1000,learning_rate=0.01,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6823063	test: 0.6823443	best: 0.6823443 (0)	total: 31ms	remaining: 31s
100:	learn: 0.3481324	test: 0.3498848	best: 0.3498848 (100)	total: 2.65s	remaining: 23.6s
200:	learn: 0.3134754	test: 0.3162396	best: 0.3162396 (200)	total: 5.42s	remaining: 21.5s
300:	learn: 0.3015594	test: 0.3057822	best: 0.3057822 (300)	total: 8.18s	remaining: 19s
400:	learn: 0.2942198	test: 0.3004332	best: 0.3004332 (400)	total: 10.9s	remaining: 16.3s
500:	learn: 0.2889540	test: 0.2968523	best: 0.2968523 (500)	total: 13.6s	remaining: 13.6s
600:	learn: 0.2852732	test: 0.2948829	best: 0.2948817 (599)	total: 16.3s	remaining: 10.8s
700:	learn: 0.2823781	test: 0.2934764	best: 0.2934742 (699)	total: 18.9s	remaining: 8.08s
800:	learn: 0.2797573	test: 0.2920891	best: 0.2920891 (800)	total: 21.6s	remaining: 5.38s
900:	learn: 0.2774970	test: 0.2911237	best: 0.2911237 (900)	total: 24.5s	remaining: 2.69s
999:	learn: 0.2749761	test: 0.2896530	best: 0.2896530 (999)	total: 27s	remaining: 0us

bestTest = 0.289652980

In [27]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.4 %
Precision = 79.1 %
Recall = 64.7%
F1 score = 71.2%


In [28]:
#learning rate=0.1
clf=catboost.CatBoostClassifier(learning_rate=0.1,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5952369	test: 0.5955553	best: 0.5955553 (0)	total: 29.8ms	remaining: 29.8s
100:	learn: 0.2737917	test: 0.2891897	best: 0.2891880 (99)	total: 2.69s	remaining: 24s
200:	learn: 0.2554362	test: 0.2822633	best: 0.2821903 (193)	total: 5.33s	remaining: 21.2s
300:	learn: 0.2435123	test: 0.2818742	best: 0.2816393 (232)	total: 8.16s	remaining: 18.9s
400:	learn: 0.2341409	test: 0.2817917	best: 0.2811593 (362)	total: 11s	remaining: 16.4s
500:	learn: 0.2260160	test: 0.2812939	best: 0.2809208 (489)	total: 13.8s	remaining: 13.7s
600:	learn: 0.2185104	test: 0.2811669	best: 0.2809208 (489)	total: 16.6s	remaining: 11s
700:	learn: 0.2106777	test: 0.2820709	best: 0.2809208 (489)	total: 19.4s	remaining: 8.29s
800:	learn: 0.2040436	test: 0.2826772	best: 0.2809208 (489)	total: 22.2s	remaining: 5.52s
900:	learn: 0.1981026	test: 0.2834479	best: 0.2809208 (489)	total: 25s	remaining: 2.75s
999:	learn: 0.1925533	test: 0.2840588	best: 0.2809208 (489)	total: 28.4s	remaining: 0us

bestTest = 0.2809207649

In [29]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.7 %
Precision = 78.3 %
Recall = 68.0%
F1 score = 72.8%


In [30]:
#learning rate=0.2
clf=catboost.CatBoostClassifier(learning_rate=0.2,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5195499	test: 0.5200523	best: 0.5200523 (0)	total: 31.6ms	remaining: 31.5s
100:	learn: 0.2553256	test: 0.2842527	best: 0.2837815 (83)	total: 2.73s	remaining: 24.3s
200:	learn: 0.2350277	test: 0.2842300	best: 0.2830372 (155)	total: 5.46s	remaining: 21.7s
300:	learn: 0.2182575	test: 0.2849617	best: 0.2830372 (155)	total: 8.27s	remaining: 19.2s
400:	learn: 0.2041632	test: 0.2872916	best: 0.2830372 (155)	total: 11s	remaining: 16.5s
500:	learn: 0.1926095	test: 0.2893456	best: 0.2830372 (155)	total: 13.8s	remaining: 13.7s
600:	learn: 0.1832756	test: 0.2912935	best: 0.2830372 (155)	total: 16.6s	remaining: 11s
700:	learn: 0.1725922	test: 0.2927472	best: 0.2830372 (155)	total: 19.5s	remaining: 8.3s
800:	learn: 0.1636031	test: 0.2961164	best: 0.2830372 (155)	total: 22.3s	remaining: 5.53s
900:	learn: 0.1563993	test: 0.2983600	best: 0.2830372 (155)	total: 25.1s	remaining: 2.76s
999:	learn: 0.1483367	test: 0.3005922	best: 0.2830372 (155)	total: 27.9s	remaining: 0us

bestTest = 0.2830372

In [31]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.7 %
Precision = 78.3 %
Recall = 67.9%
F1 score = 72.8%


In [32]:
#learning rate=0.25
clf=catboost.CatBoostClassifier(learning_rate=0.25,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.4891696	test: 0.4897156	best: 0.4897156 (0)	total: 38.7ms	remaining: 38.7s
100:	learn: 0.2515604	test: 0.2838858	best: 0.2838529 (83)	total: 2.99s	remaining: 26.6s
200:	learn: 0.2293473	test: 0.2845151	best: 0.2837584 (105)	total: 5.67s	remaining: 22.5s
300:	learn: 0.2096712	test: 0.2861294	best: 0.2837584 (105)	total: 8.5s	remaining: 19.7s
400:	learn: 0.1949462	test: 0.2892737	best: 0.2837584 (105)	total: 11.3s	remaining: 16.9s
500:	learn: 0.1801893	test: 0.2930828	best: 0.2837584 (105)	total: 14.1s	remaining: 14s
600:	learn: 0.1684673	test: 0.2976701	best: 0.2837584 (105)	total: 16.9s	remaining: 11.2s
700:	learn: 0.1581764	test: 0.3020269	best: 0.2837584 (105)	total: 19.8s	remaining: 8.45s
800:	learn: 0.1478398	test: 0.3051241	best: 0.2837584 (105)	total: 22.7s	remaining: 5.63s
900:	learn: 0.1389224	test: 0.3089050	best: 0.2837584 (105)	total: 25.5s	remaining: 2.8s
999:	learn: 0.1309568	test: 0.3114515	best: 0.2837584 (105)	total: 28.3s	remaining: 0us

bestTest = 0.283758

In [33]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.6 %
Precision = 77.7 %
Recall = 68.0%
F1 score = 72.5%


Higher learning rate leads to achieving best iteration earlier than small learning rate , it seems that after this iteration the model starts to overfit and give slightly lower accuracy , however high learning rate leads to higher training time
* we will proceed with learning rate=0.1

[Table of Contents](#0)

<a name="8"></a>
###  **2) L2 leaf regularization** : is used to smooth weights to avoid overfitting

In [34]:
#l2 leaf regularization=0.001
clf=catboost.CatBoostClassifier(learning_rate=0.1,l2_leaf_reg=0.001,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5932358	test: 0.5935547	best: 0.5935547 (0)	total: 31.5ms	remaining: 31.4s
100:	learn: 0.2636021	test: 0.2868684	best: 0.2868684 (100)	total: 2.74s	remaining: 24.4s
200:	learn: 0.2383880	test: 0.2813151	best: 0.2805174 (150)	total: 5.86s	remaining: 23.3s
300:	learn: 0.2215984	test: 0.2816939	best: 0.2805174 (150)	total: 8.65s	remaining: 20.1s
400:	learn: 0.2075431	test: 0.2822540	best: 0.2805174 (150)	total: 11.5s	remaining: 17.1s
500:	learn: 0.1948248	test: 0.2868218	best: 0.2805174 (150)	total: 14.3s	remaining: 14.2s
600:	learn: 0.1840303	test: 0.2900302	best: 0.2805174 (150)	total: 17.1s	remaining: 11.3s
700:	learn: 0.1739241	test: 0.2913301	best: 0.2805174 (150)	total: 20s	remaining: 8.53s
800:	learn: 0.1647212	test: 0.2952060	best: 0.2805174 (150)	total: 22.9s	remaining: 5.68s
900:	learn: 0.1566633	test: 0.2981615	best: 0.2805174 (150)	total: 25.7s	remaining: 2.83s
999:	learn: 0.1485252	test: 0.3013744	best: 0.2805174 (150)	total: 28.5s	remaining: 0us

bestTest = 0.280

In [35]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.9 %
Recall = 67.5%
F1 score = 72.8%


In [36]:
#l2 leaf regularization=0.5
clf=catboost.CatBoostClassifier(learning_rate=0.1,l2_leaf_reg=0.5,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5945399	test: 0.5948648	best: 0.5948648 (0)	total: 30.4ms	remaining: 30.3s
100:	learn: 0.2714608	test: 0.2885550	best: 0.2885550 (100)	total: 2.74s	remaining: 24.4s
200:	learn: 0.2498425	test: 0.2811125	best: 0.2810510 (196)	total: 5.46s	remaining: 21.7s
300:	learn: 0.2363654	test: 0.2817325	best: 0.2805849 (225)	total: 8.61s	remaining: 20s
400:	learn: 0.2250869	test: 0.2813015	best: 0.2805849 (225)	total: 11.4s	remaining: 17.1s
500:	learn: 0.2141780	test: 0.2818502	best: 0.2805849 (225)	total: 14.2s	remaining: 14.1s
600:	learn: 0.2056623	test: 0.2829574	best: 0.2805849 (225)	total: 16.9s	remaining: 11.2s
700:	learn: 0.1978667	test: 0.2841859	best: 0.2805849 (225)	total: 19.7s	remaining: 8.38s
800:	learn: 0.1897216	test: 0.2853209	best: 0.2805849 (225)	total: 22.5s	remaining: 5.59s
900:	learn: 0.1819720	test: 0.2864540	best: 0.2805849 (225)	total: 25.4s	remaining: 2.79s
999:	learn: 0.1753588	test: 0.2874456	best: 0.2805849 (225)	total: 28.2s	remaining: 0us

bestTest = 0.280

In [37]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.9 %
Recall = 68.7%
F1 score = 73.5%


In [38]:
#l2 leaf regularization=1
clf=catboost.CatBoostClassifier(learning_rate=0.1,l2_leaf_reg=1,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5947648	test: 0.5950885	best: 0.5950885 (0)	total: 30.6ms	remaining: 30.5s
100:	learn: 0.2733033	test: 0.2895401	best: 0.2894534 (95)	total: 2.75s	remaining: 24.5s
200:	learn: 0.2517651	test: 0.2822205	best: 0.2820003 (185)	total: 5.46s	remaining: 21.7s
300:	learn: 0.2392577	test: 0.2812038	best: 0.2807956 (273)	total: 8.2s	remaining: 19s
400:	learn: 0.2287838	test: 0.2818232	best: 0.2805515 (325)	total: 11.5s	remaining: 17.1s
500:	learn: 0.2186377	test: 0.2829661	best: 0.2805515 (325)	total: 14.3s	remaining: 14.2s
600:	learn: 0.2095064	test: 0.2839810	best: 0.2805515 (325)	total: 17.1s	remaining: 11.3s
700:	learn: 0.2010400	test: 0.2851927	best: 0.2805515 (325)	total: 19.8s	remaining: 8.46s
800:	learn: 0.1934301	test: 0.2860837	best: 0.2805515 (325)	total: 22.8s	remaining: 5.65s
900:	learn: 0.1868481	test: 0.2870715	best: 0.2805515 (325)	total: 25.6s	remaining: 2.81s
999:	learn: 0.1799973	test: 0.2886144	best: 0.2805515 (325)	total: 28.4s	remaining: 0us

bestTest = 0.28055

In [39]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.6 %
Recall = 69.1%
F1 score = 73.5%


least training for l2_leaf_reg=0.5 with slight differences in accuracy
* we will proceed with l2_leaf_reg=0.5

[Table of Contents](#0)

<a name="9"></a>
###  **3) Grow policy**
* Symmetric tree: All leaves from last tree level are split with same condition
* Depthwise: A tree is built level by level unti specified depth is reached
* Lossguide: A tree is built leaf by leaf until max number of leaves is reached

In [40]:
###Symmetric is the default
clf=catboost.CatBoostClassifier(learning_rate=0.1,l2_leaf_reg=0.5,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5945399	test: 0.5948648	best: 0.5948648 (0)	total: 30.2ms	remaining: 30.2s
100:	learn: 0.2714608	test: 0.2885550	best: 0.2885550 (100)	total: 2.73s	remaining: 24.4s
200:	learn: 0.2498425	test: 0.2811125	best: 0.2810510 (196)	total: 5.45s	remaining: 21.6s
300:	learn: 0.2363654	test: 0.2817325	best: 0.2805849 (225)	total: 8.2s	remaining: 19s
400:	learn: 0.2250869	test: 0.2813015	best: 0.2805849 (225)	total: 11s	remaining: 16.4s
500:	learn: 0.2141780	test: 0.2818502	best: 0.2805849 (225)	total: 14.2s	remaining: 14.2s
600:	learn: 0.2056623	test: 0.2829574	best: 0.2805849 (225)	total: 17s	remaining: 11.3s
700:	learn: 0.1978667	test: 0.2841859	best: 0.2805849 (225)	total: 19.8s	remaining: 8.44s
800:	learn: 0.1897216	test: 0.2853209	best: 0.2805849 (225)	total: 22.6s	remaining: 5.62s
900:	learn: 0.1819720	test: 0.2864540	best: 0.2805849 (225)	total: 25.5s	remaining: 2.8s
999:	learn: 0.1753588	test: 0.2874456	best: 0.2805849 (225)	total: 28.3s	remaining: 0us

bestTest = 0.280584868

In [41]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.9 %
Recall = 68.7%
F1 score = 73.5%


In [42]:
###Depthwise grow policy
clf=catboost.CatBoostClassifier(learning_rate=0.1,grow_policy='Depthwise',l2_leaf_reg=0.5,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.5998386	test: 0.6004418	best: 0.6004418 (0)	total: 23ms	remaining: 23s
100:	learn: 0.2580632	test: 0.2838573	best: 0.2838573 (100)	total: 1.83s	remaining: 16.3s
200:	learn: 0.2253977	test: 0.2833439	best: 0.2810162 (138)	total: 3.65s	remaining: 14.5s
300:	learn: 0.2000787	test: 0.2853718	best: 0.2810162 (138)	total: 5.41s	remaining: 12.6s
400:	learn: 0.1786780	test: 0.2893496	best: 0.2810162 (138)	total: 7.21s	remaining: 10.8s
500:	learn: 0.1580061	test: 0.2963995	best: 0.2810162 (138)	total: 9.03s	remaining: 9s
600:	learn: 0.1404394	test: 0.3036854	best: 0.2810162 (138)	total: 10.9s	remaining: 7.21s
700:	learn: 0.1245573	test: 0.3103634	best: 0.2810162 (138)	total: 12.7s	remaining: 5.43s
800:	learn: 0.1123588	test: 0.3154687	best: 0.2810162 (138)	total: 14.7s	remaining: 3.66s
900:	learn: 0.1000992	test: 0.3225391	best: 0.2810162 (138)	total: 16.9s	remaining: 1.85s
999:	learn: 0.0889745	test: 0.3281205	best: 0.2810162 (138)	total: 18.7s	remaining: 0us

bestTest = 0.28101622

In [43]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.7 %
Recall = 67.9%
F1 score = 72.9%


In [44]:
###Lossguide grow policy
clf=catboost.CatBoostClassifier(learning_rate=0.1,grow_policy='Lossguide',l2_leaf_reg=0.5,verbose=100,random_seed=42)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6030690	test: 0.6039207	best: 0.6039207 (0)	total: 25.4ms	remaining: 25.3s
100:	learn: 0.2657071	test: 0.2868104	best: 0.2868104 (100)	total: 2.19s	remaining: 19.5s
200:	learn: 0.2372165	test: 0.2801849	best: 0.2797425 (175)	total: 4.47s	remaining: 17.8s
300:	learn: 0.2172126	test: 0.2831768	best: 0.2797425 (175)	total: 6.63s	remaining: 15.4s
400:	learn: 0.2002546	test: 0.2849174	best: 0.2797425 (175)	total: 8.7s	remaining: 13s
500:	learn: 0.1849820	test: 0.2888861	best: 0.2797425 (175)	total: 10.8s	remaining: 10.7s
600:	learn: 0.1712273	test: 0.2918731	best: 0.2797425 (175)	total: 12.8s	remaining: 8.48s
700:	learn: 0.1587420	test: 0.2965670	best: 0.2797425 (175)	total: 14.8s	remaining: 6.3s
800:	learn: 0.1466736	test: 0.2999217	best: 0.2797425 (175)	total: 16.7s	remaining: 4.16s
900:	learn: 0.1358889	test: 0.3033201	best: 0.2797425 (175)	total: 18.7s	remaining: 2.06s
999:	learn: 0.1256977	test: 0.3074916	best: 0.2797425 (175)	total: 20.7s	remaining: 0us

bestTest = 0.27974

In [45]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.7 %
Precision = 77.9 %
Recall = 68.3%
F1 score = 72.8%


According to catboost authors symmetric trees are much faster than other types although we find that it doesn't yeild least training time which may be a result from the fact that we are using high learning rate
we will proceed with symmetric grow policy

[Table of Contents](#0)

<a name="10"></a>

### **4) Bootstrap_type**: it's used to determine the weights of examples when splitting the tree , supported boostrap types are:
* Bayesian
* MVS:Minimum Variance Sampling
* Poisson `supported for GPU only`
* Bernoulli
* or no sampling: equal weights for examples

Bayseian

In [46]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,random_seed=42,l2_leaf_reg=0.5,bootstrap_type='Bayesian',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6644727	test: 0.6647705	best: 0.6647705 (0)	total: 24.3ms	remaining: 24.3s
100:	learn: 0.3001741	test: 0.3047258	best: 0.3047258 (100)	total: 2.73s	remaining: 24.3s
200:	learn: 0.2851763	test: 0.2956241	best: 0.2956241 (200)	total: 5.43s	remaining: 21.6s
300:	learn: 0.2760543	test: 0.2916553	best: 0.2916553 (300)	total: 8.57s	remaining: 19.9s
400:	learn: 0.2675631	test: 0.2882005	best: 0.2881894 (399)	total: 11.3s	remaining: 16.9s
500:	learn: 0.2602447	test: 0.2855096	best: 0.2854755 (485)	total: 14.1s	remaining: 14.1s
600:	learn: 0.2548030	test: 0.2849942	best: 0.2849369 (589)	total: 16.9s	remaining: 11.2s
700:	learn: 0.2495963	test: 0.2841204	best: 0.2840963 (699)	total: 19.7s	remaining: 8.38s
800:	learn: 0.2449276	test: 0.2838492	best: 0.2838052 (799)	total: 22.4s	remaining: 5.56s
900:	learn: 0.2407378	test: 0.2834190	best: 0.2833793 (891)	total: 25.2s	remaining: 2.77s
999:	learn: 0.2367194	test: 0.2837893	best: 0.2833793 (891)	total: 27.9s	remaining: 0us

bestTest = 0.2

In [47]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.9 %
Precision = 78.3 %
Recall = 68.8%
F1 score = 73.3%


MVS

In [48]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,random_seed=42,l2_leaf_reg=0.5,bootstrap_type='MVS',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6610685	test: 0.6611818	best: 0.6611818 (0)	total: 30.7ms	remaining: 30.7s
100:	learn: 0.3000568	test: 0.3057228	best: 0.3057228 (100)	total: 2.72s	remaining: 24.2s
200:	learn: 0.2833762	test: 0.2949945	best: 0.2949945 (200)	total: 5.5s	remaining: 21.9s
300:	learn: 0.2738627	test: 0.2898677	best: 0.2898622 (299)	total: 8.06s	remaining: 18.7s
400:	learn: 0.2647084	test: 0.2850429	best: 0.2850119 (399)	total: 11.1s	remaining: 16.6s
500:	learn: 0.2579450	test: 0.2825031	best: 0.2824672 (498)	total: 13.8s	remaining: 13.8s
600:	learn: 0.2527986	test: 0.2821027	best: 0.2821027 (600)	total: 16.7s	remaining: 11.1s
700:	learn: 0.2482473	test: 0.2819458	best: 0.2818844 (684)	total: 19.4s	remaining: 8.27s
800:	learn: 0.2441327	test: 0.2816757	best: 0.2816521 (796)	total: 22.1s	remaining: 5.49s
900:	learn: 0.2400126	test: 0.2815939	best: 0.2814782 (830)	total: 24.9s	remaining: 2.73s
999:	learn: 0.2365537	test: 0.2813364	best: 0.2813354 (998)	total: 27.6s	remaining: 0us

bestTest = 0.28

In [49]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.7 %
Recall = 68.9%
F1 score = 73.5%


Almost same performance with less training time than bayesian sampling since according to catboost library documentation MVS is better for speeding up the training

Bernoulli

In [50]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,random_seed=42,l2_leaf_reg=0.5,bootstrap_type='Bernoulli',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6646120	test: 0.6646486	best: 0.6646486 (0)	total: 24.8ms	remaining: 24.8s
100:	learn: 0.3010294	test: 0.3061794	best: 0.3061794 (100)	total: 2.55s	remaining: 22.7s
200:	learn: 0.2854213	test: 0.2961021	best: 0.2961021 (200)	total: 5.03s	remaining: 20s
300:	learn: 0.2756031	test: 0.2911594	best: 0.2911594 (300)	total: 7.53s	remaining: 17.5s
400:	learn: 0.2664717	test: 0.2868016	best: 0.2868016 (400)	total: 10.1s	remaining: 15.1s
500:	learn: 0.2590339	test: 0.2844889	best: 0.2844889 (500)	total: 13s	remaining: 13s
600:	learn: 0.2533698	test: 0.2834908	best: 0.2834264 (576)	total: 15.9s	remaining: 10.5s
700:	learn: 0.2481724	test: 0.2834829	best: 0.2832914 (674)	total: 18.5s	remaining: 7.91s
800:	learn: 0.2437009	test: 0.2830142	best: 0.2830142 (800)	total: 21.1s	remaining: 5.25s
900:	learn: 0.2393162	test: 0.2833470	best: 0.2828809 (811)	total: 23.7s	remaining: 2.6s
999:	learn: 0.2350456	test: 0.2831425	best: 0.2828809 (811)	total: 26.3s	remaining: 0us

bestTest = 0.28288089

In [51]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.9 %
Precision = 78.9 %
Recall = 68.1%
F1 score = 73.1%


Poisson

<font color='red'>**Warning**</font> These two cells won't run without changing runtime to `GPU`


In [52]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,task_type='GPU',random_seed=42,l2_leaf_reg=0.5,bootstrap_type='Poisson',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6586441	test: 0.6589261	best: 0.6589261 (0)	total: 174ms	remaining: 2m 54s
100:	learn: 0.2970747	test: 0.3011471	best: 0.3011471 (100)	total: 4.87s	remaining: 43.4s
200:	learn: 0.2835160	test: 0.2913983	best: 0.2913983 (200)	total: 9.56s	remaining: 38s
300:	learn: 0.2776661	test: 0.2881637	best: 0.2881637 (300)	total: 14.2s	remaining: 32.9s
400:	learn: 0.2717731	test: 0.2847356	best: 0.2847348 (396)	total: 19s	remaining: 28.3s
500:	learn: 0.2665052	test: 0.2820009	best: 0.2820009 (500)	total: 23.5s	remaining: 23.4s
600:	learn: 0.2637146	test: 0.2810769	best: 0.2810769 (600)	total: 28s	remaining: 18.6s
700:	learn: 0.2616333	test: 0.2804653	best: 0.2804653 (700)	total: 32.6s	remaining: 13.9s
800:	learn: 0.2595100	test: 0.2797864	best: 0.2797847 (797)	total: 37.2s	remaining: 9.25s
900:	learn: 0.2573568	test: 0.2794165	best: 0.2794165 (900)	total: 42s	remaining: 4.61s
999:	learn: 0.2555900	test: 0.2790321	best: 0.2790152 (983)	total: 46.6s	remaining: 0us
bestTest = 0.2790151527

In [53]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.9 %
Recall = 68.6%
F1 score = 73.4%


No sampling

In [54]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,l2_leaf_reg=0.5,random_seed=42,bootstrap_type='No',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6620492	test: 0.6623045	best: 0.6623045 (0)	total: 58.5ms	remaining: 58.5s
100:	learn: 0.2990343	test: 0.3036509	best: 0.3036509 (100)	total: 3.28s	remaining: 29.2s
200:	learn: 0.2830532	test: 0.2937504	best: 0.2937504 (200)	total: 5.93s	remaining: 23.6s
300:	learn: 0.2738061	test: 0.2895126	best: 0.2895126 (300)	total: 8.55s	remaining: 19.9s
400:	learn: 0.2641470	test: 0.2846327	best: 0.2846327 (400)	total: 11.3s	remaining: 16.8s
500:	learn: 0.2574697	test: 0.2819358	best: 0.2819358 (500)	total: 14s	remaining: 13.9s
600:	learn: 0.2526584	test: 0.2816632	best: 0.2816566 (598)	total: 16.8s	remaining: 11.1s
700:	learn: 0.2475642	test: 0.2807430	best: 0.2807037 (691)	total: 19.5s	remaining: 8.33s
800:	learn: 0.2432037	test: 0.2807900	best: 0.2805409 (737)	total: 22.3s	remaining: 5.54s
900:	learn: 0.2392862	test: 0.2805608	best: 0.2805409 (737)	total: 25.1s	remaining: 2.75s
999:	learn: 0.2355294	test: 0.2805596	best: 0.2804093 (934)	total: 27.8s	remaining: 0us

bestTest = 0.280

In [55]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.9 %
Precision = 78.5 %
Recall = 68.7%
F1 score = 73.3%


we will proceed with learning rate=0.03 as it gives less training time and allowe for comparison for different parameters and let the bootstrap type be chosen by the library based on different other parameters such as  objective, task_type, bagging_temperature and sampling_unit

[Table of Contents](#0)

<a name="11"></a>
### **5) Boosting type:**
* Ordered : Usually provides better quality on small datasets, but it may be slower than the Plain scheme
* Plain : The classic gradient boosting scheme

Ordered

In [56]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,l2_leaf_reg=0.5,random_seed=42,boosting_type='Ordered',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6644005	test: 0.6647124	best: 0.6647124 (0)	total: 39.9ms	remaining: 39.9s
100:	learn: 0.3026575	test: 0.3066320	best: 0.3066320 (100)	total: 5.32s	remaining: 47.3s
200:	learn: 0.2879803	test: 0.2952198	best: 0.2952198 (200)	total: 9.79s	remaining: 38.9s
300:	learn: 0.2793832	test: 0.2895103	best: 0.2895103 (300)	total: 14.2s	remaining: 32.9s
400:	learn: 0.2712666	test: 0.2853230	best: 0.2853230 (400)	total: 18.9s	remaining: 28.2s
500:	learn: 0.2647037	test: 0.2825007	best: 0.2824947 (499)	total: 24s	remaining: 23.9s
600:	learn: 0.2605388	test: 0.2808099	best: 0.2808011 (597)	total: 28.8s	remaining: 19.1s
700:	learn: 0.2565735	test: 0.2791247	best: 0.2791247 (700)	total: 33.9s	remaining: 14.5s
800:	learn: 0.2533100	test: 0.2787956	best: 0.2786192 (758)	total: 39.4s	remaining: 9.79s
900:	learn: 0.2507434	test: 0.2785507	best: 0.2784410 (889)	total: 44.3s	remaining: 4.87s
999:	learn: 0.2483175	test: 0.2784267	best: 0.2783479 (975)	total: 49.1s	remaining: 0us

bestTest = 0.278

In [57]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 79.2 %
Recall = 68.0%
F1 score = 73.2%


Plain

In [58]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6610685	test: 0.6611818	best: 0.6611818 (0)	total: 30.4ms	remaining: 30.4s
100:	learn: 0.3000568	test: 0.3057228	best: 0.3057228 (100)	total: 2.76s	remaining: 24.6s
200:	learn: 0.2833762	test: 0.2949945	best: 0.2949945 (200)	total: 5.49s	remaining: 21.8s
300:	learn: 0.2738627	test: 0.2898677	best: 0.2898622 (299)	total: 8.07s	remaining: 18.7s
400:	learn: 0.2647084	test: 0.2850429	best: 0.2850119 (399)	total: 10.7s	remaining: 16s
500:	learn: 0.2579450	test: 0.2825031	best: 0.2824672 (498)	total: 13.5s	remaining: 13.4s
600:	learn: 0.2527986	test: 0.2821027	best: 0.2821027 (600)	total: 16.2s	remaining: 10.8s
700:	learn: 0.2482473	test: 0.2819458	best: 0.2818844 (684)	total: 19.4s	remaining: 8.28s
800:	learn: 0.2441327	test: 0.2816757	best: 0.2816521 (796)	total: 22.2s	remaining: 5.52s
900:	learn: 0.2400126	test: 0.2815939	best: 0.2814782 (830)	total: 25s	remaining: 2.74s
999:	learn: 0.2365537	test: 0.2813364	best: 0.2813354 (998)	total: 27.7s	remaining: 0us

bestTest = 0.28133

In [59]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 88.0 %
Precision = 78.7 %
Recall = 68.9%
F1 score = 73.5%


Plain is much faster in training than ordered
we will proceed with plain boosting

[Table of Contents](#0)

<a name="12"></a>
### **6) max_leaves:** The maximum number of leafs in the resulting tree. Can be used only with the `Lossguide` growing policy
`It is not recommended to use values greater than 64, since it can significantly slow down the training process`

max_leaves=31

In [60]:
# Default max leaves=31
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6642644	test: 0.6645452	best: 0.6645452 (0)	total: 26.1ms	remaining: 26.1s
100:	learn: 0.2957062	test: 0.3025316	best: 0.3025316 (100)	total: 2.15s	remaining: 19.2s
200:	learn: 0.2758330	test: 0.2891300	best: 0.2891300 (200)	total: 4.34s	remaining: 17.2s
300:	learn: 0.2668057	test: 0.2849635	best: 0.2849635 (300)	total: 6.52s	remaining: 15.1s
400:	learn: 0.2570405	test: 0.2812742	best: 0.2812742 (400)	total: 8.69s	remaining: 13s
500:	learn: 0.2473518	test: 0.2800287	best: 0.2798491 (457)	total: 10.9s	remaining: 10.9s
600:	learn: 0.2399192	test: 0.2797182	best: 0.2795666 (592)	total: 13.2s	remaining: 8.77s
700:	learn: 0.2333106	test: 0.2799538	best: 0.2795666 (592)	total: 15.4s	remaining: 6.58s
800:	learn: 0.2269905	test: 0.2800177	best: 0.2795666 (592)	total: 17.6s	remaining: 4.38s
900:	learn: 0.2211398	test: 0.2805204	best: 0.2795666 (592)	total: 19.7s	remaining: 2.17s
999:	learn: 0.2159195	test: 0.2812017	best: 0.2795666 (592)	total: 22.1s	remaining: 0us

bestTest = 0.279

In [61]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.0 %
Recall = 68.7%
F1 score = 73.1%


max_leaves=64

In [62]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',max_leaves=64,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6628212	test: 0.6632060	best: 0.6632060 (0)	total: 29.2ms	remaining: 29.1s
100:	learn: 0.2904373	test: 0.3011205	best: 0.3011205 (100)	total: 2.58s	remaining: 23s
200:	learn: 0.2711805	test: 0.2892533	best: 0.2892533 (200)	total: 5.03s	remaining: 20s
300:	learn: 0.2612391	test: 0.2845535	best: 0.2845535 (300)	total: 7.49s	remaining: 17.4s
400:	learn: 0.2491946	test: 0.2813289	best: 0.2813289 (400)	total: 10.2s	remaining: 15.3s
500:	learn: 0.2395089	test: 0.2811767	best: 0.2809971 (459)	total: 12.8s	remaining: 12.8s
600:	learn: 0.2312081	test: 0.2810906	best: 0.2809922 (595)	total: 15.2s	remaining: 10.1s
700:	learn: 0.2225224	test: 0.2817069	best: 0.2808388 (623)	total: 17.7s	remaining: 7.56s
800:	learn: 0.2147212	test: 0.2827520	best: 0.2808388 (623)	total: 20.3s	remaining: 5.04s
900:	learn: 0.2092678	test: 0.2837313	best: 0.2808388 (623)	total: 22.9s	remaining: 2.51s
999:	learn: 0.2024351	test: 0.2846939	best: 0.2808388 (623)	total: 25.6s	remaining: 0us

bestTest = 0.28083

In [63]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.3 %
Recall = 68.5%
F1 score = 73.0%


max_leaves=86

In [64]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',max_leaves=86,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6628212	test: 0.6632060	best: 0.6632060 (0)	total: 29.6ms	remaining: 29.6s
100:	learn: 0.2904373	test: 0.3011205	best: 0.3011205 (100)	total: 2.72s	remaining: 24.2s
200:	learn: 0.2711805	test: 0.2892533	best: 0.2892533 (200)	total: 5.81s	remaining: 23.1s
300:	learn: 0.2612391	test: 0.2845535	best: 0.2845535 (300)	total: 8.41s	remaining: 19.5s
400:	learn: 0.2491946	test: 0.2813289	best: 0.2813289 (400)	total: 11.1s	remaining: 16.5s
500:	learn: 0.2395089	test: 0.2811767	best: 0.2809971 (459)	total: 13.6s	remaining: 13.6s
600:	learn: 0.2312081	test: 0.2810906	best: 0.2809922 (595)	total: 16.3s	remaining: 10.8s
700:	learn: 0.2225224	test: 0.2817069	best: 0.2808388 (623)	total: 18.9s	remaining: 8.07s
800:	learn: 0.2147212	test: 0.2827520	best: 0.2808388 (623)	total: 21.6s	remaining: 5.36s
900:	learn: 0.2092678	test: 0.2837313	best: 0.2808388 (623)	total: 24.3s	remaining: 2.67s
999:	learn: 0.2024351	test: 0.2846939	best: 0.2808388 (623)	total: 27s	remaining: 0us

bestTest = 0.280

In [65]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.3 %
Recall = 68.5%
F1 score = 73.0%


[Table of Contents](#0)

<a name="13"></a>
### **7) min_data_in_leaf:**
The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with samples count less than the specified value.
Can be used only with the `Lossguide` and `Depthwise` growing policies.
Default is 1

min_data_in_leaf=10

In [66]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',min_data_in_leaf=10,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6657995	test: 0.6661344	best: 0.6661344 (0)	total: 28ms	remaining: 28s
100:	learn: 0.2950147	test: 0.3028651	best: 0.3028651 (100)	total: 2.44s	remaining: 21.7s
200:	learn: 0.2751650	test: 0.2892169	best: 0.2892169 (200)	total: 4.81s	remaining: 19.1s
300:	learn: 0.2657880	test: 0.2848729	best: 0.2848729 (300)	total: 7.34s	remaining: 17s
400:	learn: 0.2565651	test: 0.2816344	best: 0.2816344 (400)	total: 10s	remaining: 14.9s
500:	learn: 0.2480669	test: 0.2809313	best: 0.2805178 (486)	total: 12.3s	remaining: 12.3s
600:	learn: 0.2407037	test: 0.2808478	best: 0.2805178 (486)	total: 14.8s	remaining: 9.79s
700:	learn: 0.2344547	test: 0.2813783	best: 0.2805178 (486)	total: 17.2s	remaining: 7.32s
800:	learn: 0.2286862	test: 0.2820163	best: 0.2805178 (486)	total: 19.5s	remaining: 4.84s
900:	learn: 0.2228070	test: 0.2819085	best: 0.2805178 (486)	total: 21.8s	remaining: 2.4s
999:	learn: 0.2170052	test: 0.2827175	best: 0.2805178 (486)	total: 24.1s	remaining: 0us

bestTest = 0.2805177836

In [67]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.5 %
Recall = 68.3%
F1 score = 73.0%


min_data_in_leaf=20

In [68]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',min_data_in_leaf=20,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6655069	test: 0.6658067	best: 0.6658067 (0)	total: 26.6ms	remaining: 26.6s
100:	learn: 0.2942144	test: 0.3023251	best: 0.3023251 (100)	total: 2.56s	remaining: 22.8s
200:	learn: 0.2745777	test: 0.2886031	best: 0.2886031 (200)	total: 4.99s	remaining: 19.8s
300:	learn: 0.2655649	test: 0.2847957	best: 0.2847681 (299)	total: 7.36s	remaining: 17.1s
400:	learn: 0.2559411	test: 0.2812536	best: 0.2812078 (394)	total: 9.71s	remaining: 14.5s
500:	learn: 0.2490497	test: 0.2807982	best: 0.2806173 (492)	total: 12.1s	remaining: 12s
600:	learn: 0.2418929	test: 0.2811279	best: 0.2806173 (492)	total: 14.5s	remaining: 9.65s
700:	learn: 0.2354370	test: 0.2814163	best: 0.2806173 (492)	total: 17.1s	remaining: 7.28s
800:	learn: 0.2301853	test: 0.2821071	best: 0.2806173 (492)	total: 19.3s	remaining: 4.8s
900:	learn: 0.2251260	test: 0.2818399	best: 0.2806173 (492)	total: 21.6s	remaining: 2.37s
999:	learn: 0.2197045	test: 0.2821172	best: 0.2806173 (492)	total: 23.8s	remaining: 0us

bestTest = 0.2806

In [69]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.8 %
Precision = 78.5 %
Recall = 68.4%
F1 score = 73.1%


min_data_in_leaf=40

In [70]:
clf=catboost.CatBoostClassifier(learning_rate=0.03,grow_policy='Lossguide',min_data_in_leaf=40,l2_leaf_reg=0.5,random_seed=42,boosting_type='Plain',verbose=100)
start=time.time()
clf.fit(X_train,y_train,cat_features=[6,7,8,9,10,11,12,13],eval_set=(X_val,y_val))
end=time.time()
training_time=end-start
print(f'Model fitted in {round(training_time,2)} seconds')

0:	learn: 0.6654499	test: 0.6657779	best: 0.6657779 (0)	total: 26.6ms	remaining: 26.5s
100:	learn: 0.2943867	test: 0.3027108	best: 0.3027108 (100)	total: 2.35s	remaining: 20.9s
200:	learn: 0.2744083	test: 0.2885313	best: 0.2885313 (200)	total: 4.78s	remaining: 19s
300:	learn: 0.2661861	test: 0.2846455	best: 0.2846455 (300)	total: 7.21s	remaining: 16.8s
400:	learn: 0.2569442	test: 0.2815728	best: 0.2815447 (398)	total: 9.69s	remaining: 14.5s
500:	learn: 0.2501471	test: 0.2806881	best: 0.2806707 (498)	total: 12.1s	remaining: 12s
600:	learn: 0.2444923	test: 0.2804404	best: 0.2803533 (585)	total: 14.2s	remaining: 9.45s
700:	learn: 0.2389646	test: 0.2804974	best: 0.2803533 (585)	total: 16.5s	remaining: 7.03s
800:	learn: 0.2338276	test: 0.2807935	best: 0.2803533 (585)	total: 18.7s	remaining: 4.66s
900:	learn: 0.2289832	test: 0.2811315	best: 0.2803533 (585)	total: 21s	remaining: 2.3s
999:	learn: 0.2248859	test: 0.2814338	best: 0.2803533 (585)	total: 23.7s	remaining: 0us

bestTest = 0.28035326

In [71]:
y_pred=clf.predict(X_test)
evaluate_model(y_test,y_pred)

Accuracy = 87.9 %
Precision = 78.6 %
Recall = 68.7%
F1 score = 73.3%


Generally, setting min data in leaf minimize training time

[Table of Contents](#0)

<a name="14"></a>
### **8) nan_mode:**
* Forbidden : Missing values are not supported, their presence is interpreted as an error.

* Min (Default): Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

* Max : Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

____________________________________________________________________________________
Adult dataset doesn't contain any missings to try this parameter

**Note that** you are able to plot the change in logloss, accuracy...etc while training for both test(validation) and train but it's available only if you are using `Jupyter Notebook` or `TensorBoard` and needs additional packages,
see [Catboost Visualization](https://catboost.ai/en/docs/features/visualization) for more details

**Catboost** library has great capabilites and hyperparameters to tune , see [Catboost training parameters](https://catboost.ai/en/docs/references/training-parameters/common) for more details

[Table of Contents](#0)

Generally speaking:
* For few iterations use less learning rate than many iterations, note that catboost can determine automatically the best learning rate
* In our case l2_regularization_leaf was best at 0.5,we recommend trying different values to find the best
* Symmetric trees are much faster than other types of grow in theory but it's not always the case practically
* Plain boosting type is much faster than ordered one
* You can let catboost determine your bootstrap type automatically
* For non symmetric trees(e.g. lossguide) you can use min data in leaf and max leaves as a kind of regularization

<a name="15"></a>
## **Limitations**

1.   Interpretability: Using categorical embeddings might make it harder to interpret the importance of individual categorical levels.
2.   Performance with Homogeneous Data: Research has indicated that while CatBoost excels with heterogeneous data (data containing features of different types), it may not be the optimal choice for homogeneous data (data where all features are of the same type). Studies comparing gradient-boosted machine learning algorithms with deep learning algorithms on homogeneous data, such as digital image data, found that deep learning algorithms performed better in terms of accuracy and Area Under the Receiver Operating Characteristic Curve (AUC). This suggests that CatBoost, being a gradient-boosted decision tree algorithm, is better suited for problems involving heterogeneous data rather than homogeneous data.
3.  Catboost is slow with datasets that have a small number of features compared to other gradient-boosting algorithms (CPU speed)


[Table of Contents](#0)

<a name="16"></a>
## **Optimal Use Cases**
Catboost is optimaly used with:
1. Categorical Data: CatBoost is particularly effective when dealing with datasets containing categorical features, as it can naturally handle them without the need for one-hot encoding.
2. Large Datasets: Due to its efficient implementation and parallelization capabilities, CatBoost performs well on large datasets.

CatBoost is a versatile machine-learning model with several popular use cases including:

1.  In recommendation systems, it's employed to suggest products or content tailored to user preferences.
2.  For fraud detection, it plays a crucial role in identifying fraudulent activities, especially in the financial sector.
3. In the realm of image classification, CatBoost excels at categorizing images into different groups based on content.
4. In text classification, the model is adept at sorting text into predefined categories, which is particularly useful in sentiment analysis.
Another significant application is in customer churn prediction, where it predicts the likelihood of customers discontinuing the use of a service or product.
5. In the field of medical diagnoses, CatBoost assists healthcare professionals by diagnosing diseases from complex data sets.
6. In natural language processing, the model is utilized for understanding and processing human language, which has a wide range of applications.

[Table of Contents](#0)

<a name="17"></a>
## **Parallelization Capabilities**

1. Training and Prediction: CatBoost efficiently parallelizes both training and prediction processes, making it well-suited for multi-core processors and distributed computing environments.
2. Efficiency Impact: The parallelization capabilities contribute to faster training times, especially beneficial for large datasets and complex models.



[Table of Contents](#0)

<a name="18"></a>
## **References**
* Prokhorenkova, L. et al. (2019). CatBoost: unbiased boosting with categorical features. arXiv:1706.09516 [cs]. Available at: https://arxiv.org/abs/1706.09516.
* catboost.ai. Common parameters. [online] Available at: https://catboost.ai/en/docs/references/training-parameters/common. [Accessed 28 Feb. 2024].
* CatBoost Model Guide - Advanced Gradient Boosting ML Library by Yandex. [online] Available at: https://www.modelbit.com/model-hub/catboost-model-guide#:~:text=data%20is%20critical.- [Accessed 28 Feb. 2024].
* Anna Veronika Dorogush: Mastering gradient boosting with CatBoost | PyData London 2019. [online] Available at: https://www.youtube.com/watch?v=usdEWSDisS0 [Accessed 28 Feb. 2024].