# <p style='text-align: center;'> CatBoost Classifier </p>

## What is CatBoost Algorithm?
CatBoost is a recently open-sourced machine learning algorithm from Yandex.


**CatBoost** name comes from two words **Category** and **Boosting**. As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data. **Boost** comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library.


CatBoost is a new machine learning algorithm based on gradient boosting. This algorithm was developed by researchers and engineers at Yandex (Russian tech company) in the year 2017 to serve multi-functional purposes such as Recommendation systems, Personal assistants, Self-driving cars, Weather prediction, and many other tasks. 


According to the CatBoost documentation, CatBoost supports numerical, categorical, and text features but has a good handling technique for categorical data. 


The CatBoost algorithm has quite a number of parameters to tune the features in the processing stage.


The CatBoost algorithm is a **Supervised** Machine Learning algorithm.


CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees. The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.


<b> Building stages for a single tree:

1. Preliminary calculation of splits.
    
    
2. (Optional) Transforming categorical features to numerical features.
    
    
3. (Optional) Transforming text features to numerical features.
    
    
4. Choosing the tree structure. This stage is affected by the set Bootstrap options.
    
    
5. Calculating values in leaves.
    
    
<b> The main difference between CatBoost and other boosting algorithms is that the CatBoost uses symmetric trees. And builds them level by level.
    
![image.png](attachment:image.png)
    
    
Symmetric tree is a tree where nodes of each level use the same split as shown in the above picture. This allows to encode path to leaf with an index which helps in decreasing prediction time, and its extremely important for low latency environments.    

## How does CatBoost works ?
Let’s look into the dataset which has 10 data points ordered in time.

![image.png](attachment:image.png)


In order to Calculate residuals for each data point using a model, that has been trained on all the other data points at that time(suppose if we want to calculate residual of x5, we suppose to train model using x1,x2,x3,x4 data points), this procedure becomes computationally expensive when we have large set of data points.  In that case instead of training different models for each data point, it trains only log(number_of_datapoints) models. If a model has been trained on n data points then that model is used to calculate residuals for the next n data points.


In the above dataset, we calculate residuals of x5,x6,x7, and x8, using the model that has been trained on x1, x2,x3, and x4 data points. This process is known as **ordered boosting**.


CatBoost divides a given dataset into random permutations and applies ordered boosting on those random permutations. By default, CatBoost creates four random permutations. With this randomness, we can further stop overfitting of our model, and the randomness can be further controlled by tuning parameters.


## Features of CatBoost ?

- **Great quality without parameter tuning:** Since the default parameters of CatBoost itself gives better result, there is no need of tuning hyper parameters, which in turn reduces the time spent for tuning.


- **Automatic Handling of Missing Values:** Missing values are a common problem in real-world datasets. Traditional gradient boosting frameworks require imputing missing values before training the model. CatBoost, however, can handle missing values automatically. During training, it learns the optimal direction to move along the gradient for each missing value, based on the patterns in the data.


- **Categorical features support:** CatBoost supports working with non numeric factors because of which time spent for pre-processing the data will be eliminated, and improves the training result too.


- **Fast and scalable GPU version:** Training the model on GPU gives better speedup compared to training the model on CPU. CatBoost efficiently supports multi-card configuration for large datasets.


- **Improved accuracy:** CatBoost reduces over-fitting while constructing models with a novel gradient-boosting scheme.


- **Fast prediction**: CatBoost uses distributed GPUs, this feature enables CatBoost to learn faster and make predictions 13–16 times faster than other algorithms.

## When to use CatBoost ?

1. Where short training time on robust data is required.


2. when you are working on dataset which has categorical features, and you want to get rid of converting these features into numerical format.


3. when you need to choose model, which is incredibly faster than many other algorithms.


## When to not use CatBoost ?

1. Where tuning parameters of categorical features are necessary to optimize the model.

## Is tuning required in CatBoost?
The answer is not straightforward because of the type and features of the dataset. The default settings of the parameters in CatBoost would do a good job. 


CatBoost produces good results without extensive hyper-parameter tuning. However, some important parameters can be tuned in CatBoost to get a better result.

## Advantages of CatBoost Library
- **Performance:** CatBoost provides state of the art results and it is competitive with any leading machine learning algorithm on the performance front.


- **Handling Categorical features automatically:** We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. You can read more about it here.


- **Robust:** It reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models. Although, CatBoost has multiple parameters to tune and it contains parameters like the number of trees, learning rate, regularization, tree depth, fold size, bagging temperature and others. You can read about all these parameters here.


- **Easy-to-use:** You can use CatBoost from the command line, using an user-friendly API for both Python and R.

## CatBoost – Comparison to other boosting libraries
We have multiple boosting libraries like XGBoost, H2O and LightGBM and all of these perform well on variety of problems. CatBoost developer have compared the performance with competitors on standard ML datasets:

![image.png](attachment:image.png)


The comparison above shows the log-loss value for test data and it is lowest in the case of CatBoost in most cases. It clearly signifies that CatBoost mostly performs better for both tuned and default models.


In addition to this, CatBoost does not require conversion of data set to any specific format like XGBoost and LightGBM.

## CatBoost vs. LightGBM vs. XGBoost Comparison

All of LightGBM, XGBoost, and CatBoost have the ability to execute on either CPUs or GPUs for accelerated learning, but their comparisons are more nuanced in practice. Each framework has an extensive list of tunable hyperparameters that affect learning and eventual performance.

First off, CatBoost is designed for categorical data and is known to have the best performance on it, showing the state-of-the-art performance over XGBoost and LightGBM in eight datasets in its official journal article. As of CatBoost version 0.6, a trained CatBoost tree can predict extraordinarily faster than either XGBoost or LightGBM.

On the flip side, some of CatBoost’s internal identification of categorical data slows its training time significantly in comparison to XGBoost, but it is still reported much faster than XGBoost. LightGBM also boasts accuracy and training speed increases over XGBoost in five of the benchmarks examined in its original publication.

But to XGBoost’s credit, XGBoost has been around the block longer than either LightGBM and CatBoost, so it has better learning resources and a more active developer community. The distributed Gradient Boosting library uses parallel tree boosting to solve numerous data science problems quickly and accurately. It also doesn’t hurt that XGBoost is substantially faster and more accurate than its predecessors and other competitors such as Scikit-learn.

Each boosting technique and framework has a time and a place—and it is often not clear which will perform best until testing them all. Fortunately, prior work has done a decent amount of benchmarking the three choices, but ultimately it’s up to you, the engineer, to determine the best tool for the job.