# Machine Learning 32: CatBoost Algorithm in Machine Learning 
## 1. Introduction and Motivation

CatBoost (short for "Categorical Boosting") is a gradient boosting algorithm developed by Yandex in 2017. It was created to address two major challenges in existing gradient boosting methods like XGBoost and LightGBM:

1. Handling categorical features efficiently without manual preprocessing.
2. Reducing overfitting and prediction shift, which are common issues in boosting algorithms.

By introducing innovations like ordered boosting and oblivious decision trees, CatBoost provides strong performance across a wide variety of machine learning tasks.


## 2. Fundamental Concepts

* **Gradient Boosting**: An ensemble technique that builds models sequentially, where each new model corrects the errors made by previous models.
* **Decision Trees**: Weak learners that split data into rules and form the base models in boosting.
* **Ordered Boosting**: A technique introduced by CatBoost that prevents target leakage and reduces overfitting by ensuring the model does not use the same data point for both training and prediction.


## 3. Handling Categorical Variables

CatBoost’s most unique feature is its ability to handle categorical data directly, without needing manual encoding methods like One-Hot Encoding or Label Encoding.

It uses target statistics with an ordered approach:

* Categorical values are transformed into numeric values based on the statistics of the target variable.
* Ordered statistics ensure that the encoding does not use information from the future (avoiding data leakage).

This makes CatBoost especially effective for datasets with many categorical features.


## 4. Working Mechanism Step by Step

1. **Symmetric (Oblivious) Trees**: CatBoost builds balanced trees where all splits at the same depth are the same. This improves both training and prediction efficiency.

2. **Ordered Boosting**: Prevents prediction shift by constructing models in a way that avoids using the same data point for training and prediction simultaneously.

3. **Efficient Encoding of Categorical Features**: Categorical features are transformed internally into numerical representations using hashing and ordered statistics.

4. **Training Process**:

   * Start with simple baseline predictions.
   * Iteratively add oblivious trees that correct residual errors.
   * Ordered boosting ensures unbiased gradient estimation at each step.


## 5. Advantages of CatBoost

* Handles categorical features natively without preprocessing.
* Reduces overfitting with ordered boosting.
* Provides fast training with efficient use of resources.
* Requires less hyperparameter tuning compared to other boosting algorithms.
* Works well with both small and large datasets.


## 6. Disadvantages of CatBoost

* Higher memory usage compared to simpler algorithms.
* Less interpretable compared to linear models.
* On extremely large datasets, it may be slower than LightGBM (though often faster than XGBoost).


## 7. Real-World Applications of CatBoost

* **Finance**: Credit scoring, fraud detection.
* **Healthcare**: Disease risk prediction, patient outcome forecasting.
* **E-commerce**: Recommendation systems, customer segmentation.
* **Search Engines and Advertising**: Ranking results, click-through rate prediction.


## 8. CatBoost vs XGBoost vs LightGBM

| Feature              | CatBoost         | XGBoost           | LightGBM        |
| -------------------- | ---------------- | ----------------- | --------------- |
| Categorical Handling | Built-in support | Requires encoding | Limited support |
| Overfitting Control  | Strong (ordered) | Weaker            | Moderate        |
| Training Speed       | Fast             | Moderate          | Very Fast       |
| Accuracy             | High             | High              | High            |
| Memory Efficiency    | Medium           | Medium            | Best            |


In [1]:
# !pip install catboost

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
# data import titanic
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# pre-processing


In [4]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

## Q.Why usage of knn imputer for age column? 
Ans: Because age column has 177 missing values which is around 20% of the total values in the column.If we use mean or median imputer, it will not consider the similarity between rows and will impute the missing values with the mean or median of the column.

## Q.How does knn imputer work?
Ans: It finds the k nearest neighbors of the missing value and imputes the missing value with the mean of the k nearest neighbors.
Here we are using k=5.  

In [5]:
# impute missing values using knn imputers in age column 
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df['age'] = imputer.fit_transform(df[['age']])

In [6]:
# impute embarked missing values using pandas
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

In [7]:
# drop deck column
df.drop('deck', axis=1, inplace=True)

In [8]:
# df missing values
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  embark_town  891 non-null    object  
 12  alive        891 non-null    object  
 13  alone        891 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(5)
memory usage: 79.4+ KB


In [10]:
# convert each category column to category
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
# add this as a new column in the dataframe
df[categorical_cols] = df[categorical_cols].astype('category')

In [11]:
df[categorical_cols]

Unnamed: 0,sex,embarked,class,who,embark_town,alive
0,male,S,Third,man,Southampton,no
1,female,C,First,woman,Cherbourg,yes
2,female,S,Third,woman,Southampton,yes
3,female,S,First,woman,Southampton,yes
4,male,S,Third,man,Southampton,no
...,...,...,...,...,...,...
886,male,S,Second,man,Southampton,no
887,female,S,First,woman,Southampton,yes
888,female,S,Third,woman,Southampton,no
889,male,C,First,man,Cherbourg,yes


In [12]:
# split data into X and y
X = df.drop('survived', axis=1)
y = df['survived']

In [13]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# run the catboost classifier
model = CatBoostClassifier(iterations=100,
                           learning_rate=0.1,
                           depth=3,
                           loss_function='Logloss',
                           eval_metric='Accuracy',
                           random_seed=42,
                           verbose=False)

# train the model
model.fit(X_train, y_train, cat_features=categorical_cols.tolist())

<catboost.core.CatBoostClassifier at 0x2ed4dcea7b0>

In [15]:
# predictions
y_pred = model.predict(X_test)

In [16]:
# evaluate the model
print(f'Accuracy Score: {accuracy_score(y_test, y_pred)}')
print(f'Confusion Matrix: \n {confusion_matrix(y_test, y_pred)}')
print(f'Classification Report: \n {classification_report(y_test, y_pred)}')

Accuracy Score: 1.0
Confusion Matrix: 
 [[105   0]
 [  0  74]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       105
           1       1.00      1.00      1.00        74

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179



## 9. Summary for Beginners

CatBoost is a gradient boosting algorithm that simplifies machine learning by automatically handling categorical features and reducing overfitting. It is an excellent choice when working with datasets containing many categorical variables. While it may use more memory and be less interpretable than simple models, CatBoost is powerful, accurate, and often requires minimal preprocessing and tuning.
