<a href="https://colab.research.google.com/github/TusharGwal/Machine-Learning/blob/main/Model_Selection_and_Boosting/catboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CatBoost

Great follow-up! 😺 Let’s talk about **CatBoost**, a cutting-edge machine learning algorithm that’s especially useful when working with **categorical data**.

---

## 🧠 What is CatBoost?

> **CatBoost** stands for **Categorical Boosting**.
> It’s an open-source, **gradient boosting library** developed by **Yandex**.

Like XGBoost and LightGBM, CatBoost builds an **ensemble of decision trees**, but it brings unique strengths, especially in **handling categorical features automatically**.

---

## 🌟 Key Features of CatBoost

| Feature                                                               | Description                                                                   |
| --------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| 🏷️ Categorical Handling                                              | **No need for one-hot encoding or label encoding** — it handles it internally |
| 🏃‍♂️ Fast and Accurate                                               | Uses **ordered boosting** to reduce overfitting                               |
| 🚀 Works Well with Defaults                                           | Very **little parameter tuning** needed to get strong results                 |
| 🧠 Built-in Cross Validation                                          | Has built-in support for **k-fold CV**                                        |
| 🔢 Supports Classification, Regression, Ranking, and Multiclass tasks |                                                                               |

---

## 📦 How CatBoost Works Differently

### ✅ **Automatic Categorical Encoding**

* It uses **target statistics** and **combinations** of categories.
* Avoids overfitting using **ordered boosting**, a smarter way to compute statistics without data leakage.

### ✅ **Ordered Boosting**

* Solves the **prediction shift** problem in standard boosting.
* Ensures that each model only learns from **previous, independent** examples.

---

## ✅ Code Example

```python
from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=0
)

model.fit(X_train, y_train, cat_features=[0, 2])  # If feature 0 and 2 are categorical

y_pred = model.predict(X_test)
```

---

## 🧪 Use Cases

| Domain     | Use Case                               |
| ---------- | -------------------------------------- |
| E-commerce | Product recommendation, CTR prediction |
| Banking    | Credit scoring                         |
| Healthcare | Diagnosis prediction                   |
| NLP        | Feature-rich text classification       |

---

## 🧠 When to Use CatBoost?

| Situation                                    | Go with CatBoost? |
| -------------------------------------------- | ----------------- |
| Lots of categorical data                     | ✅ YES             |
| Need fast, accurate results with less tuning | ✅ YES             |
| Tabular datasets with mixed data types       | ✅ YES             |
| Time series (with careful setup)             | ✅ Possible        |

---

## 🔍 CatBoost vs. XGBoost vs. LightGBM

| Feature              | CatBoost           | XGBoost     | LightGBM        |
| -------------------- | ------------------ | ----------- | --------------- |
| Categorical Handling | ✅ Built-in         | ❌ Manual    | ⚠️ Label encode |
| Performance          | ✅ Great out-of-box | ✅ High      | ✅ Very fast     |
| Training Time        | ⚠️ Slightly slower | ⚠️ Moderate | ✅ Fastest       |
| Interpretability     | ✅ Good             | ✅ Good      | ✅ Good          |

---


## Importing the libraries

In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [3]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training CatBoost on the Training set

In [5]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)

Learning rate set to 0.007956
0:	learn: 0.6778283	total: 57.9ms	remaining: 57.8s
1:	learn: 0.6642874	total: 59.5ms	remaining: 29.7s
2:	learn: 0.6510578	total: 61ms	remaining: 20.3s
3:	learn: 0.6351685	total: 68.7ms	remaining: 17.1s
4:	learn: 0.6203906	total: 70.2ms	remaining: 14s
5:	learn: 0.6053561	total: 71.7ms	remaining: 11.9s
6:	learn: 0.5913363	total: 78.3ms	remaining: 11.1s
7:	learn: 0.5773888	total: 80ms	remaining: 9.92s
8:	learn: 0.5638394	total: 81.5ms	remaining: 8.98s
9:	learn: 0.5507421	total: 88.2ms	remaining: 8.73s
10:	learn: 0.5377201	total: 89.7ms	remaining: 8.07s
11:	learn: 0.5243873	total: 91.2ms	remaining: 7.51s
12:	learn: 0.5129034	total: 97.8ms	remaining: 7.43s
13:	learn: 0.5047204	total: 99.4ms	remaining: 7s
14:	learn: 0.4942404	total: 101ms	remaining: 6.63s
15:	learn: 0.4836253	total: 108ms	remaining: 6.63s
16:	learn: 0.4733355	total: 110ms	remaining: 6.34s
17:	learn: 0.4629416	total: 121ms	remaining: 6.58s
18:	learn: 0.4527778	total: 122ms	remaining: 6.32s
19:	le

<catboost.core.CatBoostClassifier at 0x7b24293489d0>

## Making the Confusion Matrix

In [6]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[84  3]
 [ 0 50]]


0.9781021897810219

## Applying k-Fold Cross Validation

In [7]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
6:	learn: 0.6007221	total: 18.4ms	remaining: 2.61s
7:	learn: 0.5865261	total: 20.4ms	remaining: 2.53s
8:	learn: 0.5760173	total: 23.5ms	remaining: 2.58s
9:	learn: 0.5641784	total: 24.9ms	remaining: 2.46s
10:	learn: 0.5538549	total: 26.4ms	remaining: 2.38s
11:	learn: 0.5413434	total: 29.2ms	remaining: 2.4s
12:	learn: 0.5308262	total: 38ms	remaining: 2.89s
13:	learn: 0.5187893	total: 39.8ms	remaining: 2.8s
14:	learn: 0.5084890	total: 52.3ms	remaining: 3.43s
15:	learn: 0.4986254	total: 59.2ms	remaining: 3.64s
16:	learn: 0.4890714	total: 66.2ms	remaining: 3.83s
17:	learn: 0.4790883	total: 73.1ms	remaining: 3.98s
18:	learn: 0.4700108	total: 79.8ms	remaining: 4.12s
19:	learn: 0.4630325	total: 84.9ms	remaining: 4.16s
20:	learn: 0.4536134	total: 87.7ms	remaining: 4.09s
21:	learn: 0.4429695	total: 91.5ms	remaining: 4.07s
22:	learn: 0.4362340	total: 95.2ms	remaining: 4.04s
23:	learn: 0.4280061	total: 99ms	remaining: 4.02s
24:	learn