# CatBoost Regressor

**CatBoost** stands for **Cat**egory **Boost**ing. It is a gradient boosting algorithm that excels particularly with datasets containing many categorical features. It handles categorical features natively and efficiently without extensive preprocessing, while maintaining high prediction quality.

### Main Advantages
* **Superior handling** of categorical features
* **Reduced need** for hyperparameter tuning
* **High performance** with default settings
* **Robust** to overfitting

---

## How it Works

Like XGBoost and LightGBM, CatBoost uses the gradient boosting framework with the same fundamental objective function:

$$
Obj(\theta) = \sum_{i} L(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k)
$$

However, its key innovations lie in how it handles categorical features and how it computes gradients.

### 1. Ordered Boosting (The Secret Sauce)
**Problem:** Traditional gradient boosting suffers from target leakage or prediction shift. When calculating gradients for a data point, the model uses the current ensemble, but this ensemble was trained on the same data point, creating a bias.

**CatBoost Solution: Ordered Boosting**
* Uses a **permutation (ordering)** of the training data.
* For each example, the model used to calculate gradients is trained **only on the examples that come before it** in the permutation.
* This eliminates target leakage and makes the model more robust.

> **Analogy:** It's like time-series cross-validationâ€”you don't use future data to predict the past.

### 2. Innovative Categorical Feature Handling
**Problem:** Common methods like one-hot encoding can lead to high dimensionality, while label encoding can create false relationships.

**CatBoost Solution: Ordered Target Statistics**
For each categorical feature, it calculates:

$$
\text{avg\_target} = \frac{\text{sum of target for category} + \text{prior}}{\text{count of category} + 1}
$$

But crucially, it uses the same **ordered principle**: when calculating the statistic for a training example, it *only uses the examples that come before it in the permutation*.

**Other Methods Supported:**
* **One-hot encoding:** For low-cardinality features.
* **Feature combinations:** Automatically creates interactions between categorical features.

### 3. Symmetric Tree Structure
Unlike XGBoost and LightGBM which build asymmetric trees, CatBoost builds **symmetric (oblivious) trees**:
* Same splitting condition across all nodes at the same level.
* **Faster prediction time** (just index traversal).
* More resistant to overfitting.
* Easier to implement in production.

---

## When to Choose CatBoost

| Choose CatBoost when... | Be Cautious when... |
| :--- | :--- |
| Your dataset has **many categorical features** | You have very few categorical features |
| You want good performance with **minimal tuning** | Training speed is the absolute priority (LightGBM is faster) |
| You need robust results without overfitting | You need extremely fine-grained control over tree structure |
| **Productization is important** (symmetric trees = faster prediction) | |

---

## Summary: Why is CatBoost Special?

CatBoost solves two major problems inherent in other boosting algorithms:

**1. Target Leakage**
* Other models (XGBoost, LightGBM) require Label Encoding or One-hot Encoding.
* These can create target leakage and bias in gradients.

**2. Prediction Shift**
* Boosting suffers from predicting based on its own predictions, which leads to overfitting.

**$\rightarrow$ CatBoost solves both using Ordered Boosting.**

# CatBoost Classifier vs. Regressor

While both algorithms share the same underlying engine (ordered boosting, symmetric trees), they are optimized for different types of predictive tasks.

## 1. Problem Type & Output

| Feature | CatBoost Classifier | CatBoost Regressor |
| :--- | :--- | :--- |
| **Problem Domain** | Classification problems | Regression problems |
| **Goal** | Predicts class labels or probabilities | Predicts continuous numerical values |
| **Example Output** | `0`/`1`, `"Spam"`/`"Ham"`, `[0.2, 0.8]` | `125.7`, `-2.45`, `0.893` |

## 2. Loss Functions

| **Classifier** (Optimizes Probability) | **Regressor** (Optimizes Error) |
| :--- | :--- |
| **Logloss** (Binary Classification) | **RMSE** (Root Mean Squared Error) |
| **MultiClass** | **MAE** (Mean Absolute Error) |
| **CrossEntropy** | **Quantile** |
| *Output:* Probabilities $\rightarrow$ Classes | *Output:* Direct numerical values |

## 3. Evaluation Metrics

| Classifier Metrics | Regressor Metrics |
| :--- | :--- |
| Accuracy, AUC, F1-Score | RMSE, MAE, $R^2$ |
| Precision, Recall | MAPE (Mean Absolute Percentage Error) |
| Logloss | MSLE (Mean Squared Log Error) |

## 4. Key Differences

### Target Type
* **Regressor:** Numeric (continuous)
* **Classifier:** Categorical (discrete classes)

### Output Transformation
* **Regressor:** Raw number
* **Classifier:** Probability (via Sigmoid or Softmax) $\rightarrow$ Optional threshold to get class

### Gradient / Hessian Calculation
* **Regressor:** Simple difference (Prediction - Target)
* **Classifier:** Derived from classification loss (e.g., Logloss)

### Loss Function
* **Regression:** Uses MSE, MAE
* **Classification:** Uses Logloss / Cross-Entropy

---

## Summary

* **CatBoost Regressor** $\rightarrow$ Numeric prediction, regression loss.
* **CatBoost Classifier** $\rightarrow$ Probability prediction, classification loss.
* **Shared Core:** The boosting logic, symmetric tree structure, and categorical handling remain the **same**.
* **Conversion:** To convert a regressor architecture to a classifier, you primarily change the **Loss Function**, **Gradient Calculation**, and **Output Mapping**.

In [None]:
import numpy as np


class OrderedTargetEncoder:
    def __init__(self):
        self.category_means = {}

    def fit_transform(self, X_cat, y):
        X_new = np.zeros_like(X_cat, dtype=float)
        self.category_means = {}

        for i in range(len(X_cat)):
            cat = X_cat[i]

            if cat not in self.category_means:
                self.category_means[cat] = []

            # prefix mean of previous targets for this category
            if len(self.category_means[cat]) == 0:
                X_new[i] = np.mean(y[:i]) if i > 0 else np.mean(y)
            else:
                X_new[i] = np.mean(self.category_means[cat])

            self.category_means[cat].append(y[i])

        return X_new.reshape(-1, 1)

    def transform(self, X_cat):
        X_new = np.zeros_like(X_cat, dtype=float)

        for i in range(len(X_cat)):
            cat = X_cat[i]
            if cat in self.category_means:
                X_new[i] = np.mean(self.category_means[cat])
            else:
                # if new category, use global mean
                X_new[i] = np.mean([v for lst in self.category_means.values() for v in lst])

        return X_new.reshape(-1, 1)



class CatBoostTree:
    def __init__(self, depth=3):
        self.depth = depth
        self.splits = []
        self.leaf_values = None

    def fit(self, X, grad):
        n_samples, n_features = X.shape
        self.splits = []

        # symmetric tree: same feature splits at same depth
        for level in range(self.depth):
            best_gain = -1
            best_feature = None
            best_thresh = None

            for f in range(n_features):
                thresholds = np.unique(X[:, f])
                for t in thresholds:
                    left = grad[X[:, f] <= t].sum()
                    right = grad[X[:, f] > t].sum()
                    gain = left**2 + right**2
                    if gain > best_gain:
                        best_gain = gain
                        best_feature = f
                        best_thresh = t

            self.splits.append((best_feature, best_thresh))

        # assign leaf values
        num_leaves = 2 ** self.depth
        self.leaf_values = np.zeros(num_leaves)
        leaf_index = np.zeros(n_samples, dtype=int)

        for level, (f, t) in enumerate(self.splits):
            decision = (X[:, f] > t).astype(int)
            leaf_index += decision * (2 ** level)

        for leaf in range(num_leaves):
            mask = (leaf_index == leaf)
            if mask.sum() > 0:
                self.leaf_values[leaf] = -grad[mask].mean()

    def predict(self, X):
        n_samples = X.shape[0]
        preds = np.zeros(n_samples)

        for i in range(n_samples):
            leaf = 0
            for level, (f, t) in enumerate(self.splits):
                decision = int(X[i, f] > t)
                leaf += decision * (2 ** level)
            preds[i] = self.leaf_values[leaf]

        return preds



class CatBoostRegressorCore:
    def __init__(self, n_estimators=20, learning_rate=0.1, depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.depth = depth
        self.trees = []
        self.encoders = {}

    def fit(self, X_numeric, X_categorical, y):
        # Encode categorical features
        encoded_cats = []
        for j in range(X_categorical.shape[1]):
            enc = OrderedTargetEncoder()
            encoded_feature = enc.fit_transform(X_categorical[:, j], y)
            encoded_cats.append(encoded_feature)
            self.encoders[j] = enc

        # Combine numeric + encoded categorical
        X = np.hstack([X_numeric] + encoded_cats)
        pred = np.zeros_like(y, dtype=float)

        for _ in range(self.n_estimators):
            grad = pred - y  # gradient of L2 loss
            tree = CatBoostTree(depth=self.depth)
            tree.fit(X, grad)
            update = tree.predict(X)
            pred -= self.learning_rate * update
            self.trees.append(tree)

    def predict(self, X_numeric, X_categorical):
        encoded_cats = []
        for j in range(X_categorical.shape[1]):
            encoded_feature = self.encoders[j].transform(X_categorical[:, j])
            encoded_cats.append(encoded_feature)

        X = np.hstack([X_numeric] + encoded_cats)
        pred = np.zeros(X.shape[0])

        for tree in self.trees:
            pred -= self.learning_rate * tree.predict(X)

        return pred


In [None]:
# If  we can do this (Classifier):
from catboost import CatBoostClassifier
model_clf = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    cat_features=['category_col'],
    eval_metric='Accuracy'
)

# Then you can automatically do this (Regressor):
from catboost import CatBoostRegressor
model_reg = CatBoostRegressor(
    iterations=1000,           # Same
    learning_rate=0.05,        # Same  
    depth=6,                   # Same
    cat_features=['category_col'],  # Same
    eval_metric='RMSE'         # Only this changes!
)