# In-Depth CatBoost

CatBoost is a gradient-boosted decision tree library developed by Yandex that shines when our data contains categorical features (it natively handles them), and it includes algorithmic fixes (ordered boosting / permutation-based schemes) to reduce target-leakage and overfitting for categorical encodings. It uses symmetric (oblivious) trees, has strong default regularization, and supports CPU/GPU training, multi-class, regression, ranking, text features, and model interpretation (SHAP).

### Gradient boosting backbone

CatBoost is a gradient boosting framework: we sequentially add trees that fit the negative gradient (residual) of the loss function. Each new tree aims to reduce the ensemble's loss.

### Symmetric (oblivious) trees

Trees in CatBoost are symmetric: each tree level uses the same split condition for all nodes at that level. This makes trees balanced and fast at prediction (bitwise decisions), reduces overfitting, and often yields smaller, faster models.

### Ordered boosting and categorical handling

A classic problem with naive target statistics (mean target per category) is target leakage: if we compute mean target using all training labels, the model can learn to “peek” at the label.

CatBoost uses ordered target statistics and permutations: for each object it only uses preceding objects in a permutation to compute category statistics — this reduces leakage. The algorithm also supports multiple permutations (ensembling within training) to stabilize estimates.

CatBoost accepts categorical features directly (via cat_features), computes special statistics internally, and applies transformations that avoid leakage.

### Leaf value computation

Rather than the simplest gradient-descent leaf updates, CatBoost uses second-order approximations (Newton-like updates) in some cases and specialized regularization to compute leaf values robustly.

### Regularization techniques

L2 (via l2_leaf_reg), random strength (random_strength) which perturbs splits, bagging_temperature for Bayesian bootstrap-style sampling, border_count for feature binning, and od_type / od_wait for ordered/IncToDec early stopping — plus symmetric trees themselves act as implicit regularizers.


### Missing values

Treated as a separate bin — no imputation required.

### Categorical cardinality

For high cardinality features CatBoost's statistics + combinations (one can combine categoricals) are designed to work well; but extremely high-cardinality features may still need special consideration.

### CatBoost Hyperparameters Reference

| Hyperparameter | Description | Typical Range / Values |
| :--- | :--- | :--- |
| **`iterations`** | The maximum number of trees to build (**n_estimators**). | `500` – `2000` (Use with early stopping) |
| **`learning_rate`** | The step size shrinkage used to prevent overfitting (**eta**). | `0.01` – `0.3` (Lower needs more iterations) |
| **`depth`** | Depth of the symmetric trees. | `4` – `10` |
| **`l2_leaf_reg`** | L2 regularization coefficient on leaf weights. | `1` – `10` |
| **`bagging_temperature`**| Controls Bayesian bootstrap. `0` is no bagging; higher is more aggressive. | `0` – `1` (Up to `10`) |
| **`random_strength`** | Score randomness used to diversify tree splits. | `1` – `20` |
| **`border_count`** | Number of splits for numerical feature binning (discretization). | `32` – `255` |
| **`od_type` / `od_wait`** | Overfitting detector type (`IncToDec`, `Iter`) and patience (iterations). | `Iter`, `20` – `50` |
| **`task_type`** | The processing unit to be used for training. | `'CPU'` or `'GPU'` |

---

### Handling Categorical & Imbalanced Data

* **`cat_features`**: List of indices or names for categorical columns. CatBoost handles these natively.
* **`one_hot_max_size`**: Categories with a size $\le$ this value use one-hot encoding; larger ones use mean encoding.
* **`class_weights`**: Manual weights for imbalanced classes (e.g., `[1, 5]`).
* **`auto_class_weights`**: Automatically balance classes (`'Balanced'`, `'SqrtBalanced'`).

### Training & Objectives

* **`loss_function`**: The metric used for training (e.g., `Logloss`, `CrossEntropy`, `RMSE`).
* **`eval_metric`**: The metric used for validation and early stopping (e.g., `AUC`, `F1`, `MAE`).
* **`use_best_model`**: If `True`, the model will revert to the iteration with the best `eval_metric`.


### How the model learns — stepwise summary

1. Preprocessing: numeric features are binned (border_count), categorical handled via ordered statistics or one-hot (small cardinality).

2. For each iteration:

   *   Compute gradients (residuals) using the loss.

   *   Train a symmetric tree to approximate gradients; node splits chosen based on loss reduction calculated from binned features (and statistical transforms for categorical features).

   *    Compute leaf values with regularization using gradient and (optionally) second-order info.

4. Optionally use bagging (Bayesian bootstrap via bagging_temperature) for variance reduction.

5.  Use ordered boosting for permutations to avoid target leakage with categorical features.


### Is Feature Scaling Required for CatBoost?

**Short Answer: No.** Unlike distance-based algorithms (like KNN or SVM) or gradient-based models (like Neural Networks), CatBoost is an ensemble of **decision trees**. Decision trees are **invariant to monotonic transformations**.

### Why Scaling Isn't Necessary
* **Split Invariance:** Trees make decisions based on thresholds (e.g., `feature > 100`). Whether that 100 is scaled to 0.5 or 1,000,000, the "split" point remains functionally identical.
* **Rank-Based:** CatBoost looks at the relative order of values rather than their absolute magnitude.

---

### ⚠️ Exceptions: When we *Should* Scale
While CatBoost doesn't need it, our **overall pipeline** might:

1.  **Stacked Ensembles:** If we are feeding CatBoost predictions into a "meta-learner" (like a Logistic Regression) or combining it with distance-based models (KNN), those secondary models will still require scaled input.
2.  **Downstream Linear Models:** If we use the same preprocessed data for a Linear Regression baseline, scaling remains mandatory for that specific model.
3.  **Manual Encoding:** If we perform manual one-hot encoding or other transformations that we intend to use with non-tree models downstream.

---

> **Conclusion for Standalone CatBoost:** > we can skip `StandardScaler` or `MinMaxScaler` entirely. This saves computation time and keeps our features interpretable (e.g., seeing "Price > $500" in our feature importance is easier to read than "Price > 0.12").

# Practical demo 1 — classification

In [1]:
# Demo 1: Quick baseline with CatBoostClassifier on breast cancer dataset
!pip install catboost
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from catboost import CatBoostClassifier, Pool

# 1. Load
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Create CatBoost Pool (no categorical features in this dataset)
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

# 4. Define model with sensible defaults
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    loss_function='Logloss',
    eval_metric='AUC',
    random_seed=42,
    verbose=100,
    early_stopping_rounds=50,
    task_type='CPU',  # or 'GPU' if available
    use_best_model=True
)

# 5. Fit
model.fit(train_pool, eval_set=test_pool)

# 6. Evaluate
preds_proba = model.predict_proba(X_test)[:,1]
preds = (preds_proba > 0.5).astype(int)
print("Accuracy:", accuracy_score(y_test, preds))
print("AUC:", roc_auc_score(y_test, preds_proba))
print(classification_report(y_test, preds))

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0mm
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8
0:	test: 0.9742063	best: 0.9742063 (0)	total: 59.5ms	remaining: 59.4s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.9973544974
bestIteration = 5

Shrink model to first 6 iterations.
Accuracy: 0.9736842105263158
AUC: 0.9973544973544973
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        42
           1       0.97      0.99      0.98        72

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



# Practical demo 2 — pipeline + hyperparameter search

In [2]:
# Demo 2: Pipeline + RandomizedSearchCV using Iris (with a synthetic categorical feature)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier

# 1. Load and create synthetic categorical column
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# create a categorical feature by binning petal length into 3 categories
X['petal_length_cat'] = pd.cut(X['petal length (cm)'], bins=3, labels=['short', 'medium', 'long'])

# our categorical column is last column index
cat_feature_name = 'petal_length_cat'
cat_features = [X.columns.get_loc(cat_feature_name)]  # pass indices to CatBoost

# 2. For CatBoost we can pass raw DataFrame (no need to one-hot). But scikit-learn's RandomizedSearchCV expects an estimator.
cbc = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    loss_function='MultiClass',
    random_seed=42,
    verbose=0,  # silence
    early_stopping_rounds=50,
    use_best_model=False # Changed this from True to False
)

# 3. Parameter distributions for RandomizedSearch
from scipy.stats import randint, uniform
param_dist = {
    'learning_rate': uniform(0.01, 0.2),  # 0.01 - 0.21
    'depth': randint(3, 9),               # 3 - 8
    'l2_leaf_reg': uniform(1, 10),        # 1 - 11
    'bagging_temperature': uniform(0, 1.5),
    'iterations': [200, 500, 800]
}

# 4. Create RandomizedSearchCV and fit (note: we pass cat_features at fit time)
search = RandomizedSearchCV(
    estimator=cbc,
    param_distributions=param_dist,
    n_iter=20,
    cv=4,
    scoring='accuracy',
    random_state=42,
    n_jobs=1  # CatBoost can be multi-threaded internally; set n_jobs=1 to avoid conflicts
)

# 5. Fit with cat_features passed as fit param (scikit-learn will forward it to CatBoost.fit)
search.fit(X, y, cat_features=cat_features)

# 6. Best result
print("Best score:", search.best_score_)
print("Best params:", search.best_params_)

Best score: 0.9733285917496444
Best params: {'bagging_temperature': np.float64(0.5618101782710437), 'depth': 7, 'iterations': 800, 'l2_leaf_reg': np.float64(8.31993941811405), 'learning_rate': np.float64(0.12973169683940733)}


# CatBoost: Practical Implementation & Strategy Guide

### Implementation Notes
* **Kwargs Forwarding:** Passing `cat_features` into `search.fit(...)` works because `RandomizedSearchCV` forwards extra keyword arguments directly to the underlying estimator’s `.fit()` method.
* **No Scaling:** Numerical features are kept in their raw state. Scaling is only necessary if CatBoost is part of a pipeline with distance-based models (KNN, SVM, etc.).
* **Hardware:** For massive event datasets, toggle `task_type='GPU'` and specify `devices='0'`.

---

### The Tuning Checklist
| Focus Area | Strategy |
| :--- | :--- |
| **Learning Rate** | Lower `learning_rate` generalizes better; increase `iterations` to compensate. Use early stopping. |
| **Tree Depth** | Small data: `4–6`. Complex data: `7–10`. Higher depth = higher overfitting risk. |
| **Regularization** | Use `l2_leaf_reg`, `random_strength`, and `bagging_temperature` to manage variance. |
| **Categoricals** | Use `one_hot_max_size` to toggle between OHE and mean encoding. |
| **Imbalance** | Set `auto_class_weights='Balanced'` for lopsided datasets (e.g., "Will they show up?" vs "No-shows"). |
| **Validation** | Use **Stratified CV** for classification; use **Time-Series CV** for event data with temporal trends. |

---

### Bias-Variance Tradeoff (The "Tuning Knobs")



* **To Fix Bias (Underfitting):** * ↑ `iterations`, ↑ `depth`, ↑ `border_count`.
    * ↓ `l2_leaf_reg`.
* **To Fix Variance (Overfitting):**
    * ↓ `depth`, ↓ `iterations` (Early Stopping).
    * ↑ `l2_leaf_reg`, ↑ `bagging_temperature`, ↑ `random_strength`.

---

### Limitations & Edge Cases
* **Sparse Data:** Linear models or Embeddings often outperform CatBoost on high-dimensional sparse data (e.g., large-scale TF-IDF).
* **Small Data:** High risk of overfitting; require heavy regularization.
* **Temporal Leakage:** Naive CV fails on time-series data; use ordered splits to avoid "predicting the past using the future."
* **Calibration:** For critical probability tasks, we may need Isotonic Regression or Platt Scaling post-training.

---

### Advanced Interview Concepts
* **Ordered Boosting:** A permutation-based approach to compute categorical statistics that prevents target leakage.
* **Oblivious Trees:** Symmetric trees that use the same split across an entire level, leading to faster inference and better generalization.
* **Native Text Support:** Handles `text_features` directly via built-in bag-of-ngrams or embedding extraction.
* **Interpretability:** Native SHAP integration via `get_feature_importance(type='ShapValues')`.

---

### Interview Q&A Quick-Fire
**Q: Why CatBoost over XGBoost/LightGBM?**
**A:** Superior handling of categorical features, faster inference via symmetric trees, and reduced leakage through ordered boosting.

**Q: Does it need scaling?**
**A:** No. Decision trees are invariant to monotonic transformations.

**Q: What is Ordered Boosting?**
**A:** It computes categorical statistics using only "past" data points in a random permutation to avoid looking at the label of the current object.

---

### Pro-Tips & "Gotchas"
* **Reproducibility:** Always set `random_seed`.
* **Consistency:** Ensure `cat_features` are passed identically during CV and final `.fit()`.
* **Visualization:** Use `plot=True` in Jupyter to see real-time loss curves.
* **Pipelines:** It is often easier to keep CatBoost *outside* a standard Scikit-Learn `ColumnTransformer` if we want it to handle raw categories itself.