# XGBoost

XGBoost stands for Extreme Gradient Boosting. It's a powerful and scalable supervised machine learning algorithm based on the gradient boosting framework that uses decision trees as base learners.

Purpose: Primarily used for regression and classification tasks.

Fame: Dominates Kaggle competitions and is widely used in industry for its performance on structured/tabular data.

XGBoost builds trees sequentially, where each new tree is trained to correct the prediction errors (residuals) made by all the previous trees combined, using a gradient descent approach to minimize a defined loss function.

Mathematical Intuition 
1. Prediction (Additive Model):
ŷ_i = Σ (from k=1 to K) f_k(x_i)

The final prediction ŷ_i for instance i is the sum of predictions from K sequential trees (f_k).

2. Regularized Objective Function (The XGBoost Genius):
Obj(θ) = Σ L(y_i, ŷ_i) + Σ Ω(f_k)

Loss Term (L): Measures how well the model fits the data ( squared error).

Regularization Term (Ω): Penalizes model complexity to prevent overfitting. A typical form is: Ω(f) = γT + (1/2)λ||w||²

T: Number of leaves in the tree.

w: Vector of leaf weights (predictions).

γ (gamma) and λ (lambda): Hyperparameters controlling the penalty.

XGBoost uses second-order approximations (both the gradient and the Hessian) of the loss function, making its optimization faster and more accurate than traditional gradient boosting.

Why XGBoost is So Powerful

Built-in Regularization  : Directly controls overfitting via L1/L2 penalties on leaf weights and tree structure

Handles Sparse/Missing : Data	Automatically learns the best direction to handle missing values during tree construction.

Computational Speed & Scalability :	Parallel processing, out-of-core computation, and cache-aware block structure for large datasets.

High Predictive Accuracy :	Consistently delivers state-of-the-art results on diverse problems.


Boosting Process

-> n_estimators	 ->Number of boosting rounds/trees.-> 	Increase for better fit, but risk overfitting.


Tree Structure

-> learning_rate (eta) ->	Shrinks contribution of each tree.	->  Low value ( 0.01-0.3) with high n_estimators often works best.

-> max_depth ->	Maximum depth of a tree.-> 	Controls complexity; typical 3-10.

-> min_child_weight ->	Minimum sum of instance weight needed in a child node.	->  Higher values prevent overfitting.

-> gamma ->	Minimum loss reduction required to make a further partition on a leaf node. -> 	Acts as a complexity check.

Randomization

->subsample	->  Fraction of training data sampled for each tree.	-> Introduces randomness (like in Random Forest).

->colsample_bytree -> 	Fraction of features sampled for each tree.	->  Reduces overfitting and correlation between trees.

Regularization
->reg_alpha (alpha)  ->	L1 regularization on leaf weights.	  ->Adds feature selection sparsity.

->reg_lambda (lambda)	 ->L2 regularization on leaf weights.	 -> More common; makes predictions smoother.


XGBoost Regressor Workflow 

Initialize with a simple prediction (often the mean of the target variable).

For m = 1 to M (number of trees):
a. Compute the residuals (negative gradient) for all data points.
b. Fit a new weak learner (decision tree) to predict these residuals.
c. Update the model by adding the new tree's predictions (shrunken by the learning_rate).

Output the final model as the sum of all tree predictions.

When to Use XGBoost Regressor?
You have structured/tabular data (not images/text).

Predictive accuracy is the primary goal over model interpretability.

You have sufficient computational resources for training and hyperparameter tuning.

The relationships in your data are complex and non-linear.



In [1]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize and train
model = XGBRegressor(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

NameError: name 'X' is not defined

Q: Why is XGBoost often better than standard Gradient Boosting Machines (GBM)?

A: Regularization, efficient handling of missing data, use of second-order derivatives (Hessian) for faster convergence, and advanced tree pruning.

Q: How does XGBoost handle missing values?

A: During training, it learns the default direction (left or right child) for missing values at each split that minimizes loss.

Q: How can you prevent overfitting in XGBoost?

A: Use a combination of: 1) Lower max_depth, 2) Increase min_child_weight and gamma, 3) Use subsample and colsample_bytree, 4) Apply stronger L1/L2 regularization (alpha, lambda), 5) Reduce learning_rate while increasing n_estimators.

In [None]:
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the classifier
# For binary classification, objective='binary:logistic' (default)
# For multi-class, set objective='multi:softmax' and num_class
model = xgb.XGBClassifier(
    objective='multi:softmax',  # For multi-class classification
    num_class=3,                 # Number of classes in the Iris dataset
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)  # Predicts class labels
# y_pred_proba = model.predict_proba(X_test)  # Predicts class probabilities

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

mportant Parameters for Classification
While you already know many parameters (like max_depth, eta) from the regressor, a few are particularly important for classification:

scale_pos_weight : Crucial for imbalanced datasets. A common value is (number of negative class samples) / (number of positive class samples). This tells the model to pay more attention to the minority class.

eval_metric: While training, it's helpful to monitor metrics like 'logloss', 'error' (classification error), or 'auc' (for binary classification).

max_delta_step: Can sometimes help stabilize training in logistic regression for extremely imbalanced classes.

Q: How does XGBoost handle a multi-class classification problem?

A: It uses a one-vs-all (OvA) strategy internally. When you set objective='multi:softmax', it essentially trains multiple binary classifiers (one for each class) and selects the class with the highest probability.

Q: When would you choose XGBoost Classifier over a Random Forest?

A: When your dataset is large, you need the highest possible accuracy and have the time/resources for careful tuning. Random Forest is excellent and more robust to overfitting with less tuning, but XGBoost's gradient boosting often achieves a slightly higher performance ceiling at the cost of complexity.

Q: The classifier is overfitting. Which parameters would you adjust first?

A: 1) Increase reg_alpha (L1) and reg_lambda (L2) for stronger regularization. 2) Reduce max_depth to make trees simpler. 3) Lower learning_rate and increase n_estimators. 4) Use subsample and colsample_bytree to introduce more randomness.

### XGBoost: Regressor vs. Classifier

| Aspect | XGBoost Regressor | XGBoost Classifier |
| :--- | :--- | :--- |
| **Primary Task** | Predicts continuous numeric values. | Predicts discrete class labels. |
| **Core Objective** | Minimizes residuals (e.g., Squared Error). | Maximizes class probability (Log Loss). |
| **Default Objective**| `reg:squarederror` | `binary:logistic` or `multi:softprob`. |
| **Output Type** | Real numbers ($y \in \mathbb{R}$). | Probabilities or Class labels. |
| **Unique Params** | Standard tuning. | `num_class` (required for multi-class). |
| **Metrics** | RMSE, MAE, $R^2$. | Accuracy, F1, Log Loss, AUC. |


### XGBoost = Boosted Decision Trees → sequentially reduce errors.

### Regressor → continuous predictions, Classifier → class labels.

### Regularization + tree parameters = key to performance.

### Handles missing/sparse data, fast, accurate, widely used in ML competition 


# LightGBM

LightGBM = Gradient Boosting framework from Microsoft.

Similar to XGBoost but faster and more memory-efficient.

Uses Decision Trees as base learners.

Designed for large datasets and high-dimensional data.

Works for regression and classification.

Key Features

Gradient Boosting algorithm with leaf-wise growth (more aggressive than depth-wise growth in XGBoost).

Handles categorical features directly (no need for one-hot encoding).

Supports missing values automatically.

Efficient for sparse data.

Faster training than XGBoost for large datasets.

Regularization reduces overfitting.

Leaf-wise vs Level-wise Growth

Leaf-wise (LightGBM): Splits the leaf with the largest loss reduction.

More accurate but can overfit on small datasets.

Level-wise (XGBoost): Splits all nodes at the same depth.

Safer for small datasets.

Histogram-based Learning
Buckets continuous features into discrete bins

Uses histogram-based gradient calculation

Faster than XGBoost's pre-sorted algorithm

Reduces memory usage significantly

Leaf-wise Tree Growth
Traditional (Level-wise): Expands all leaves at same depth (balanced but less accurate)

Leaf-wise (LightGBM): Expands leaf with max delta loss

Better accuracy (can capture more complex patterns)

Risk of overfitting on small datasets

Add max_depth constraint for control

Gradient-based One-Side Sampling (GOSS)
Keeps all instances with large gradients

Randomly samples instances with small gradients

Focuses computation where it matters most

Maintains accuracy while reducing data size

Exclusive Feature Bundling (EFB)
Bundles mutually exclusive features (rarely non-zero simultaneously)

Reduces effective number of features

Crucial for high-dimensional sparse data

| Parameter          | Type   | Description                             |
| ------------------ | ------ | --------------------------------------- |
| `num_leaves`       | int    | Max number of leaves in one tree.       |
| `max_depth`        | int    | Max depth of each tree.                 |
| `learning_rate`    | float  | Shrinks weight of new trees.            |
| `n_estimators`     | int    | Number of boosting rounds.              |
| `min_data_in_leaf` | int    | Minimum number of samples per leaf.     |
| `feature_fraction` | float  | Fraction of features used per tree.     |
| `bagging_fraction` | float  | Fraction of data sampled for each tree. |
| `bagging_freq`     | int    | Frequency for bagging (0 = disabled).   |
| `lambda_l1`        | float  | L1 regularization.                      |
| `lambda_l2`        | float  | L2 regularization.                      |
| `objective`        | string | Task type (regression/classification).  |



Common objectives:

Regression: regression, huber, fair

Binary classification: binary

Multi-class classification: multiclass




| Feature           | LGBMRegressor                | LGBMClassifier                   |
| ----------------- | ---------------------------- | -------------------------------- |
| Task              | Regression (predict numbers) | Classification (predict classes) |
| Objective         | `regression` (default)       | `binary`, `multiclass`           |
| Output            | Continuous values            | Class labels / probabilities     |
| Evaluation Metric | RMSE, MAE, R²                | Accuracy, AUC, Log Loss          |


In [None]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LGBMRegressor(n_estimators=100, learning_rate=0.1, num_leaves=31)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))


In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

model = LGBMClassifier(n_estimators=100, learning_rate=0.1, num_leaves=31)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


When to choose LightGBM?

Large datasets (>10K samples)

High-dimensional data

Need fast training

Categorical features present

Key Advantages

Speed: Histogram algorithm + GOSS

Memory: EFB + histogram binning

Accuracy: Leaf-wise growth

Convenience: Handles categorical/missing data

Explain Leaf-wise Growth
"LightGBM uses leaf-wise growth which expands the leaf with the highest loss reduction, creating deeper, more complex trees on one side while keeping other parts shallow. This is more efficient than level-wise growth but requires careful regularization."

Hyperparameter Tuning Priority

learning_rate and n_estimators (with early stopping)

num_leaves and max_depth

min_data_in_leaf and min_sum_hessian_in_leaf

Regularization parameters

Sampling parameters (feature_fraction, bagging_fraction)

Missing Value Handling
"LightGBM learns the best direction to assign missing values during training by evaluating which split direction yields the largest gain. No imputation needed."




LightGBM = Fast Gradient Boosting using leaf-wise trees.

Regressor → numeric, Classifier → classes.

Handles categorical & missing data efficiently.

Hyperparameters like num_leaves and learning_rate are key to performance.

In [None]:

from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Prepare data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


model = LGBMClassifier(
    n_estimators=1000,           # Large number with early stopping
    learning_rate=0.05,
    num_leaves=31,
    max_depth=-1,
    min_child_samples=20,
    subsample=0.8,              # Bagging fraction
    colsample_bytree=0.8,       # Feature fraction
    reg_alpha=0.1,              # L1
    reg_lambda=0.1,             # L2
    random_state=42,
    n_jobs=-1,
    importance_type='gain'
)

# Train with early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='logloss',
    early_stopping_rounds=50,
    verbose=10
)

print(f"Best iteration: {model.best_iteration_}")
print(f"Best score: {model.best_score_}")

# CatBoost

CatBoost = Gradient Boosting algorithm developed by Yandex.

Designed to handle categorical features automatically.

Uses ordered boosting → reduces overfitting.

Can be used for regression and classification.

Handles large datasets efficiently.

Key Features

Categorical Feature Handling: No need for one-hot encoding.

Ordered Boosting: Reduces overfitting on small datasets.

Supports missing values automatically.

Fast training with GPU support.

Regularization to avoid overfitting.

Supports multiclass classification and regression.

Feature importance available (model.get_feature_importance()).

How it Works

Based on gradient boosting over decision trees.

Handles categorical features using “statistics on combinations of features”.

Uses symmetric trees → all leaves at same depth → faster predictions.

Ordered boosting prevents target leakage during training.


| Parameter             | Type   | Description                               |
| --------------------- | ------ | ----------------------------------------- |
| `iterations`          | int    | Number of trees / boosting rounds.        |
| `learning_rate`       | float  | Shrinks weight of new trees.              |
| `depth`               | int    | Depth of each tree.                       |
| `l2_leaf_reg`         | float  | L2 regularization coefficient.            |
| `border_count`        | int    | Number of splits for numeric features.    |
| `bagging_temperature` | float  | Controls randomness in selecting samples. |
| `random_seed`         | int    | Seed for reproducibility.                 |
| `task_type`           | string | `'CPU'` or `'GPU'`.                       |
| `loss_function`       | string | Objective function for task.              |
| `eval_metric`         | string | Metric for validation.                    |


Common loss_function values:

Regression: RMSE, MAE, Quantile

Binary classification: Logloss, CrossEntropy

Multi-class classification: MultiClass, MultiClassOneVsAll



| Feature              | CatBoostRegressor            | CatBoostClassifier               |
| -------------------- | ---------------------------- | -------------------------------- |
| Task                 | Regression (predict numbers) | Classification (predict classes) |
| Loss function        | RMSE, MAE                    | Logloss, CrossEntropy            |
| Output               | Continuous values            | Class labels / probabilities     |
| Categorical Features | Supported automatically      | Supported automatically          |







In [None]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6, verbose=0)
model.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred = model.predict(X_test)
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))


In [None]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, verbose=0)
model.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Mention categorical feature handling without one-hot encoding.

Ordered boosting → reduces overfitting on small datasets.

Symmetric trees → faster prediction.

Hyperparameter tuning: iterations, learning_rate, depth, l2_leaf_reg.

Can use GPU for faster training.

Popular in Kaggle competitions for tabular data with categorical features.

CatBoost = Gradient Boosting with native categorical handling & ordered boosting.

Regressor → numeric output, Classifier → class labels.

Reduces overfitting → good for small & medium datasets.

Efficient, accurate, GPU-compatibl



| Feature                           | **XGBoost**                                                                                                                                                         | **LightGBM**                                                                                                                                                                                            | **CatBoost**                                                                                                                                                                        |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**                    | eXtreme Gradient Boosting; gradient boosting using decision trees with regularization                                                                               | Light Gradient Boosting Machine; faster gradient boosting using **leaf-wise growth**                                                                                                                    | Categorical Boosting; gradient boosting with **native categorical handling** and **ordered boosting**                                                                               |
| **Key Idea**                      | Sequentially builds trees to reduce residual error                                                                                                                  | Leaf-wise tree growth → splits largest loss leaf → faster & more accurate                                                                                                                               | Ordered boosting → reduces overfitting, handles categorical features automatically                                                                                                  |
| **Why Use / Advantages**          | - Accurate, robust<br>- Handles missing values<br>- Regularization to reduce overfitting<br>- Widely used in competitions                                           | - Very fast & memory efficient<br>- Handles large datasets<br>- Native categorical support<br>- High accuracy due to leaf-wise trees                                                                    | - Handles categorical features **without encoding**<br>- Reduces overfitting on small datasets<br>- Symmetric trees → faster prediction<br>- GPU support                            |
| **Why Not / Disadvantages**       | - Slower than LightGBM on large datasets<br>- More memory usage<br>- Sensitive to hyperparameters                                                                   | - Can overfit on small datasets (leaf-wise growth)<br>- Slightly complex hyperparameter tuning                                                                                                          | - Slower than LightGBM on very large datasets<br>- Slightly higher memory usage<br>- Less flexible for some advanced tasks                                                          |
| **When to Use**                   | - Small/medium datasets<br>- Need highly accurate model<br>- Want **regularization control**                                                                        | - Very large datasets<br>- Need **fast training** & prediction<br>- Want high accuracy and memory efficiency                                                                                            | - Datasets with **categorical features**<br>- Small to medium datasets<br>- Want low overfitting on small samples                                                                   |
| **When Not to Use**               | - Extremely large datasets where speed is critical                                                                                                                  | - Small datasets prone to overfitting                                                                                                                                                                   | - Extremely large datasets where memory & speed matter more than categorical handling                                                                                               |
| **Key Hyperparameters**           | - `n_estimators` (trees)<br>- `learning_rate`<br>- `max_depth`<br>- `min_child_weight`<br>- `subsample`, `colsample_bytree`<br>- `gamma`, `reg_alpha`, `reg_lambda` | - `num_leaves` (leaf nodes)<br>- `max_depth`<br>- `learning_rate`<br>- `n_estimators`<br>- `min_data_in_leaf`<br>- `feature_fraction`, `bagging_fraction`, `bagging_freq`<br>- `lambda_l1`, `lambda_l2` | - `iterations`<br>- `depth`<br>- `learning_rate`<br>- `l2_leaf_reg`<br>- `border_count` (numeric splits)<br>- `bagging_temperature`<br>- `loss_function`<br>- `task_type` (CPU/GPU) |
| **Handling Categorical Features** |  Must encode manually (one-hot, label encoding)                                                                                                                    |  Partial support (can encode manually or use categorical indices)                                                                                                                                      |  Fully automatic, no preprocessing needed                                                                                                                                          |
| **Training Speed**                | Moderate                                                                                                                                                            | Very fast                                                                                                                                                                                               | Moderate                                                                                                                                                                            |
| **Prediction Speed**              | Fast                                                                                                                                                                | Fastest                                                                                                                                                                                                 | Fast                                                                                                                                                                                |
| **Memory Usage**                  | High                                                                                                                                                                | Low                                                                                                                                                                                                     | Moderate                                                                                                                                                                            |
