<a href="https://colab.research.google.com/github/RM-RAMASAMY/decision_trees/blob/main/GBM_Regressor_%26_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
!pip install catboost



In [15]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from catboost import CatBoostRegressor
import lightgbm as lgb

# Generate sample data (replace with your actual data)
np.random.seed(42)
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) # Example target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Gradient Boosting Regressor Algorithms

Gradient Boosting Regressor algorithms are machine learning techniques that predict continuous values by minimizing a regression loss function. They build an ensemble of decision trees sequentially to reduce errors and improve predictive accuracy. These models are commonly used in finance, healthcare, and other domains requiring accurate numeric predictions.

Below are explanations of **XGBoost**, **LightGBM**, and **CatBoost** regression algorithms:

---

## **1. XGBoost (Extreme Gradient Boosting)**  
XGBoost is a high-performance gradient boosting library designed for speed and efficiency. It provides robust support for regression tasks by optimizing various loss functions.

### Features:
- **Supports Regression Tasks**: Includes objectives like `reg:squarederror` (mean squared error), `reg:logistic` (logistic regression), and `reg:pseudohubererror` (Huber loss).
- **Regularization**: L1 and L2 regularization are available for better generalization.
- **Parallelization**: Optimized for multi-core processing.
- **Custom Loss Functions**: Allows users to define custom regression loss functions.
- **Handling Missing Data**: Efficiently manages missing values.

### Common Use Case:
Predicting housing prices or stock market trends using numerical features.

---

## **2. LightGBM (Light Gradient Boosting Machine)**  
LightGBM is a fast and efficient gradient boosting library designed for large-scale regression tasks with high-dimensional data.

### Features:
- **Objective Functions for Regression**:
  - `regression`: For mean squared error (MSE).
  - `l2`: For L2 loss (MSE) optimization.
  - `huber`: For robust regression using Huber loss.
- **Histogram-based Learning**: Splits data into bins, improving training speed and memory usage.
- **Leaf-wise Growth**: Splits the leaf with the maximum loss reduction, leading to deeper trees and better performance.
- **Support for Large Datasets**: Efficiently handles datasets with millions of rows and features.

### Advantages in Regression:
- Faster training for large-scale datasets.
- Handles numerical data effectively with robust results.

---

## **3. CatBoost (Categorical Boosting)**  
CatBoost is a gradient boosting library tailored for datasets with categorical features and offers robust support for regression tasks.

### Features:
- **Native Support for Categorical Data**: Processes categorical features natively without manual preprocessing (like one-hot encoding).
- **Regression Modes**:
  - `RMSE`: Optimizes for root mean squared error.
  - `MAE`: Optimizes for mean absolute error.
  - `Quantile`: For quantile regression.
- **Ordered Boosting**: Reduces overfitting by selecting training samples more carefully.
- **GPU Acceleration**: Supports GPU training for faster computation.

### Benefits in Regression:
- Excels in datasets with categorical features.
- Reduces overfitting compared to other gradient boosting frameworks.

---

## **Comparison of XGBoost, LightGBM, and CatBoost for Regression**:

| Feature                | XGBoost                | LightGBM                  | CatBoost               |
|------------------------|------------------------|---------------------------|------------------------|
| **Speed**              | Moderate              | Fast                      | Moderate              |
| **Ease of Use**        | Moderate              | Moderate                  | High                  |
| **Categorical Features** | Requires preprocessing | Limited native support     | Fully supported       |
| **Regression Objectives** | MSE, Huber, Custom    | MSE, Huber, Quantile       | RMSE, MAE, Quantile   |
| **Scalability**        | High                  | Very High                 | Moderate              |

---

## **Use Case Scenarios**:

- **XGBoost**: If you need a robust, general-purpose regression model with extensive community support.
- **LightGBM**: If speed and scalability for large datasets are critical, especially for regression tasks.
- **CatBoost**: If your data contains many categorical features and overfitting is a concern.

Each algorithm has its strengths, and the choice often depends on your specific dataset and computational requirements.


In [16]:
# @title Regression techniques XGBoost, Catboost, LightGBM

# 1. XGBoost
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
print(f"XGBoost RMSE: {xgb_rmse}")

# 2. CatBoost
cat_model = CatBoostRegressor(iterations=100, random_seed=42, verbose=0) #verbose=0 to suppress output
cat_model.fit(X_train, y_train)
cat_pred = cat_model.predict(X_test)
cat_rmse = np.sqrt(mean_squared_error(y_test, cat_pred))
print(f"CatBoost RMSE: {cat_rmse}")


# 3. LightGBM
lgb_model = lgb.LGBMRegressor(random_state=42,force_col_wise=True, verbose=-1)
lgb_model.fit(X_train, y_train)
lgb_pred = lgb_model.predict(X_test)
lgb_rmse = np.sqrt(mean_squared_error(y_test, lgb_pred))
print(f"LightGBM RMSE: {lgb_rmse}")

XGBoost RMSE: 1.2797247448286968
CatBoost RMSE: 1.116058972571808
LightGBM RMSE: 1.0285364458917656


# Gradient Boost Ranking Algorithms

Gradient Boosting Ranking algorithms are machine learning techniques that rank items by optimizing a specific ranking loss function. They build an ensemble of decision trees in a sequential manner to minimize errors and improve the model's predictive performance for ranking tasks. These models are widely used in search engines, recommendation systems, and other ranking-based applications.

Below are explanations of **XGBoost**, **LightGBM**, and **CatBoost** ranking algorithms:

---

## **1. XGBoost (Extreme Gradient Boosting)**  
XGBoost is a high-performance gradient boosting library designed for speed and efficiency. It provides a specialized **ranking mode** that optimizes learning-to-rank objectives like **pairwise rank loss** or **LambdaRank**.

### Features:
- **Supports Ranking Tasks**: XGBoost can be configured for ranking by setting the objective to `rank:pairwise` or `rank:ndcg` (Normalized Discounted Cumulative Gain).
- **Regularization**: XGBoost includes L1 and L2 regularization for better generalization.
- **Parallelization**: Highly optimized for multi-core processing.
- **Custom Loss Functions**: Allows users to define custom ranking loss functions.
- **Handling Missing Data**: Efficiently manages missing values.

### Common Use Case:
Search engines use XGBoost to rank documents based on relevance scores.

---

## **2. LightGBM (Light Gradient Boosting Machine)**  
LightGBM is a fast and efficient gradient boosting library optimized for large-scale data and high-dimensional features. It provides support for ranking through its **learning-to-rank (LTR)** objectives.

### Features:
- **Objective Functions for Ranking**:
  - `lambdarank`: Implements LambdaRank, a technique that directly optimizes metrics like NDCG.
  - `rank_xendcg`: An enhanced version of LambdaRank for NDCG optimization.
- **Histogram-based Learning**: Splits data into bins, improving training speed and memory usage.
- **Leaf-wise Growth**: Splits the leaf with the maximum loss reduction, leading to deeper trees and better performance.
- **Support for Large Datasets**: Handles datasets with millions of rows and features efficiently.

### Advantages in Ranking:
- Faster training for large-scale datasets.
- Focused on optimizing ranking metrics like NDCG or Mean Average Precision (MAP).

---

## **3. CatBoost (Categorical Boosting)**  
CatBoost is a gradient boosting library tailored for datasets with categorical features. It also supports ranking tasks with specialized ranking loss functions.

### Features:
- **Native Support for Categorical Data**: Efficiently processes categorical features without the need for manual preprocessing (like one-hot encoding).
- **Ranking Modes**:
  - `YetiRank`: Optimizes ranking metrics using pairwise comparisons.
  - `QueryCrossEntropy`: For cross-entropy loss in ranking contexts.
- **Ordered Boosting**: Reduces overfitting by selecting training samples more carefully.
- **GPU Acceleration**: Supports GPU training for faster computation.

### Benefits in Ranking:
- Excels in datasets with categorical features.
- Reduces overfitting compared to other gradient boosting frameworks.

---

## **Comparison of XGBoost, LightGBM, and CatBoost for Ranking**:

| Feature                | XGBoost                | LightGBM                  | CatBoost               |
|------------------------|------------------------|---------------------------|------------------------|
| **Speed**              | Moderate              | Fast                      | Moderate              |
| **Ease of Use**        | Moderate              | Moderate                  | High                  |
| **Categorical Features** | Requires preprocessing | Limited native support     | Fully supported       |
| **Ranking Objectives** | Pairwise, NDCG         | LambdaRank, XENDCG         | YetiRank, QueryCrossEntropy |
| **Scalability**        | High                  | Very High                 | Moderate              |

---

## **Use Case Scenarios**:

- **XGBoost**: If you need a robust, general-purpose ranking model with extensive community support.
- **LightGBM**: If speed and scalability for large datasets are critical, especially for ranking tasks.
- **CatBoost**: If your data contains many categorical features and overfitting is a concern.

Each algorithm has its strengths, and the choice often depends on your specific dataset and computational requirements.


In [17]:
# @title Ranking techniques XGBoost, Catboost, LightGBM
# Create a query data for ranking
qids = np.random.randint(0, 10, size=100) # Example query ids (10 queries)

# Combine features, target, and query ids into a DataFrame
train_data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(5)])
train_data['target'] = y
train_data['qid'] = qids

# Split data into training and testing sets
X_train, X_test, y_train, y_test, qid_train, qid_test = train_test_split(
    train_data.drop(['target','qid'], axis=1),
    train_data['target'],
    train_data['qid'],
    test_size=0.2,
    random_state=42
)


# 1. XGBoost
xgb_model = xgb.XGBRanker(objective='rank:pairwise', random_state=42)
xgb_model.fit(X_train, y_train, group=qid_train.value_counts(sort=False).values)
xgb_pred = xgb_model.predict(X_test)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
print(f"XGBoost RMSE: {xgb_rmse}")


# 2. CatBoost
cat_model = CatBoostRegressor(iterations=100, random_seed=42, verbose=0, loss_function='RMSE')
cat_model.fit(X_train, y_train)  # No specific ranking loss in CatBoost, using RMSE
cat_pred = cat_model.predict(X_test)
cat_rmse = np.sqrt(mean_squared_error(y_test, cat_pred))
print(f"CatBoost RMSE: {cat_rmse}")


# 3. LightGBM
# 3. LightGBM
# Convert y_train to integer labels for ranking
# Here, we rank based on the order of the original target values within each group
y_train_ranked = y_train.groupby(qid_train).rank(method='first').astype(int)

lgb_model = lgb.LGBMRanker(random_state=42, force_col_wise=True, verbose=-1, objective='lambdarank')
lgb_model.fit(X_train, y_train_ranked, group=qid_train.value_counts(sort=False).values) # Use ranked labels
lgb_pred = lgb_model.predict(X_test)
lgb_rmse = np.sqrt(mean_squared_error(y_test, lgb_pred)) # You might need a different evaluation metric for ranking
print(f"LightGBM RMSE: {lgb_rmse}")

XGBoost RMSE: 3.7391786532001214
CatBoost RMSE: 1.116058972571808
LightGBM RMSE: 3.5333778669703766
