## 1. XGBoost parameters

XGBoost has 3 types of parameters: **General paramaters**, **booster paramaters** and **task paramaters**.

### 1.1 Global configuration:
- `verbosity`

- `use_rmm` (True or False): Whether to use RAPIDS Memory Manager (RMM) to allocate cache GPU memory..

### 1.2 General parameters:
- `booster`: can be `gbtree`, `dart` or `gblinear`.

- `device` : default=`cpu` else `cuda` or `gpu`.

### 1.3 Parameters for tree booster:**
- `learning_rate` : default=0.3 and range: [0,1]

- `gamma` : [default=0] Gamma is a threshold that controls whether a tree is allowed to split a leaf node. When the model considers a split, it calculates how much the split reduces the loss (error). If the loss reduction is smaller than gamma, the split is NOT allowed.

- `max_depth`: default=6

| Dataset size (rows) | Overfitting risk | Recommended max_depth | Notes |
|--------------------|------------------|------------------------|-------|
| < 1,000            | Very high        | 2 – 3                  | Only simple interactions; rely on boosting |
| 1k – 10k           | High             | 3 – 4                  | Strong regularization needed |
| 10k – 50k          | Medium-high      | 4 – 5                  | Common tabular ML range |
| 50k – 100k         | Medium           | 5 – 6                  | Default sweet spot |
| 100k – 300k        | Medium-low       | 6 – 7                  | Slightly deeper trees allowed |
| 300k – 1M          | Low              | 6 – 8                  | Increase depth carefully |
| > 1M               | Very low         | 7 – 10                 | Only if features justify it |

- `min_child_weight` (default=1) : min_child_weight controls how small a leaf node is allowed to be by requiring a minimum amount of data (or evidence) in each child after a split. A small value allows many tiny, specific splits (higher overfitting risk), while a larger value forces splits to be supported by more data, making the model more conservative.

- `max_delta_step` (default=0) : Limits how big a correction a tree is allowed to make in one step. Think: “Don’t change your answer too much at once.” It prevents one tree from making a huge confidence jump based on very little data. Typical values: 1..10.

**Instances sampling:**

- `subsample` (default=1) : Percentage of data used during training. Typical values : 0.4 to 0.8.

- `sampling_method` (default=uniform): The method used to sample data.

    1. uniform
    2. gradient_based :  Note that this sampling method is only supported when tree_method is set to hist and the device is cuda; other tree methods only support uniform sampling.

**Features sampling:**

- `colsample_bytree`, `colsample_bylevel`, `colsample_bynode` (default=1) : Sampling the features.

**Regularization:**

- `lambda` (default=1): Lambda is a weight shrinker. Denominator gets bigger so leaf weight gets smaller. L2 regularization adds a penalty to the loss function:

$$
\text{Penalty} = \lambda \sum w^2
$$

$$
w = -\frac{G}{H + \lambda}
$$
| Dataset               | Suggested `lambda` |
|-----------------------|--------------------|
| Small dataset         | 5–20               |
| Noisy data            | 10–50              |
| Large dataset         | 1–5                |
| High-dimensional data | 5–30               |

- `alpha` : Increasing alpha makes the model more conservative by pushing leaf weights toward zero. Encourages exact zero weights, produces sparser models and acts like feature selection at the leaf level. Signs you need higher alpha: too many leaves with tiny effects, model reacts to noise, high variance across folds.

| Situation                | Suggested `alpha` |
|--------------------------|------------------|
| Clean data               | 0–0.5            |
| Some noise               | 0.5–2            |
| High-dimensional data    | 1–10             |
| Very noisy / sparse data | 5–50             |

**Tree methods:**

- `tree_method` (default= auto): XGBoost builds decision trees, and tree_method controls how it finds the best split at each node.

  1. exact — Exact Greedy Algorithm: Tries every possible split point for every feature. Computes gain exactly.
  2. approx — Approximate Greedy (Quantile Sketch): Uses quantile sketches to estimate good split points. Builds gradient histograms instead of scanning all values.
  3. hist — Histogram-Based Algorithm (Recommended): Pre-buckets feature values into fixed bins (histograms). Finds splits using histogram statistics. Uses highly optimized C++ code.
  4. auto — Automatic Choice (Same as hist): Currently behaves the same as hist. Keeps backward compatibility.



- `max_leaves` [default=0] : Maximum number of nodes to be added. Not used by exact tree method.

- `max_bin` [default=256]: Only used if tree_method is set to hist or approx. Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.

**Categorical features:**

- `max_cat_to_onehot`: A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes.

- `max_cat_threshold`: Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting.

**Unbalanced data**:

- scaloe_pos_weight: Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).

### 1.4 Additional parameters for Dart Booster:

How DART Works in XGBoost:

1. **Train a few trees normally.**

2. **When adding a new tree:**
   - Randomly **drop some existing trees**.
   - Compute the **gradient** using only the remaining trees.
   - **Add the new tree** to the model.

3. **Repeat** the process for all boosting rounds.

This process helps make the model **more robust** and **reduces overfitting** by preventing the model from relying too heavily on any single tree.

#### DART Booster Prediction in XGBoost:

When using a **DART booster**, `predict()` **also applies dropouts by default**.  
This is fine for training, but **on test/validation data**, it can produce **incorrect results** because not all trees are used.

#### Problem:

- DART randomly drops trees during **training** to reduce overfitting.
- By default, `predict()` also applies dropout.
- This makes predictions on new data **inaccurate**.

#### Solution:

Use the `iteration_range` parameter to **use all trained trees**:

`preds = bst.predict(dtest, iteration_range=(0, num_round))`

#### Parameters:

- `sample_type` (default = `uniform`): Determines **how the trees to drop are selected** during dropout.

| Option   | Meaning                                       | Example                                                                 |
|----------|-----------------------------------------------|-------------------------------------------------------------------------|
| uniform  | All existing trees have **equal chance** of being dropped | 10 trees → randomly pick 2 trees to drop, each tree equally likely     |
| weighted | Trees with **higher weight** are more likely to be dropped | If some trees have higher weight (more impact), they are more likely to be dropped |

- `normalize_type` (default = `tree`): Adjusts the **weight of new and dropped trees** so predictions stay consistent after dropout.

| Option  | How weights are scaled                                                                 |
|---------|----------------------------------------------------------------------------------------|
| tree    | New tree weight = 1 / (k + learning_rate); dropped trees scaled by k / (k + learning_rate)  (k = number of dropped trees) |
| forest  | New tree weight = 1 / (1 + learning_rate); dropped trees scaled by 1 / (1 + learning_rate) |


- `rate_drop` [default=0.0]: Dropout rate (a fraction of previous trees to drop during the dropout). Range: [0.0, 1.0]

- `skip_drop` [default=0.0]: Probability of skipping the dropout procedure during a boosting iteration.

### 1.5 Learning task parameters:

The learning task is the type of prediction you want (regression classification, ranking, survival analysis), and the objective is the loss function XGBoost will optimize during training. I’ll explain each with examples.

- `objective` [default=reg:squarederror]:

| Task               | Objective            | Output          | Example                        |
|-------------------|-------------------|----------------|--------------------------------|
| Regression         | reg:squarederror    | Continuous     | Predict house price            |
| Regression         | reg:squaredlogerror | Continuous, >0 | Predict sales revenue          |
| Regression         | reg:absoluteerror   | Continuous     | Predict delivery time          |
| Regression         | reg:pseudohubererror| Continuous     | Predict stock prices with outliers |
| Regression         | reg:quantileerror   | Continuous     | Predict 90th percentile delivery time |
| Regression         | reg:gamma           | Continuous >0  | Insurance claim severity       |
| Regression         | reg:tweedie         | Continuous ≥0  | Total insurance loss           |
| Binary Classification | binary:logistic  | Prob 0–1       | Predict if customer buys a product |
| Binary Classification | binary:logitraw   | Score (real)   | Predict likelihood of purchase |
| Binary Classification | binary:hinge      | 0 or 1         | Spam detection                 |
| Multi-class Classification | multi:softmax | Class label    | Predict digit (0–9)           |
| Multi-class Classification | multi:softprob | Prob vector    | Image classification (cat/dog/rabbit) |
| Ranking             | rank:pairwise      | Rank score     | Rank search results            |
| Ranking             | rank:ndcg          | Rank score     | Search engine ranking (NDCG)  |
| Ranking             | rank:map           | Rank score     | Recommendation ranking (MAP)  |
| Survival Analysis   | survival:cox       | Hazard ratio   | Predict patient survival       |
| Survival Analysis   | survival:aft       | Survival time  | Predict machine failure time   |
| Count Data          | count:poisson      | Mean count     | Predict daily website clicks   |


- `eval_metric` [default according to objective] :Evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and logloss for classification )

- `seed`




## 2. Data Interface

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np

In [None]:
train_data = pd.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])
label = pd.DataFrame(np.random.randint(2, size=4))
test_data = train_data = pd.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])

In [None]:
# prepare data
dtrain = xgb.DMatrix(train_data, label=label)
dtest = xgb.DMatrix(test_data, label=label)

## 3. Setting parameters

In [None]:
# set params for the model:
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'eval_metric': ['auc', 'ams@0']}

When you train an XGBoost model, you can provide a list of datasets to evaluate performance during training. This is done via the evals parameter in xgb.train

In [None]:
evallist = [(dtrain, 'train'), (dtest, 'eval')]

## 4. Training

In [None]:
num_round = 10
bst = xgb.train(param, dtrain, num_round, evals=evallist, verbose_eval=2)

[0]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000
[2]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000
[4]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000
[6]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000
[8]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000
[9]	train-auc:0.50000	train-ams@0:0.00000	eval-auc:0.50000	eval-ams@0:0.00000


## Model Saving


In [None]:
bst.save_model('model.ubj')

## 5. Load Model

In [None]:
bst = xgb.Booster()  # init model
bst.load_model('model.ubj')  # load model data

## Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there’s more than one, it will use the last.

`train(..., evals=evals, early_stopping_rounds=10)`

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

Prediction for early stopping:

`ypred = bst.predict(dtest, iteration_range=(0, bst.best_iteration + 1))`

## Callbacks

This [link](https://xgboost.readthedocs.io/en/stable/python/callbacks.html) can help us get more information about callbacks.

XGBoost provides an callback interface class: TrainingCallback, user defined callbacks should inherit this class and override corresponding methods. There’s a working example in [Demo for using and defining callback functions](https://xgboost.readthedocs.io/en/stable/python/examples/callbacks.html#sphx-glr-python-examples-callbacks-py).

# Using xgboost on GPU devices

Shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU
acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it
time consuming to process. We compare the run-time and accuracy of the GPU and CPU
histogram algorithms.

In addition, The demo showcases using GPU with other GPU-related libraries including
cupy and cuml. These libraries are not strictly required.