# What is XGBoost (Extreme Gradient Boosting)?

A high‑performance gradient boosting library for **tree‑based** models (classification/regression). Strong default for **tabular** data.

Rule of thumb: “Add small decision trees one‑by‑one; each new tree corrects errors of the previous ones using gradients.”

---

# Core idea and notation

- At iteration t, the model is $$\hat{y}^{(t)}(x) = \hat{y}^{(t-1)}(x) + f_t(x)$$, where $$f_t$$ is a new regression tree.
- Objective to minimize:
$$
\mathcal{L}^{(t)} = \sum_{i=1}^N l\big(y_i, \hat{y}^{(t-1)}_i + f_t(x_i)\big) \, + \, \Omega(f_t)
$$
- Using Taylor expansion around $$\hat{y}^{(t-1)}$$:
  - First derivative (gradient): $$g_i = \partial_{\hat{y}} l(y_i, \hat{y}^{(t-1)}_i)$$
  - Second derivative (Hessian): $$h_i = \partial^2_{\hat{y}} l(y_i, \hat{y}^{(t-1)}_i)$$
- For each leaf j with index set $$I_j$$, define $$G_j = \sum_{i\in I_j} g_i$$, $$H_j = \sum_{i\in I_j} h_i$$.
- The optimal leaf weight is approx:
$$
w_j^* = -\frac{G_j}{H_j + \lambda}
$$
- The split **gain** when splitting a node into left/right:
$$
\text{Gain} = \frac{1}{2}\left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma
$$
($$\lambda$$: L2 regularization; $$\gamma$$: min loss reduction to make a split)

---

# Important hyperparameters (and roles)

- `eta` (learning rate): 0.01–0.3; smaller needs more trees, can generalize better.
- `n_estimators` (num_boost_round): number of trees; pair with early stopping.
- `max_depth` / `max_leaves`: tree complexity; deeper can overfit.
- `min_child_weight`: larger → more conservative splits.
- `subsample`, `colsample_bytree`: row/feature subsampling for regularization and speed.
- `reg_alpha` (L1), `reg_lambda` (L2): regularize leaf weights.
- `gamma`: required loss reduction to split (tree-specific penalty).
- Imbalance: `scale_pos_weight` ≈ (neg/pos ratio) for binary classification.

---

# Step‑by‑step (paper‑and‑pencil)

1) Initialize $$\hat{y}^{(0)}$$ (e.g., constant prediction).
2) For t = 1..T:
   - Compute gradients $$g_i$$ and Hessians $$h_i$$ for current predictions.
   - Build a new tree by choosing splits that maximize **Gain**.
   - Add the scaled tree to the model: $$\hat{y}^{(t)} = \hat{y}^{(t-1)} + \eta f_t(x)$$.
   - Optionally use early stopping based on validation set.

---

# Pseudocode

```
y_hat = init_constant()
for t in 1..T:
  g, h = grad_hess(loss, y, y_hat)
  tree = build_tree_by_gain(X, g, h, lambda, gamma, max_depth, min_child_weight)
  y_hat = y_hat + eta * tree.predict(X)
  if early_stopping and no_improvement: break
```

---

# Minimal Python (sklearn API)

```python
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

xgb = XGBClassifier(
    n_estimators=1000, learning_rate=0.05, max_depth=4,
    subsample=0.8, colsample_bytree=0.8,
    reg_lambda=1.0, reg_alpha=0.0, gamma=0.0,
    eval_metric="auc", n_jobs=-1, random_state=42
)
xgb.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False, early_stopping_rounds=50)
print("Best iteration:", xgb.best_iteration)
print("Val AUC:", xgb.best_score)
```

---

# Practical tips, pitfalls, and variants

- Always monitor a **validation** metric with **early stopping** to avoid overfitting.
- Tune `eta` jointly with `n_estimators` (smaller eta → more trees).
- Keep trees shallow to start (max_depth 3–6).
- Use CV/Grid/RandomizedSearch to tune; mind interaction of hyperparameters.
- Be careful with leakage (all preprocessing fit on train only).
- Interpretability: feature importances can be biased; consider permutation importance/SHAP.

---

# Quick cheat sheet

- Additive trees: each corrects previous errors using gradients
- Gain formula drives splits; regularize with $$\lambda,\gamma$$
- Use early stopping; shallow trees; tune subsampling


# XGBoost

## Importing the libraries

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [0]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training XGBoost on the Training set

In [0]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [0]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

## Applying k-Fold Cross Validation

In [0]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))