# Theory — XGBoost (Extreme Gradient Boosting)

What it solves
- High‑performance tree‑based gradient boosting for classification/regression; strong tabular baseline.

Plain intuition
- Build trees sequentially; each new tree fits the gradient of the loss (errors) made so far.
- Many small improvements (weak learners) add up to a strong model.

Core math (sketch)
- Objective at iteration t: L = Σ_i l(y_i, ŷ_i^{(t-1)} + f_t(x_i)) + Ω(f_t)
  • l: loss (e.g., logistic), f_t: new tree, Ω: regularization (L1/L2 + tree complexity)
- Add tree f_t that best reduces L using a second‑order Taylor approximation (gradient + Hessian).

Important hyperparameters
- Learning rate (eta): 0.01–0.3; smaller eta → more trees needed, better generalization.
- n_estimators (num_boost_round): total boosting rounds; use early stopping with a validation set.
- max_depth / max_leaves: tree complexity; deeper trees can overfit.
- min_child_weight: minimum sum of instance weight in a leaf; larger → conservative.
- subsample / colsample_bytree: row/feature subsampling for regularization and speed.
- reg_alpha (L1), reg_lambda (L2): regularize leaf weights.
- gamma: minimum loss reduction to make a split.
- scale_pos_weight: handle class imbalance (= negative/positive ratio).

Training tips
- Always monitor validation metric with early_stopping_rounds (e.g., 50).
- Tune eta with n_estimators jointly; use smaller eta with more rounds.
- Start with shallow trees (max_depth 3–6) and moderate subsampling.

Tiny example (concept)
- Begin with constant prediction (e.g., average log‑odds).
- Tree 1 fits residuals; add scaled by eta; repeat → decision function improves iteratively.

Pros, cons, pitfalls
- Pros: Excellent accuracy on tabular data; handles missing values; many regularization knobs.
- Cons: Many hyperparameters; can overfit if unchecked; longer training on very large datasets.
- Pitfalls: No early stopping; too deep trees; leakage in preprocessing; not using proper metric for imbalance.

How this notebook implements it
- Dataset: Data.csv / Churn_Modelling.csv depending on variant.
- Steps: split → build DMatrix (or sklearn API) → fit with eval_set and early stopping → evaluate.
- Tip: Use sklearn API (XGBClassifier/XGBRegressor) for easy Pipeline/CV integration.

Quick checklist
- Use validation set + early stopping.
- Tune depth, learning rate, rounds, and subsampling.
- Check feature importance cautiously; validate with CV.


# XGBoost

## Importing the libraries

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [0]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training XGBoost on the Training set

In [0]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [0]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

## Applying k-Fold Cross Validation

In [0]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))