<a href="https://colab.research.google.com/github/wingated/cs473/blob/main/mini_labs/week_12_xgboost.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BYU CS 473 — XGBoost

In this assignment, you will learn the core ideas behind XGBoost and apply the method to a dataset of your choice.
We’ll connect the math from the textbook to hands-on modeling.

---

## Learning Goals
- Explain the XGBoost objective function and its components.
- Define and use key terms: regularizer, second-order Taylor expansion, leaf weights, gain, and split criterion.
- Apply XGBoost to a dataset, tune hyperparameters, and evaluate results.
- Understand how XGBoost improves upon traditional boosting methods.

## Part 1 — Key Concepts from the Textbook  

Read through the definitions below. For each one, write a **1–2 sentence explanation in your own words**.  

### 1. Regularizer  
Equation (18.47):  
$\Omega(f) = \gamma J + \frac{1}{2} \lambda \sum_{j=1}^J w_j^2$  

**Question:** Why does XGBoost penalize both the **number of leaves** and the **magnitude of leaf weights**?  


XGBoost penalizes having too many leaves because complex trees can overfit by carving the data into tiny, overly-specific regions. It also penalizes large leaf weights because extreme predictions usually indicate overfitting to noise rather than learning a stable pattern.

### 2. Second-order Taylor Expansion of the Loss  
Equation (18.49):  
$L_m(F_m) \approx \sum_{i=1}^N \Big[ \ell(y_i, f_{m-1}(x_i)) + g_{im} F_m(x_i) + \tfrac{1}{2} h_{im} F_m(x_i)^2 \Big] + \Omega(F_m)$  

**Question:** How does including the **Hessian term** (curvature) make boosting more accurate compared to using only gradients?  


The Hessian tells the model how quickly the loss is changing (its curvature), so using it lets the algorithm take smarter, more precise steps toward the minimum instead of relying only on slope information. This leads to better optimization and typically faster, more accurate convergence.

### 3. Optimal Leaf Weights  
Equation (18.54):  
$w_j^* = - \frac{G_{jm}}{H_{jm} + \lambda}$  

**Question:** What does this formula mean about how leaf weights are chosen?  


The optimal weight of a leaf is found by taking the negative total gradient for that leaf and scaling it by the total curvature plus regularization, so leaves with strong, consistent gradients get larger weights. The λ term prevents huge updates when curvature is small, stabilizing the model.

### 4. Gain of a Split  
Equation (18.56):  
$\text{gain} = \tfrac{1}{2}\Bigg( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \Bigg) - \gamma$  

**Question:** Why does XGBoost reject splits with **negative gain**?  


A negative gain means the split would increase error or complexity more than it improves fit, which makes the model worse overall. XGBoost only keeps splits that clearly reduce the loss after accounting for regularization.

## Part 2 — Visualizing Boosting  

### 2.1 Bagging vs Boosting (Recap)  
Describe in words how **bagging** and **boosting** differ in how they:  
- Use data sampling  
- Build models sequentially or in parallel  
- Reduce bias vs variance  



Bagging trains each model on a different random sample of the training data. Boosting trains each new model on the full dataset, but reweights the data so that points that were previously mispredicted get more emphasis.


Bagging trains all models in parallel. Boosting trains models sequentially, with each new model correcting the mistakes of the previous ones.


Bagging mainly reduces variance, while boosting primarily reduces bias.

## Part 3 — Implementing XGBoost on 2 Datasets  

### Step 1 — Look at the example dataset



In [1]:
# Example: load a dataset
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = xgb.XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    eta=0.1,        # learning rate
    max_depth=3,    # tree depth
    n_estimators=100,
    reg_lambda=1.0, # L2 regularization
    reg_alpha=0.0   # L1 regularization
)

model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.956140350877193


### Step 2 — Implement XGboost on a dataset of your choice  
- Example locations to find a dataset:  
  - A built-in dataset (e.g. `load_digits`)  
  - A Kaggle dataset  


In [2]:
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

params_list_digits = [
    {"max_depth": 3, "eta": 0.1, "n_estimators": 100, "reg_lambda": 1.0, "reg_alpha": 0.0},
    {"max_depth": 4, "eta": 0.05, "n_estimators": 200, "reg_lambda": 1.5, "reg_alpha": 0.5},
    {"max_depth": 2, "eta": 0.2, "n_estimators": 250, "reg_lambda": 2.0, "reg_alpha": 1.0},
    {"max_depth": 6, "eta": 0.3, "n_estimators": 60, "reg_lambda": 0.5, "reg_alpha": 0.0},
]

print("\n=== Digits Dataset ===")
for i, params in enumerate(params_list_digits, 1):
    model = xgb.XGBClassifier(
        objective="multi:softprob",
        num_class=10,
        eval_metric="mlogloss",
        **params
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Model {i} Params: {params} -> Accuracy: {acc:.4f}")



=== Digits Dataset ===
Model 1 Params: {'max_depth': 3, 'eta': 0.1, 'n_estimators': 100, 'reg_lambda': 1.0, 'reg_alpha': 0.0} -> Accuracy: 0.9639
Model 2 Params: {'max_depth': 4, 'eta': 0.05, 'n_estimators': 200, 'reg_lambda': 1.5, 'reg_alpha': 0.5} -> Accuracy: 0.9611
Model 3 Params: {'max_depth': 2, 'eta': 0.2, 'n_estimators': 250, 'reg_lambda': 2.0, 'reg_alpha': 1.0} -> Accuracy: 0.9611
Model 4 Params: {'max_depth': 6, 'eta': 0.3, 'n_estimators': 60, 'reg_lambda': 0.5, 'reg_alpha': 0.0} -> Accuracy: 0.9694


### Step 3 — Experiment with Hyperparameters on your dataset and the Cancer dataset
- Change `max_depth`, `eta`, or `n_estimators`.  
- Add regularization with `reg_lambda` and `reg_alpha`.  
- **Question:** How do these changes affect performance?  


Deeper trees improved performance slightly, but not dramatically, because the dataset already separates well with simple boundaries.

A moderate-to-high learning rate performed best here, likely because digits is a low-noise dataset and the model doesn’t need very careful small steps.

Using too many trees can overfit or add unnecessary complexity. The best model found a sweet spot with 60 trees.

Strong regularization prevents overfitting, but in a dataset like digits—clean, well-separated images—too much regularization hurts accuracy because it restricts the model’s ability to fit the data.

## Part 4 — Reflection  

Answer the following in complete sentences:  
1. What role does the **regularizer** play in preventing overfitting?  
2. How does using the **second-order Taylor expansion** help optimize the trees?  
3. What surprised you most when experimenting with hyperparameters?  
4. Why is XGBoost considered both a **statistical innovation** (Taylor expansion, regularization) and a **computer science innovation** (scalability, out-of-core learning)?  


1. The regularizer keeps the model from getting too complex by penalizing too many leaves or extreme leaf weights, which helps prevent overfitting.

2. Using both the gradient and the curvature of the loss lets XGBoost make more accurate updates when building trees, so it learns faster and predicts better.

3. I was surprised that deeper trees didn't improve performance more.

4. It’s a statistical innovation because it uses advanced optimization and regularization to improve predictions, and a computer science innovation because it runs very efficiently, supports parallel computation, and can handle very large datasets.