# **Chapter 22: Tree-Based Models**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the fundamentals of decision trees and how they partition the feature space
- Build and interpret decision trees for regression and classification on NEPSE data
- Explain how random forests reduce overfitting through bagging and feature randomness
- Train random forest models and analyze feature importance
- Grasp the concept of gradient boosting and how it sequentially corrects errors
- Implement gradient boosting machines (GBM) using popular libraries
- Compare and contrast XGBoost, LightGBM, and CatBoost for time‑series forecasting
- Tune hyperparameters effectively to avoid overfitting
- Apply tree‑based models to predict NEPSE stock returns and direction
- Understand the strengths and limitations of tree‑based models in financial time‑series

---

## **22.1 Decision Trees Fundamentals**

Decision trees are intuitive, interpretable models that learn a series of if‑then‑else rules from the data. They partition the feature space into rectangular regions and assign a prediction (mean for regression, majority class for classification) to each region. In the context of the NEPSE prediction system, a decision tree might learn rules like: "If the 5‑day moving average is above the 20‑day moving average **and** yesterday's return was positive, then predict an up move."

### **22.1.1 Tree Construction**

A decision tree is built recursively:

1. Start with all training samples at the root node.
2. For each feature, evaluate possible split points and choose the one that best separates the target according to a criterion (e.g., Gini impurity for classification, mean squared error for regression).
3. Split the node into two child nodes.
4. Repeat recursively on each child node until a stopping condition is met (e.g., maximum depth, minimum samples per leaf).

The splitting criterion for regression is typically the reduction in variance (or MSE). For classification, common criteria are Gini impurity or entropy.

### **22.1.2 Building a Decision Tree for NEPSE Data**

Let's start by preparing a feature set for a single NEPSE stock. We'll create lag features, rolling statistics, and technical indicators as we did in earlier chapters.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

# Load NEPSE data
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Use a single symbol for simplicity
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Create basic features (as in previous chapters)
df_stock['Return'] = df_stock['Close'].pct_change() * 100

# Lag features
for lag in [1, 2, 3, 5]:
    df_stock[f'Return_Lag{lag}'] = df_stock['Return'].shift(lag)
    df_stock[f'Volume_Lag{lag}'] = df_stock['Vol'].shift(lag)

# Rolling features
for window in [5, 10, 20]:
    df_stock[f'MA_{window}'] = df_stock['Close'].rolling(window).mean()
    df_stock[f'Volatility_{window}'] = df_stock['Return'].rolling(window).std()

# Technical indicators (simplified RSI)
delta = df_stock['Close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean()
rs = avg_gain / avg_loss
df_stock['RSI'] = 100 - (100 / (1 + rs))

# Target: next day's return (regression) or direction (classification)
df_stock['Target_Return'] = df_stock['Return'].shift(-1)
df_stock['Target_Direction'] = (df_stock['Target_Return'] > 0).astype(int)

# Drop NaN rows
df_stock = df_stock.dropna()

# Define features (exclude target and metadata)
feature_cols = [col for col in df_stock.columns if col not in 
                ['Date', 'Symbol', 'S.No', 'Conf.', 'Open', 'High', 'Low', 'Close', 'LTP',
                 'VWAP', 'Prev. Close', 'Turnover', 'Trans.', 'Diff', 'Range', 'Diff %',
                 'Range %', 'VWAP %', '52 Weeks High', '52 Weeks Low',
                 'Return', 'Target_Return', 'Target_Direction']]

X = df_stock[feature_cols]
y_reg = df_stock['Target_Return']
y_clf = df_stock['Target_Direction']

# Train/test split (temporal)
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train_reg, y_test_reg = y_reg.iloc[:split_idx], y_reg.iloc[split_idx:]
y_train_clf, y_test_clf = y_clf.iloc[:split_idx], y_clf.iloc[split_idx:]

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")
print(f"Feature list: {feature_cols[:5]}...")
```

**Explanation:**

- We engineer a set of features typical for stock prediction: lagged returns, lagged volume, moving averages, volatility, and RSI.
- The target for regression is the next day's return; for classification, a binary indicator of positive return.
- A temporal split (80% train, 20% test) respects time order. We'll use this throughout the chapter.

Now let's fit a decision tree regressor.

```python
# Decision Tree for regression
dt_reg = DecisionTreeRegressor(max_depth=5, min_samples_split=10, random_state=42)
dt_reg.fit(X_train, y_train_reg)

# Predictions
y_pred_train = dt_reg.predict(X_train)
y_pred_test = dt_reg.predict(X_test)

print(f"Train RMSE: {np.sqrt(mean_squared_error(y_train_reg, y_pred_train)):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_test)):.4f}")

# Visualize a small part of the tree
plt.figure(figsize=(20,10))
plot_tree(dt_reg, max_depth=2, feature_names=feature_cols, filled=True, fontsize=10)
plt.show()
```

**Explanation:**

- We set `max_depth=5` to keep the tree relatively simple and avoid overfitting. `min_samples_split=10` ensures that internal nodes have at least 10 samples before they can be split.
- The train RMSE is lower than test RMSE, as expected. If the gap is large, the tree is overfitting; we could reduce depth or increase `min_samples_split`.
- The tree visualization (limited to depth 2) shows the first few splits. For example, the root might split on `RSI` or `MA_20`, indicating which features are most important.

For classification, we use `DecisionTreeClassifier`.

```python
dt_clf = DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=42)
dt_clf.fit(X_train, y_train_clf)

y_pred_clf = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test_clf, y_pred_clf)
print(f"Classification accuracy: {accuracy:.4f}")
```

**Explanation:**

- The classifier aims to predict direction. Accuracy above 0.5 indicates some predictive power, but we must compare to a baseline (e.g., always predicting "up" if the market is trending up). In our temporal split, we should compute the baseline accuracy on the test set.

### **22.1.3 Interpreting Decision Trees**

Decision trees are highly interpretable. We can extract the rules:

```python
# Get feature importances
importances = dt_reg.feature_importances_
feature_imp = pd.DataFrame({'feature': feature_cols, 'importance': importances})
feature_imp = feature_imp.sort_values('importance', ascending=False).head(10)
print(feature_imp)

# Plot
plt.figure(figsize=(10,6))
plt.barh(feature_imp['feature'], feature_imp['importance'])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances from Decision Tree')
plt.gca().invert_yaxis()
plt.show()
```

**Explanation:**

- Feature importance is calculated based on how much each feature reduces the splitting criterion (e.g., MSE) across all nodes. The top features give insight into what drives predictions.
- For NEPSE, we might see that recent returns (e.g., `Return_Lag1`) and volatility measures dominate.

### **22.1.4 Limitations of a Single Tree**

- **High variance:** Small changes in data can lead to completely different trees.
- **Instability:** Trees are sensitive to the specific training sample.
- **Overfitting:** Without pruning, trees can grow deep and memorize noise.
- **Piecewise constant predictions:** They produce step functions, not smooth approximations.

These limitations motivate ensemble methods like random forests and gradient boosting.

---

## **22.2 Random Forest**

Random Forest is an ensemble of decision trees, each trained on a bootstrap sample of the data (bagging) and with a random subset of features considered at each split. This decorrelates the trees and reduces variance while maintaining low bias.

### **22.2.1 Bootstrap Aggregating (Bagging)**

- **Bootstrap:** Create B random samples (with replacement) from the training set, each of the same size as the original.
- **Aggregating:** Train a tree on each bootstrap sample, then average predictions (regression) or take majority vote (classification).
- Bagging reduces variance without increasing bias much.

Random forest adds an extra layer of randomness: at each split, only a random subset of features is considered (typically sqrt(n_features) for classification, n_features/3 for regression). This makes the trees even less correlated.

### **22.2.2 Building a Random Forest for NEPSE**

We'll use `RandomForestRegressor` and `RandomForestClassifier` from scikit‑learn.

```python
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_split=10,
                               random_state=42, n_jobs=-1)
rf_reg.fit(X_train, y_train_reg)

# Predict
y_pred_rf_train = rf_reg.predict(X_train)
y_pred_rf_test = rf_reg.predict(X_test)

print(f"Random Forest Train RMSE: {np.sqrt(mean_squared_error(y_train_reg, y_pred_rf_train)):.4f}")
print(f"Random Forest Test RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_rf_test)):.4f}")

# Compare with single tree
print(f"Single Tree Test RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_test)):.4f}")
```

**Explanation:**

- `n_estimators=100` creates 100 trees. More trees generally improve performance but increase computation.
- `max_depth=5` limits tree depth, controlling complexity. Without depth limit, trees can grow fully, potentially overfitting.
- The test RMSE is often lower than a single tree, demonstrating the benefit of ensembling.

For classification:

```python
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_clf.fit(X_train, y_train_clf)
y_pred_rf_clf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
```

### **22.2.3 Feature Importance from Random Forest**

Random forest provides a more stable feature importance measure than a single tree, averaged over many trees.

```python
importances = rf_reg.feature_importances_
feature_imp_rf = pd.DataFrame({'feature': feature_cols, 'importance': importances})
feature_imp_rf = feature_imp_rf.sort_values('importance', ascending=False).head(10)
print(feature_imp_rf)

# Plot
plt.figure(figsize=(10,6))
plt.barh(feature_imp_rf['feature'], feature_imp_rf['importance'])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances from Random Forest')
plt.gca().invert_yaxis()
plt.show()
```

**Explanation:**

- Importance is computed as the average reduction in impurity (e.g., MSE) across all trees for each feature.
- This ranking helps identify which features are most predictive. For NEPSE, we might see that lagged returns and volatility dominate.

### **22.2.4 Out‑of‑Bag (OOB) Score**

Since each tree is trained on a bootstrap sample, about one‑third of the data is left out (out‑of‑bag). These OOB samples can be used as a validation set without needing a separate validation split.

```python
rf_reg_oob = RandomForestRegressor(n_estimators=100, max_depth=5, oob_score=True, random_state=42)
rf_reg_oob.fit(X_train, y_train_reg)
print(f"OOB R² score: {rf_reg_oob.oob_score_:.4f}")
```

**Explanation:**

- `oob_score=True` computes the R² score on the OOB samples. This gives an unbiased estimate of test performance, useful for hyperparameter tuning.

---

## **22.3 Gradient Boosting Machines**

Gradient boosting builds an ensemble of trees sequentially, where each new tree tries to correct the errors of the previous ones. Instead of bagging, it uses boosting: trees are added one at a time, and each new tree is trained on the residual errors of the current ensemble.

### **22.3.1 Boosting Concept**

1. Start with a simple model (e.g., a constant prediction).
2. Compute the residuals (errors) of the current model.
3. Fit a new tree to predict the residuals.
4. Add the new tree to the ensemble with a small learning rate (shrinkage).
5. Repeat steps 2‑4 for M iterations.

The final prediction is the sum of all tree predictions (times the learning rate). This sequential approach can achieve high accuracy but is prone to overfitting if not carefully regularized.

### **22.3.2 Gradient Boosting in Scikit‑Learn**

Scikit‑learn provides `GradientBoostingRegressor` and `GradientBoostingClassifier`.

```python
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3,
                                    random_state=42)
gb_reg.fit(X_train, y_train_reg)

y_pred_gb = gb_reg.predict(X_test)
rmse_gb = np.sqrt(mean_squared_error(y_test_reg, y_pred_gb))
print(f"Gradient Boosting Test RMSE: {rmse_gb:.4f}")

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3,
                                     random_state=42)
gb_clf.fit(X_train, y_train_clf)
y_pred_gb_clf = gb_clf.predict(X_test)
acc_gb = accuracy_score(y_test_clf, y_pred_gb_clf)
print(f"Gradient Boosting Accuracy: {acc_gb:.4f}")
```

**Explanation:**

- `n_estimators`: number of boosting stages.
- `learning_rate`: shrinks the contribution of each tree; a lower rate requires more trees but often improves generalization.
- `max_depth`: typically kept small (3‑5) to keep trees weak, which is beneficial for boosting.

### **22.3.3 Early Stopping**

To avoid overfitting, we can use early stopping: monitor validation performance and stop adding trees when it no longer improves.

```python
# Split training data into train and validation (temporal)
val_split = int(len(X_train) * 0.8)
X_train_gb, X_val_gb = X_train.iloc[:val_split], X_train.iloc[val_split:]
y_train_gb, y_val_gb = y_train_reg.iloc[:val_split], y_train_reg.iloc[val_split:]

# Fit with early stopping
gb_early = GradientBoostingRegressor(n_estimators=500, validation_fraction=0.2,
                                      n_iter_no_change=10, tol=0.001, random_state=42)
gb_early.fit(X_train_gb, y_train_gb)

print(f"Optimal number of trees: {gb_early.n_estimators_}")
```

**Explanation:**

- `validation_fraction` reserves part of the training data for validation.
- `n_iter_no_change` stops training if validation score doesn't improve for that many iterations.
- The final model uses the best number of trees.

### **22.3.4 Feature Importance**

Like random forest, gradient boosting also provides feature importances.

```python
importances_gb = gb_reg.feature_importances_
feature_imp_gb = pd.DataFrame({'feature': feature_cols, 'importance': importances_gb})
feature_imp_gb = feature_imp_gb.sort_values('importance', ascending=False).head(10)
print(feature_imp_gb)
```

---

## **22.4 XGBoost**

XGBoost (Extreme Gradient Boosting) is an optimized implementation of gradient boosting that has become the go‑to algorithm for many machine learning competitions. It includes enhancements like regularization, handling of missing values, and efficient tree pruning.

### **22.4.1 System Overview**

XGBoost features:

- **Regularization:** L1 and L2 penalties on leaf weights to reduce overfitting.
- **Sparsity awareness:** Handles missing values automatically by learning the best direction to go when a value is missing.
- **Weighted quantile sketch:** Efficiently finds optimal split points.
- **Cross‑validation and early stopping** built in.
- **Parallel processing** (though boosting is sequential, tree construction can be parallelized).

### **22.4.2 Installing and Using XGBoost**

```bash
pip install xgboost
```

```python
import xgboost as xgb

# Prepare data in DMatrix format (optional but efficient)
dtrain = xgb.DMatrix(X_train, label=y_train_reg)
dtest = xgb.DMatrix(X_test, label=y_test_reg)

# Set parameters
params = {
    'objective': 'reg:squarederror',
    'max_depth': 3,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'random_state': 42
}

# Train with early stopping
model_xgb = xgb.train(params, dtrain, num_boost_round=1000,
                      evals=[(dtest, 'test')],
                      early_stopping_rounds=10,
                      verbose_eval=False)

# Predict
y_pred_xgb = model_xgb.predict(dtest)
rmse_xgb = np.sqrt(mean_squared_error(y_test_reg, y_pred_xgb))
print(f"XGBoost Test RMSE: {rmse_xgb:.4f}")
```

**Explanation:**

- `subsample` fraction of samples used per tree (stochastic gradient boosting).
- `colsample_bytree` fraction of features used per tree.
- `reg_alpha` and `reg_lambda` are L1 and L2 regularization on weights.
- `early_stopping_rounds` stops if test error doesn't improve for 10 rounds.
- The model automatically uses the best iteration.

For classification:

```python
params_clf = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'eval_metric': 'logloss',
    'random_state': 42
}

dtrain_clf = xgb.DMatrix(X_train, label=y_train_clf)
dtest_clf = xgb.DMatrix(X_test, label=y_test_clf)

model_xgb_clf = xgb.train(params_clf, dtrain_clf, num_boost_round=1000,
                           evals=[(dtest_clf, 'test')],
                           early_stopping_rounds=10,
                           verbose_eval=False)

y_pred_xgb_clf_prob = model_xgb_clf.predict(dtest_clf)
y_pred_xgb_clf = (y_pred_xgb_clf_prob > 0.5).astype(int)
acc_xgb = accuracy_score(y_test_clf, y_pred_xgb_clf)
print(f"XGBoost Accuracy: {acc_xgb:.4f}")
```

### **22.4.3 Feature Importance in XGBoost**

XGBoost provides several types of importance: weight (number of times a feature is used), gain (average gain when the feature is used), and cover (average coverage).

```python
importance = model_xgb.get_score(importance_type='gain')
# Convert to DataFrame
imp_df = pd.DataFrame(list(importance.items()), columns=['feature', 'importance'])
imp_df = imp_df.sort_values('importance', ascending=False).head(10)
print(imp_df)
```

**Explanation:**

- `importance_type='gain'` is often more informative than 'weight' because it reflects how much the feature improves the model.
- The results can be plotted similarly.

### **22.4.4 Hyperparameter Tuning with XGBoost**

We can use `GridSearchCV` or `RandomizedSearchCV` with XGBoost's scikit‑learn API.

```python
from sklearn.model_selection import GridSearchCV

xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200],
    'subsample': [0.8, 1.0]
}

grid = GridSearchCV(xgb_model, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)
grid.fit(X_train, y_train_reg)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_}")
```

**Note:** When using time‑series data, the CV folds should be time‑based, not random. We can pass a custom time‑series splitter to `GridSearchCV`.

---

## **22.5 LightGBM**

LightGBM (Light Gradient Boosting Machine) is another gradient boosting framework developed by Microsoft. It is designed for efficiency and speed, especially with large datasets. It introduces two novel techniques: **Gradient‑based One‑Side Sampling (GOSS)** and **Exclusive Feature Bundling (EFB)**.

### **22.5.1 Leaf‑Wise Growth**

Unlike level‑wise growth (XGBoost's default), LightGBM grows trees leaf‑wise: it splits the leaf with the highest loss reduction, leading to deeper, asymmetric trees. This can converge faster but may overfit on small datasets.

### **22.5.2 Installing and Using LightGBM**

```bash
pip install lightgbm
```

```python
import lightgbm as lgb

# Prepare dataset (LightGBM has its own Dataset format)
train_data = lgb.Dataset(X_train, label=y_train_reg)
test_data = lgb.Dataset(X_test, label=y_test_reg, reference=train_data)

# Set parameters
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
    'seed': 42
}

# Train with early stopping
model_lgb = lgb.train(params,
                      train_data,
                      valid_sets=[test_data],
                      num_boost_round=1000,
                      callbacks=[lgb.early_stopping(10), lgb.log_evaluation(0)])

# Predict
y_pred_lgb = model_lgb.predict(X_test, num_iteration=model_lgb.best_iteration)
rmse_lgb = np.sqrt(mean_squared_error(y_test_reg, y_pred_lgb))
print(f"LightGBM Test RMSE: {rmse_lgb:.4f}")
```

**Explanation:**

- `num_leaves` controls the complexity (similar to `max_depth` in level‑wise trees). Typical values 31, 63, etc.
- `feature_fraction` is analogous to `colsample_bytree`.
- `bagging_fraction` and `bagging_freq` implement bagging (subsampling) to reduce overfitting.
- Early stopping monitors the validation metric (here RMSE) and stops if no improvement.

For classification:

```python
params_clf = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'verbose': -1,
    'seed': 42
}

train_clf = lgb.Dataset(X_train, label=y_train_clf)
test_clf = lgb.Dataset(X_test, label=y_test_clf, reference=train_clf)

model_lgb_clf = lgb.train(params_clf, train_clf, valid_sets=[test_clf],
                          num_boost_round=1000,
                          callbacks=[lgb.early_stopping(10), lgb.log_evaluation(0)])

y_pred_lgb_prob = model_lgb_clf.predict(X_test)
y_pred_lgb_clf = (y_pred_lgb_prob > 0.5).astype(int)
acc_lgb = accuracy_score(y_test_clf, y_pred_lgb_clf)
print(f"LightGBM Accuracy: {acc_lgb:.4f}")
```

### **22.5.3 Feature Importance in LightGBM**

LightGBM provides importance via `feature_importance()`.

```python
importance_lgb = model_lgb.feature_importance(importance_type='gain')
feature_imp_lgb = pd.DataFrame({'feature': feature_cols, 'importance': importance_lgb})
feature_imp_lgb = feature_imp_lgb.sort_values('importance', ascending=False).head(10)
print(feature_imp_lgb)
```

---

## **22.6 CatBoost**

CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex. It excels at handling categorical features automatically (hence the name) and is known for its robustness and reduced need for hyperparameter tuning.

### **22.6.1 Ordered Boosting and Categorical Handling**

- **Ordered boosting:** A permutation‑driven approach to avoid target leakage when dealing with categorical features.
- **Automatic categorical feature processing:** You can specify which columns are categorical, and CatBoost applies various encoding schemes (e.g., one‑hot, mean encoding) with counter‑measures against overfitting.

### **22.6.2 Installing and Using CatBoost**

```bash
pip install catboost
```

```python
from catboost import CatBoostRegressor, CatBoostClassifier

# CatBoost Regressor
cb_reg = CatBoostRegressor(iterations=1000,
                           learning_rate=0.1,
                           depth=3,
                           loss_function='RMSE',
                           verbose=100,
                           early_stopping_rounds=10,
                           random_seed=42)

cb_reg.fit(X_train, y_train_reg,
           eval_set=(X_test, y_test_reg),
           plot=False)  # plot=True for interactive visualization

y_pred_cb = cb_reg.predict(X_test)
rmse_cb = np.sqrt(mean_squared_error(y_test_reg, y_pred_cb))
print(f"CatBoost Test RMSE: {rmse_cb:.4f}")

# CatBoost Classifier
cb_clf = CatBoostClassifier(iterations=1000,
                            learning_rate=0.1,
                            depth=3,
                            loss_function='Logloss',
                            verbose=100,
                            early_stopping_rounds=10,
                            random_seed=42)

cb_clf.fit(X_train, y_train_clf,
           eval_set=(X_test, y_test_clf),
           plot=False)

y_pred_cb_clf = cb_clf.predict(X_test)
acc_cb = accuracy_score(y_test_clf, y_pred_cb_clf)
print(f"CatBoost Accuracy: {acc_cb:.4f}")
```

**Explanation:**

- `iterations` is the number of trees (boosting rounds).
- `depth` is the tree depth (similar to `max_depth`).
- CatBoost automatically handles missing values and categorical features. If we had categorical columns (like stock symbol), we would pass them via the `cat_features` parameter.
- The `verbose` parameter controls output; `early_stopping_rounds` stops if validation metric doesn't improve.

### **22.6.3 Feature Importance in CatBoost**

```python
importance_cb = cb_reg.get_feature_importance()
feature_imp_cb = pd.DataFrame({'feature': feature_cols, 'importance': importance_cb})
feature_imp_cb = feature_imp_cb.sort_values('importance', ascending=False).head(10)
print(feature_imp_cb)
```

---

## **22.7 Comparison of Tree Models**

Let's compare the performance of the different tree‑based models on the NEPSE regression task.

```python
models = {
    'Decision Tree': dt_reg,
    'Random Forest': rf_reg,
    'Gradient Boosting': gb_reg,
    'XGBoost': model_xgb,
    'LightGBM': model_lgb,
    'CatBoost': cb_reg
}

results = []
for name, model in models.items():
    if name in ['XGBoost', 'LightGBM', 'CatBoost']:
        # These models have their own predict methods
        if name == 'XGBoost':
            y_pred = model.predict(xgb.DMatrix(X_test))
        elif name == 'LightGBM':
            y_pred = model.predict(X_test, num_iteration=model.best_iteration)
        else:
            y_pred = model.predict(X_test)
    else:
        y_pred = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred))
    results.append({'Model': name, 'Test RMSE': rmse})

results_df = pd.DataFrame(results).sort_values('Test RMSE')
print(results_df)
```

**Explanation:**

- The comparison shows which model performs best on this particular dataset. Typically, boosting methods (XGBoost, LightGBM, CatBoost) outperform bagging (Random Forest) and single trees, but the ranking can vary with data.
- Note that we haven't tuned hyperparameters extensively; a fair comparison would involve tuning each model.

**Key Differences:**

| Feature | Random Forest | Gradient Boosting | XGBoost | LightGBM | CatBoost |
|---------|---------------|-------------------|---------|----------|----------|
| Tree growth | Level-wise | Level-wise | Level-wise (histogram) | Leaf-wise | Symmetric |
| Handling categoricals | One-hot encoding needed | One-hot encoding needed | One-hot encoding needed | One-hot encoding or native | Native, excellent |
| Speed on large data | Moderate | Moderate | Fast | Very fast | Fast |
| Overfitting tendency | Low | High (needs tuning) | Moderate | Moderate | Low |
| Default performance | Good | Good | Very good | Very good | Very good |
| Interpretability | High (importances) | Medium | Medium | Medium | Medium |

---

## **22.8 Best Practices for Tree‑Based Models in Time‑Series**

### **22.8.1 Feature Engineering**

Tree models can handle non‑linear relationships and interactions automatically, but they still benefit from good feature engineering. For NEPSE, include:

- Lagged returns and volumes.
- Rolling statistics (mean, std, min, max, skew).
- Technical indicators (RSI, MACD, Bollinger Bands).
- Calendar features (day of week, month, fiscal quarter).
- Domain‑specific features (circuit breaker proximity).

### **22.8.2 Preventing Look‑Ahead Bias**

All features must be computable at prediction time. Ensure that rolling windows use only past data (`shift()` and closed windows). For example, when computing a 5‑day moving average for today's features, use `rolling(5).mean().shift(1)` to exclude today's value.

### **22.8.3 Handling Missing Values**

Tree models can handle missing values internally (XGBoost, LightGBM, CatBoost learn a default direction). However, it's good practice to either forward‑fill or drop rows with too many missing values.

### **22.8.4 Temporal Cross‑Validation**

Use time‑series cross‑validation (e.g., `TimeSeriesSplit`) to tune hyperparameters. Random shuffling will leak future information.

```python
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

tscv = TimeSeriesSplit(n_splits=3)
param_grid = {'max_depth': [3, 5, 7], 'n_estimators': [100, 200]}
rf = RandomForestRegressor(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=tscv, scoring='neg_mean_squared_error')
grid.fit(X, y_reg)  # use full dataset with temporal CV
print(grid.best_params_)
```

### **22.8.5 Regularization and Early Stopping**

- For boosting models, always use early stopping on a validation set to prevent overfitting.
- Use `subsample`, `colsample_bytree`, `reg_alpha`, `reg_lambda` to add regularization.
- For random forests, `max_depth`, `min_samples_split`, and `min_samples_leaf` control complexity.

### **22.8.6 Interpretability**

Use feature importances to understand which features drive predictions. This can also help in feature selection and debugging.

### **22.8.7 Scalability**

For large datasets (many stocks and long history), LightGBM and XGBoost are faster. Consider using GPU acceleration if available.

---

## **22.9 Application to NEPSE: Predicting Next‑Day Direction**

Let's put it all together: we'll build a classifier to predict whether the NEPSE index (or a specific stock) will go up the next day. We'll use CatBoost for its ease of use and good default performance.

```python
# Assume X_train, y_train_clf, X_test, y_test_clf are already defined from temporal split

# CatBoost classifier with some tuning
model = CatBoostClassifier(iterations=500,
                           learning_rate=0.05,
                           depth=4,
                           loss_function='Logloss',
                           eval_metric='Accuracy',
                           verbose=200,
                           early_stopping_rounds=20,
                           random_seed=42)

model.fit(X_train, y_train_clf,
          eval_set=(X_test, y_test_clf),
          plot=False)

# Predict on test
y_pred = model.predict(X_test)
acc = accuracy_score(y_test_clf, y_pred)
print(f"Test accuracy: {acc:.4f}")

# Feature importance
feat_imp = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.get_feature_importance()
}).sort_values('importance', ascending=False).head(10)
print(feat_imp)

# Plot
plt.figure(figsize=(10,6))
plt.barh(feat_imp['feature'], feat_imp['importance'])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances - CatBoost Classifier')
plt.gca().invert_yaxis()
plt.show()
```

**Explanation:**

- The model achieves a certain accuracy. If it's above 0.5, it has predictive power. But we must compare to a baseline: e.g., the proportion of up days in the test set (if >0.5, always predicting up gives that accuracy).
- Feature importance reveals which features the model relies on. For NEPSE, we might see that recent returns, volatility, and RSI are top.

---

## **22.10 Chapter Summary**

In this chapter, we explored tree‑based models for time‑series prediction, using the NEPSE dataset as a concrete example.

- **Decision trees** provide interpretable rules but are prone to overfitting.
- **Random forests** reduce variance by bagging and feature subsampling, yielding robust models.
- **Gradient boosting** builds trees sequentially to correct errors, often achieving state‑of‑the‑art performance.
- **XGBoost, LightGBM, and CatBoost** are optimized implementations with additional features like regularization, native categorical support, and speed.
- We compared their strengths and weaknesses and provided best practices for time‑series applications.

### **Practical Takeaways for the NEPSE System:**

- Tree models are excellent for capturing non‑linear relationships and interactions among features like lagged returns, volume, and technical indicators.
- Always use temporal cross‑validation to avoid look‑ahead bias.
- Regularization and early stopping are crucial to prevent overfitting, especially in boosting.
- Feature importance helps interpret what drives predictions, which is valuable for understanding market dynamics.
- CatBoost and LightGBM are good starting points due to their ease of use and performance.

In the next chapter, **Chapter 23: Linear Models for Time‑Series**, we will revisit linear models but with a focus on regularization and their application to financial forecasting.

---

**End of Chapter 22**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='21. traditional_statistical_models.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='23. linear_models_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
