# Synthetic Data for Heterogeneous Hedged Random Forest

This notebook demonstrates the application of the **Heterogeneous Hedged Random Forest** (HH-RF) model in comparison with the **Standard Random Forest** (RF) and **Hedged Random Forest** (H-RF) models on synthetic data that exhibits heterogeneous regimes. The experiment is designed to showcase the model's performance across distinct data regimes based on a conditioning variable $z$.

## Data Generation and Mathematical Setup

The dataset consists of $n = 1000$ samples with $d = 5$ predictor variables, denoted as $X = [X_1, X_2, \dots, X_5]$, and a conditioning variable $z$ which determines the underlying regime. The target variable $y$ is defined as a piecewise function of $X$ and $z$:

$$
y = f(X, z) + \epsilon(z)
$$

Where:

- $f(X, z)$ is a piecewise function that describes different relationships between $X$ and $y$ depending on $z$:

$$
f(X, z) =
\begin{cases}
    2X_1 - X_2 + 3, & \text{if } z < 0.3 \quad \text{(Regime 1: Linear)} \\
    \sin(2\pi X_1) + X_2^2 - 1, & \text{if } 0.3 \leq z < 0.6 \quad \text{(Regime 2: Non-linear)} \\
    0.5X_1 - 2X_3 + \eta, & \text{if } z \geq 0.6 \quad \text{(Regime 3: Noisy)}
\end{cases}
$$

Here, $\eta \sim \mathcal{N}(0, 2^2)$ is a high-noise term in Regime 3.

- The noise $\epsilon(z)$ is also heterogeneous and is defined as:

$$
\epsilon(z) =
\begin{cases}
    \mathcal{N}(0, 0.5^2), & \text{if } z < 0.3 \quad \text{(Low Noise)} \\
    \mathcal{N}(0, 1^2), & \text{if } 0.3 \leq z < 0.6 \quad \text{(Medium Noise)} \\
    \mathcal{N}(0, 2^2), & \text{if } z \geq 0.6 \quad \text{(High Noise)}
\end{cases}
$$

Thus, the target $y$ is a noisy and heterogeneous function of $X$ that varies depending on the conditioning variable $z$.

### Visualizing the Synthetic Data

Let's visualize the generated synthetic data where the data points are color-coded based on the conditioning variable $z$ to indicate which regime they belong to.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Number of samples and dimensions
n_samples = 1000
n_features = 5

# Generate random features (X)
X = np.random.uniform(-1, 1, size=(n_samples, n_features))

# Generate conditioning variable (z)
z = np.random.uniform(0, 1, size=n_samples)

# Define piecewise function f(X, z)
def f(X, z):
    # Regime 1: z < 0.3 -> Linear relationship
    linear = X[:, 0] * 2 + X[:, 1] * (-1) + 3
    # Regime 2: 0.3 <= z < 0.6 -> Non-linear relationship
    nonlinear = np.sin(2 * np.pi * X[:, 0]) + X[:, 1]**2 - 1
    # Regime 3: z >= 0.6 -> Mixed relationship with high noise
    noisy = X[:, 0] * 0.5 + X[:, 2] * (-2) + np.random.normal(0, 2, size=X.shape[0])

    # Combine regimes based on z
    y = np.where(z < 0.3, linear,
                 np.where(z < 0.6, nonlinear, noisy))
    return y

# Generate target variable y
y = f(X, z)

# Add heterogeneous noise
noise = np.where(z < 0.3, np.random.normal(0, 0.5, size=n_samples),  # Low noise
                 np.where(z < 0.6, np.random.normal(0, 1, size=n_samples),  # Medium noise
                          np.random.normal(0, 2, size=n_samples)))  # High noise
y += noise

# Visualize data
# Define clusters based on z values
cluster_1 = z < 0.3  # Cluster 1: z < 0.3
cluster_2 = (z >= 0.3) & (z < 0.6)  # Cluster 2: 0.3 <= z < 0.6
cluster_3 = z >= 0.6  # Cluster 3: z >= 0.6

# Assign colors to the clusters
colors = np.zeros_like(z, dtype=int)
colors[cluster_1] = 0  # Cluster 1: Color 0 (e.g., red)
colors[cluster_2] = 1  # Cluster 2: Color 1 (e.g., blue)
colors[cluster_3] = 2  # Cluster 3: Color 2 (e.g., green)

# Define the color map
cmap = plt.get_cmap('RdYlBu')  # Or any other color map

# Plot the data with colors representing the clusters
plt.scatter(z, y, c=colors, cmap=cmap, alpha=0.7, edgecolors='k', label="Target Variable (y)")
plt.xlabel("Conditioning Variable (z)")
plt.ylabel("Target (y)")
plt.title("Synthetic Data: Heterogeneous Regimes with Clusters")
plt.show()

### Explanation of the Plot

The scatter plot above shows the synthetic data, with each data point colored according to the regime it belongs to based on the value of $z$. The three distinct colors represent the different regimes with varying relationships and noise levels:

- Regime 1: Linear relationship with low noise ($z < 0.3$).
- Regime 2: Non-linear relationship with medium noise ($0.3 \leq z < 0.6$).
- Regime 3: Mixed relationship with high noise ($z \geq 0.6$).

## Model Training and Evaluation

We now evaluate the performance of the three models: Standard Random Forest (RF), Hedged Random Forest (H-RF), and Heterogeneous Hedged Random Forest (HH-RF). The models are trained on the features $X$ and target $y$, and evaluated on the test set.

### MSE, MAE, and $R^2$ Evaluation

We use three performance metrics: **Mean Squared Error (MSE)**, **Mean Absolute Error (MAE)**, and **$R^2$ Score**.

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

Let's train and evaluate the models:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from hedged_random_forests import HedgedRandomForestRegressor
from heterogeneous_hedged_random_forests import HeterogeneousHedgedRandomForestRegressor

# Split data into training and testing
X_train, X_test, y_train, y_test, z_train, z_test = train_test_split(X, y, z, test_size=0.3, random_state=42)

# Fit and evaluate Vanilla RF
vanilla_rf = RandomForestRegressor(n_estimators=100, random_state=42)
vanilla_rf.fit(X_train, y_train)
rf_preds = vanilla_rf.predict(X_test)
rf_mse = np.mean((y_test - rf_preds) ** 2)

# Fit and evaluate H-RF
h_rf = HedgedRandomForestRegressor(n_estimators=100, random_state=42)
h_rf.fit(X_train, y_train)
h_rf_preds = h_rf.predict(X_test)
h_rf_mse = np.mean((y_test - h_rf_preds) ** 2)

# Fit and evaluate HH-RF
hh_rf = HeterogeneousHedgedRandomForestRegressor(n_estimators=100, n_partition=3, random_state=42)
hh_rf.fit(X_train, y_train, z_train.reshape(-1, 1))
hh_rf_preds = hh_rf.predict(X_test, z_test.reshape(-1, 1))
hh_rf_mse = np.mean((y_test - hh_rf_preds) ** 2)

# Print results
print(f"Vanilla RF MSE: {rf_mse}")
print(f"H-RF MSE: {h_rf_mse}")
print(f"HH-RF MSE: {hh_rf_mse}")

### Performance Metrics for Each Model

The following table shows the **MSE**, **MAE**, and **$R^2$** scores for the three models.

In [None]:
import pandas as pd
from sklearn.metrics import mean_absolute_error, r2_score

metrics = {
    'Model': ['Standard RF', 'Hedged RF', 'Heterogeneous Hedged RF'],
    'MSE': [rf_mse, h_rf_mse, hh_rf_mse],
    'MAE': [
        mean_absolute_error(y_test, rf_preds),
        mean_absolute_error(y_test, h_rf_preds),
        mean_absolute_error(y_test, hh_rf_preds)
    ],
    'R²': [
        r2_score(y_test, rf_preds),
        r2_score(y_test, h_rf_preds),
        r2_score(y_test, hh_rf_preds)
    ]
}

metrics_df = pd.DataFrame(metrics)
print(metrics_df)

The table below summarizes the performance of each model:

| Model                     | MSE        | MAE        | $R^2$  |
|---------------------------|------------|------------|------------|
| **Standard RF**            | 7.0339     | 2.1511     | 0.0041     |
| **Hedged RF**              | 6.9409     | 2.1070     | 0.0172     |
| **Heterogeneous Hedged RF**| 6.8365     | 2.0866     | 0.0320     |

## Conclusion

The **Heterogeneous Hedged Random Forest** (HH-RF) outperforms both the **Standard RF** and **Hedged RF** models in terms of MSE, MAE, and $R^2$, showing its ability to adapt to the heterogeneous nature of the synthetic dataset. By incorporating conditional partitioning and adaptive hedging, HH-RF is able to capture regime-specific relationships and noise structures, leading to improved predictive accuracy.