# ⭐ Tutorial: Synthetic Data Generation with RiskLabAI

This notebook is a tutorial for the synthetic data generation functions in the `RiskLabAI` library. Synthetic data is crucial for backtesting, as it allows us to test strategies under known, controlled conditions that may not exist in historical data.

We will demonstrate:
1.  **Block-Diagonal Covariance:** How to generate a 'true' structured covariance matrix, simulate noisy observations from it, and recover the sample covariance.
2.  **Drift-Burst Hypothesis:** How to generate the drift (`μ`) and volatility (`σ`) parameters for a market bubble scenario.
3.  **Regime-Switching Prices:** The main event: generating a realistic, Heston-Merton price path that switches between a 'calm' and 'volatile' regime based on a Markov chain.
4.  **Parallel Generation:** We'll conclude by showing the function for scaling this up to generate thousands of paths for Monte Carlo analysis.

## 0. Setup and Imports

First, we import our libraries and the necessary modules from `RiskLabAI`. Thanks to the `__init__.py` file, we can import all synthetic data functions under one alias, `synth`.

In [None]:
# Standard Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings

# RiskLabAI Imports
# We can import the entire module thanks to the __init__.py file
import RiskLabAI.data.synthetic_data as synth
import RiskLabAI.utils.publication_plots as pub_plots

# --- Notebook Configuration ---
warnings.filterwarnings('ignore')
np.set_printoptions(precision=4, suppress=True)
pub_plots.setup_publication_style() # Apply global publication style

## 1. Block-Diagonal Covariance (simulation.py)

A common task is to create a 'true' covariance matrix (`cov0`) with a known cluster structure, draw noisy samples from it, and then generate the empirical covariance `cov1`. This is the exact setup used in the Denoising tutorial to prove its effectiveness.

We'll use `form_true_matrix` to create a ground-truth matrix and `simulates_cov_mu` to generate observations.

In [None]:
# 1. Define the true, underlying market structure
n_blocks = 10         # 10 clusters (e.g., sectors)
b_size = 50           # 50 assets per cluster
b_corr = 0.5          # 50% correlation within a cluster

print("Forming 'true' (ground-truth) covariance matrix...")
mu0, cov0 = synth.form_true_matrix(n_blocks, b_size, b_corr)

# 2. Simulate n_obs from this true matrix
n_obs = 1000 # Number of observations (e.g., days)
print("Simulating observations from true matrix...")
mu1, cov1 = synth.simulates_cov_mu(mu0, cov0, n_obs, shrink=False)

print(f"\nTrue Covariance Matrix (cov0) shape: {cov0.shape}")
print(f"Empirical Covariance (cov1) shape: {cov1.shape}")

Let's visualize the 'true' and 'empirical' matrices. The true matrix has a sharp, block-diagonal structure (after shuffling). The empirical matrix is a noisy approximation of it.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

im1 = ax1.imshow(cov0, cmap='viridis')
pub_plots.apply_plot_style(ax1, 'True Covariance (cov0)', 'Assets', 'Assets')
fig.colorbar(im1, ax=ax1, fraction=0.046, pad=0.04)

im2 = ax2.imshow(cov1, cmap='viridis')
pub_plots.apply_plot_style(ax2, 'Empirical Covariance (cov1)', 'Assets', 'Assets')
fig.colorbar(im2, ax=ax2, fraction=0.046, pad=0.04)

plt.tight_layout()
plt.show()

## 2. Drift-Burst Hypothesis (drift_burst_hypothesis.py)

This module generates the parameters for a bubble. It models drift (`μ`) and volatility (`σ`) exploding as they approach a midpoint (`t=0.5`) and then collapsing.

We can use the `drift_volatility_burst` function to generate these parameter lists, which can then be fed into our regime-switching model.

In [None]:
# Bubble parameters
bubble_length = 500 # 500 timesteps
a_before = 0.1      # Low positive drift before burst
a_after = -0.1      # Negative drift after burst
b_before = 0.1      # Low volatility before burst
b_after = 0.2       # Higher volatility after burst
alpha = 0.5         # Drift exponent
beta = 0.5          # Volatility exponent

drifts, volatilities = synth.drift_volatility_burst(
    bubble_length, a_before, a_after, b_before, b_after, alpha, beta
)

# --- Plotting ---
fig, ax = plt.subplots(figsize=(14, 7))
steps = np.linspace(0, 1, bubble_length)

# Plot Drift (Primary Y-axis)
ax.plot(steps, drifts, label='Drift (μ)', color='C0')

# Create secondary Y-axis for Volatility
ax2 = ax.twinx()
ax2.plot(steps, volatilities, label='Volatility (σ)', color='C1', linestyle='--')

# Add explosion line
ax.axvline(x=0.5, color='red', linestyle=':', linewidth=2, label='Burst (t=0.5)')

# Apply styling
pub_plots.apply_plot_style(
    ax, 
    'Drift-Burst Hypothesis (DBH) Parameters',
    'Time (t)', 
    'Drift (μ)'
)
ax2.set_ylabel('Volatility (σ)')

lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines + lines2, labels + labels2, loc='upper left')
ax2.grid(False)

plt.show()

## 3. Regime-Switching Prices (synthetic_controlled_environment.py)

This is the most powerful feature: generating a price path from a Heston-Merton jump-diffusion model with Markov-switching regimes. 

We'll define two simple regimes:
* **"Calm":** Low drift, low volatility, mean-reverting.
* **"Volatile":** Low drift, high volatility, and a chance of jumps.

We'll also define a **transition matrix** that says the market is 98% likely to stay in its current regime at each step.

In [None]:
# 1. Define the regimes
regimes = {
    "Calm": {
        "mu": 0.05,     # Drift
        "kappa": 2.0,   # Vol mean-reversion speed
        "theta": 0.05,  # Long-term vol
        "xi": 0.2,      # Vol of vol
        "rho": -0.6,    # Correlation (leverage effect)
        "lam": 0.0,     # Jump intensity (0 for calm)
        "m": 0.0,
        "v": 0.0
    },
    "Volatile": {
        "mu": 0.02,
        "kappa": 1.0,   # Slower mean-reversion
        "theta": 0.2,   # High long-term vol
        "xi": 0.5,      # High vol of vol
        "rho": -0.8,    # Stronger leverage effect
        "lam": 0.1,     # Jump intensity (non-zero)
        "m": -0.1,    # Mean jump size (negative)
        "v": 0.1      # Jump volatility
    }
}

# 2. Define the transition matrix (Calm, Volatile)
# P(i, j) = probability of moving from state i to state j
transition_matrix = np.array([
    [0.98, 0.02],  # From Calm -> (Calm, Volatile)
    [0.02, 0.98]   # From Volatile -> (Calm, Volatile)
])

# 3. Set simulation parameters
total_time = 2.0  # 2 years
n_steps = 252 * 2   # 2 years of business days

print("Generating price path...")
prices, regimes = synth.generate_prices_from_regimes(
    regimes,
    transition_matrix,
    total_time,
    n_steps,
    random_state=42
)

print("Generation complete.")

### Visualizing the Regimes

Let's plot the resulting price path. We'll shade the background to show which regime was active at each point in time. This is a very effective way to confirm the model is working as expected.

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

# Plot the price series
prices.plot(ax=ax, label='Synthetic Price', color='black', logy=True)

# --- Create regime shading ---
regime_map = {"Calm": 0, "Volatile": 1}
regime_values = [regime_map[r] for r in regimes]

ax.fill_between(
    prices.index, 
    ax.get_ylim()[0], 
    ax.get_ylim()[1], 
    where=[r == 1 for r in regime_values], 
    color='red', 
    alpha=0.2, 
    label='Volatile Regime'
)
ax.fill_between(
    prices.index, 
    ax.get_ylim()[0], 
    ax.get_ylim()[1], 
    where=[r == 0 for r in regime_values], 
    color='green', 
    alpha=0.2, 
    label='Calm Regime'
)

# Apply styling
pub_plots.apply_plot_style(
    ax, 
    'Heston-Merton Price Path with Markov-Switching Regimes',
    'Date', 
    'Price (Log Scale)'
)
ax.legend(loc='upper left')
plt.show()

**Analysis:** The plot clearly shows the model is working. The price path is smooth and steady during the 'Calm' (green) periods and becomes erratic, gappy (jumps), and trends down during the 'Volatile' (red) periods.

## 4. Parallel Generation for Monte Carlo

Generating one path is useful, but the real power comes from generating thousands of paths for a Monte Carlo simulation. The module provides `parallel_generate_prices` for this purpose.

It takes the same arguments but adds `number_of_paths` and `n_jobs` (for CPU cores) and returns a DataFrame of paths.

In [None]:
print("Function signature for parallel generation:")
print("RiskLabAI.data.synthetic_data.parallel_generate_prices(")
print("    number_of_paths: int,")
print("    regimes: Dict,")
print("    transition_matrix: np.ndarray,")
print("    total_time: float,")
print("    n_steps: int,")
print("    random_state: Optional[int] = None,")
print("    n_jobs: int = 1")
print(")")

# Example (commented out to run fast):
# prices_df, regimes_df = synth.parallel_generate_prices(
#     number_of_paths=1000,
#     regimes=regimes,
#     transition_matrix=transition_matrix,
#     total_time=total_time,
#     n_steps=n_steps,
#     n_jobs=-1 # Use all cores
# )
# print(f"Generated {prices_df.shape[1]} paths.")

## 5. Conclusion

This notebook demonstrated the power of the `RiskLabAI.data.synthetic_data` module. We have shown how to:

1.  **Create Structured Covariance:** Generate `cov0` and `cov1` for testing portfolio methods (like Denoising or HRP).
2.  **Model Bubbles:** Generate drift and volatility parameters that mimic a financial bubble using the Drift-Burst Hypothesis.
3.  **Generate Realistic Prices:** Create high-fidelity, Numba-accelerated Heston-Merton price paths that switch between different market regimes.
4.  **Scale Up:** Use `parallel_generate_prices` to run large-scale Monte Carlo simulations.

This module provides a robust 