# Synthetic Time Series Generation for Benchmarking

This tutorial introduces `TimeSeriesSimulator`, a tool for generating synthetic time series data with customizable statistical distributions. This is invaluable for systematically testing how forecasting models perform on data with specific, known characteristics.

## Why Synthetic Data?

When developing forecasting solutions, you often need to answer questions like:

- **"How does my model handle sudden demand spikes?"** (e.g., from promotions or viral events)
- **"Which model is most robust to heavy-tailed distributions?"** (e.g., insurance claims, website traffic)
- **"How do different models behave with multiple seasonalities?"** (e.g., daily + weekly + yearly patterns)

Real-world data is messy, expensive to obtain, and you can't control its characteristics. With synthetic data:

1. **You know the ground truth** - You designed the data generation process
2. **You can isolate specific behaviors** - Test one characteristic at a time
3. **You can generate unlimited samples** - No data scarcity issues
4. **Reproducibility** - Same seed = same data, every time

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For development - direct import
import sys
sys.path.insert(0, '../../../python')

from statsforecast.synthetic import TimeSeriesSimulator

## Basic Usage: Built-in Distributions

`TimeSeriesSimulator` comes with 7 built-in distributions:
- `normal`, `poisson`, `exponential`, `gamma`, `uniform`, `binomial`, `lognormal`

In [None]:
# Generate normally distributed time series
sim = TimeSeriesSimulator(
    length=100,
    distribution="normal",
    dist_params={"loc": 100, "scale": 15},
    seed=42,
)

df = sim.simulate(n_series=3)
print(f"Generated {df['unique_id'].nunique()} series with {len(df)} total rows")
df.head(10)

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(12, 4))
for uid in df['unique_id'].unique():
    subset = df[df['unique_id'] == uid]
    ax.plot(subset['ds'], subset['y'], label=f'Series {uid}', alpha=0.7)
ax.set_title('Normal Distribution (loc=100, scale=15)')
ax.legend()
plt.tight_layout()
plt.show()

## Adding Trend and Seasonality

Real time series often have trend and seasonal components. `TimeSeriesSimulator` supports:

**Trends:** `linear`, `quadratic`, `exponential`, or custom callable

**Seasonality:** Single or multiple periods

In [None]:
# Gamma distribution with linear trend and weekly seasonality
sim = TimeSeriesSimulator(
    length=180,  # ~6 months
    distribution="gamma",
    dist_params={"shape": 5, "scale": 10},  # mean = 50
    trend="linear",
    trend_params={"slope": 0.2, "intercept": 0},
    seasonality=7,  # weekly
    seasonality_strength=15.0,
    seed=42,
)

df = sim.simulate()

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df['ds'], df['y'])
ax.set_title('Gamma + Linear Trend + Weekly Seasonality')
plt.tight_layout()
plt.show()

In [None]:
# Multiple seasonalities (weekly + monthly)
sim = TimeSeriesSimulator(
    length=365,
    distribution="normal",
    dist_params={"loc": 100, "scale": 5},
    seasonality=[7, 30],  # weekly and monthly
    seasonality_strength=[10.0, 20.0],
    seed=42,
)

df = sim.simulate()

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df['ds'], df['y'])
ax.set_title('Multiple Seasonalities: Weekly (amplitude=10) + Monthly (amplitude=20)')
plt.tight_layout()
plt.show()

## The Key Feature: Custom Distributions

**This is why `TimeSeriesSimulator` exists.**

Built-in distributions are useful, but real-world data often has complex patterns that don't fit standard distributions. The `distribution` parameter accepts any callable with signature:

```python
def my_distribution(size: int, rng: np.random.Generator) -> np.ndarray:
    # Generate `size` values using `rng` for reproducibility
    return values
```

This gives you **complete control** over the data generation process.

### Example 1: Demand with Promotional Spikes

In retail forecasting, demand usually follows a gamma-like distribution, but promotional events cause sudden spikes. How do different models handle this?

In [None]:
def demand_with_spikes(size, rng):
    """Simulate retail demand with random promotional spikes."""
    # Base demand follows gamma distribution
    base_demand = rng.gamma(shape=5, scale=10, size=size)
    
    # 5% of days have promotional spikes
    spike_mask = rng.random(size) < 0.05
    spike_multiplier = rng.uniform(2.5, 5.0, size=size)
    
    demand = base_demand.copy()
    demand[spike_mask] *= spike_multiplier[spike_mask]
    
    return demand


sim = TimeSeriesSimulator(
    length=365,
    distribution=demand_with_spikes,
    trend="linear",
    trend_params={"slope": 0.05},  # slight growth
    seasonality=7,
    seasonality_strength=10.0,
    seed=42,
)

df = sim.simulate(n_series=1)

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df['ds'], df['y'], alpha=0.8)
ax.axhline(df['y'].mean(), color='red', linestyle='--', label=f'Mean: {df["y"].mean():.1f}')
ax.set_title('Retail Demand with Promotional Spikes')
ax.set_ylabel('Demand')
ax.legend()
plt.tight_layout()
plt.show()

print(f"Normal range: {df['y'].quantile(0.05):.1f} - {df['y'].quantile(0.95):.1f}")
print(f"Max (spike): {df['y'].max():.1f}")

### Example 2: Bimodal Distribution (Two Customer Segments)

Imagine you have two customer segments with different spending patterns - some spend ~$20, others spend ~$80.

In [None]:
def bimodal_spending(size, rng):
    """Two customer segments with different spending patterns."""
    # 60% low spenders, 40% high spenders
    segment = rng.random(size) < 0.6
    
    values = np.zeros(size)
    values[segment] = rng.normal(20, 5, size=segment.sum())  # Low spenders
    values[~segment] = rng.normal(80, 10, size=(~segment).sum())  # High spenders
    
    return np.maximum(values, 0)  # No negative spending


sim = TimeSeriesSimulator(
    length=200,
    distribution=bimodal_spending,
    seed=42,
)

df = sim.simulate()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(df['ds'], df['y'], alpha=0.7)
axes[0].set_title('Bimodal Spending Over Time')
axes[0].set_ylabel('Spending ($)')

axes[1].hist(df['y'], bins=30, edgecolor='black', alpha=0.7)
axes[1].set_title('Spending Distribution (Two Segments)')
axes[1].set_xlabel('Spending ($)')

plt.tight_layout()
plt.show()

### Example 3: Regime Changes (Market Conditions)

Financial or economic data often has regime changes - periods of low volatility followed by high volatility.

In [None]:
def regime_switching(size, rng):
    """Alternating regimes of low and high volatility."""
    values = np.zeros(size)
    regime_length = 50
    
    for i in range(0, size, regime_length):
        end = min(i + regime_length, size)
        segment_size = end - i
        
        # Alternate between calm and volatile regimes
        if (i // regime_length) % 2 == 0:
            # Calm regime: low volatility around 100
            values[i:end] = rng.normal(100, 5, size=segment_size)
        else:
            # Volatile regime: high volatility around 100
            values[i:end] = rng.normal(100, 25, size=segment_size)
    
    return values


sim = TimeSeriesSimulator(
    length=300,
    distribution=regime_switching,
    seed=42,
)

df = sim.simulate()

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df['ds'], df['y'], alpha=0.8)

# Shade the volatile regimes
for i in range(1, 6, 2):
    start_idx = i * 50
    end_idx = min((i + 1) * 50, 300)
    if start_idx < 300:
        ax.axvspan(df['ds'].iloc[start_idx], df['ds'].iloc[min(end_idx-1, 299)], 
                   alpha=0.2, color='red', label='High volatility' if i == 1 else '')

ax.set_title('Regime Switching: Alternating Calm and Volatile Periods')
ax.legend()
plt.tight_layout()
plt.show()

## Integration with StatsForecast

The output is directly compatible with `StatsForecast`. Here's a complete example of generating synthetic data and evaluating models:

In [None]:
# Generate data for multiple series
sim = TimeSeriesSimulator(
    length=120,  # 120 days
    distribution=demand_with_spikes,
    trend="linear",
    trend_params={"slope": 0.1},
    seasonality=7,
    seasonality_strength=8.0,
    seed=42,
)

df = sim.simulate(n_series=5)
print(f"Generated {df['unique_id'].nunique()} series")
print(f"Date range: {df['ds'].min()} to {df['ds'].max()}")
df.head()

In [None]:
# This data is ready for StatsForecast
# from statsforecast import StatsForecast
# from statsforecast.models import AutoARIMA, SeasonalNaive
#
# sf = StatsForecast(
#     models=[AutoARIMA(season_length=7), SeasonalNaive(season_length=7)],
#     freq='D',
# )
#
# # Split train/test
# train = df[df['ds'] < '2020-04-01']
# test = df[df['ds'] >= '2020-04-01']
#
# # Fit and forecast
# sf.fit(train)
# forecasts = sf.predict(h=30)
#
# # Evaluate
# from utilsforecast.evaluation import evaluate
# from utilsforecast.losses import mae, rmse
# results = evaluate(forecasts.merge(test, on=['unique_id', 'ds']), metrics=[mae, rmse])

## Summary

`TimeSeriesSimulator` enables systematic model evaluation by generating synthetic data with:

| Feature | Options |
|---------|--------|
| **Distribution** | 7 built-in + any custom callable |
| **Trend** | linear, quadratic, exponential, custom |
| **Seasonality** | Single or multiple periods |
| **Noise** | Additional Gaussian noise |
| **Output** | StatsForecast-compatible DataFrame |

**Key insight:** The custom callable interface lets you model domain-specific behaviors (promotional spikes, regime changes, bimodal segments) that don't fit standard distributions. This makes your model benchmarking more realistic and actionable.