# DatasetsForecast Tutorial

DatasetsForecast provides one-line access to major time series forecasting datasets.

## Setup

Install required packages:

```bash
pip install datasetsforecast pandas numpy matplotlib statsforecast
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasetsforecast.m4 import M4
from datasetsforecast.hierarchical import HierarchicalData

## M4 Competition Dataset

Load the M4 forecasting competition data with different frequency groups:

In [None]:
# Load M4 Hourly dataset (414 time series, 48-hour forecasts)
data_df, _, meta_df = M4.load(directory='data', group='Hourly')

print(f"Training data shape: {data_df.shape}")
print(f"Number of unique series: {data_df['unique_id'].nunique()}")

# Preview the data structure
print(data_df.groupby('unique_id')['y'].count().head())

## M4 Frequency Groups

Each M4 group has different characteristics:

In [None]:
# Available M4 frequency groups and their properties
frequency_info = {
    'Yearly': {'series': 23000, 'horizon': 6, 'seasonality': 1},
    'Quarterly': {'series': 24000, 'horizon': 8, 'seasonality': 4},
    'Monthly': {'series': 48000, 'horizon': 18, 'seasonality': 12},
    'Weekly': {'series': 359, 'horizon': 13, 'seasonality': 1},
    'Daily': {'series': 4227, 'horizon': 14, 'seasonality': 1},
    'Hourly': {'series': 414, 'horizon': 48, 'seasonality': 24}
}

for freq, info in frequency_info.items():
    print(f"{freq}: {info['series']} series, {info['horizon']}-step ahead forecasts")

## Visualize Sample Series

Plot sample hourly series to understand data patterns:

In [None]:
# Visualize sample hourly series
sample_series = ['H1', 'H2', 'H3']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, series_id in enumerate(sample_series):
    series_data = data_df[data_df['unique_id'] == series_id]
    axes[i].plot(series_data['ds'], series_data['y'], color='#98FE09', linewidth=2)
    axes[i].set_title(f'Series {series_id}')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Hierarchical Forecasting with Tourism Dataset

Load hierarchical data with constraint matrices:

In [None]:
# Load Tourism dataset with hierarchical structure
Y_df, S_df, _ = HierarchicalData.load(directory="data", group="TourismLarge")

print(f"Time series data shape: {Y_df.shape}")
print(f"Constraint matrix shape: {S_df.shape}")

# Show hierarchical structure
print(f"\nHierarchical structure:")
print(f"  Total series: {len(Y_df['unique_id'].unique())}")
print(f"  Bottom level series: {S_df.shape[1]}")
print(f"  Aggregated levels: {S_df.shape[0] - S_df.shape[1]}")
print(f"  Total hierarchical nodes: {S_df.shape[0]}")

## Long Horizon Forecasting with ETT Dataset

Load ETT dataset for long-term forecasting:

In [None]:
from datasetsforecast.long_horizon import LongHorizon

# Load ETT dataset for long-horizon forecasting
Y_train, Y_val, Y_test = LongHorizon.load(directory="data", group="ETTh1")

print(f"Training data shape: {Y_train.shape}")
print(f"Validation data shape: {Y_val.shape}")

# Show sample data
print(f"\nSample training data:")
print(Y_train.head())

## Predictive Maintenance with PHM2008

Load remaining useful life (RUL) data:

In [None]:
from datasetsforecast.phm2008 import PHM2008

# Load PHM2008 dataset for remaining useful life prediction
Y_df, *_ = PHM2008.load(directory="data", group="FD001")

print(f"Dataset shape: {Y_df.shape}")
print(f"Unique engines: {Y_df['unique_id'].nunique()}")

# Analyze RUL distribution
rul_stats = Y_df.groupby('unique_id')['y'].agg(['min', 'max', 'count'])
print(f"\nRUL Statistics per engine:")
print(f"  Average cycles per engine: {rul_stats['count'].mean():.0f}")
print(f"  Min cycles: {rul_stats['count'].min()}")
print(f"  Max cycles: {rul_stats['count'].max()}")
print(f"  Average max RUL: {rul_stats['max'].mean():.0f}")

# Show sample data structure
print(f"\nSample data:")
print(Y_df.head(10))

## Model Evaluation

Evaluate forecasts using competition metrics:

In [None]:
from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive
from datasetsforecast.m4 import M4Evaluation

# Create simple benchmark forecasts
models = [Naive(), SeasonalNaive(season_length=24)]
sf = StatsForecast(models=models, freq='H')

# Generate forecasts for evaluation
forecasts = sf.forecast(df=data_df, h=48)
y_hat = forecasts[['Naive']].values

# Evaluate using M4 methodology
evaluation = M4Evaluation.evaluate('data', 'Hourly', y_hat)
print("M4 Evaluation Results:")
print(evaluation)

## Benchmark Comparison

Compare against competition winners:

In [None]:
# Load benchmark forecasts from M4 competition winners
naive2_forecasts = M4Evaluation.load_benchmark('data', 'Hourly')
naive2_evaluation = M4Evaluation.evaluate('data', 'Hourly', naive2_forecasts)

benchmark_comparison = pd.DataFrame({
    'Our Model': evaluation.iloc[0],
    'Naive2 Benchmark': naive2_evaluation.iloc[0]
})

print("Benchmark Comparison:")
print(benchmark_comparison)