# Chapter 29: Specialized Time‑Series Architectures

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the motivation behind specialized architectures designed for time‑series forecasting
- Describe the N‑BEATS architecture and how it uses backward and forward residual connections
- Implement a simplified version of N‑BEATS for univariate time‑series prediction
- Explain how DeepAR produces probabilistic forecasts using autoregressive RNNs with negative binomial likelihood
- Build a probabilistic forecasting model using DeepAR‑like principles with TensorFlow Probability
- Grasp the key components of the Temporal Fusion Transformer (TFT) – an attention‑based model with explainability
- Recognize when neural hierarchical interpolation (N-HiTS) is appropriate for long‑horizon forecasting
- Apply Gaussian Processes to time‑series and understand their uncertainty estimates
- Use state space models for interpretable trend‑seasonal decomposition
- Combine multiple architectures into hybrid models for improved performance
- Develop a systematic approach to architecture selection based on data characteristics and forecasting requirements

---

## **29.1 Introduction to Specialized Time‑Series Architectures**

While generic architectures like LSTMs, CNNs, and Transformers can be applied to time‑series, researchers have developed specialized models that incorporate inductive biases specific to temporal data. These models often achieve state‑of‑the‑art performance on forecasting benchmarks and offer additional benefits such as interpretability, probabilistic outputs, or efficient handling of long sequences.

For the NEPSE prediction system, understanding these architectures helps us choose the right tool for the job. For example, if we need not just point forecasts but also prediction intervals (uncertainty), DeepAR or Gaussian Processes are suitable. If we require interpretability to explain predictions to stakeholders, Temporal Fusion Transformers provide attention‑based explanations. If we are forecasting many related time series (e.g., all stocks in the NEPSE index), N‑BEATS or DeepAR can be adapted for multivariate or multi‑series settings.

---

## **29.2 N‑BEATS (Neural Basis Expansion Analysis for Time‑Series Forecasting)**

N‑BEATS, introduced by Oreshkin et al. (2019), is a pure deep learning architecture that does not rely on time‑specific components like RNNs or convolutions. Instead, it uses a stack of fully connected blocks with backward and forward residual connections. Each block consists of two parts: a **backcast** (the part of the input that is "removed" to focus on residuals) and a **forecast** (the contribution of that block to the final prediction). The outputs of all blocks are summed to produce the final forecast.

### **29.2.1 Architecture Overview**

- The input is a window of past observations of length `L`.
- The network has multiple stacks, each containing several blocks.
- Each block has a fully connected network that outputs two vectors: `backcast` (same length as input) and `forecast` (length equal to forecast horizon `H`).
- The backcast is subtracted from the input before passing to the next block, allowing subsequent blocks to model the residual.
- The forecasts from all blocks are summed to obtain the overall prediction.

This design forces the model to learn a hierarchical decomposition of the time series, making it interpretable and often very accurate.

### **29.2.2 Implementing a Simplified N‑BEATS in Keras**

We'll implement a basic N‑BEATS model for univariate forecasting (predict next 5 days from past 20 days). This is a simplified version; the original uses multiple stacks and specialized basis functions.

```python
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

class NBeatsBlock(layers.Layer):
    def __init__(self, units, theta_dim, backcast_length, forecast_length, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.theta_dim = theta_dim
        self.backcast_length = backcast_length
        self.forecast_length = forecast_length
        
        # Fully connected layers to produce theta
        self.fc1 = layers.Dense(units, activation='relu')
        self.fc2 = layers.Dense(units, activation='relu')
        self.fc3 = layers.Dense(units, activation='relu')
        self.theta = layers.Dense(theta_dim, activation='linear')
        
        # Basis layers: map theta to backcast and forecast
        # In the original, these are fixed basis functions (trend, seasonality)
        # Here we use learnable linear projections (simple but less interpretable)
        self.backcast_basis = layers.Dense(backcast_length, use_bias=False)
        self.forecast_basis = layers.Dense(forecast_length, use_bias=False)
        
    def call(self, inputs):
        # inputs: (batch, backcast_length)
        x = self.fc1(inputs)
        x = self.fc2(x)
        x = self.fc3(x)
        theta = self.theta(x)  # (batch, theta_dim)
        backcast = self.backcast_basis(theta)
        forecast = self.forecast_basis(theta)
        return backcast, forecast

def build_nbeats(backcast_length, forecast_length, stack_sizes=[3, 3], units=64, theta_dim=16):
    """
    Build a simple N-BEATS model.
    stack_sizes: number of blocks per stack (list of ints)
    """
    input_layer = layers.Input(shape=(backcast_length,))
    x = input_layer
    forecast_sum = 0
    
    for i, n_blocks in enumerate(stack_sizes):
        for _ in range(n_blocks):
            block = NBeatsBlock(units, theta_dim, backcast_length, forecast_length)
            backcast, forecast = block(x)
            x = layers.Subtract()([x, backcast])  # residual connection
            forecast_sum = layers.Add()([forecast_sum, forecast])
    
    model = keras.Model(inputs=input_layer, outputs=forecast_sum)
    return model

# Example usage
backcast_len = 20
forecast_len = 5
model_nbeats = build_nbeats(backcast_len, forecast_len, stack_sizes=[2, 2], units=32, theta_dim=8)
model_nbeats.compile(optimizer='adam', loss='mse')
model_nbeats.summary()
```

**Explanation:**

- `NBeatsBlock` is a custom layer that computes theta from the input using three dense layers, then projects theta to backcast and forecast via learnable linear layers (in the original, these are often fixed basis functions like trend polynomials or Fourier terms). Our version uses learnable projections, which is simpler but still effective.
- The main model loops through stacks and blocks, subtracting each block's backcast from the input, and summing the forecasts.
- The final output is the sum of all block forecasts.

### **29.2.3 Training on NEPSE Data**

We'll prepare data for multi‑step forecasting (predict next 5 returns from past 20 returns).

```python
# Load and prepare NEPSE data (as before)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()
df_stock['Return'] = df_stock['Close'].pct_change() * 100
df_stock = df_stock.dropna(subset=['Return'])

# Create sequences for multi-step forecasting
def create_multi_output_sequences(data, backcast_len, forecast_len):
    X, y = [], []
    for i in range(len(data) - backcast_len - forecast_len + 1):
        X.append(data[i:i+backcast_len])
        y.append(data[i+backcast_len:i+backcast_len+forecast_len])
    return np.array(X), np.array(y)

backcast_len = 20
forecast_len = 5
data = df_stock['Return'].values.reshape(-1, 1)
X, y = create_multi_output_sequences(data, backcast_len, forecast_len)
X = X.squeeze(axis=-1)  # (samples, backcast_len)
y = y.squeeze(axis=-1)  # (samples, forecast_len)

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale (fit on training)
from sklearn.preprocessing import StandardScaler
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train)
y_test_scaled = scaler_y.transform(y_test)

# Build and train model
model_nbeats = build_nbeats(backcast_len, forecast_len, stack_sizes=[2, 2], units=32, theta_dim=8)
model_nbeats.compile(optimizer='adam', loss='mse')

early_stop = keras.callbacks.EarlyStopping(monitor='loss', patience=5, restore_best_weights=True)
model_nbeats.fit(X_train_scaled, y_train_scaled, epochs=50, batch_size=32, callbacks=[early_stop], verbose=0)

# Predict and inverse transform
y_pred_scaled = model_nbeats.predict(X_test_scaled)
y_pred = scaler_y.inverse_transform(y_pred_scaled)
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
print(f"N-BEATS test RMSE (averaged over 5 steps): {rmse:.4f}")
```

**Explanation:**

- We reshape the data to have multiple output steps.
- The model is trained to predict the next 5 returns. Scaling is applied to both input and output.
- The RMSE is averaged across the 5 forecast steps; we could also compute per‑step metrics.

---

## **29.3 DeepAR**

DeepAR (Salinas et al., 2020) is a probabilistic forecasting model based on autoregressive RNNs. It uses a recurrent neural network (LSTM or GRU) to model the conditional distribution of each time step given past values and covariates. Instead of predicting a point value, DeepAR outputs parameters of a distribution (e.g., mean and variance for Gaussian, or mean and dispersion for negative binomial for count data). It is trained by maximizing the likelihood.

Key features:

- Handles multiple related time series (e.g., many stocks) by learning shared patterns.
- Produces probabilistic forecasts (prediction intervals).
- Incorporates covariates (e.g., day of week, month) easily.

For the NEPSE system, DeepAR could be used to forecast the entire distribution of future returns, which is valuable for risk management.

### **29.3.1 DeepAR Conceptual Overview**

- Input: at each time step `t`, the model receives the previous target value `z_{t-1}` (if available) and covariates `x_t`.
- The RNN updates its state.
- At prediction time, the model is used autoregressively: it samples from the predicted distribution at each step and feeds the sample back as input for the next step.

### **29.3.2 Implementing a Simplified DeepAR‑like Model in TensorFlow**

We'll build a model that predicts the parameters of a Gaussian distribution for each future step. We'll use a many‑to‑many RNN with a custom loss function (negative log‑likelihood).

```python
import tensorflow_probability as tfp

def gaussian_nll(y_true, y_pred):
    """
    Negative log-likelihood for Gaussian distribution.
    y_pred: (batch, horizon, 2) where last dim = [mean, log_std]
    """
    mean = y_pred[..., 0]
    log_std = y_pred[..., 1]
    std = tf.exp(log_std)
    return -tf.reduce_mean(tfp.distributions.Normal(mean, std).log_prob(y_true))

def build_deepar_model(backcast_len, forecast_len, n_features=1, lstm_units=32):
    """
    Simplified DeepAR: encoder-decoder with LSTM.
    During training, we use teacher forcing (true previous target).
    During inference, we would sample and feed back.
    """
    # Encoder: process backcast window
    encoder_inputs = layers.Input(shape=(backcast_len, n_features))
    encoder_lstm = layers.LSTM(lstm_units, return_state=True)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
    encoder_states = [state_h, state_c]
    
    # Decoder: we'll use teacher forcing during training
    decoder_inputs = layers.Input(shape=(forecast_len, n_features))
    decoder_lstm = layers.LSTM(lstm_units, return_sequences=True)
    decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    
    # Output layer: predict mean and log_std for each time step
    outputs = layers.TimeDistributed(layers.Dense(2))(decoder_outputs)
    
    model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=outputs)
    model.compile(optimizer='adam', loss=gaussian_nll)
    return model

# Prepare data for teacher forcing: decoder inputs are the target sequence shifted
# In training, we use the actual future values as decoder inputs (teacher forcing)
def prepare_deepar_data(data, backcast_len, forecast_len):
    X_encoder = []
    X_decoder = []
    y = []
    for i in range(len(data) - backcast_len - forecast_len):
        X_encoder.append(data[i:i+backcast_len])
        # decoder input is the target sequence (but we need values for each step)
        # In DeepAR, decoder input at step t is the target at t-1 (starting with the last encoder value)
        # For simplicity, we'll use the target sequence as decoder input (teacher forcing)
        X_decoder.append(data[i+backcast_len:i+backcast_len+forecast_len])
        y.append(data[i+backcast_len:i+backcast_len+forecast_len])
    return np.array(X_encoder), np.array(X_decoder), np.array(y)

# Prepare data
data_vals = df_stock['Return'].values.reshape(-1, 1)
X_enc, X_dec, y_multi = prepare_deepar_data(data_vals, backcast_len=20, forecast_len=5)

# Temporal split
split = int(len(X_enc) * 0.8)
X_enc_train, X_enc_test = X_enc[:split], X_enc[split:]
X_dec_train, X_dec_test = X_dec[:split], X_dec[split:]
y_train, y_test = y_multi[:split], y_multi[split:]

# Build and train
model_deepar = build_deepar_model(backcast_len=20, forecast_len=5, lstm_units=32)
model_deepar.fit([X_enc_train, X_dec_train], y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=0)

# Prediction (during inference, we need to sample autoregressively; here we'll just use teacher forcing for evaluation)
y_pred_params = model_deepar.predict([X_enc_test, X_dec_test])
mean_pred = y_pred_params[..., 0]
std_pred = tf.exp(y_pred_params[..., 1]).numpy()
rmse = np.sqrt(np.mean((mean_pred - y_test)**2))
print(f"DeepAR-like RMSE: {rmse:.4f}")
```

**Explanation:**

- The model uses an LSTM encoder to process the backcast window, and an LSTM decoder that receives the true future values during training (teacher forcing). The decoder output is passed through a `TimeDistributed` dense layer with 2 units (mean and log standard deviation).
- The loss function is the negative log‑likelihood of a Gaussian distribution with predicted mean and standard deviation. This encourages the model to output both a point forecast and its uncertainty.
- During inference, we would need to run the model autoregressively: start with the last observed value, sample from the predicted distribution, feed the sample as next input, and repeat. Our evaluation above uses teacher forcing (feeding true values) which gives an optimistic estimate. A proper evaluation would implement autoregressive sampling.

---

## **29.4 Temporal Fusion Transformers (TFT)**

Temporal Fusion Transformers (Lim et al., 2021) combine LSTM layers with attention mechanisms to provide interpretable and accurate forecasts. Key components:

- **Variable selection network:** Chooses relevant input features at each time step.
- **LSTM encoder‑decoder:** Processes the sequence.
- **Multi‑head attention:** Captures long‑term dependencies.
- **Interpretable outputs:** Attention weights and variable selection provide insight into predictions.

TFT is designed for multi‑horizon forecasting with exogenous variables and can handle multiple time series. For NEPSE, we could use it to forecast returns using past returns, volume, and other covariates, while also obtaining explanations of which features matter most.

Implementing TFT from scratch is complex; we can use existing libraries like `pytorch-forecasting` (PyTorch) or `darts` (which has a TFT implementation). However, for demonstration, we'll outline the structure and show how to use a high‑level library.

```python
# Example using darts (if installed)
from darts.models import TFTModel
from darts import TimeSeries
from darts.dataprocessing.transformers import Scaler

# Convert to darts TimeSeries
series = TimeSeries.from_dataframe(df_stock, 'Date', 'Return')

# Scale
scaler = Scaler()
series_scaled = scaler.fit_transform(series)

# Define TFT model
tft = TFTModel(
    input_chunk_length=20,
    output_chunk_length=5,
    hidden_size=32,
    lstm_layers=2,
    num_attention_heads=4,
    dropout=0.1,
    batch_size=32,
    n_epochs=50,
    add_relative_index=False,  # no need if we have datetime
    add_encoders={'cyclic': {'future': ['month', 'dayofweek']}},  # example covariates
    random_state=42
)

# Train (using historical data)
tft.fit(series_scaled, val_series=series_scaled[-100:])  # simple validation

# Forecast
forecast = tft.predict(n=5)
forecast = scaler.inverse_transform(forecast)
print(forecast)
```

**Explanation:**

- `TFTModel` from `darts` encapsulates the full architecture.
- We specify input length (20 days) and output length (5 days).
- We can add covariates (e.g., month, day of week) via `add_encoders`.
- The model trains and then produces a probabilistic forecast.
- This is a high‑level example; in practice, we would tune hyperparameters and validate properly.

---

## **29.5 Neural Hierarchical Interpolation (N‑HiTS)**

N‑HiTS (Challu et al., 2023) is an extension of N‑BEATS that introduces hierarchical interpolation to capture patterns at different scales. It uses:

- **Hierarchical downsampling:** The input is downsampled at multiple rates to create a pyramid of scales.
- **Interpolation:** Predictions at different scales are interpolated back to the original resolution and combined.
- **Efficiency:** It can handle very long sequences efficiently.

For NEPSE, if we wanted to forecast long horizons (e.g., 100 days), N‑HiTS would be more efficient than a standard Transformer.

Implementation is similar to N‑BEATS but with added downsampling blocks. We won't implement from scratch but note its existence.

---

## **29.6 Gaussian Processes**

Gaussian Processes (GPs) are non‑parametric probabilistic models that define a distribution over functions. They are well‑suited for time‑series with limited data because they provide uncertainty estimates and can incorporate prior knowledge through the kernel function. However, they scale poorly (O(n³)) and are rarely used for large datasets.

For a small subset of NEPSE data (e.g., one stock with 500 points), a GP could be a good baseline.

```python
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, Matern

# Prepare data (simple: predict next value using lagged values)
X_gp = np.arange(len(df_stock)).reshape(-1, 1)  # time index
y_gp = df_stock['Return'].values

# Use last 80% for training
train_size = int(0.8 * len(X_gp))
X_train_gp, X_test_gp = X_gp[:train_size], X_gp[train_size:]
y_train_gp, y_test_gp = y_gp[:train_size], y_gp[train_size:]

# Kernel: RBF + WhiteKernel for noise
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X_train_gp, y_train_gp)

# Predict and get uncertainty
y_pred_gp, sigma = gp.predict(X_test_gp, return_std=True)
rmse_gp = np.sqrt(np.mean((y_pred_gp - y_test_gp)**2))
print(f"GP RMSE: {rmse_gp:.4f}")
print(f"Uncertainty (avg std): {sigma.mean():.4f}")
```

**Explanation:**

- We use a simple RBF kernel to model smoothness, plus a white noise kernel to capture observation noise.
- The GP provides both point predictions and uncertainty (standard deviation).
- This model assumes the time index is the only input; we could also include lagged features, but then the input dimension increases, making GP slower.

---

## **29.7 State Space Models**

State space models (SSMs) are a classical approach that models a time series as the combination of a latent state evolving over time and an observation model. They are highly interpretable and can incorporate trends, seasonality, and regression effects. The `statsmodels` library provides tools like `UnobservedComponents` for this.

```python
from statsmodels.tsa.statespace.structural import UnobservedComponents

# Fit a local linear trend model
ssm = UnobservedComponents(y_train_gp, level='local linear trend', seasonal=5)
ssm_res = ssm.fit()

# Forecast next 5 steps
forecast_ssm = ssm_res.forecast(5)
print(forecast_ssm)
```

**Explanation:**

- The model decomposes the series into level, trend, and seasonal components.
- It is interpretable and often works well when the data has clear structure.
- For NEPSE returns, seasonality may be weak, but the trend component might capture local momentum.

---

## **29.8 Hybrid Models**

Hybrid models combine two or more architectures to leverage their strengths. For example:

- **ARIMA + Neural Network:** Use ARIMA to capture linear patterns and a neural network to model non‑linear residuals.
- **CNN + LSTM:** Use CNN to extract local features, LSTM to model long‑term dependencies.
- **Transformer + LSTM:** Use Transformer for long‑range attention and LSTM for sequential processing.

For NEPSE, a hybrid of a statistical model (e.g., ARIMA) and a neural network could be effective. We can implement a simple residual hybrid:

```python
# Step 1: Fit ARIMA on training returns
from statsmodels.tsa.arima.model import ARIMA
arima = ARIMA(y_train_gp, order=(1,0,1))
arima_fit = arima.fit()
arima_pred_train = arima_fit.predict()
arima_pred_test = arima_fit.forecast(steps=len(y_test_gp))

# Step 2: Train neural network on ARIMA residuals
residuals = y_train_gp - arima_pred_train
# (prepare features for NN, e.g., lagged returns)
# ... train NN to predict residuals

# Step 3: Combine forecasts
final_pred = arima_pred_test + nn_pred_test
```

**Explanation:**

- The ARIMA captures the linear autocorrelation; the NN learns to correct its errors.
- This can improve accuracy over either model alone.

---

## **29.9 Architecture Selection**

Choosing the right architecture depends on:

- **Data size:** N‑BEATS, DeepAR, and Transformers need more data than ARIMA or GP. For a single stock, simple models may suffice.
- **Forecast horizon:** For long horizons, N‑HiTS or Informer are more efficient.
- **Interpretability:** TFT and state space models offer explanations; deep learning black boxes do not.
- **Uncertainty requirements:** DeepAR, GP, and TFT provide probabilistic forecasts.
- **Multiple series:** DeepAR and TFT naturally handle multiple related time series.

For the NEPSE system, a pragmatic approach is to start with simple models (ARIMA, ETS) and progress to more complex ones only if they significantly improve validation performance. We should maintain a suite of models and use cross‑validation to select the best.

---

## **29.10 Chapter Summary**

In this chapter, we surveyed specialized time‑series architectures, each with unique strengths:

- **N‑BEATS:** A pure deep learning model that decomposes the series via residual blocks; simple and accurate.
- **DeepAR:** Probabilistic forecasting with RNNs; handles multiple series and provides uncertainty.
- **Temporal Fusion Transformers:** Attention‑based model with interpretability; supports exogenous variables.
- **N‑HiTS:** Hierarchical version of N‑BEATS for long sequences.
- **Gaussian Processes:** Non‑parametric probabilistic model; good for small datasets.
- **State Space Models:** Classical interpretable models; useful as baselines.
- **Hybrid Models:** Combine statistical and ML components for potentially improved accuracy.

### **Practical Takeaways for the NEPSE System:**

- For point forecasts of a single stock, N‑BEATS or a simple LSTM may be sufficient.
- If you need prediction intervals, try DeepAR or a Gaussian Process.
- If interpretability is critical, consider TFT or state space models.
- Always start with a simple baseline; only add complexity if it demonstrably helps on out‑of‑sample validation.
- Use time‑series cross‑validation to compare architectures fairly.

In the next chapter, **Chapter 30: Model Training Best Practices**, we will consolidate all we've learned about training neural networks effectively, covering data preparation, batch size selection, learning rate scheduling, loss function selection, early stopping, checkpointing, mixed precision, distributed training, monitoring, and debugging.

---

**End of Chapter 29**