# Chapter 55: Probabilistic Forecasting

## Learning Objectives

By the end of this chapter, you will be able to:

- Distinguish between deterministic (point) forecasts and probabilistic forecasts that quantify uncertainty
- Understand why uncertainty quantification is critical for risk management in financial time‑series prediction
- Generate prediction intervals using quantile regression and conformal prediction
- Implement probabilistic models such as Bayesian linear regression and Gaussian processes
- Apply ensemble methods to estimate prediction uncertainty
- Evaluate probabilistic forecasts using proper scoring rules (CRPS, log‑likelihood, interval coverage)
- Communicate probabilistic predictions effectively to stakeholders
- Integrate uncertainty estimates into trading decisions and risk management for the NEPSE system

---

## Introduction

Throughout this handbook, we have focused on **point forecasts**—predictions that give a single value (e.g., tomorrow's closing price or the probability that the price will go up). However, in financial markets, uncertainty is pervasive. A point forecast tells you nothing about how confident the model is. A prediction of 1050 with a narrow confidence interval is very different from the same prediction with a wide interval. For risk management, position sizing, and decision making, understanding the uncertainty around a prediction is as important as the prediction itself.

**Probabilistic forecasting** addresses this by producing a full probability distribution over future outcomes. Instead of saying “tomorrow’s close will be 1050”, we say “there is a 90% chance that tomorrow’s close will be between 1020 and 1080”. This extra information allows traders to quantify risk, set stop‑losses, and allocate capital more intelligently.

In this chapter, we will explore methods for generating probabilistic forecasts, from simple techniques like quantile regression to advanced approaches like Gaussian processes and Bayesian neural networks. We will apply these methods to the NEPSE stock prediction problem and discuss how to evaluate and use probabilistic predictions in practice.

---

## 55.1 Deterministic vs. Probabilistic Forecasting

A **deterministic forecast** (or point forecast) is a single value: `ŷ`. A **probabilistic forecast** is a probability distribution over possible values of the target variable. For a continuous target, this could be a density function; for practical use, we often summarise it by **prediction intervals** at various confidence levels (e.g., 50%, 80%, 95%).

Why go probabilistic?

- **Risk assessment**: Know the likelihood of extreme outcomes (e.g., a price drop of more than 5%).
- **Decision making**: Optimal decisions under uncertainty require knowing the full distribution, not just the mean. For example, in portfolio optimisation, we need expected return and covariance.
- **Model diagnostics**: A well‑calibrated probabilistic model produces intervals that contain the true value the specified percentage of the time (e.g., 90% intervals contain the truth 90% of the time). This is a stronger check than just point forecast accuracy.
- **Communication**: Stakeholders (traders, risk managers) can understand the range of possible outcomes.

For the NEPSE system, a probabilistic forecast might answer: “Given the current market conditions, what is the distribution of tomorrow’s return for NABIL stock? What is the 5th percentile (worst case) and 95th percentile (best case)?”

---

## 55.2 Prediction Intervals from Residuals

A simple way to obtain prediction intervals is to assume that the errors of a point forecast model are normally distributed with constant variance. Then an approximate 95% interval is `ŷ ± 1.96 * σ`, where `σ` is the standard deviation of the residuals on a validation set.

This method is easy but has strong assumptions: homoscedasticity (constant variance) and normality. Financial returns often exhibit heteroscedasticity (volatility clustering) and heavy tails, so these intervals may be poorly calibrated.

**Example: Linear regression with residual‑based intervals**

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assume X_train, y_train, X_test, y_test are prepared
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on validation set (time‑based)
y_val_pred = model.predict(X_val)
residuals = y_val - y_val_pred
residual_std = np.std(residuals)

# Predict on test
y_test_pred = model.predict(X_test)
lower = y_test_pred - 1.96 * residual_std
upper = y_test_pred + 1.96 * residual_std

# Check coverage
coverage = np.mean((y_test >= lower) & (y_test <= upper))
print(f"95% prediction interval coverage: {coverage:.3f}")
```

**Explanation:**  
We compute the standard deviation of residuals on validation data (not training, to avoid optimism). Then we construct intervals assuming normality. The actual coverage may differ; if it's significantly below 95%, the normality/constant variance assumption is violated.

---

## 55.3 Quantile Regression

Quantile regression directly models conditional quantiles of the target distribution, without assuming a parametric form. For a desired quantile `τ` (e.g., 0.1 for the 10th percentile), we minimise the pinball loss:

`L(y, ŷ) = (τ - 1) * (y - ŷ)  if y < ŷ else τ * (y - ŷ)`

This yields a model that predicts the `τ`‑quantile. By training separate models for different quantiles (e.g., 0.1, 0.5, 0.9), we can construct prediction intervals.

**Example: Quantile Regression with scikit‑learn**

```python
from sklearn.ensemble import GradientBoostingRegressor

# Model for median (τ=0.5)
model_median = GradientBoostingRegressor(loss='quantile', alpha=0.5, random_state=42)
model_median.fit(X_train, y_train)

# Model for lower quantile (τ=0.1)
model_lower = GradientBoostingRegressor(loss='quantile', alpha=0.1, random_state=42)
model_lower.fit(X_train, y_train)

# Model for upper quantile (τ=0.9)
model_upper = GradientBoostingRegressor(loss='quantile', alpha=0.9, random_state=42)
model_upper.fit(X_train, y_train)

# Predict
y_lower = model_lower.predict(X_test)
y_median = model_median.predict(X_test)
y_upper = model_upper.predict(X_test)

# 80% prediction interval: [y_lower, y_upper]
coverage = np.mean((y_test >= y_lower) & (y_test <= y_upper))
print(f"80% interval coverage: {coverage:.3f}")
```

**Explanation:**  
`GradientBoostingRegressor` with `loss='quantile'` implements quantile regression. We train three models for different quantiles. The intervals are formed by the lower and upper quantile predictions. This method is flexible and can capture heteroscedasticity because the quantile models can adapt to changing variance.

**Caveat:** The intervals from independent quantile models may cross (e.g., the predicted 90th percentile might be below the predicted 10th percentile for some inputs). This is undesirable but can be mitigated by post‑processing or using models that ensure monotonicity (e.g., quantile regression forests).

---

## 55.4 Conformal Prediction

Conformal prediction is a framework that produces prediction sets with guaranteed coverage, regardless of the underlying model, under the assumption of exchangeability. It is model‑agnostic and works with any point predictor.

The basic idea for regression:

1. Split the training data into a proper training set and a calibration set.
2. Train a model on the proper training set.
3. On the calibration set, compute **non‑conformity scores** (e.g., absolute error `|y_i - ŷ_i|`).
4. For a new test point, compute its prediction `ŷ_test` and then form the prediction interval as `ŷ_test ± q`, where `q` is the `(1-α)`‑th quantile of the calibration scores.

This yields a prediction interval that, under exchangeability, covers the true value with probability at least `1-α` (on average).

**Example: Conformal prediction for NEPSE returns**

```python
import numpy as np
from sklearn.ensemble import RandomForestRegressor

# Split data into training, calibration, test
# Assume X, y are sorted by time
n = len(X)
n_train = int(0.6 * n)
n_cal = int(0.2 * n)
X_train = X[:n_train]
y_train = y[:n_train]
X_cal = X[n_train:n_train+n_cal]
y_cal = y[n_train:n_train+n_cal]
X_test = X[n_train+n_cal:]
y_test = y[n_train+n_cal:]

# Train model on proper training set
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Compute non‑conformity scores on calibration set
y_cal_pred = model.predict(X_cal)
scores = np.abs(y_cal - y_cal_pred)

# For a desired coverage 90%, find the 90th percentile of scores
alpha = 0.1
q = np.percentile(scores, 100 * (1 - alpha))

# Predict on test
y_test_pred = model.predict(X_test)
lower = y_test_pred - q
upper = y_test_pred + q

# Check coverage
coverage = np.mean((y_test >= lower) & (y_test <= upper))
print(f"Conformal 90% interval coverage: {coverage:.3f}")
```

**Explanation:**  
We reserve a calibration set that is not used for training. The scores (absolute errors) on this set represent the typical prediction error. The interval `ŷ ± q` will cover the true value for approximately 90% of test points, assuming exchangeability (i.e., that the calibration and test sets are drawn from the same distribution). This method is very simple and works with any model, but the intervals have constant width (`2q`) for all test points. Variants like **locally weighted conformal prediction** can produce adaptive widths.

---

## 55.5 Bayesian Methods

Bayesian approaches naturally produce probabilistic forecasts by treating model parameters as random variables and computing the posterior predictive distribution. Given training data `D`, the predictive distribution for a new input `x*` is:

`p(y* | x*, D) = ∫ p(y* | x*, θ) p(θ | D) dθ`

This integral accounts for both aleatoric uncertainty (noise in the data) and epistemic uncertainty (uncertainty about model parameters).

### 55.5.1 Bayesian Linear Regression

In Bayesian linear regression, we place a prior on the coefficients, typically a Gaussian, and the likelihood is Gaussian with known variance (or we also put a prior on the variance). The posterior and predictive are analytically tractable.

**Example with PyMC3**

```python
import pymc3 as pm
import numpy as np
import pandas as pd

# Prepare data (add intercept)
X_train_with_intercept = np.c_[np.ones(X_train.shape[0]), X_train]
X_test_with_intercept = np.c_[np.ones(X_test.shape[0]), X_test]

with pm.Model() as linear_model:
    # Priors for coefficients
    sigma = pm.HalfCauchy('sigma', beta=10)
    beta = pm.Normal('beta', mu=0, sigma=10, shape=X_train_with_intercept.shape[1])
    
    # Likelihood
    mu = pm.math.dot(X_train_with_intercept, beta)
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y_train)
    
    # Inference
    trace = pm.sample(2000, tune=1000, return_inferencedata=False)

# Posterior predictive
with linear_model:
    y_pred = pm.sample_posterior_predictive(trace, samples=500,
                                             var_names=['y_obs'],
                                             given={'X': X_test_with_intercept})

# y_pred['y_obs'] has shape (500, n_test)
y_pred_samples = y_pred['y_obs']
# Compute mean and quantiles
y_mean = y_pred_samples.mean(axis=0)
y_lower = np.percentile(y_pred_samples, 5, axis=0)
y_upper = np.percentile(y_pred_samples, 95, axis=0)
```

**Explanation:**  
We define a probabilistic model with priors and likelihood. MCMC sampling gives us samples from the posterior distribution of parameters. Then we generate predictive samples, which give a full distribution for each test point. The 90% credible interval is the 5th to 95th percentiles of these samples.

### 55.5.2 Gaussian Processes

Gaussian processes (GPs) are a powerful non‑parametric Bayesian method for regression. They place a prior over functions directly, and the posterior predictive is Gaussian with mean and variance given by kernel computations. GPs naturally provide uncertainty estimates that grow in regions with few data points.

**Example with scikit‑learn**

```python
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

# Define kernel: RBF + noise
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)

gp = GaussianProcessRegressor(kernel=kernel, alpha=0.0, normalize_y=True, random_state=42)
gp.fit(X_train, y_train)

# Predict with uncertainty
y_mean, y_std = gp.predict(X_test, return_std=True)

# 90% interval
lower = y_mean - 1.645 * y_std
upper = y_mean + 1.645 * y_std
```

**Explanation:**  
`GaussianProcessRegressor` fits a GP model. The kernel captures similarity between points; the RBF kernel assumes that nearby points have correlated outputs. The `WhiteKernel` accounts for noise. The prediction returns mean and standard deviation, which can be used to form Gaussian intervals. Note that these intervals assume Gaussian predictive distribution, which may be reasonable but is an approximation.

---

## 55.6 Ensemble Methods for Uncertainty

Ensemble methods, such as Random Forest or neural network ensembles, can provide uncertainty estimates by looking at the variance of predictions across ensemble members. This captures epistemic uncertainty (disagreement among models).

**Example: Random Forest prediction intervals**

```python
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get predictions from all trees
all_preds = np.array([tree.predict(X_test) for tree in rf.estimators_])
# all_preds shape: (n_trees, n_test)

y_mean = all_preds.mean(axis=0)
y_std = all_preds.std(axis=0)

# 90% interval (assuming Gaussian)
lower = y_mean - 1.645 * y_std
upper = y_mean + 1.645 * y_std
```

**Explanation:**  
Each tree in a Random Forest gives a prediction. The variance across trees reflects uncertainty due to different bootstrap samples and feature subsets. This works reasonably well but may underestimate uncertainty if all trees are similar. Also, the assumption of Gaussianity for the interval may be poor.

### 55.6.1 Quantile Regression Forests

Quantile Regression Forests (QRF) extend Random Forest to estimate conditional quantiles. Instead of averaging predictions, they keep all leaf values and compute the empirical distribution of training targets in each leaf.

**Example with `quantile-forest` library**

```python
from quantile_forest import RandomForestQuantileRegressor

qrf = RandomForestQuantileRegressor(n_estimators=100, random_state=42)
qrf.fit(X_train, y_train)

# Predict quantiles
y_lower = qrf.predict(X_test, quantiles=0.1)
y_median = qrf.predict(X_test, quantiles=0.5)
y_upper = qrf.predict(X_test, quantiles=0.9)
```

**Explanation:**  
`RandomForestQuantileRegressor` stores the training targets in each leaf. For a test point, it finds the leaf for each tree and collects all training targets in that leaf. The quantiles are then computed from the pooled set across trees. This gives non‑parametric intervals that can capture asymmetry and heteroscedasticity.

---

## 55.7 Probabilistic Forecasting with Neural Networks

Neural networks can also produce probabilistic forecasts by:

- **Mean + Variance networks**: Output both mean and variance, and train with negative log‑likelihood of a Gaussian (or other) distribution.
- **Quantile regression**: Output multiple quantiles simultaneously (using a multi‑head network).
- **Bayesian neural networks**: Place distributions over weights and use variational inference or Monte Carlo dropout.

### 55.7.1 Mean‑Variance Network

We can modify a neural network to output two values: mean `μ` and log‑variance `log(σ²)`. The loss is the negative log‑likelihood of a Gaussian:

`L = ½ log(σ²) + ½ (y - μ)² / σ² + constant`

This allows the network to learn heteroscedastic uncertainty.

**Example with PyTorch**

```python
import torch
import torch.nn as nn
import torch.optim as optim

class MeanVarianceNet(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.mean_out = nn.Linear(hidden_dim, 1)
        self.logvar_out = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        h = torch.relu(self.fc1(x))
        h = torch.relu(self.fc2(h))
        mean = self.mean_out(h).squeeze()
        logvar = self.logvar_out(h).squeeze()
        return mean, logvar

def nll_loss(mean, logvar, target):
    # Negative log‑likelihood of Gaussian
    return (0.5 * logvar + 0.5 * (target - mean)**2 / torch.exp(logvar)).mean()

# Training loop
model = MeanVarianceNet(input_dim=X_train.shape[1], hidden_dim=64)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    optimizer.zero_grad()
    mean, logvar = model(X_train_tensor)
    loss = nll_loss(mean, logvar, y_train_tensor)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, loss {loss.item():.4f}")

# Predict
mean, logvar = model(X_test_tensor)
std = torch.exp(0.5 * logvar).detach().numpy()
mean = mean.detach().numpy()
lower = mean - 1.645 * std
upper = mean + 1.645 * std
```

**Explanation:**  
The network outputs both mean and log‑variance. The loss is the negative log‑likelihood, which encourages the network to output high variance where errors are large and low variance where predictions are accurate. This yields input‑dependent uncertainty.

### 55.7.2 Monte Carlo Dropout

Monte Carlo dropout approximates Bayesian inference in neural networks. By applying dropout at test time and running multiple forward passes, we obtain a distribution of predictions whose variance captures model uncertainty.

```python
def mc_dropout_predict(model, X, n_passes=100):
    model.train()  # enable dropout
    preds = []
    for _ in range(n_passes):
        preds.append(model(X).detach().numpy())
    return np.array(preds)

# Assume model was trained with dropout
pred_samples = mc_dropout_predict(model, X_test_tensor)
mean = pred_samples.mean(axis=0)
std = pred_samples.std(axis=0)
lower = mean - 1.645 * std
upper = mean + 1.645 * std
```

**Explanation:**  
Dropout is typically used only during training. Here we keep it on during inference, generating different predictions each pass. The variance across passes reflects the model's uncertainty.

---

## 55.8 Evaluating Probabilistic Forecasts

Evaluating probabilistic forecasts requires metrics that assess both calibration (the intervals contain the true value the right proportion of the time) and sharpness (the intervals are as narrow as possible while being well‑calibrated).

### 55.8.1 Coverage and Interval Width

The simplest check: for a nominal coverage `1-α`, compute the empirical coverage on a test set:

```python
coverage = np.mean((y_test >= lower) & (y_test <= upper))
```

If coverage is much lower than `1-α`, the model is overconfident; if much higher, it's underconfident.

Also compute the average interval width:

```python
width = np.mean(upper - lower)
```

A well‑calibrated model should have narrow intervals.

### 55.8.2 Continuous Ranked Probability Score (CRPS)

CRPS measures the difference between the predicted cumulative distribution function (CDF) and the empirical CDF of the observation. It is a proper scoring rule that rewards both calibration and sharpness. For a forecast that gives a distribution, CRPS is:

`CRPS(F, y) = ∫ (F(z) - 𝟙(z ≥ y))² dz`

Lower CRPS is better.

If the forecast is a sample of predictions (e.g., from MCMC or MC dropout), we can compute CRPS using the `properscoring` library.

```python
import properscoring as ps

# pred_samples shape: (n_samples, n_test)
crps = ps.crps_ensemble(y_test, pred_samples)
mean_crps = crps.mean()
print(f"Mean CRPS: {mean_crps:.4f}")
```

### 55.8.3 Pinball Loss (Quantile Loss)

For quantile forecasts, the pinball loss at quantile `τ` is:

`L_τ(y, ŷ_τ) = (τ - 1)(y - ŷ_τ) if y < ŷ_τ else τ (y - ŷ_τ)`

Average pinball loss over all quantiles gives a measure of quantile forecast performance.

```python
def pinball_loss(y_true, y_pred, tau):
    error = y_true - y_pred
    return np.mean(np.where(error >= 0, tau * error, (tau - 1) * error))

# For multiple quantiles
taus = [0.1, 0.5, 0.9]
losses = [pinball_loss(y_test, y_pred_tau, tau) for tau in taus]
```

---

## 55.9 Communicating Probabilistic Predictions

Presenting probabilistic forecasts to stakeholders (traders, managers) requires care. Instead of showing full distributions, we often summarise:

- **Fan charts**: Show prediction intervals with varying transparency over time.
- **Tables**: Provide median and 90% interval for key decisions.
- **Scenario analysis**: "There is a 10% chance that the price will drop below 1000."

For the NEPSE system, you might produce a daily report with:

```
Stock NABIL:
- Tomorrow's expected return: +0.5%
- 80% confidence interval: [-1.2%, +2.3%]
- Probability of positive return: 62%
```

---

## 55.10 Integrating Uncertainty into Trading Decisions

Probabilistic forecasts can directly inform trading:

- **Position sizing**: Allocate more capital to trades with higher confidence (narrow intervals).
- **Stop‑loss setting**: Set stop‑loss at the lower bound of a high‑confidence interval.
- **Risk limits**: Do not take positions if the potential loss (e.g., 5th percentile) exceeds a threshold.
- **Portfolio optimisation**: Use the full predicted distribution to compute expected utility, not just mean.

For example, using the predicted mean `μ` and variance `σ²` from a probabilistic model, you could apply a mean‑variance criterion:

`allocation ∝ (μ - r_f) / σ²`

This allocates more to assets with higher expected return and lower uncertainty.

---

## Chapter Summary

In this chapter, we explored probabilistic forecasting and its importance for the NEPSE stock prediction system. We covered:

- The distinction between point forecasts and probabilistic forecasts.
- Simple residual‑based prediction intervals and their limitations.
- Quantile regression using gradient boosting to model conditional quantiles.
- Conformal prediction, a model‑agnostic method with guaranteed coverage.
- Bayesian methods, including Bayesian linear regression and Gaussian processes.
- Ensemble‑based uncertainty from random forests and quantile regression forests.
- Neural network approaches: mean‑variance networks and Monte Carlo dropout.
- Evaluation metrics: coverage, interval width, CRPS, pinball loss.
- Communicating probabilistic forecasts effectively.
- Using uncertainty estimates to inform trading decisions and risk management.

Probabilistic forecasting transforms a prediction from a single guess into a rich description of possible futures. For the NEPSE system, this means traders can better understand the risks they are taking and make more informed decisions. In the next chapter, we will discuss **Multi‑Variate Time‑Series**, where we model multiple related time series simultaneously.

---

**End of Chapter 55**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='54. reinforcement_learning_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='56. multi_variate_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
