Variance Landscape Mapping in Reinforcement Learning Reward Signals

Separating what you can't know from what you just haven't seen yet.

A research codebase introducing Variance Landscape Mapping (VLM) — a family of exploration strategies that exploit heteroskedastic reward signals: the empirical fact that reward noise $\sigma^2(s,a)$ is different in every corner of the state space, and that confusing this irreducible noise with data-limited uncertainty leads to systematic over- and under-exploration.

Motivation

Standard RL exploration algorithms — ε-greedy, UCB, Thompson Sampling — treat reward noise as a nuisance to average away. But not all noise is equal:

Aleatoric uncertainty: baked into the environment. No matter how many times you visit state $s$ and take action $a$, the reward will still vary. This is irreducible.
Epistemic uncertainty: you simply haven't visited $(s,a)$ enough yet. More data collapses this.

Conflating the two leads to concrete failure modes:

Failure Mode	Cause	Effect
Over-exploration of noisy states	High variance mistaken for uncertainty	Agent burns budget re-sampling chaotic transitions
Premature exploitation	Low-variance states look "known" early	Q-estimate trusted before it has converged
Slow convergence	Exploration bonus never decays correctly	Agent never specialises

The chess analogy: opening moves have enormous variance in eventual outcomes (the game branches explosively), but endgame positions with one piece remaining are nearly deterministic. An ideal agent should know this and behave accordingly.

The Core Idea

Standard RL reward model: $r(s,a) = \mu(s,a) + \varepsilon$, $\quad\varepsilon \sim \mathcal{N}(0, \sigma^2)$

where $\sigma^2$ is a global constant.

VLM model: $r(s,a) \sim \mathcal{N}!\left(\mu(s,a),; \sigma^2(s,a)\right)$

where $\sigma^2(s,a)$ is a per-(state, action) quantity learned online.

The variance landscape map is a data structure — either a tabular dictionary of Welford accumulators or a neural two-headed network — that maintains a running estimate of $\hat{\sigma}^2_{ale}(s,a)$ and signals when that estimate has converged, meaning aleatoric variance is "known" and can be peeled away from the exploration bonus.

Once converged, the exploration bonus is no longer inflated by irreducible noise; it reflects only the remaining epistemic gap.

Uncertainty Decomposition

$$\underbrace{U(s,a)}_{\text{total}} = \underbrace{\hat{\sigma}^2_{ale}(s,a)}_{\text{aleatoric (irreducible)}} + \underbrace{\frac{c}{n(s,a)}}_{\text{epistemic (data-limited)}}$$

The VLM exploration bonus decomposes these two terms and scales them separately:

$$\text{Bonus}(s,a) = c_1\sqrt{\frac{\hat{\sigma}^2_{ale}(s,a)\ln t}{n(s,a)}} + c_2\sqrt{\frac{\ln t}{n(s,a)}}$$

The first term is a UCB-style confidence interval whose width is modulated by the aleatoric noise level. In high-noise states, the agent demands many more visits before the Q-estimate is trusted. In low-noise states, fewer samples are needed.

Exploitation gate: even after the bonus decays, the agent refuses to exploit greedily until:

$$n(s,a) \geq \frac{k}{\hat{\sigma}^2_{ale}(s,a)}$$

forcing proportionally more visits from noisier state-action pairs before they can be exploited.

Novel Algorithms

VLM-Q (Tabular)

File: agents/vlm_q_agent.py

The tabular variant. Maintains a standard Q-table alongside a VarianceLandscapeMap — a dictionary mapping $(s,a)$ pairs to independent Welford accumulators.

Action selection:

$$a^* = \arg\max_a\left[Q(s,a) + c_1\sqrt{\frac{\hat{\sigma}^2_{ale}(s,a)\ln t}{n(s,a)}} + c_2\sqrt{\frac{\ln t}{n(s,a)}}\right]$$

Variance freezing: once is_variance_converged(s, a) returns True (relative change over the last variance_window observations is below stability_threshold), the aleatoric estimate is frozen. The first bonus term no longer grows as more data arrives — only the epistemic term (second) continues to decay.

Q-update: standard TD(0):

$$Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'}Q(s',a') - Q(s,a)\right]$$

Key hyperparameters:

Parameter	Default	Role
`c1`	1.0	Aleatoric bonus coefficient
`c2`	0.5	Epistemic bonus coefficient
`k`	50	Exploitation gate multiplier
`variance_window`	50	Steps over which stability is checked
`variance_stability_threshold`	0.05	Max relative change to declare convergence

VLM-DQN (Neural)

File: agents/vlm_dqn_agent.py

The deep RL variant. Runs two networks in parallel:

Q-network — standard DQN: $\text{state} \to Q(s, a_i)$ for all $i$. Trained with Huber TD loss against a frozen target network (updated every target_update_freq steps).
Variance network (HeteroskedasticRewardNet) — a two-headed MLP: $(\text{state}, a) \to (\hat{\mu}, \log\hat{\sigma}^2)$. Trained on every transition in the replay buffer with Gaussian NLL:

$$\mathcal{L}_{var} = \frac{1}{2}\left[\log\hat{\sigma}^2(s,a) + \frac{(r - \hat{\mu}(s,a))^2}{\hat{\sigma}^2(s,a)}\right]$$

Action selection:

$$a^* = \arg\max_a\left[Q(s,a) + \beta \cdot \frac{\hat{\sigma}(s,a)}{\sqrt{\hat{n}(s,a)}}\right]$$

where $\hat{n}(s,a)$ is a soft visit count maintained in a hash table, and $\hat{\sigma}(s,a) = \exp!\left(\frac{1}{2}\log\hat{\sigma}^2\right)$ comes from the variance network.

The variance network is updated every var_update_freq gradient steps, decoupled from the Q-network update schedule.

Architecture (defaults):

Q-network: Linear(state_dim → 64) → ReLU → Linear(64 → 64) → ReLU → Linear(64 → n_actions)
Var-network: shared trunk + two linear heads (mean head, log-var head)

Baselines

Agent	File	Variance-Aware?	Strategy
ε-greedy Q-Learning	agents/q_learning.py	No	Decaying random exploration
UCB-V	agents/ucbv_agent.py	Yes, but confounded	Empirical variance in UCB bound — cannot separate aleatoric from epistemic
C51	agents/distributional_q.py	Implicit	Distributional RL; models full return distribution, not reward variance directly
VLM-Q (ours)	agents/vlm_q_agent.py	Separated	Explicit aleatoric map + convergence gate
VLM-DQN (ours)	agents/vlm_dqn_agent.py	Separated	Heteroskedastic reward network + bonus

Why UCB-V falls short: UCB-V uses the empirical sample variance of returns (not rewards) in the bonus, which conflates aleatoric and epistemic components. It cannot freeze the bonus when aleatoric variance is known, and cannot apply per-state confidence gates.

Why C51 falls short: distributional RL models the distribution of discounted cumulative returns, not the per-step reward noise. Multi-step mixing and discounting blur state-level heteroskedasticity beyond recognition.

Environments

NoisyGridworld

File: envs/noisy_gridworld.py

A 10×10 Gymnasium discrete environment with three hard-coded variance zones:

Zone	Rows	Cols	$\mu$	$\sigma$	Interpretation
Low-noise path	4–6	0–9	−1.0	0.1	Reliable but longer route
High-noise shortcut	1–3	3–7	−0.5	3.0	Faster expected return, high variance
Default region	0–9	0–9	−1.0	1.0	Moderate noise background

The ground-truth $\sigma(s)$ grid is stored on the environment object, enabling quantitative validation of learned variance maps (MSE against ground truth).

Goal: reach cell (9,9). Episode terminates on goal or after max_steps=200 steps.

HeteroskedasticBandit

File: envs/heteroskedastic_bandit.py

A K-arm bandit where each arm $k$ has independently configured $(\mu_k, \sigma_k)$. Used in the bandit experiment to validate the Welford estimator: after $N$ pulls of each arm, the estimated variance should converge to the true $\sigma_k^2$.

NoisyFrozenLake

File: envs/noisy_frozen_lake.py

FrozenLake-v1 augmented with state-dependent reward noise — provides a sparse-reward transfer test to complement the dense-reward gridworld.

OBPBandit (Real-World)

File: envs/obp_bandit.py

A Gymnasium bandit backed by real click logs from the Open Bandit Dataset (Saito et al., NeurIPS 2021), collected by ZOZO Technologies on a Japanese fashion e-commerce platform. Calling step(action) replays with replacement from historical click records for that arm — no simulation, real heteroskedasticity.

Property	Value
Arms	34 (men campaign), 46 (women), 80 (all)
Reward	Binary click: 0 or 1
Logging policy	`random` (unbiased) or `bts` (Thompson Sampling)
Campaigns	`men`, `women`, `all`
True variance	Bernoulli: $\sigma^2_a = p_a(1-p_a)$, naturally heteroskedastic

Ground-truth per-arm CTR and Bernoulli variance are computed from the full dataset and exposed as ground_truth_means and ground_truth_variances properties for quantitative evaluation.

Estimators

`WelfordAccumulator` — estimators/welford.py

Numerically stable, single-pass online algorithm for mean and variance. Each update is O(1) time and O(1) memory. Supports merging two accumulators via Chan's parallel formula, enabling future distributed extension.

Update rule: for sample $x_n$,

$$\delta \leftarrow x_n - \bar{x}_{n-1}, \qquad \bar{x}_n \leftarrow \bar{x}_{n-1} + \delta/n, \qquad M_n \leftarrow M_{n-1} + \delta(x_n - \bar{x}_n)$$

$$\hat{\sigma}^2_n = M_n / (n - 1) \quad (n \geq 2)$$

`VarianceLandscapeMap` — estimators/variance_map.py

A defaultdict of WelfordAccumulators keyed by (state, action). Provides:

update(s, a, r) — feed a new reward observation
get_variance(s, a) — current $\hat{\sigma}^2$ estimate
is_variance_converged(s, a) — True once the relative change in $\hat{\sigma}^2$ over the last variance_window observations falls below stability_threshold
get_aleatoric_uncertainty(s, a) — returns the frozen estimate if converged, or the live estimate otherwise
snapshot() — exports the full variance map as a 2-D array for heatmap visualisation

`HeteroskedasticRewardNet` — estimators/hetero_net.py

PyTorch nn.Module with shared trunk and two linear output heads:

Mean head: predicts $\hat{\mu}(s,a)$
Log-variance head: predicts $\log\hat{\sigma}^2(s,a)$ (unbounded; exponentiated before use)

Trained end-to-end with gaussian_nll_loss, which is the heteroskedastic regression loss above.

Real-World Dataset: Open Bandit Pipeline

In addition to synthetic environments, the codebase validates against real production data from the Open Bandit Dataset (OBD) (Saito et al., NeurIPS 2021 Datasets & Benchmarks Track).

Why OBD?

OBD is the only publicly available logged bandit dataset from a real recommendation system with known logging propensities. The random policy logs make sampling unbiased, allowing the Welford estimator to recover true per-arm Bernoulli variances — a ground-truth check impossible on most real datasets.

What the data shows

The men campaign (34 fashion items, 10 000 interaction records) has per-arm CTR ranging from 0% to 1.5%, giving Bernoulli variance $p_a(1-p_a)$ that spans an order of magnitude. This real heteroskedasticity is the ideal testbed for VLM.

OBP Experiments

Command	What it does
`python main.py obp`	Validate Welford variance recovery on OBD; reports Spearman ρ and var MSE
`python main.py obp-agents`	Train Q-Learning, UCB-V, and VLM-Q on OBD; generate all three plots

Generated Plots

File	Description
`results/obp_bandit/obp_cumulative_regret.png`	Cumulative regret curves with 95% CI across 20 seeds
`results/obp_bandit/obp_variance_recovery.png`	Ground-truth vs VLM-Q estimated per-arm Bernoulli variance
`results/obp_bandit/obp_spearman_convergence.png`	Spearman ρ of arm-variance ranking over 10 000 interaction steps

Key Real-World Results

Spearman ρ = 0.70 by step 10 000 — VLM-Q correctly ranks 34 real fashion items by aleatoric risk.
ρ crosses the 0.5 pass threshold around step 4 500, showing the Welford map converges well within a typical deployment window.
UCB-V and VLM-Q incur higher regret than ε-greedy in this ultra-sparse binary regime (overall CTR < 0.5%), because exploration bonuses keep revisiting low-CTR arms that look uncertain. This is an expected limitation when aleatoric variance is uniformly near-zero and is documented as a boundary condition for VLM.

Project Structure

variance_landscape_rl/
├── envs/                          # Custom Gymnasium environments
│   ├── noisy_gridworld.py         #   10×10 grid, three σ zones, ground-truth map
│   ├── heteroskedastic_bandit.py  #   K-arm bandit with per-arm σ (synthetic)
│   ├── noisy_frozen_lake.py       #   FrozenLake + state-dependent noise
│   └── obp_bandit.py              #   Real-world: Open Bandit Dataset (80 fashion items, binary clicks)
├── estimators/                    # Online variance estimation
│   ├── welford.py                 #   Welford O(1) accumulator
│   ├── variance_map.py            #   Tabular (s, a) → σ̂² map + convergence detection
│   └── hetero_net.py              #   PyTorch μ + log-σ² two-headed network
├── agents/                        # Learning algorithms
│   ├── common.py                  #   ReplayBuffer, EpsilonSchedule, BaseAgent
│   ├── q_learning.py              #   Baseline: ε-greedy Q-learning
│   ├── ucbv_agent.py              #   Baseline: UCB-V (Audibert 2007)
│   ├── distributional_q.py        #   Baseline: C51 (Bellemare 2017)
│   ├── vlm_q_agent.py             #   Novel: VLM-Q — tabular heteroskedastic exploration
│   └── vlm_dqn_agent.py           #   Novel: VLM-DQN — neural heteroskedastic exploration
├── experiments/                   # Experiment runners
│   ├── bandit_experiment.py       #   Phase 1: validate Welford on synthetic bandit
│   ├── obp_experiment.py          #   Phase 1b: validate Welford on real OBP click data
│   ├── obp_agent_experiment.py    #   Phase 1c: train all agents on OBP + generate plots
│   ├── tabular_experiment.py      #   Phase 2: Q-Learning vs UCB-V vs VLM-Q (20 seeds)
│   ├── neural_experiment.py       #   Phase 3: DQN vs C51 vs VLM-DQN (20 seeds)
│   └── statistical_tests.py       #   Friedman, Nemenyi post-hoc, Welch's t + Cohen's d
├── visualization/                 # Plot generation
│   ├── variance_heatmap.py        #   Ground truth vs estimated σ² heatmaps
│   ├── learning_curves.py         #   Return curves with 95% CI bands
│   └── obp_plots.py               #   OBP-specific: regret curves, variance recovery, Spearman ρ
├── configs/                       # YAML experiment configurations
│   ├── bandit.yaml                #   Synthetic bandit arms and noise levels
│   ├── obp_bandit.yaml            #   OBP dataset + agent hyperparameters
│   ├── tabular.yaml               #   Gridworld zones, agent hyperparameters (20 seeds × 5000 ep)
│   └── neural.yaml                #   Neural agent hyperparameters
├── results/                       # Auto-generated outputs
│   ├── bandit/
│   ├── obp_bandit/
│   ├── tabular/
│   └── neural/
├── main.py                        # CLI entry point
├── requirements.txt
└── README.md

Installation

# Clone
git clone <repo-url>
cd variance_landscape_rl

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Requirements: Python ≥ 3.9, PyTorch ≥ 2.0, Gymnasium ≥ 0.29.

Quick Start

# Full pipeline: bandit validation → tabular → neural → stats → plots
python main.py all

# Individual phases
python main.py bandit                                  # Phase 1: validate Welford on synthetic bandit
python main.py tabular                                 # Phase 2: tabular agent comparison
python main.py neural                                  # Phase 3: neural agent comparison

# Real-world OBP experiments
python main.py obp                                     # Validate Welford on real OBP click data
python main.py obp-agents                              # Train all agents on OBP, generate 3 plots
python main.py obp-agents --config configs/obp_bandit.yaml   # Custom config

# Statistical analysis on saved results
python main.py stats results/tabular/tabular_results.json
python main.py stats results/neural/neural_results.json

# Visualisation
python main.py plot results/tabular/learning_curves.json
python main.py plot results/neural/learning_curves.json

# Custom config
python main.py tabular --config configs/tabular.yaml

Configuration

All experiments are driven by YAML files in configs/. The key knobs:

# configs/tabular.yaml (excerpt)
environment:
  name: NoisyGridworld
  grid_size: 10
  zones:
    - name: low_noise_path
      rows: [4, 6]   cols: [0, 10]
      reward_mean: -1.0   reward_std: 0.1
    - name: high_noise_shortcut
      rows: [1, 3]   cols: [3, 7]
      reward_mean: -0.5   reward_std: 3.0

agents:
  vlm_q:
    c1: 1.0                        # aleatoric bonus coefficient
    c2: 0.5                        # epistemic bonus coefficient
    k: 50                          # exploitation gate multiplier
    variance_window: 50            # stability detection window
    variance_stability_threshold: 0.05

experiment:
  n_episodes: 5000
  n_seeds: 20
  checkpoint_steps: [100, 500, 2000, 5000]

# configs/obp_bandit.yaml (excerpt)
bandit:
  behavior_policy: "random"        # unbiased logging policy
  campaign: "men"                  # 34 arms — denser per-arm coverage
  n_steps: 10000
  n_seeds: 20

agents:
  vlm_q:
    gamma: 0.0                     # bandit: no future state
    c1: 1.0
    c2: 0.5
    k: 20.0
    variance_window: 30
    variance_stability_threshold: 0.02

experiment:
  log_interval: 200
  output_dir: "results/obp_bandit"

To sweep hyperparameters, copy the YAML, modify, and pass --config your_config.yaml.

Evaluation Protocol

Seeds: 20 independent seeds per agent per environment.
Checkpoints: Q-tables and metrics saved at episodes 100, 500, 2000, and 5000.
Primary metric: mean episode return over the final 100 episodes.
Secondary metric: area under the learning curve (sample efficiency proxy).
Variance estimation metric (synthetic): MSE between $\hat{\sigma}^2(s)$ and the ground-truth _reward_std grid stored on NoisyGridworld.
Variance estimation metric (real-world OBP): Spearman rank correlation ρ between estimated and true per-arm Bernoulli variances — measures whether VLM-Q correctly orders arms by aleatoric risk without requiring exact magnitude recovery.
Regret metric (OBP): cumulative pseudo-regret $\sum_t (\mu^* - \mu_{a_t})$ where $\mu^*$ is the best arm's empirical CTR.
Statistical tests:
- Friedman test (non-parametric k-way comparison across all agents).
- Nemenyi post-hoc (pairwise significance with familywise error rate correction).
- Welch's t-test + Cohen's d for each pair, reported with 95% CI.

All results are serialised to JSON under results/ for reproducible analysis.

Results Layout

results/
├── bandit/
│   └── bandit_results.json              # synthetic arm-level variance MSE over time
├── obp_bandit/
│   ├── obp_results.json                 # Welford validation: Spearman ρ + var MSE per seed
│   ├── obp_agent_results.json           # agent training: regret/reward curves + VLM-Q var snapshots
│   ├── obp_cumulative_regret.png        # Figure: cumulative regret curves with 95% CI
│   ├── obp_variance_recovery.png        # Figure: ground-truth vs VLM-Q per-arm variance bars
│   └── obp_spearman_convergence.png     # Figure: Spearman ρ of arm-variance ranking over time
├── tabular/
│   ├── tabular_results.json             # per-seed final returns
│   ├── learning_curves.json             # episode × agent mean/std return
│   ├── statistical_analysis.json        # Friedman p-value, Nemenyi CD, Welch pairs
│   └── checkpoints/
│       ├── q_learning/
│       │   └── seed{i}_ep{n}.json       # Q-table snapshot
│       ├── ucbv/
│       └── vlm_q/
└── neural/
    ├── neural_results.json
    ├── learning_curves.json
    └── statistical_analysis.json

Key References

Welford, B.P. (1962). Note on a Method for Calculating Corrected Sums of Squares and Products. Technometrics, 4(3), 419–420.
Chan, T.F., Golub, G.H., & LeVeque, R.J. (1979). Updating formulae and a pairwise algorithm for computing sample variances. COMPSTAT 1979.
Audibert, J.Y., Munos, R., & Szepesvári, C. (2007). Tuning Bandit Algorithms in Stochastic Environments. ALT 2007. (UCB-V)
Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML 2017. (C51)
Kendall, A. & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS 2017. (Aleatoric vs epistemic uncertainty decomposition)
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. (DQN)
Saito, Y. et al. (2021). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. NeurIPS 2021 Datasets and Benchmarks Track. (OBD — real-world click data used in envs/obp_bandit.py)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
configs		configs
envs		envs
estimators		estimators
experiments		experiments
visualization		visualization
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Variance Landscape Mapping in Reinforcement Learning Reward Signals

Table of Contents

Motivation

The Core Idea

Uncertainty Decomposition

Novel Algorithms

VLM-Q (Tabular)

VLM-DQN (Neural)

Baselines

Environments

NoisyGridworld

HeteroskedasticBandit

NoisyFrozenLake

OBPBandit (Real-World)

Estimators

WelfordAccumulator — estimators/welford.py

VarianceLandscapeMap — estimators/variance_map.py

HeteroskedasticRewardNet — estimators/hetero_net.py

Real-World Dataset: Open Bandit Pipeline

Why OBD?

What the data shows

OBP Experiments

Generated Plots

Key Real-World Results

Project Structure

Installation

Quick Start

Configuration

Evaluation Protocol

Results Layout

Key References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`WelfordAccumulator` — estimators/welford.py

`VarianceLandscapeMap` — estimators/variance_map.py

`HeteroskedasticRewardNet` — estimators/hetero_net.py

Packages