Separating what you can't know from what you just haven't seen yet.
A research codebase introducing Variance Landscape Mapping (VLM) — a family of exploration strategies that exploit heteroskedastic reward signals: the empirical fact that reward noise
- Motivation
- The Core Idea
- Uncertainty Decomposition
- Novel Algorithms
- Baselines
- Environments
- Estimators
- Real-World Dataset: Open Bandit Pipeline
- Project Structure
- Installation
- Quick Start
- Configuration
- Evaluation Protocol
- Results Layout
- Key References
Standard RL exploration algorithms — ε-greedy, UCB, Thompson Sampling — treat reward noise as a nuisance to average away. But not all noise is equal:
-
Aleatoric uncertainty: baked into the environment. No matter how many times you visit state
$s$ and take action$a$ , the reward will still vary. This is irreducible. -
Epistemic uncertainty: you simply haven't visited
$(s,a)$ enough yet. More data collapses this.
Conflating the two leads to concrete failure modes:
| Failure Mode | Cause | Effect |
|---|---|---|
| Over-exploration of noisy states | High variance mistaken for uncertainty | Agent burns budget re-sampling chaotic transitions |
| Premature exploitation | Low-variance states look "known" early | Q-estimate trusted before it has converged |
| Slow convergence | Exploration bonus never decays correctly | Agent never specialises |
The chess analogy: opening moves have enormous variance in eventual outcomes (the game branches explosively), but endgame positions with one piece remaining are nearly deterministic. An ideal agent should know this and behave accordingly.
Standard RL reward model:
where
VLM model:
where
The variance landscape map is a data structure — either a tabular dictionary of Welford accumulators or a neural two-headed network — that maintains a running estimate of
Once converged, the exploration bonus is no longer inflated by irreducible noise; it reflects only the remaining epistemic gap.
The VLM exploration bonus decomposes these two terms and scales them separately:
The first term is a UCB-style confidence interval whose width is modulated by the aleatoric noise level. In high-noise states, the agent demands many more visits before the Q-estimate is trusted. In low-noise states, fewer samples are needed.
Exploitation gate: even after the bonus decays, the agent refuses to exploit greedily until:
forcing proportionally more visits from noisier state-action pairs before they can be exploited.
File: agents/vlm_q_agent.py
The tabular variant. Maintains a standard Q-table alongside a VarianceLandscapeMap — a dictionary mapping
Action selection:
Variance freezing: once is_variance_converged(s, a) returns True (relative change over the last variance_window observations is below stability_threshold), the aleatoric estimate is frozen. The first bonus term no longer grows as more data arrives — only the epistemic term (second) continues to decay.
Q-update: standard TD(0):
Key hyperparameters:
| Parameter | Default | Role |
|---|---|---|
c1 |
1.0 | Aleatoric bonus coefficient |
c2 |
0.5 | Epistemic bonus coefficient |
k |
50 | Exploitation gate multiplier |
variance_window |
50 | Steps over which stability is checked |
variance_stability_threshold |
0.05 | Max relative change to declare convergence |
File: agents/vlm_dqn_agent.py
The deep RL variant. Runs two networks in parallel:
-
Q-network — standard DQN:
$\text{state} \to Q(s, a_i)$ for all$i$ . Trained with Huber TD loss against a frozen target network (updated everytarget_update_freqsteps). -
Variance network (
HeteroskedasticRewardNet) — a two-headed MLP:$(\text{state}, a) \to (\hat{\mu}, \log\hat{\sigma}^2)$ . Trained on every transition in the replay buffer with Gaussian NLL:
Action selection:
where
The variance network is updated every var_update_freq gradient steps, decoupled from the Q-network update schedule.
Architecture (defaults):
- Q-network: Linear(state_dim → 64) → ReLU → Linear(64 → 64) → ReLU → Linear(64 → n_actions)
- Var-network: shared trunk + two linear heads (mean head, log-var head)
| Agent | File | Variance-Aware? | Strategy |
|---|---|---|---|
| ε-greedy Q-Learning | agents/q_learning.py | No | Decaying random exploration |
| UCB-V | agents/ucbv_agent.py | Yes, but confounded | Empirical variance in UCB bound — cannot separate aleatoric from epistemic |
| C51 | agents/distributional_q.py | Implicit | Distributional RL; models full return distribution, not reward variance directly |
| VLM-Q (ours) | agents/vlm_q_agent.py | Separated | Explicit aleatoric map + convergence gate |
| VLM-DQN (ours) | agents/vlm_dqn_agent.py | Separated | Heteroskedastic reward network + bonus |
Why UCB-V falls short: UCB-V uses the empirical sample variance of returns (not rewards) in the bonus, which conflates aleatoric and epistemic components. It cannot freeze the bonus when aleatoric variance is known, and cannot apply per-state confidence gates.
Why C51 falls short: distributional RL models the distribution of discounted cumulative returns, not the per-step reward noise. Multi-step mixing and discounting blur state-level heteroskedasticity beyond recognition.
File: envs/noisy_gridworld.py
A 10×10 Gymnasium discrete environment with three hard-coded variance zones:
| Zone | Rows | Cols | Interpretation | ||
|---|---|---|---|---|---|
| Low-noise path | 4–6 | 0–9 | −1.0 | 0.1 | Reliable but longer route |
| High-noise shortcut | 1–3 | 3–7 | −0.5 | 3.0 | Faster expected return, high variance |
| Default region | 0–9 | 0–9 | −1.0 | 1.0 | Moderate noise background |
The ground-truth
Goal: reach cell (9,9). Episode terminates on goal or after max_steps=200 steps.
File: envs/heteroskedastic_bandit.py
A K-arm bandit where each arm
File: envs/noisy_frozen_lake.py
FrozenLake-v1 augmented with state-dependent reward noise — provides a sparse-reward transfer test to complement the dense-reward gridworld.
File: envs/obp_bandit.py
A Gymnasium bandit backed by real click logs from the Open Bandit Dataset (Saito et al., NeurIPS 2021), collected by ZOZO Technologies on a Japanese fashion e-commerce platform. Calling step(action) replays with replacement from historical click records for that arm — no simulation, real heteroskedasticity.
| Property | Value |
|---|---|
| Arms | 34 (men campaign), 46 (women), 80 (all) |
| Reward | Binary click: 0 or 1 |
| Logging policy |
random (unbiased) or bts (Thompson Sampling) |
| Campaigns |
men, women, all
|
| True variance | Bernoulli: |
Ground-truth per-arm CTR and Bernoulli variance are computed from the full dataset and exposed as ground_truth_means and ground_truth_variances properties for quantitative evaluation.
WelfordAccumulator — estimators/welford.py
Numerically stable, single-pass online algorithm for mean and variance. Each update is O(1) time and O(1) memory. Supports merging two accumulators via Chan's parallel formula, enabling future distributed extension.
Update rule: for sample
VarianceLandscapeMap — estimators/variance_map.py
A defaultdict of WelfordAccumulators keyed by (state, action). Provides:
-
update(s, a, r)— feed a new reward observation -
get_variance(s, a)— current$\hat{\sigma}^2$ estimate -
is_variance_converged(s, a)—Trueonce the relative change in$\hat{\sigma}^2$ over the lastvariance_windowobservations falls belowstability_threshold -
get_aleatoric_uncertainty(s, a)— returns the frozen estimate if converged, or the live estimate otherwise -
snapshot()— exports the full variance map as a 2-D array for heatmap visualisation
HeteroskedasticRewardNet — estimators/hetero_net.py
PyTorch nn.Module with shared trunk and two linear output heads:
-
Mean head: predicts
$\hat{\mu}(s,a)$ -
Log-variance head: predicts
$\log\hat{\sigma}^2(s,a)$ (unbounded; exponentiated before use)
Trained end-to-end with gaussian_nll_loss, which is the heteroskedastic regression loss above.
In addition to synthetic environments, the codebase validates against real production data from the Open Bandit Dataset (OBD) (Saito et al., NeurIPS 2021 Datasets & Benchmarks Track).
OBD is the only publicly available logged bandit dataset from a real recommendation system with known logging propensities. The random policy logs make sampling unbiased, allowing the Welford estimator to recover true per-arm Bernoulli variances — a ground-truth check impossible on most real datasets.
The men campaign (34 fashion items, 10 000 interaction records) has per-arm CTR ranging from 0% to 1.5%, giving Bernoulli variance
| Command | What it does |
|---|---|
python main.py obp |
Validate Welford variance recovery on OBD; reports Spearman ρ and var MSE |
python main.py obp-agents |
Train Q-Learning, UCB-V, and VLM-Q on OBD; generate all three plots |
| File | Description |
|---|---|
results/obp_bandit/obp_cumulative_regret.png |
Cumulative regret curves with 95% CI across 20 seeds |
results/obp_bandit/obp_variance_recovery.png |
Ground-truth vs VLM-Q estimated per-arm Bernoulli variance |
results/obp_bandit/obp_spearman_convergence.png |
Spearman ρ of arm-variance ranking over 10 000 interaction steps |
- Spearman ρ = 0.70 by step 10 000 — VLM-Q correctly ranks 34 real fashion items by aleatoric risk.
- ρ crosses the 0.5 pass threshold around step 4 500, showing the Welford map converges well within a typical deployment window.
- UCB-V and VLM-Q incur higher regret than ε-greedy in this ultra-sparse binary regime (overall CTR < 0.5%), because exploration bonuses keep revisiting low-CTR arms that look uncertain. This is an expected limitation when aleatoric variance is uniformly near-zero and is documented as a boundary condition for VLM.
variance_landscape_rl/
├── envs/ # Custom Gymnasium environments
│ ├── noisy_gridworld.py # 10×10 grid, three σ zones, ground-truth map
│ ├── heteroskedastic_bandit.py # K-arm bandit with per-arm σ (synthetic)
│ ├── noisy_frozen_lake.py # FrozenLake + state-dependent noise
│ └── obp_bandit.py # Real-world: Open Bandit Dataset (80 fashion items, binary clicks)
├── estimators/ # Online variance estimation
│ ├── welford.py # Welford O(1) accumulator
│ ├── variance_map.py # Tabular (s, a) → σ̂² map + convergence detection
│ └── hetero_net.py # PyTorch μ + log-σ² two-headed network
├── agents/ # Learning algorithms
│ ├── common.py # ReplayBuffer, EpsilonSchedule, BaseAgent
│ ├── q_learning.py # Baseline: ε-greedy Q-learning
│ ├── ucbv_agent.py # Baseline: UCB-V (Audibert 2007)
│ ├── distributional_q.py # Baseline: C51 (Bellemare 2017)
│ ├── vlm_q_agent.py # Novel: VLM-Q — tabular heteroskedastic exploration
│ └── vlm_dqn_agent.py # Novel: VLM-DQN — neural heteroskedastic exploration
├── experiments/ # Experiment runners
│ ├── bandit_experiment.py # Phase 1: validate Welford on synthetic bandit
│ ├── obp_experiment.py # Phase 1b: validate Welford on real OBP click data
│ ├── obp_agent_experiment.py # Phase 1c: train all agents on OBP + generate plots
│ ├── tabular_experiment.py # Phase 2: Q-Learning vs UCB-V vs VLM-Q (20 seeds)
│ ├── neural_experiment.py # Phase 3: DQN vs C51 vs VLM-DQN (20 seeds)
│ └── statistical_tests.py # Friedman, Nemenyi post-hoc, Welch's t + Cohen's d
├── visualization/ # Plot generation
│ ├── variance_heatmap.py # Ground truth vs estimated σ² heatmaps
│ ├── learning_curves.py # Return curves with 95% CI bands
│ └── obp_plots.py # OBP-specific: regret curves, variance recovery, Spearman ρ
├── configs/ # YAML experiment configurations
│ ├── bandit.yaml # Synthetic bandit arms and noise levels
│ ├── obp_bandit.yaml # OBP dataset + agent hyperparameters
│ ├── tabular.yaml # Gridworld zones, agent hyperparameters (20 seeds × 5000 ep)
│ └── neural.yaml # Neural agent hyperparameters
├── results/ # Auto-generated outputs
│ ├── bandit/
│ ├── obp_bandit/
│ ├── tabular/
│ └── neural/
├── main.py # CLI entry point
├── requirements.txt
└── README.md
# Clone
git clone <repo-url>
cd variance_landscape_rl
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtRequirements: Python ≥ 3.9, PyTorch ≥ 2.0, Gymnasium ≥ 0.29.
# Full pipeline: bandit validation → tabular → neural → stats → plots
python main.py all
# Individual phases
python main.py bandit # Phase 1: validate Welford on synthetic bandit
python main.py tabular # Phase 2: tabular agent comparison
python main.py neural # Phase 3: neural agent comparison
# Real-world OBP experiments
python main.py obp # Validate Welford on real OBP click data
python main.py obp-agents # Train all agents on OBP, generate 3 plots
python main.py obp-agents --config configs/obp_bandit.yaml # Custom config
# Statistical analysis on saved results
python main.py stats results/tabular/tabular_results.json
python main.py stats results/neural/neural_results.json
# Visualisation
python main.py plot results/tabular/learning_curves.json
python main.py plot results/neural/learning_curves.json
# Custom config
python main.py tabular --config configs/tabular.yamlAll experiments are driven by YAML files in configs/. The key knobs:
# configs/tabular.yaml (excerpt)
environment:
name: NoisyGridworld
grid_size: 10
zones:
- name: low_noise_path
rows: [4, 6] cols: [0, 10]
reward_mean: -1.0 reward_std: 0.1
- name: high_noise_shortcut
rows: [1, 3] cols: [3, 7]
reward_mean: -0.5 reward_std: 3.0
agents:
vlm_q:
c1: 1.0 # aleatoric bonus coefficient
c2: 0.5 # epistemic bonus coefficient
k: 50 # exploitation gate multiplier
variance_window: 50 # stability detection window
variance_stability_threshold: 0.05
experiment:
n_episodes: 5000
n_seeds: 20
checkpoint_steps: [100, 500, 2000, 5000]# configs/obp_bandit.yaml (excerpt)
bandit:
behavior_policy: "random" # unbiased logging policy
campaign: "men" # 34 arms — denser per-arm coverage
n_steps: 10000
n_seeds: 20
agents:
vlm_q:
gamma: 0.0 # bandit: no future state
c1: 1.0
c2: 0.5
k: 20.0
variance_window: 30
variance_stability_threshold: 0.02
experiment:
log_interval: 200
output_dir: "results/obp_bandit"To sweep hyperparameters, copy the YAML, modify, and pass --config your_config.yaml.
- Seeds: 20 independent seeds per agent per environment.
- Checkpoints: Q-tables and metrics saved at episodes 100, 500, 2000, and 5000.
- Primary metric: mean episode return over the final 100 episodes.
- Secondary metric: area under the learning curve (sample efficiency proxy).
-
Variance estimation metric (synthetic): MSE between
$\hat{\sigma}^2(s)$ and the ground-truth_reward_stdgrid stored onNoisyGridworld. - Variance estimation metric (real-world OBP): Spearman rank correlation ρ between estimated and true per-arm Bernoulli variances — measures whether VLM-Q correctly orders arms by aleatoric risk without requiring exact magnitude recovery.
-
Regret metric (OBP): cumulative pseudo-regret
$\sum_t (\mu^* - \mu_{a_t})$ where$\mu^*$ is the best arm's empirical CTR. -
Statistical tests:
- Friedman test (non-parametric k-way comparison across all agents).
- Nemenyi post-hoc (pairwise significance with familywise error rate correction).
- Welch's t-test + Cohen's d for each pair, reported with 95% CI.
All results are serialised to JSON under results/ for reproducible analysis.
results/
├── bandit/
│ └── bandit_results.json # synthetic arm-level variance MSE over time
├── obp_bandit/
│ ├── obp_results.json # Welford validation: Spearman ρ + var MSE per seed
│ ├── obp_agent_results.json # agent training: regret/reward curves + VLM-Q var snapshots
│ ├── obp_cumulative_regret.png # Figure: cumulative regret curves with 95% CI
│ ├── obp_variance_recovery.png # Figure: ground-truth vs VLM-Q per-arm variance bars
│ └── obp_spearman_convergence.png # Figure: Spearman ρ of arm-variance ranking over time
├── tabular/
│ ├── tabular_results.json # per-seed final returns
│ ├── learning_curves.json # episode × agent mean/std return
│ ├── statistical_analysis.json # Friedman p-value, Nemenyi CD, Welch pairs
│ └── checkpoints/
│ ├── q_learning/
│ │ └── seed{i}_ep{n}.json # Q-table snapshot
│ ├── ucbv/
│ └── vlm_q/
└── neural/
├── neural_results.json
├── learning_curves.json
└── statistical_analysis.json
- Welford, B.P. (1962). Note on a Method for Calculating Corrected Sums of Squares and Products. Technometrics, 4(3), 419–420.
- Chan, T.F., Golub, G.H., & LeVeque, R.J. (1979). Updating formulae and a pairwise algorithm for computing sample variances. COMPSTAT 1979.
- Audibert, J.Y., Munos, R., & Szepesvári, C. (2007). Tuning Bandit Algorithms in Stochastic Environments. ALT 2007. (UCB-V)
- Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML 2017. (C51)
- Kendall, A. & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS 2017. (Aleatoric vs epistemic uncertainty decomposition)
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. (DQN)
- Saito, Y. et al. (2021). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. NeurIPS 2021 Datasets and Benchmarks Track. (OBD — real-world click data used in
envs/obp_bandit.py)