Skip to content

Alanperry1/vlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variance Landscape Mapping in Reinforcement Learning Reward Signals

Separating what you can't know from what you just haven't seen yet.

A research codebase introducing Variance Landscape Mapping (VLM) — a family of exploration strategies that exploit heteroskedastic reward signals: the empirical fact that reward noise $\sigma^2(s,a)$ is different in every corner of the state space, and that confusing this irreducible noise with data-limited uncertainty leads to systematic over- and under-exploration.


Table of Contents

  1. Motivation
  2. The Core Idea
  3. Uncertainty Decomposition
  4. Novel Algorithms
  5. Baselines
  6. Environments
  7. Estimators
  8. Real-World Dataset: Open Bandit Pipeline
  9. Project Structure
  10. Installation
  11. Quick Start
  12. Configuration
  13. Evaluation Protocol
  14. Results Layout
  15. Key References

Motivation

Standard RL exploration algorithms — ε-greedy, UCB, Thompson Sampling — treat reward noise as a nuisance to average away. But not all noise is equal:

  • Aleatoric uncertainty: baked into the environment. No matter how many times you visit state $s$ and take action $a$, the reward will still vary. This is irreducible.
  • Epistemic uncertainty: you simply haven't visited $(s,a)$ enough yet. More data collapses this.

Conflating the two leads to concrete failure modes:

Failure Mode Cause Effect
Over-exploration of noisy states High variance mistaken for uncertainty Agent burns budget re-sampling chaotic transitions
Premature exploitation Low-variance states look "known" early Q-estimate trusted before it has converged
Slow convergence Exploration bonus never decays correctly Agent never specialises

The chess analogy: opening moves have enormous variance in eventual outcomes (the game branches explosively), but endgame positions with one piece remaining are nearly deterministic. An ideal agent should know this and behave accordingly.


The Core Idea

Standard RL reward model: $r(s,a) = \mu(s,a) + \varepsilon$, $\quad\varepsilon \sim \mathcal{N}(0, \sigma^2)$

where $\sigma^2$ is a global constant.

VLM model: $r(s,a) \sim \mathcal{N}!\left(\mu(s,a),; \sigma^2(s,a)\right)$

where $\sigma^2(s,a)$ is a per-(state, action) quantity learned online.

The variance landscape map is a data structure — either a tabular dictionary of Welford accumulators or a neural two-headed network — that maintains a running estimate of $\hat{\sigma}^2_{ale}(s,a)$ and signals when that estimate has converged, meaning aleatoric variance is "known" and can be peeled away from the exploration bonus.

Once converged, the exploration bonus is no longer inflated by irreducible noise; it reflects only the remaining epistemic gap.


Uncertainty Decomposition

$$\underbrace{U(s,a)}_{\text{total}} = \underbrace{\hat{\sigma}^2_{ale}(s,a)}_{\text{aleatoric (irreducible)}} + \underbrace{\frac{c}{n(s,a)}}_{\text{epistemic (data-limited)}}$$

The VLM exploration bonus decomposes these two terms and scales them separately:

$$\text{Bonus}(s,a) = c_1\sqrt{\frac{\hat{\sigma}^2_{ale}(s,a)\ln t}{n(s,a)}} + c_2\sqrt{\frac{\ln t}{n(s,a)}}$$

The first term is a UCB-style confidence interval whose width is modulated by the aleatoric noise level. In high-noise states, the agent demands many more visits before the Q-estimate is trusted. In low-noise states, fewer samples are needed.

Exploitation gate: even after the bonus decays, the agent refuses to exploit greedily until:

$$n(s,a) \geq \frac{k}{\hat{\sigma}^2_{ale}(s,a)}$$

forcing proportionally more visits from noisier state-action pairs before they can be exploited.


Novel Algorithms

VLM-Q (Tabular)

File: agents/vlm_q_agent.py

The tabular variant. Maintains a standard Q-table alongside a VarianceLandscapeMap — a dictionary mapping $(s,a)$ pairs to independent Welford accumulators.

Action selection:

$$a^* = \arg\max_a\left[Q(s,a) + c_1\sqrt{\frac{\hat{\sigma}^2_{ale}(s,a)\ln t}{n(s,a)}} + c_2\sqrt{\frac{\ln t}{n(s,a)}}\right]$$

Variance freezing: once is_variance_converged(s, a) returns True (relative change over the last variance_window observations is below stability_threshold), the aleatoric estimate is frozen. The first bonus term no longer grows as more data arrives — only the epistemic term (second) continues to decay.

Q-update: standard TD(0):

$$Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'}Q(s',a') - Q(s,a)\right]$$

Key hyperparameters:

Parameter Default Role
c1 1.0 Aleatoric bonus coefficient
c2 0.5 Epistemic bonus coefficient
k 50 Exploitation gate multiplier
variance_window 50 Steps over which stability is checked
variance_stability_threshold 0.05 Max relative change to declare convergence

VLM-DQN (Neural)

File: agents/vlm_dqn_agent.py

The deep RL variant. Runs two networks in parallel:

  1. Q-network — standard DQN: $\text{state} \to Q(s, a_i)$ for all $i$. Trained with Huber TD loss against a frozen target network (updated every target_update_freq steps).

  2. Variance network (HeteroskedasticRewardNet) — a two-headed MLP: $(\text{state}, a) \to (\hat{\mu}, \log\hat{\sigma}^2)$. Trained on every transition in the replay buffer with Gaussian NLL:

$$\mathcal{L}_{var} = \frac{1}{2}\left[\log\hat{\sigma}^2(s,a) + \frac{(r - \hat{\mu}(s,a))^2}{\hat{\sigma}^2(s,a)}\right]$$

Action selection:

$$a^* = \arg\max_a\left[Q(s,a) + \beta \cdot \frac{\hat{\sigma}(s,a)}{\sqrt{\hat{n}(s,a)}}\right]$$

where $\hat{n}(s,a)$ is a soft visit count maintained in a hash table, and $\hat{\sigma}(s,a) = \exp!\left(\frac{1}{2}\log\hat{\sigma}^2\right)$ comes from the variance network.

The variance network is updated every var_update_freq gradient steps, decoupled from the Q-network update schedule.

Architecture (defaults):

  • Q-network: Linear(state_dim → 64) → ReLU → Linear(64 → 64) → ReLU → Linear(64 → n_actions)
  • Var-network: shared trunk + two linear heads (mean head, log-var head)

Baselines

Agent File Variance-Aware? Strategy
ε-greedy Q-Learning agents/q_learning.py No Decaying random exploration
UCB-V agents/ucbv_agent.py Yes, but confounded Empirical variance in UCB bound — cannot separate aleatoric from epistemic
C51 agents/distributional_q.py Implicit Distributional RL; models full return distribution, not reward variance directly
VLM-Q (ours) agents/vlm_q_agent.py Separated Explicit aleatoric map + convergence gate
VLM-DQN (ours) agents/vlm_dqn_agent.py Separated Heteroskedastic reward network + bonus

Why UCB-V falls short: UCB-V uses the empirical sample variance of returns (not rewards) in the bonus, which conflates aleatoric and epistemic components. It cannot freeze the bonus when aleatoric variance is known, and cannot apply per-state confidence gates.

Why C51 falls short: distributional RL models the distribution of discounted cumulative returns, not the per-step reward noise. Multi-step mixing and discounting blur state-level heteroskedasticity beyond recognition.


Environments

NoisyGridworld

File: envs/noisy_gridworld.py

A 10×10 Gymnasium discrete environment with three hard-coded variance zones:

Zone Rows Cols $\mu$ $\sigma$ Interpretation
Low-noise path 4–6 0–9 −1.0 0.1 Reliable but longer route
High-noise shortcut 1–3 3–7 −0.5 3.0 Faster expected return, high variance
Default region 0–9 0–9 −1.0 1.0 Moderate noise background

The ground-truth $\sigma(s)$ grid is stored on the environment object, enabling quantitative validation of learned variance maps (MSE against ground truth).

Goal: reach cell (9,9). Episode terminates on goal or after max_steps=200 steps.

HeteroskedasticBandit

File: envs/heteroskedastic_bandit.py

A K-arm bandit where each arm $k$ has independently configured $(\mu_k, \sigma_k)$. Used in the bandit experiment to validate the Welford estimator: after $N$ pulls of each arm, the estimated variance should converge to the true $\sigma_k^2$.

NoisyFrozenLake

File: envs/noisy_frozen_lake.py

FrozenLake-v1 augmented with state-dependent reward noise — provides a sparse-reward transfer test to complement the dense-reward gridworld.

OBPBandit (Real-World)

File: envs/obp_bandit.py

A Gymnasium bandit backed by real click logs from the Open Bandit Dataset (Saito et al., NeurIPS 2021), collected by ZOZO Technologies on a Japanese fashion e-commerce platform. Calling step(action) replays with replacement from historical click records for that arm — no simulation, real heteroskedasticity.

Property Value
Arms 34 (men campaign), 46 (women), 80 (all)
Reward Binary click: 0 or 1
Logging policy random (unbiased) or bts (Thompson Sampling)
Campaigns men, women, all
True variance Bernoulli: $\sigma^2_a = p_a(1-p_a)$, naturally heteroskedastic

Ground-truth per-arm CTR and Bernoulli variance are computed from the full dataset and exposed as ground_truth_means and ground_truth_variances properties for quantitative evaluation.


Estimators

WelfordAccumulatorestimators/welford.py

Numerically stable, single-pass online algorithm for mean and variance. Each update is O(1) time and O(1) memory. Supports merging two accumulators via Chan's parallel formula, enabling future distributed extension.

Update rule: for sample $x_n$,

$$\delta \leftarrow x_n - \bar{x}_{n-1}, \qquad \bar{x}_n \leftarrow \bar{x}_{n-1} + \delta/n, \qquad M_n \leftarrow M_{n-1} + \delta(x_n - \bar{x}_n)$$

$$\hat{\sigma}^2_n = M_n / (n - 1) \quad (n \geq 2)$$

VarianceLandscapeMapestimators/variance_map.py

A defaultdict of WelfordAccumulators keyed by (state, action). Provides:

  • update(s, a, r) — feed a new reward observation
  • get_variance(s, a) — current $\hat{\sigma}^2$ estimate
  • is_variance_converged(s, a)True once the relative change in $\hat{\sigma}^2$ over the last variance_window observations falls below stability_threshold
  • get_aleatoric_uncertainty(s, a) — returns the frozen estimate if converged, or the live estimate otherwise
  • snapshot() — exports the full variance map as a 2-D array for heatmap visualisation

HeteroskedasticRewardNetestimators/hetero_net.py

PyTorch nn.Module with shared trunk and two linear output heads:

  • Mean head: predicts $\hat{\mu}(s,a)$
  • Log-variance head: predicts $\log\hat{\sigma}^2(s,a)$ (unbounded; exponentiated before use)

Trained end-to-end with gaussian_nll_loss, which is the heteroskedastic regression loss above.


Real-World Dataset: Open Bandit Pipeline

In addition to synthetic environments, the codebase validates against real production data from the Open Bandit Dataset (OBD) (Saito et al., NeurIPS 2021 Datasets & Benchmarks Track).

Why OBD?

OBD is the only publicly available logged bandit dataset from a real recommendation system with known logging propensities. The random policy logs make sampling unbiased, allowing the Welford estimator to recover true per-arm Bernoulli variances — a ground-truth check impossible on most real datasets.

What the data shows

The men campaign (34 fashion items, 10 000 interaction records) has per-arm CTR ranging from 0% to 1.5%, giving Bernoulli variance $p_a(1-p_a)$ that spans an order of magnitude. This real heteroskedasticity is the ideal testbed for VLM.

OBP Experiments

Command What it does
python main.py obp Validate Welford variance recovery on OBD; reports Spearman ρ and var MSE
python main.py obp-agents Train Q-Learning, UCB-V, and VLM-Q on OBD; generate all three plots

Generated Plots

File Description
results/obp_bandit/obp_cumulative_regret.png Cumulative regret curves with 95% CI across 20 seeds
results/obp_bandit/obp_variance_recovery.png Ground-truth vs VLM-Q estimated per-arm Bernoulli variance
results/obp_bandit/obp_spearman_convergence.png Spearman ρ of arm-variance ranking over 10 000 interaction steps

Key Real-World Results

  • Spearman ρ = 0.70 by step 10 000 — VLM-Q correctly ranks 34 real fashion items by aleatoric risk.
  • ρ crosses the 0.5 pass threshold around step 4 500, showing the Welford map converges well within a typical deployment window.
  • UCB-V and VLM-Q incur higher regret than ε-greedy in this ultra-sparse binary regime (overall CTR < 0.5%), because exploration bonuses keep revisiting low-CTR arms that look uncertain. This is an expected limitation when aleatoric variance is uniformly near-zero and is documented as a boundary condition for VLM.

Project Structure

variance_landscape_rl/
├── envs/                          # Custom Gymnasium environments
│   ├── noisy_gridworld.py         #   10×10 grid, three σ zones, ground-truth map
│   ├── heteroskedastic_bandit.py  #   K-arm bandit with per-arm σ (synthetic)
│   ├── noisy_frozen_lake.py       #   FrozenLake + state-dependent noise
│   └── obp_bandit.py              #   Real-world: Open Bandit Dataset (80 fashion items, binary clicks)
├── estimators/                    # Online variance estimation
│   ├── welford.py                 #   Welford O(1) accumulator
│   ├── variance_map.py            #   Tabular (s, a) → σ̂² map + convergence detection
│   └── hetero_net.py              #   PyTorch μ + log-σ² two-headed network
├── agents/                        # Learning algorithms
│   ├── common.py                  #   ReplayBuffer, EpsilonSchedule, BaseAgent
│   ├── q_learning.py              #   Baseline: ε-greedy Q-learning
│   ├── ucbv_agent.py              #   Baseline: UCB-V (Audibert 2007)
│   ├── distributional_q.py        #   Baseline: C51 (Bellemare 2017)
│   ├── vlm_q_agent.py             #   Novel: VLM-Q — tabular heteroskedastic exploration
│   └── vlm_dqn_agent.py           #   Novel: VLM-DQN — neural heteroskedastic exploration
├── experiments/                   # Experiment runners
│   ├── bandit_experiment.py       #   Phase 1: validate Welford on synthetic bandit
│   ├── obp_experiment.py          #   Phase 1b: validate Welford on real OBP click data
│   ├── obp_agent_experiment.py    #   Phase 1c: train all agents on OBP + generate plots
│   ├── tabular_experiment.py      #   Phase 2: Q-Learning vs UCB-V vs VLM-Q (20 seeds)
│   ├── neural_experiment.py       #   Phase 3: DQN vs C51 vs VLM-DQN (20 seeds)
│   └── statistical_tests.py       #   Friedman, Nemenyi post-hoc, Welch's t + Cohen's d
├── visualization/                 # Plot generation
│   ├── variance_heatmap.py        #   Ground truth vs estimated σ² heatmaps
│   ├── learning_curves.py         #   Return curves with 95% CI bands
│   └── obp_plots.py               #   OBP-specific: regret curves, variance recovery, Spearman ρ
├── configs/                       # YAML experiment configurations
│   ├── bandit.yaml                #   Synthetic bandit arms and noise levels
│   ├── obp_bandit.yaml            #   OBP dataset + agent hyperparameters
│   ├── tabular.yaml               #   Gridworld zones, agent hyperparameters (20 seeds × 5000 ep)
│   └── neural.yaml                #   Neural agent hyperparameters
├── results/                       # Auto-generated outputs
│   ├── bandit/
│   ├── obp_bandit/
│   ├── tabular/
│   └── neural/
├── main.py                        # CLI entry point
├── requirements.txt
└── README.md

Installation

# Clone
git clone <repo-url>
cd variance_landscape_rl

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Requirements: Python ≥ 3.9, PyTorch ≥ 2.0, Gymnasium ≥ 0.29.


Quick Start

# Full pipeline: bandit validation → tabular → neural → stats → plots
python main.py all

# Individual phases
python main.py bandit                                  # Phase 1: validate Welford on synthetic bandit
python main.py tabular                                 # Phase 2: tabular agent comparison
python main.py neural                                  # Phase 3: neural agent comparison

# Real-world OBP experiments
python main.py obp                                     # Validate Welford on real OBP click data
python main.py obp-agents                              # Train all agents on OBP, generate 3 plots
python main.py obp-agents --config configs/obp_bandit.yaml   # Custom config

# Statistical analysis on saved results
python main.py stats results/tabular/tabular_results.json
python main.py stats results/neural/neural_results.json

# Visualisation
python main.py plot results/tabular/learning_curves.json
python main.py plot results/neural/learning_curves.json

# Custom config
python main.py tabular --config configs/tabular.yaml

Configuration

All experiments are driven by YAML files in configs/. The key knobs:

# configs/tabular.yaml (excerpt)
environment:
  name: NoisyGridworld
  grid_size: 10
  zones:
    - name: low_noise_path
      rows: [4, 6]   cols: [0, 10]
      reward_mean: -1.0   reward_std: 0.1
    - name: high_noise_shortcut
      rows: [1, 3]   cols: [3, 7]
      reward_mean: -0.5   reward_std: 3.0

agents:
  vlm_q:
    c1: 1.0                        # aleatoric bonus coefficient
    c2: 0.5                        # epistemic bonus coefficient
    k: 50                          # exploitation gate multiplier
    variance_window: 50            # stability detection window
    variance_stability_threshold: 0.05

experiment:
  n_episodes: 5000
  n_seeds: 20
  checkpoint_steps: [100, 500, 2000, 5000]
# configs/obp_bandit.yaml (excerpt)
bandit:
  behavior_policy: "random"        # unbiased logging policy
  campaign: "men"                  # 34 arms — denser per-arm coverage
  n_steps: 10000
  n_seeds: 20

agents:
  vlm_q:
    gamma: 0.0                     # bandit: no future state
    c1: 1.0
    c2: 0.5
    k: 20.0
    variance_window: 30
    variance_stability_threshold: 0.02

experiment:
  log_interval: 200
  output_dir: "results/obp_bandit"

To sweep hyperparameters, copy the YAML, modify, and pass --config your_config.yaml.


Evaluation Protocol

  • Seeds: 20 independent seeds per agent per environment.
  • Checkpoints: Q-tables and metrics saved at episodes 100, 500, 2000, and 5000.
  • Primary metric: mean episode return over the final 100 episodes.
  • Secondary metric: area under the learning curve (sample efficiency proxy).
  • Variance estimation metric (synthetic): MSE between $\hat{\sigma}^2(s)$ and the ground-truth _reward_std grid stored on NoisyGridworld.
  • Variance estimation metric (real-world OBP): Spearman rank correlation ρ between estimated and true per-arm Bernoulli variances — measures whether VLM-Q correctly orders arms by aleatoric risk without requiring exact magnitude recovery.
  • Regret metric (OBP): cumulative pseudo-regret $\sum_t (\mu^* - \mu_{a_t})$ where $\mu^*$ is the best arm's empirical CTR.
  • Statistical tests:
    • Friedman test (non-parametric k-way comparison across all agents).
    • Nemenyi post-hoc (pairwise significance with familywise error rate correction).
    • Welch's t-test + Cohen's d for each pair, reported with 95% CI.

All results are serialised to JSON under results/ for reproducible analysis.


Results Layout

results/
├── bandit/
│   └── bandit_results.json              # synthetic arm-level variance MSE over time
├── obp_bandit/
│   ├── obp_results.json                 # Welford validation: Spearman ρ + var MSE per seed
│   ├── obp_agent_results.json           # agent training: regret/reward curves + VLM-Q var snapshots
│   ├── obp_cumulative_regret.png        # Figure: cumulative regret curves with 95% CI
│   ├── obp_variance_recovery.png        # Figure: ground-truth vs VLM-Q per-arm variance bars
│   └── obp_spearman_convergence.png     # Figure: Spearman ρ of arm-variance ranking over time
├── tabular/
│   ├── tabular_results.json             # per-seed final returns
│   ├── learning_curves.json             # episode × agent mean/std return
│   ├── statistical_analysis.json        # Friedman p-value, Nemenyi CD, Welch pairs
│   └── checkpoints/
│       ├── q_learning/
│       │   └── seed{i}_ep{n}.json       # Q-table snapshot
│       ├── ucbv/
│       └── vlm_q/
└── neural/
    ├── neural_results.json
    ├── learning_curves.json
    └── statistical_analysis.json

Key References

About

A research codebase introducing Variance Landscape Mapping (VLM) — a family of exploration strategies that exploit heteroskedastic reward signals

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages