In this Colab, we perform a simple test of our reward model comparison technique on randomly generated reward models.

Specifically, we randomly sample two reward models $r_g(s,a,s')$ and $r_n(s,a,s')$ that we treat as the "ground truth" and "noise" reward model respectively. We also randomly sample a potential function $\phi_n(s)$, a scale factor $\lambda > 0$ and shift $c$.

For a given reward noise magnitude $\sigma_r$ and potential noise magnitude $\sigma_\phi$, the synthetically generated reward model is:
  $$r_o(s,a,s') = \lambda (r_g(s,a,s') + \sigma_r\cdot r_n(s,a,s') + \sigma_\phi(\gamma \phi_n(s') - \phi_n(s))) + c,$$
where $\gamma$ is the discount factor.

We then compare the constructed reward model $r_o$ to the ground truth $r_g$. Concretely, we search over the space of reward functions that are equivalent to $r_o$ -- will induce the same optimal policy as $r_o$ under arbitrary transition dynamics -- to find the one closest to $r_g$. The permissible transformations to $r_o$ are positive affine transformations (shifting and positive scaling) and addition of a potential function.

Note that if the reward noise magnitude is set to $\sigma_r = 0$, then $r_o$ can be transformed to exactly match $r_g$. As $\sigma_r$ increases, it is not usually possible to match $r_g$ directly, and the magnitude of the error should increase proportionally to $r_n$. By contrast, changes in $\lambda$, $c$, and $\sigma_\phi$ do not change the *minimum* distance -- but can make the optimization problem easier (when closer to the identity) or harder (when further away).

We perform this experiment for a variety of values of the reward noise magnitude $\sigma_r$ and potential noise $\sigma_\phi$. We use the standard $\ell_2$ distance metric, on a transition distribution induced by randomly taking actions in two simple environments: a "point mass" environment where actions produce forces on a point mass with position and velocity, and a mock 5D environment where states are uniformly sampled in $[0,1]$ and actions sampled from $[0,1]$ are added to the state (clipped to stay in $[0,1]$).

# Setup: Imports and Environment Creation
---


In [0]:
import gym
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import tensorflow as tf

from imitation.util import rollout

from evaluating_rewards.envs import point_mass
from evaluating_rewards.experiments import datasets
from evaluating_rewards.experiments import synthetic
from evaluating_rewards.experiments import util
from evaluating_rewards.experiments import visualize

In [0]:
env_uniform = datasets.dummy_env_and_dataset(dims=5)  # 5-dimensional uniform state distribution
env_pm = datasets.make_pm()  # PointMass

# Experiment: Comparing noisy models
---


In [0]:
#@title Helper Methods
def run_compare_synthetic(reps=3, **kwargs):
  dfs = []
  metrics = []
  for _ in range(reps):
    with util.fresh_sess(intra_op=12, inter_op=12):
      df, metric = synthetic.compare_synthetic(**kwargs)
    dfs.append(df)
    metrics.append(metric)
  return dfs, metrics


def plot_shaping_comparison(dfs, **kwargs):
  fig, axs = plt.subplots(1, len(dfs), figsize=(16, 4), squeeze=False)
  longforms = []
  for df, ax in zip(dfs, axs[0]):
    longform = visualize.plot_shaping_comparison(df, ax=ax)
    longforms.append(longform)
  return longforms

The below method runs a simple version of the experiment described above. It tests two different reward model architectures: a linear model, and two-hidden layer model. The reward noise magnitude ranges between $0.0$ and $1.0$, while the potential noise ranges from $0.0$ to $10.0$. Note that since the potential shaping term is the *difference* between the potential at two (usually near-by) states, the potential shaping tends to be smaller -- this is why the scale is increased for the potential noise.

The graphs plot the intrinsic distance and shaping magnitude in solid and dashed lines respectively. The intrinsic distance is the minimal distance between the set of reward functions equivalent to $r_o$ and the target reward $r_t$. The shaping magnitude is the average size of the potential shaping added. The x-axis shows the reward noise $n_r$ and the different colors show the potential noise $n_\phi$ (darker is larger).

In [0]:
#@title Run Experiment
def compare_environment_architecture(**kwargs):
  envs = {'pm': env_pm, 'uniform': env_uniform}
  layers = {
      'linear': {
          'reward_hids': [], 
          'dataset_potential_hids': [], 
          'model_potential_hids': [],
          'learning_rate': 1e-1
      },
      'twolayer': {
          'reward_hids': [32, 32],
          'dataset_potential_hids': [32, 32],
          'model_potential_hids': [32, 32],
          'learning_rate': 1e-2,
      }
  }
  res = {}
  for env_name, env in envs.items():
    for arch_name, arch in layers.items():
      print(env_name, arch_name)
      res[f'{env_name}_{arch_name}'] = run_compare_synthetic(**arch, **env,
                                                             potential_noise=np.arange(0.0, 10.0, 2.0),
                                                             **kwargs)
  return res

env_arch_comparisons = compare_environment_architecture()

In [0]:
#@title Plot Results
for k, (dfs, metrics) in env_arch_comparisons.items():
  plot_shaping_comparison(dfs)
  plt.title(k)

Observe that in all cases, the intrisinc distance (solid line) is largely unaffetced by the potential shaping (line color). This is as expected, since adding potential shaping does not change the equivalence class (but may make the optimization problem harder).

Furthermore, the intrinsic distance increases monotonically with reward noise, and in most cases has an approximately linear relationship as expected.

For the potential shaping, observe that larger potential noises (darker colors) produce larger potential shaping. In other words, if we add more potential noise, the learned potential tends to also be larger, as expected. The potential shaping magnitude varies with reward noise, but there is no clear pattern. Intuitively, since the reward noise we're adding may itself include a potential term, this could increase the potential shaping term we need to add or decrease it.

# Experiment: Summary Statistics

---

We previously mentioned the potential noise was chosen to be a 10x greater scale than the reward noise, due to a smaller magnitude potential shaping. This section computes summary statistics (mean, standard deviation, etc) of a randomly initialized reward model and potential shaping.

We observe a similar variance in the reward (`'reward'`) and potential shaping (`'old_potential'`, `'new_potential'`). However, on PointMass, the magnitude of the potential difference (`'shaping') is much lower (by a factor of 200x). This is due to the dynamics causing the next state to be close to the current state.

In [0]:
#@title Compute summary statistics on dataset
def summary(dataset_generator, observation_space, action_space, batch_size=4096):
  dataset = next(dataset_generator(batch_size, batch_size))
  with util.fresh_sess():
    summary_stats = synthetic.summary_stats(observation_space, action_space, dataset)
  return summary_stats

In [0]:
#@title 5D uniform environment
summary(**env_uniform)

In [0]:
#@title PointMass environment
summary(**env_pm)