# Demo of EPIC Distance

EPIC distance measures the dissimilarity between reward functions. It works by mapping reward functions to a canonical representative that is invariant to potential shaping, then computing the Pearson correlation between the canonicalized reward functions. In this notebook, we compute EPIC distance between reward functions in a simple PointMass environment.

For more information, see the accompanying [paper](https://arxiv.org/abs/2006.13900). This notebook will produce a subset of the heatmap in Figure 2(a).

## Setup

First, we install the `evaluating_rewards` library and its dependencies.

In [None]:
!cd ../ && python setup.py install

## Imports

Now, we import some standard RL and ML dependencies, and relevant modules from our `evaluating_rewards` library.

In [1]:
# Turn off distracting warnings and logging
import warnings
warnings.filterwarnings("ignore")

import tensorflow as tf
import logging
logging.getLogger("tensorflow").setLevel(logging.CRITICAL)

# Import rest of the dependencies
import gym
import pandas as pd
from stable_baselines.common import vec_env

from evaluating_rewards import datasets, epic_sample, serialize, tabular, util

# Configuration

In this section, we specify some hyperparameters, including the environment to load (PointMass) and models to compare.

In [2]:
n_samples = 4096  # number of samples to take final expectation over
n_mean_samples = 4096  # number of samples to use to canonicalize potential
env_name = "evaluating_rewards/PointMassLine-v0"  # the environment to compare in
# The reward models to load.
model_kinds = (
    "evaluating_rewards/PointMassSparseWithCtrl-v0",
    "evaluating_rewards/PointMassDenseWithCtrl-v0",
    "evaluating_rewards/PointMassGroundTruth-v0"
)
seed = 42  # make results deterministic

# Load Models

Here, we load the reward models. For simplicity, we load hardcoded reward models, and so specify a `"dummy"` reward path. `serialize` also supports loading learned reward models produced by other packages like [imitation](http://github.com/humancompatibleai/imitation). You can also register your own loaders with `serialize`, or load reward models by another mechanism -- the only requirement is they must satisfy the `evaluating_rewards.rewards.RewardModel` interface.

In [3]:
venv = vec_env.DummyVecEnv([lambda: gym.make(env_name)])
sess = tf.Session()
with sess.as_default():
    tf.set_random_seed(seed)
    models = {kind: serialize.load_reward(reward_type=kind, reward_path="dummy", venv=venv) 
              for kind in model_kinds}

# Compute EPIC Distance

Finally, we canonicalize the rewards and compute the Pearson distance between canonicalized rewards, producing the EPIC distance.

In [4]:
# Define visitation and state/action distribution
# Sample observation and actions from the Gym spaces
venv.observation_space.seed(seed)
venv.action_space.seed(seed)
with datasets.sample_dist_from_space(venv.observation_space) as obs_dist:
    with datasets.sample_dist_from_space(venv.action_space) as act_dist:
        # Visitation distribution (obs,act,next_obs) is IID sampled from obs_dist and act_dist
        with datasets.transitions_factory_iid_from_sample_dist(obs_dist, act_dist) as batch_callable:
            batch = batch_callable(n_samples)
        
        # Finally, let's compute the EPIC distance between these models.
        # First, we'll canonicalize the rewards.
        with sess.as_default():
            deshaped_rew = epic_sample.sample_canon_shaping(
                models=models,
                batch=batch,
                act_dist=act_dist,
                obs_dist=obs_dist,
                n_mean_samples=n_mean_samples,
                # You can also specify the discount rate and the type of norm,
                # but defaults are fine for most use cases.
            )

# Now, let's compute the Pearson distance between these canonicalized rewards.
# The canonicalized rewards are quantized to `n_samples` granularity, so we can
# then compute the Pearson distance on this (finite approximation) exactly.
epic_distance = util.cross_distance(deshaped_rew, deshaped_rew, tabular.pearson_distance, parallelism=1)

# Results

The `Sparse` and `Dense` rewards are equivalent, differing only up to shaping, and accordingly have zero EPIC distance. The `GroundTruth` reward is not equivalent and so has a significant EPIC distance. See section 6.1 of the [https://arxiv.org/pdf/2006.13900.pdf](paper) for further information.

In [5]:
epic_df = pd.Series(epic_distance).unstack()
epic_df.index = epic_df.index.str.replace(r'evaluating_rewards/PointMass(.*)-v0', r'\1')
epic_df.columns = epic_df.columns.str.replace(r'evaluating_rewards/PointMass(.*)-v0', r'\1')
epic_df

Unnamed: 0,DenseWithCtrl,GroundTruth,SparseWithCtrl
DenseWithCtrl,0.0,0.547262,0.0
GroundTruth,0.547262,0.0,0.547262
SparseWithCtrl,0.0,0.547262,0.0
