# Learn a Reward Function using Maximum Conditional Entropy Inverse Reinforcement Learning

MCE IRL only supports tabular environments.

The cliffworld environment, that we use here is a TabularEnvironment.
It's observations consist of the POMDP's observations and the actual state.
We later also need VecEnv objects that expose just the internal POMDP-state or just the POMDP-observation as its observation.

In [None]:
from imitation.algorithms.mce_irl import (
    MCEIRL,
    mce_occupancy_measures,
    mce_partition_fh,
    TabularPolicy,
)
import gym
import imitation.envs.examples.model_envs
from imitation.algorithms import base

from imitation.data import rollout
from imitation.envs import resettable_env
from stable_baselines3.common.vec_env import DummyVecEnv
from imitation.rewards import reward_nets


env_name = "imitation/CliffWorld15x6-v0"
env = gym.make(env_name)
state_venv = resettable_env.DictExtractWrapper(
    DummyVecEnv([lambda: gym.make(env_name)] * 4), "state"
)
obs_venv = resettable_env.DictExtractWrapper(
    DummyVecEnv([lambda: gym.make(env_name)] * 4), "obs"
)

Then we derive an expert policy using Bellman backups. We analytically compute the occupancy measures, and also sample some expert trajectories.

In [None]:
_, _, pi = mce_partition_fh(env)

_, om = mce_occupancy_measures(env, pi=pi)

expert = TabularPolicy(
    state_space=env.pomdp_state_space,
    action_space=env.action_space,
    pi=pi,
    rng=None,
)

expert_trajs = rollout.generate_trajectories(
    policy=expert,
    venv=state_venv,
    sample_until=rollout.make_min_timesteps(5000),
)

print("Expert stats: ", rollout.rollout_stats(expert_trajs))

Finally, we set up the MCE algorithm and train it.

In [None]:
def train_mce_irl(demos, **kwargs):
    reward_net = reward_nets.BasicRewardNet(
        env.pomdp_observation_space,
        env.action_space,
        use_action=False,
        use_next_state=False,
        use_done=False,
        hid_sizes=[],
    )

    mce_irl = MCEIRL(demos, env, reward_net, linf_eps=1e-3)
    mce_irl.train(**kwargs)

    imitation_trajs = rollout.generate_trajectories(
        policy=mce_irl.policy,
        venv=state_venv,
        sample_until=rollout.make_min_timesteps(5000),
    )
    print("Imitation stats: ", rollout.rollout_stats(imitation_trajs))

    return mce_irl

First, we train it on the analytically computed occupancy measures. This should give a very precise result.

In [None]:
mce_irl_from_om = train_mce_irl(om)

Then we train it on trajectories sampled from the expert. This gives a stochastic approximation to occupancy measure, so performance is a little worse. Using more expert trajectories should improve performance -- try it!

In [None]:
mce_irl_from_trajs = train_mce_irl(expert_trajs[0:10])