# RMs for RL

$$
\newcommand{\tuple}[1]{\left\langle #1 \right\rangle}
\newcommand{\StateSpace}[0]{\mathcal{S}}
\newcommand{\ActionSpace}[0]{\mathcal{A}}
\newcommand{\SAS}[0]{\StateSpace\times\ActionSpace\times\StateSpace}
\newcommand{\ContextSpace}[0]{\mathcal{C}}
\newcommand{\MDPFunc}[0]{\mathcal{M}}
\newcommand{\CMDP}[0]{\tuple{\ContextSpace, \StateSpace, \ActionSpace, \MDPFunc}}
\newcommand{\MDPInContext}[1]{\tuple{\StateSpace, \ActionSpace, p^{#1}, r^{#1}, \gamma}}
\newcommand{\propsym}[0]{\mathcal{P}}
\newcommand{\RM}[0]{\tuple{\propsym, U, \delta_u, \delta_r}}
\newcommand{\RMsym}[0]{\mathcal{R}}
\newcommand{\MDPRM}[0]{\tuple{\StateSpace , \ActionSpace, p,\gamma,\propsym, L, U, \delta_u, \delta_r}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
$$

## Problem Formulation

Let $X = \CMDP$ be a CMDP and let $\Psi$ be the distribution over $\ContextSpace$. Let $f$ be a parameterized function, called the _adaptation function_, that takes trajectories and outputs adapted parameters. Denote by $\tau_c^{1:K}$ a collection of $K$ trajectories collected within context $c\in\ContextSpace$. We would like to find meta parameters $\theta^*$ such that sampling few trajectories from parameterized policy $\pi_\theta $ and adapting $\theta$ to $\phi = f_\theta(\tau_c^{1:K})$ maximizes $\pi_\phi$'s return over $c\sim\Psi$. More formally, we would like to find:
\begin{equation}
    \theta^*\in\argmax{\theta}\mathbb{E}_{c\sim\Psi}\left[J_c(\pi_\phi)\middle|\phi = f_\theta(\tau_c^{1:K}), \tau_c^{1:K}\sim\pi_\theta\right]
\end{equation}
with the smallest $K$ possible. We measure performance using the "time to threshold" metric, that measures the number of samples/trajectories collected in order to achieve some threshold accumulated rewards.

## Method
![GNN RM usage](images/GNN_RM_usage.png)

## Imports and notebook utils

In [1]:
%load_ext tensorboard
%load_ext autoreload
%autoreload 2

from pathlib import Path

from rmrl.reward_machines.rm_env import RMEnvWrapper
from rmrl.envs.multitask_env import MultiTaskWrapper
from rmrl.envs.mujoco.HalfCheetahV3 import velocity_env
from rmrl.envs.mujoco.reward_machines.HalfCheetahV3 import VelocityRM
from rmrl.reward_machines.potential_functions import ValueIteration
from rmrl.policies.rm_policy import RMPolicy
from rmrl.nn.models import RMFeatureExtractorSB

from stable_baselines3 import DDPG

MODELS_DIR = Path('./models')
LOGS_DIR = Path('./logs')
TB_DIR = LOGS_DIR / 'tensorboard'


RS_GAMMA = 0.9
MAX_ITERS = 1000

  if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'):


In [2]:
def get_ddpg_trained_model(name, policy, env, timesteps):
    model = DDPG(policy=policy,
                 env=env,
                 verbose=1,
                 tensorboard_log=TB_DIR / name)
    # load model if exists
    try:
        print('loading pre-trained model')
        return model.load(MODELS_DIR / name, model.env)
    except FileNotFoundError:
        print('pre-trained model not found. training model')
        train_model(model, timesteps)
        model.save(MODELS_DIR / name)
        return model

def train_model(model, timesteps):
    try:
        iter(timesteps)
    except TypeError:
        timesteps = [timesteps]

    for i, ts in enumerate(timesteps, 1):
        print(f'run number {i}. {ts} timesteps')
        model.learn(total_timesteps=ts,
                    tb_log_name=f'run{i}',  # number the run logs
                    reset_num_timesteps=False)  # continue the same curve

def animate_env(model, num_iters=MAX_ITERS):
    env = model.env
    try:
        obs = env.reset()
        for i in range(num_iters):
            action, _state = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            env.render()
            if done:
                obs = env.reset()
    except KeyboardInterrupt:
        print('Early stop by user')
    finally:
        env.close()

## Original cheetah env (move forward)

In [3]:
model_fw = get_ddpg_trained_model('ddpg_cheetah_fw',
                                  'MlpPolicy',
                                  'HalfCheetah-v3',
                                  timesteps=[1e4] * 4)  # 4 rounds of 10,000

Using cpu device
Creating environment from the given name 'HalfCheetah-v3'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
loading pre-trained model


objc[94488]: Class GLFWApplicationDelegate is implemented in both /Users/guyazran/.mujoco/mjpro150/bin/libglfw.3.dylib (0x12571d778) and /usr/local/Caskroom/miniconda/base/envs/rmrl2/lib/python3.9/site-packages/glfw/libglfw.3.dylib (0x1268857c0). One of the two will be used. Which one is undefined.
objc[94488]: Class GLFWWindowDelegate is implemented in both /Users/guyazran/.mujoco/mjpro150/bin/libglfw.3.dylib (0x12571d700) and /usr/local/Caskroom/miniconda/base/envs/rmrl2/lib/python3.9/site-packages/glfw/libglfw.3.dylib (0x1268857e8). One of the two will be used. Which one is undefined.
objc[94488]: Class GLFWContentView is implemented in both /Users/guyazran/.mujoco/mjpro150/bin/libglfw.3.dylib (0x12571d7a0) and /usr/local/Caskroom/miniconda/base/envs/rmrl2/lib/python3.9/site-packages/glfw/libglfw.3.dylib (0x126885838). One of the two will be used. Which one is undefined.
objc[94488]: Class GLFWWindow is implemented in both /Users/guyazran/.mujoco/mjpro150/bin/libglfw.3.dylib (0x1257

In [4]:
%tensorboard --logdir ./logs/tensorboard/ddpg_cheetah_fw

In [None]:
animate_env(model_fw)

## Fixed velocity (5.0) cheetah env

In [5]:
fixed_vel_env = velocity_env(initial_goal_vel=5.0, change_task_on_reset=False)
fixed_vel_env.reset()
print(f'goal velocity is: {fixed_vel_env.task}')
print('goal changes on reset' if fixed_vel_env.change_task_on_reset else 'fixed goal')

goal velocity is: 5.0
fixed goal


In [9]:
model_vel5 = get_ddpg_trained_model('ddpg_cheetah_vel5',
                                    'MlpPolicy',
                                    fixed_vel_env,
                                    timesteps=[1e5] * 2)  # 2 rounds of 100,000

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
loading pre-trained model
pre-trained model not found. training model
run number 1. 100000.0 timesteps
Logging to logs/tensorboard/ddpg_cheetah_vel5/run1_0
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 1e+03     |
|    ep_rew_mean     | -5.51e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 123       |
|    time_elapsed    | 32        |
|    total timesteps | 4000      |
| train/             |           |
|    actor_loss      | 23.1      |
|    critic_loss     | 0.803     |
|    learning_rate   | 0.001     |
|    n_updates       | 3000      |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 1e+03     |
|    ep_rew_mean     | -5.39e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps       

In [10]:
%tensorboard --logdir ./logs/tensorboard/ddpg_cheetah_vel5

In [12]:
animate_env(model_vel5)

Creating window glfw


## RM cheetah env

In [None]:
# function to create RMs in env wrapper
potential_fn = ValueIteration()
def rm_fn(task_env: MultiTaskWrapper):
    rm = VelocityRM(task_env.task)
    rm.reshape_rewards(potential_fn(rm, RS_GAMMA), gamma=RS_GAMMA)
    return [rm]

In [None]:
fixed_vel_env_rs = RMEnvWrapper(fixed_vel_env, rm_fn,
                                rm_observations=False,
                                change_rms_on_reset=False)  # fixed task. no need to reset RM

A quick demo of the internal env reward machine and reward shaping

In [None]:
print(f'goal velocity: {fixed_vel_env_rs.task}')
fixed_vel_env_rs.rms[0].draw()
fixed_vel_env_rs.rms[0].delta(8, ['75%'])

In [None]:
rm = fixed_vel_env_rs.rms[0]
rm.reset_rewards()

In [None]:
print(0, rm.G[0])
print()
print(1, rm.G[1])
print()
print(7, rm.G[7])
print()
print(8, rm.G[8])
print()
print(9, rm.G[9])
print()
print(10, rm.G[10])

In [None]:
pots = potential_fn(rm, RS_GAMMA)
pots

In [None]:
rm.reshape_rewards(pots, RS_GAMMA)

In [None]:
print(0, rm.G[0])
print()
print(1, rm.G[1])
print()
print(7, rm.G[7])
print()
print(8, rm.G[8])
print()
print(9, rm.G[9])
print()
print(10, rm.G[10])

Back to training models

In [None]:
model_vel5_rs = get_ddpg_trained_model('ddpg_cheetah_vel5_rs',
                                       'MlpPolicy',
                                       fixed_vel_env_rs,
                                       timesteps=[1e5] * 2)  # 2 rounds of 100,000

In [None]:
%tensorboard --logdir ./logs/tensorboard/ddpg_cheetah_vel5_rs

In [None]:
animate_env(fixed_vel_env_rs, model_vel_rs)

In [None]:
fixed_vel_env_rs_graph_input = RMEnvWrapper(fixed_vel_env, rm_fn,
                                            rm_observations=True,
                                            change_rms_on_reset=False)

In [None]:
## TODO support policy_kwargs in "get_ddpg_trained_model" function
policy_kwargs = dict(
    features_extractor_class=RMFeatureExtractorSB
)

model_vel_rs_gnn = DDPG('MultiInputPolicy',
                        fixed_vel_env_rs_graph_input,
                        verbose=1,
                        tensorboard_log="./ddpg_cheetah_vel5.0_rs_gnn_tensorboard/",
                        policy_kwargs=policy_kwargs,
                        batch_size=1)
model_vel_rs_gnn.learn(total_timesteps=100_000, tb_log_name="first_run", reset_num_timesteps=False)
model_vel_rs_gnn.learn(total_timesteps=100_000, tb_log_name="second_run", reset_num_timesteps=False)
model_vel_rs_gnn.learn(total_timesteps=100_000, tb_log_name="third_run", reset_num_timesteps=False)
model_vel_rs_gnn.learn(total_timesteps=100_000, tb_log_name="fourth_run", reset_num_timesteps=False)

In [None]:
%tensorboard --logdir ./ddpg_cheetah_vel5.0_rs_gnn_tensorboard/

In [None]:
animate_env(fixed_vel_env_rs_graph_input, model_vel_rs_gnn)