#### Note: This example is compatible with versions v0.1.5 or lower, for the most recent version, see [this](https://github.com/facebookresearch/mbrl-lib/blob/main/notebooks/pets_example.ipynb) notebook.

In [8]:
%%bash

git clone http://github.com/ScorcaF/imitation # && git checkout 0861607f146457e3e086ee91c362c39aeac1d8c4
cd imitation
pip install -e . 
mv imitation/src/imitation /usr/local/lib/python3.7/dist-packages/
 

# pip install mbrl
# pip install omegaconf
# apt-get install swig
# pip install matplotlib==3.1.1
# # install required system dependencies
# apt-get install -y xvfb x11-utils

# install required python dependencies (might need to install additional gym extras depending)
# pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*

# pip3 install box2d-py
# pip3 install gym[Box_2D]
# pip install stable_baselines3


In [13]:
%%bash
git clone https://github.com/ScorcaF/mbrl-lib.git && git checkout ScorcaF-mods
cd mbrl-lib/
pip install -e ".[dev]"
mv mbrl-lib/mbrl /usr/local/lib/python3.7/dist-packages/

In [17]:
from IPython import display
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import torch
import omegaconf

import mbrl.env.cartpole_continuous as cartpole_env
import mbrl.env.reward_fns as reward_fns
import mbrl.env.termination_fns as termination_fns
import mbrl.models as models
import mbrl.planning as planning
import mbrl.util.common as common_util
import mbrl.util as util


%load_ext autoreload
%autoreload 2

mpl.rcParams.update({"font.size": 16})

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Creating the environment

First we instantiate the environment and specify which reward function and termination function to use with the gym-like environment wrapper, along with some utility objects. The termination function tells the wrapper if an observation should cause an episode to end or not, and it is an input used in some algorithms, like [MBPO](https://github.com/JannerM/mbpo/blob/master/mbpo/static/halfcheetah.py). The reward function is used to compute the value of the reward given an observation, and it's used by some algorithms, like [PETS](https://github.com/kchua/handful-of-trials/blob/77fd8802cc30b7683f0227c90527b5414c0df34c/dmbrl/controllers/MPC.py#L65).

In [18]:
seed = 0
env = cartpole_env.CartPoleEnv()
env.seed(seed)
rng = np.random.default_rng(seed=0)
generator = torch.Generator(device=device)
generator.manual_seed(seed)
obs_shape = env.observation_space.shape
act_shape = env.action_space.shape

# This functions allows the model to evaluate the true rewards given an observation 
reward_fn = reward_fns.cartpole
# This function allows the model to know if an observation should make the episode end
term_fn = termination_fns.cartpole

# Hydra configuration

MBRL-Lib uses [Hydra](https://github.com/facebookresearch/hydra) to manage configurations. For the purpose of this example, you can think of the configuration object as a dictionary with key/value pairs--and equivalent attributes--that specify the model and algorithmic options. Our toolbox expects the configuration object to be organized as follows:

In [19]:
trial_length = 200
num_trials = 10
ensemble_size = 5

# Everything with "???" indicates an option with a missing value.
# Our utility functions will fill in these details using the 
# environment information
cfg_dict = {
    # dynamics model configuration
    "dynamics_model": {
        "model": 
        {
            "_target_": "mbrl.models.GaussianMLP",
            "device": device,
            "num_layers": 3,
            "ensemble_size": ensemble_size,
            "hid_size": 200,
            "in_size": "???",
            "out_size": "???",
            "deterministic": False,
            "propagation_method": "fixed_model",
            # can also configure activation function for GaussianMLP
            "activation_fn_cfg": {
                "_target_": "torch.nn.LeakyReLU",
                "negative_slope": 0.01
            }
        }
    },
    # options for training the dynamics model
    "algorithm": {
        
        "agent" :{
          "_target_": "mbrl.third_party.pytorch_sac_pranz24.sac.SAC",
          "num_inputs": "???",
          "action_space" :
            {
              "_target_": "gym.env.Box",
              "low": "???",
              "high":"???",
              "shape": "???" },

          "args": {
            "gamma": 0.99,
            "tau": 0.005,
            "alpha": 0.2,
            "policy": "Gaussian",
            "target_update_interval": 4,
            "automatic_entropy_tuning": True,
            "target_entropy": -0.05,
            "hidden_size": 256,
            "lr": 0.0003,
            "batch_size": 256,
            "device": "cpu"
        }},

        "normalize": True,
        "normalize_double_precision": True,
        "target_is_delta": True,
        "learned_rewards": True,
        "freq_train_model": 200,
        "real_data_ratio": 0.0,
        "sac_samples_action": True,
        "initial_exploration_steps": 5000,
        "random_initial_explore": False,
        "num_eval_episodes": 1},


    # these are experiment specific options
    "overrides": {
        "num_steps": 500, #5000
        "epoch_length": 20, #200
        "num_elites": 5,
        "patience": 5,
        "model_lr": 0.001,
        "model_wd": 0.00005,
        "model_batch_size": 256,
        "validation_ratio": 0.2,
        "freq_train_model": 200,
        "effective_model_rollouts_per_step": 400,
        "rollout_schedule": [1, 15, 1, 1],
        "num_sac_updates_per_step": 20,
        "sac_updates_every_steps": 1,
        "num_epochs_to_retain_sac_buffer": 1,

        

        "sac_gamma": 0.99,
        "sac_tau": 0.005,
        "sac_alpha": 0.2,
        "sac_policy": "Gaussian",
        "sac_target_update_interval": 4,
        "sac_automatic_entropy_tuning": True,
        "sac_target_entropy": -0.05,
        "sac_hidden_size": 256,
        "sac_lr": 0.0003,
        "sac_batch_size": 256
}}
cfg = omegaconf.OmegaConf.create(cfg_dict)

<div class="alert alert-block alert-info"><b>Note: </b> This example uses a probabilistic ensemble. You can also use a fully deterministic model with class GaussianMLP by setting ensemble_size=1, and deterministic=True. </div>

# Creating a dynamics model

Given the configuration above, the following two lines of code create a wrapper for 1-D transition reward models, and a gym-like environment that wraps it, which we can use for simulating the real environment. The 1-D model wrapper takes care of creating input/output data tensors to the underlying NN model (by concatenating observations, actions and rewards appropriately), normalizing the input data to the model, and other data processing tasks (e.g., converting observation targets to deltas with respect to the input observation).

In [None]:
# # Create a 1-D dynamics model for this environment
# dynamics_model = common_util.create_one_dim_tr_model(cfg, obs_shape, act_shape)

# # Create a gym-like environment to encapsulate the model
# model_env = models.ModelEnv(env, dynamics_model, term_fn, reward_fn, generator=generator)

# Create a replay buffer

We can create a replay buffer for this environment an configuration using the following method

In [None]:
# replay_buffer = common_util.create_replay_buffer(cfg, obs_shape, act_shape, rng=rng)

We can now populate the replay buffer with random trajectories of a desired length, using a single function call to `util.rollout_agent_trajectories`. Note that we pass an agent of type `planning.RandomAgent` to generate the actions; however, this method accepts any agent that is a subclass of `planning.Agent`, allowing changing exploration strategies with minimal changes to the code. 

In [None]:
# common_util.rollout_agent_trajectories(
#     env,
#     trial_length, # initial exploration steps
#     planning.RandomAgent(env),
#     {}, # keyword arguments to pass to agent.act()
#     replay_buffer=replay_buffer,
#     trial_length=trial_length
# )

# print("# samples stored", replay_buffer.num_stored)

# samples stored 200


# SAC Agent


In [None]:
# agent_cfg = omegaconf.OmegaConf.create({
    
#   "_target_": "mbrl.third_party.pytorch_sac_pranz24.sac.SAC",
#   "num_inputs": "???",
#   "action_space" :
#    {
#     "_target_": "gym.env.Box",
#     "low": "???",
#     "high":"???",
#     "shape": "???" } ,

#   "args": {
#     "gamma": 0.99,
#     "tau": 0.005,
#     "alpha": 0.2,
#     "policy": "Gaussian",
#     "target_update_interval": 4,
#     "automatic_entropy_tuning": True,
#     "target_entropy": -0.05,
#     "hidden_size": 256,
#     "lr": 0.0003,
#     "batch_size": 256,
#     "device": "cpu"
# }})

# agent = planning.create_trajectory_optim_agent_for_model(
#     model_env,
#     agent_cfg,
#     num_particles=20
# )

# MBPO utilities

In [21]:
import os
from typing import Optional, Sequence, cast

import gym
import hydra.utils
import numpy as np
import omegaconf
import torch

import mbrl.constants
import mbrl.models
import mbrl.planning
import mbrl.third_party.pytorch_sac_pranz24 as pytorch_sac_pranz24 # Missing when install mbrl
import mbrl.types
import mbrl.util
import mbrl.util.common
import mbrl.util.math
from mbrl.planning.sac_wrapper import SACAgent
from mbrl.third_party.pytorch_sac import VideoRecorder

MBPO_LOG_FORMAT = mbrl.constants.EVAL_LOG_FORMAT + [
    ("epoch", "E", "int"),
    ("rollout_length", "RL", "int"),
]


# def rollout_model_and_populate_sac_buffer(
#     model_env: mbrl.models.ModelEnv,
#     replay_buffer: mbrl.util.ReplayBuffer,
#     agent: SACAgent,
#     sac_buffer: mbrl.util.ReplayBuffer,
#     sac_samples_action: bool,
#     rollout_horizon: int,
#     batch_size: int,
# ):

#     batch = replay_buffer.sample(batch_size)
#     initial_obs, *_ = cast(mbrl.types.TransitionBatch, batch).astuple()
#     model_state = model_env.reset(
#         initial_obs_batch=cast(np.ndarray, initial_obs),
#         return_as_np=True,
#     )
#     accum_dones = np.zeros(initial_obs.shape[0], dtype=bool)
#     obs = initial_obs
#     for i in range(rollout_horizon):
#         action = agent.act(obs, sample=sac_samples_action, batched=True)
#         pred_next_obs, pred_rewards, pred_dones, model_state = model_env.step(
#             action, model_state, sample=True
#         )
#         sac_buffer.add_batch(
#             obs[~accum_dones],
#             action[~accum_dones],
#             pred_next_obs[~accum_dones],
#             pred_rewards[~accum_dones, 0],
#             pred_dones[~accum_dones, 0],
#         )
#         obs = pred_next_obs
#         accum_dones |= pred_dones.squeeze()


# def evaluate(
#     env: gym.Env,
#     agent: SACAgent,
#     num_episodes: int,
#     video_recorder: VideoRecorder,
# ) -> float:
#     avg_episode_reward = 0
#     for episode in range(num_episodes):
#         obs = env.reset()
#         video_recorder.init(enabled=(episode == 0))
#         done = False
#         episode_reward = 0
#         while not done:
#             action = agent.act(obs)
#             obs, reward, done, _ = env.step(action)
#             video_recorder.record(env)
#             episode_reward += reward
#         avg_episode_reward += episode_reward
#     return avg_episode_reward / num_episodes


# def maybe_replace_sac_buffer(
#     sac_buffer: Optional[mbrl.util.ReplayBuffer],
#     obs_shape: Sequence[int],
#     act_shape: Sequence[int],
#     new_capacity: int,
#     seed: int,
# ) -> mbrl.util.ReplayBuffer:
#     if sac_buffer is None or new_capacity != sac_buffer.capacity:
#         if sac_buffer is None:
#             rng = np.random.default_rng(seed=seed)
#         else:
#             rng = sac_buffer.rng
#         new_buffer = mbrl.util.ReplayBuffer(new_capacity, obs_shape, act_shape, rng=rng)
#         if sac_buffer is None:
#             return new_buffer
#         obs, action, next_obs, reward, done = sac_buffer.get_all().astuple()
#         new_buffer.add_batch(obs, action, next_obs, reward, done)
#         return new_buffer
#     return sac_buffer

# Running MBPO

In [22]:
# ------------------- Initialization -------------------
debug_mode = cfg.get("debug_mode", False)

obs_shape = env.observation_space.shape
act_shape = env.action_space.shape

mbrl.planning.complete_agent_cfg(env, cfg.algorithm.agent)
agent = SACAgent(
    cast(pytorch_sac_pranz24.SAC, hydra.utils.instantiate(cfg.algorithm.agent))
)

work_dir = os.getcwd()
# enable_back_compatible to use pytorch_sac agent
logger = mbrl.util.Logger(work_dir, enable_back_compatible=True)
logger.register_group(
    mbrl.constants.RESULTS_LOG_NAME,
    MBPO_LOG_FORMAT,
    color="green",
    dump_frequency=1,
)
save_video = cfg.get("save_video", False)
video_recorder = VideoRecorder(work_dir if save_video else None)

rng = np.random.default_rng(seed=cfg.seed)
torch_generator = torch.Generator(device=cfg.device)
if cfg.seed is not None:
    torch_generator.manual_seed(cfg.seed)

In [None]:
# -------------- Create initial overrides. dataset --------------
dynamics_model = mbrl.util.common.create_one_dim_tr_model(cfg, obs_shape, act_shape)
use_double_dtype = cfg.algorithm.get("normalize_double_precision", False)
dtype = np.double if use_double_dtype else np.float32
replay_buffer = mbrl.util.common.create_replay_buffer(
    cfg,
    obs_shape,
    act_shape,
    rng=rng,
    obs_type=dtype,
    action_type=dtype,
    reward_type=dtype,
)
random_explore = cfg.algorithm.random_initial_explore
mbrl.util.common.rollout_agent_trajectories(
    env,
    cfg.algorithm.initial_exploration_steps,
    mbrl.planning.RandomAgent(env) if random_explore else agent,
    {} if random_explore else {"sample": True, "batched": False},
    replay_buffer=replay_buffer,
)


In [None]:
silent = False
test_env = env
# --------------------- Training Loop ---------------------
rollout_batch_size = (
    cfg.overrides.effective_model_rollouts_per_step * cfg.algorithm.freq_train_model
)
trains_per_epoch = int(
    np.ceil(cfg.overrides.epoch_length / cfg.overrides.freq_train_model)
)
updates_made = 0
env_steps = 0
model_env = mbrl.models.ModelEnv(
    env, dynamics_model, term_fn, None, generator=torch_generator
)
model_trainer = mbrl.models.ModelTrainer(
    dynamics_model,
    optim_lr=cfg.overrides.model_lr,
    weight_decay=cfg.overrides.model_wd,
    logger=None if silent else logger,
)
best_eval_reward = -np.inf
epoch = 0
sac_buffer = None
while env_steps < cfg.overrides.num_steps:
    rollout_length = int(
        mbrl.util.math.truncated_linear(
            *(cfg.overrides.rollout_schedule + [epoch + 1])
        )
    )
    sac_buffer_capacity = rollout_length * rollout_batch_size * trains_per_epoch
    sac_buffer_capacity *= cfg.overrides.num_epochs_to_retain_sac_buffer
    sac_buffer = maybe_replace_sac_buffer(
        sac_buffer, obs_shape, act_shape, sac_buffer_capacity, cfg.seed
    )
    obs, done = None, False
    for steps_epoch in range(cfg.overrides.epoch_length):
        if steps_epoch == 0 or done:
            obs, done = env.reset(), False
        # --- Doing env step and adding to model dataset ---
        next_obs, reward, done, _ = mbrl.util.common.step_env_and_add_to_buffer(
            env, obs, agent, {}, replay_buffer
        )

        # --------------- Model Training -----------------
        if (env_steps + 1) % cfg.overrides.freq_train_model == 0:
            mbrl.util.common.train_model_and_save_model_and_data(
                dynamics_model,
                model_trainer,
                cfg.overrides,
                replay_buffer,
                work_dir=work_dir,
            )

            # --------- Rollout new model and store imagined trajectories --------
            # Batch all rollouts for the next freq_train_model steps together
            rollout_model_and_populate_sac_buffer(
                model_env,
                replay_buffer,
                agent,
                sac_buffer,
                cfg.algorithm.sac_samples_action,
                rollout_length,
                rollout_batch_size,
            )

            if debug_mode:
                print(
                    f"Epoch: {epoch}. "
                    f"SAC buffer size: {len(sac_buffer)}. "
                    f"Rollout length: {rollout_length}. "
                    f"Steps: {env_steps}"
                )

        # --------------- Agent Training -----------------
        for _ in range(cfg.overrides.num_sac_updates_per_step):
            use_real_data = rng.random() < cfg.algorithm.real_data_ratio
            which_buffer = replay_buffer if use_real_data else sac_buffer
            if (env_steps + 1) % cfg.overrides.sac_updates_every_steps != 0 or len(
                which_buffer
            ) < cfg.overrides.sac_batch_size:
                break  # only update every once in a while

            agent.sac_agent.update_parameters(
                which_buffer,
                cfg.overrides.sac_batch_size,
                updates_made,
                logger,
                reverse_mask=True,
            )
            updates_made += 1
            # if not silent and updates_made % cfg.log_frequency_agent == 0:
            #     logger.dump(updates_made, save=True)

        # ------ Epoch ended (evaluate and save model) ------
        if (env_steps + 1) % cfg.overrides.epoch_length == 0:
            avg_reward = evaluate(
                test_env, agent, cfg.algorithm.num_eval_episodes, video_recorder
            )
            logger.log_data(
                mbrl.constants.RESULTS_LOG_NAME,
                {
                    "epoch": epoch,
                    "env_step": env_steps,
                    "episode_reward": avg_reward,
                    "rollout_length": rollout_length,
                },
            )
            if avg_reward > best_eval_reward:
                video_recorder.save(f"{epoch}.mp4")
                best_eval_reward = avg_reward
                agent.sac_agent.save_checkpoint(
                    ckpt_path=os.path.join(work_dir, "sac.pth")
                )
            epoch += 1

        env_steps += 1
        obs = next_obs

Group model_train has already been registered.
| [34mmodel_train[0m | I: 0 | E: 0 | TD: 4000 | VD: 1000 | MLOSS: 237.9728 | MVSCORE: 0.0010 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 1 | TD: 4000 | VD: 1000 | MLOSS: -12.8112 | MVSCORE: 0.0005 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 2 | TD: 4000 | VD: 1000 | MLOSS: -34.4008 | MVSCORE: 0.0002 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 3 | TD: 4000 | VD: 1000 | MLOSS: -37.1748 | MVSCORE: 0.0001 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 4 | TD: 4000 | VD: 1000 | MLOSS: -38.8702 | MVSCORE: 0.0001 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 5 | TD: 4000 | VD: 1000 | MLOSS: -40.6386 | MVSCORE: 0.0001 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 6 | TD: 4000 | VD: 1000 | MLOSS: -42.3612 | MVSCORE: 0.0001 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 7 | TD: 4000 | VD: 1000 | MLOSS: -43.6501 | MVSCORE: 0.0000 | MBVSCORE: 0.0000
| [34mmodel_train[0m | I: 0 | E: 8 | TD

In [25]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import gym
import mbrl.env.cartpole_continuous as cartpole_env


env = cartpole_env.CartPoleEnv()

expert = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0)
expert.learn(1_000)  # Note: set to 100000 to train a proficient expert


from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv

rollouts = rollout.rollout(
    expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(env)] * 5),
    rollout.make_sample_until(min_timesteps=None, min_episodes=60),
)

In [57]:
from imitation.algorithms.adversarial.gail import GAIL
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm


from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv


#Need to manage vec envs
venv = DummyVecEnv([lambda: env] * 1)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)

model_trainer = models.ModelTrainer(dynamics_model, optim_lr=1e-3, weight_decay=5e-5)

# RuntimeError: BufferingWrapper reset() before samples were accessed, n_stored = 200*times_execution
# common_util.rollout_agent_trajectories(
#     env,
#     trial_length, # initial exploration steps
#     planning.RandomAgent(env),
#     {}, # keyword arguments to pass to agent.act()
#     replay_buffer=replay_buffer,
#     trial_length=trial_length)


gail_trainer = GAIL(
    demonstrations=rollouts,
    demo_batch_size=1024,
    gen_replay_buffer_capacity=2048,
    n_disc_updates_per_round=4,
    venv=venv,
    gen_algo=agent,
    reward_net=reward_net,
    cfg=cfg,
    model_trainer=model_trainer,
    dynamics_model=dynamics_model,
    replay_buffer=replay_buffer,
    gen_train_timesteps = 2000,
    allow_variable_horizon=True, #https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html,
    term_fn=term_fn,
    torch_generator=torch_generator
)




Running with `allow_variable_horizon` set to True. Some algorithms are biased towards shorter or longer episodes, which may significantly confound results. Additionally, even unbiased algorithms can exploit the information leak from the termination condition, producing spuriously high performance. See https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html for more information.


In [58]:
gail_trainer.train(20000)  # Note: set to 300000 for better results

round:   0%|          | 0/10 [00:00<?, ?it/s]

Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth
Saving models to /content/sac.pth


round:   0%|          | 0/10 [02:15<?, ?it/s]


AttributeError: ignored