# Tutorial: Begin to Develop Agents for Offline Reinforcement Learning

This notebook aims to give a simple tutorial of the running logic of this offline RL project. Here is the outline:
   1. Dataset: D4RL
   2. Agent
   3. Behavior Cloning Example

## Offline Reinforcement Learning: a simplest review

Let's think about playing a game, such as the game of playing the badminton:

A **Reward** is the positive/negative feedback, such as win or loss.

A **policy** takes observations (such as your opponents' position, your own position, your stamina ...) and produce **actions** (how to hit the shuttlecock, how to move) to maximize the expected reward.

Reinforcement learning (RL) aims to learn a **policy** that can attains the maximal expected **reward**.

Online RL aims to learn the **policy** by playing the badminton yourself. You iteratively do "play -> win/loss -> summarize why you win/loss -> play..."

Offline RL aims to learn the **policy** by a **dataset** (such as watching the video from the world champion Lin Dan at home), then exercise the learned policy thereafter. You only do "learn -> play".

In this project, we only consider offline RL, which aims to learn a data-driven decision maker. For a more comprehensive review, you can read this [paper](https://arxiv.org/abs/2005.01643).

## 1. D4RL dataset

[D4RL](https://github.com/digital-brain-sh/d4rl) provides standardized environments and datasets for training and benchmarking **offline RL** algorithms.

D4RL can be installed by cloning the repository as follows:

```
git clone https://github.com/rail-berkeley/d4rl.git
cd d4rl
pip install -e .
```

The installation also installs [Mujoco](https://github.com/google-deepmind/mujoco) physics engine. If not, you need to install it for this project.
Suppose we have download and installed the d4rl in your root repository, then we can load the dataset use the **gym** and **d4rl** package:

In [1]:
import gym
import d4rl # Import required to register environments
env_name = "hopper-medium-v2"
# Create the environment
env = gym.make(env_name)

# Use d4rl.qlearning_dataset which adds next_observations.
dataset = d4rl.qlearning_dataset(env)
print(dataset['observations'].shape)  # Number of instances x Observation dimension

pybullet build time: Aug 15 2022 11:36:51
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
load datafile: 100%|██████████| 21/21 [00:01<00:00, 10.68it/s]


(999998, 11)


In this project, we have wrapped the dataset loading function for you in ```dataset.make_env_and_dataset```. This function loads the dataset and normalizes the rewards used in most research papers, and also create a wrapped ```env``` object to interactive with.

In [2]:
from dataset import make_env_and_dataset

env_name = "hopper-medium-v2"

env, dataset = make_env_and_dataset(env_name, seed = 520)

  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
  deprecation(
load datafile: 100%|██████████| 21/21 [00:01<00:00, 10.66it/s]
split to trajectories: 100%|██████████| 999998/999998 [00:01<00:00, 658216.07it/s]


Please ignore the lowered precision warning and the deprecation warning. In this project, we only consider the  **"hopper-medium-v2"** dataset. The first time you load the data may take some time for downloading. You can sample a batch of dataset using ```dataset.sample(batch_size=batch_size)```. Here the **mask** information takes binary values: mask=0 if the game finishes and mask=1 otherwise.

In [3]:
batch_size = 256
sample_batch = dataset.sample(batch_size=batch_size)
print("Number of transitions = ", dataset.size)
print("Batch size = ", batch_size)
print("observations shape = ", sample_batch.observations.shape)
print("actions shape = ", sample_batch.actions.shape)
print("rewards shape = ", sample_batch.rewards.shape)
print("next_observations shape = ", sample_batch.next_observations.shape)
print("masks (1-done) shape = ", sample_batch.masks.shape)

Number of transitions =  999998
Batch size =  256
observations shape =  (256, 11)
actions shape =  (256, 3)
rewards shape =  (256,)
next_observations shape =  (256, 11)
masks (1-done) shape =  (256,)


## 2. Agent

An Agent object is the entity to learn from the dataset and also produce interactive actions. You need to write your own agent by inheriting from the ```Agent``` object in ```rlagents.agent```. All you need to do is to re-write the ```__init__```, ```update```, ```sample_actions``` methods.

In [4]:
import numpy as np
from dataset import Batch
from typing import Dict, Any
InfoDict = Dict[str, Any]


class Agent(object):
    name = 'agent'
    
    def __init__(self, *args, **kwargs):
        # TODO: write your own way to initialize the agent: such as networks, optimizers, ...
        pass

    def update(self, batch: Batch) -> InfoDict:
        # TODO: how to update the agent? the data batch contains five information: 'observations', 'actions', 'rewards', 'masks', 'next_observations'.
        raise NotImplementedError

    def sample_actions(self, observations: np.ndarray) -> np.ndarray:
        # TODO: how to produce actions for environment interation. A typical action is a np.array vector ranges in [-1, 1].

        raise NotImplementedError

    def __str__(self):
        return self.__class__.__name__


For example, a random agent without the requirement of training can be defined as:

In [5]:
class RandomAgent(Agent):
    name = 'random agent without training'

    def __init__(self, action_space: gym.spaces.box.Box):
        self.action_space = action_space

    def update(self, batch) -> InfoDict:
        pass

    def sample_actions(self, observations):
        return self.action_space.sample()
    
rand_agent = RandomAgent(env.action_space)

Then you can evaluate the agent by:

In [6]:
from eval import evaluate
res = evaluate(rand_agent, env=env, num_episodes=10, render=False)
print(res)

{'mean': 1.3120474228487389, 'median': 1.2027454683943803, 'std': 0.4506554193728289, 'min': 0.8491734081854676, 'max': 2.4467042480069217, 'length': 26.7}


## 3.Behavior Cloning Example

To make the coding simpler, we give behavior-cloning (BC) agents as examples. BC agents take observations as inputs, and predict the actions in the dataset. In our ```hopper-medium-v2``` example, the input observation is of dimension 11, the output action is of dimension 3.

We provide two versions of agent design for you to refer to: coded with [Pytorch](https://pytorch.org/get-started/locally/) or [jax](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html)&[flax](https://flax.readthedocs.io/en/latest/getting_started.html). The jax agent is slightly faster than the torch agent, while the torch agent is more human-readable. You can also use tensorflow if you like. There is no constrain for the usage of deep learning packages. But do not use well-coded learner from other people. Try to write your own training/testing process.


Here we use the torch agent as the example:


Some imports... 
You may need to install some packages if there is any package not found error.

In [7]:
import os
import numpy as np
import torch
import gym
from collections import deque
from tensorboardX import SummaryWriter
from dataset import make_env_and_dataset
from tqdm import trange
from agents import TorchBCLearner, JAXBCLearner
from eval import eval_agent, STATISTICS
from utils import prepare_output_dir, set_torch_seed
# for tensorboard visualization in jupyter notebook only
%load_ext tensorboard

Preparation for the results recording: set seeds, create save folder and summary writer.

In [8]:
# set the seed
seed = 520
set_torch_seed(seed)

# create a saving directory
save_dir = prepare_output_dir(suffix="Behavior-Cloning")
with open(os.path.join(save_dir, f"seed_{seed}.txt"), "w") as f:
    print("\t".join(["steps"] + STATISTICS), file=f)
summary_writer = SummaryWriter(os.path.join(save_dir, 'tensorboard', f'seed={seed}'))
print(f"Results are saved in '{save_dir}' ")

Results are saved in 'results/20240325-073144_Behavior-Cloning' 


Set hyperparameters: here we model the actions using MLP with 3 hidden layers, each of dim=256

In [9]:
max_steps = 100000  # maximal number of training steps, 100000 is for this tutorial only and it's too short for most methods. You may need to try 1M~2M
eval_interval = 5000  # evaluate the agent every 'eval_interval' gradient steps
log_interval = 1000  # record the training statistics, such as loss every 'log_interval' gradient steps
num_eval_episodes = 10  # number of evaluation episodes for each evaluation. Should be >= 10 for stability
batch_size = 256
hidden_dims = (256, 256, 256)  # for MLP with 3 hidden layers, each of dim=256

Fetch dataset and create corresponding env:

In [10]:
# fetch dataset and the corresponding environment
env_name = 'hopper-medium-v2'
env, dataset = make_env_and_dataset(env_name, seed=seed)

  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
  deprecation(
load datafile: 100%|██████████| 21/21 [00:01<00:00, 10.76it/s]
split to trajectories: 100%|██████████| 999998/999998 [00:01<00:00, 661684.08it/s]


Define the torch agent, you can use ```torch.device('cuda: 0')``` if cuda is available. For jax-coded agent, you can use 

```
agent = JAXBCLearner(seed=seed,
                     obs_dim=obs_dim,
                     act_dim=act_dim,
                     actor_lr=3e-4,
                     layer_norm=True,
                     hidden_dims=hidden_dims,
                     lr_decay_T=max_steps)
```

In [11]:
obs_dim = len(env.observation_space.sample())
act_dim = len(env.action_space.sample())
agent = TorchBCLearner(obs_dim=obs_dim,
                       act_dim=act_dim,
                       actor_lr=3e-4,
                       layer_norm=True,
                       hidden_dims=hidden_dims,
                       lr_decay_T=max_steps,
                       device=torch.device('cpu'))

Here is the main training and evaluation process. Record the latest 5 mean returns as the final performance

In [12]:
# training
latest_mean_returns = deque(maxlen=5)  # track the performance of the latest 5 evaluation
for i in trange(max_steps):

    # evaluation
    if i % eval_interval == 0:
        eval_res = eval_agent(i, agent, env, summary_writer, save_dir, seed, num_eval_episodes)
        latest_mean_returns.append(eval_res['mean'])
        print(f"Step={i}, Eval Mean={eval_res['mean']}")

    # training process
    batch = dataset.sample(batch_size)
    update_info = agent.update(batch)

    # record the training information
    if i % log_interval == 0:
        for k, v in update_info.items():
            summary_writer.add_scalar(f'training/{k}', v, i)
        summary_writer.flush()

print(f"Final Mean Return={np.mean(latest_mean_returns)}")

  0%|          | 42/100000 [00:00<07:51, 211.78it/s]

Step=0, Eval Mean=0.868019780407904


  5%|▍         | 4994/100000 [00:21<06:45, 234.19it/s]

Step=5000, Eval Mean=45.02583104081061


 10%|█         | 10043/100000 [00:43<16:53, 88.75it/s]

Step=10000, Eval Mean=48.937190052511085


 15%|█▌        | 15030/100000 [01:06<14:31, 97.55it/s] 

Step=15000, Eval Mean=46.7221872956148


 20%|██        | 20030/100000 [01:28<16:53, 78.87it/s] 

Step=20000, Eval Mean=55.363668854270884


 25%|██▌       | 25049/100000 [01:50<14:23, 86.84it/s] 

Step=25000, Eval Mean=47.96478300354196


 30%|███       | 30048/100000 [02:12<12:23, 94.05it/s] 

Step=30000, Eval Mean=39.40810451253631


 35%|███▌      | 35038/100000 [02:31<09:50, 110.04it/s]

Step=35000, Eval Mean=48.99787741237678


 40%|████      | 40054/100000 [02:49<07:57, 125.54it/s]

Step=40000, Eval Mean=42.49380795069705


 45%|████▌     | 45044/100000 [03:07<07:17, 125.71it/s]

Step=45000, Eval Mean=41.77125900212759


 50%|█████     | 50050/100000 [03:24<06:31, 127.43it/s]

Step=50000, Eval Mean=45.46947967646627


 55%|█████▌    | 55039/100000 [03:42<06:13, 120.29it/s]

Step=55000, Eval Mean=48.07477505858807


 60%|██████    | 60033/100000 [04:00<05:12, 127.94it/s]

Step=60000, Eval Mean=42.96874110515497


 65%|██████▌   | 65029/100000 [04:17<05:40, 102.84it/s]

Step=65000, Eval Mean=44.6270015130715


 70%|███████   | 70044/100000 [04:35<04:27, 111.97it/s]

Step=70000, Eval Mean=48.215093696345875


 75%|███████▌  | 75035/100000 [04:53<03:56, 105.74it/s]

Step=75000, Eval Mean=50.62383224522444


 80%|████████  | 80046/100000 [05:11<02:43, 122.10it/s]

Step=80000, Eval Mean=44.46050057606199


 85%|████████▌ | 85032/100000 [05:29<01:59, 125.37it/s]

Step=85000, Eval Mean=45.93610000788423


 90%|█████████ | 90037/100000 [05:46<01:18, 127.50it/s]

Step=90000, Eval Mean=45.80896832047246


 95%|█████████▌| 95048/100000 [06:04<00:41, 119.32it/s]

Step=95000, Eval Mean=46.78829250340031


100%|██████████| 100000/100000 [06:21<00:00, 262.07it/s]

Final Mean Return=46.72353873060868





After few minutes training, you should get a BC agent with final mean return around 50. To view the training process, you can use ```tensorboard``` to track the training curves.

In [13]:
%tensorboard --logdir {save_dir}

Launching TensorBoard...

You can also check the video of how the agent control the hopper:

In [14]:
# create the video
env = gym.make("Hopper-v2")
env = gym.wrappers.RecordVideo(env, save_dir)
observation, done = env.reset(), False
while not done:
    action = agent.sample_actions(observation)  # eval takes argmax from actor net
    observation, _, done, info = env.step(np.clip(action, -1, 1))
env.close()

print(f"The video is saved in {save_dir} as a '.mp4' file!")

  logger.warn(
  logger.warn(
  logger.deprecation(
  self.pid = _posixsubprocess.fork_exec(


Creating offscreen glfw
The video is saved in results/20240325-073144_Behavior-Cloning as a '.mp4' file!


A compact code version can be found in ```main.py``` for references. You can directly run

```
python main.py --agent torchBC --create_video
```

to produce the whole train/test results.
