# RL Baselines3 Zoo: Training in Colab



Github Repo: [https://github.com/DLR-RM/rl-baselines3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo)

Stable-Baselines3 Repo: [https://github.com/DLR-RM/rl-baselines3-zoo](https://github.com/DLR-RM/stable-baselines3)


# Install Dependencies



In [None]:
!apt-get install swig cmake ffmpeg freeglut3-dev xvfb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
freeglut3-dev is already the newest version (2.8.1-3).
freeglut3-dev set to manually installed.
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0 xvfb
0 upgraded, 3 newly installed, 0 to remove and 37 not upgraded.
Need to get 1,885 kB of archives.
After this operation, 8,093 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.9 [784 kB]
Fetched 1,885 kB in 

## Clone RL Baselines3 Zoo Repo

In [None]:
!git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo

Cloning into 'rl-baselines3-zoo'...
remote: Enumerating objects: 3126, done.[K
remote: Counting objects: 100% (186/186), done.[K
remote: Compressing objects: 100% (122/122), done.[K
remote: Total 3126 (delta 116), reused 106 (delta 60), pack-reused 2940[K
Receiving objects: 100% (3126/3126), 2.12 MiB | 4.37 MiB/s, done.
Resolving deltas: 100% (2046/2046), done.
Submodule 'rl-trained-agents' (https://github.com/DLR-RM/rl-trained-agents) registered for path 'rl-trained-agents'
Cloning into '/content/rl-baselines3-zoo/rl-trained-agents'...
remote: Enumerating objects: 1567, done.        
remote: Counting objects: 100% (152/152), done.        
remote: Compressing objects: 100% (102/102), done.        
remote: Total 1567 (delta 55), reused 140 (delta 50), pack-reused 1415        
Receiving objects: 100% (1567/1567), 1.12 GiB | 27.86 MiB/s, done.
Resolving deltas: 100% (275/275), done.
Submodule path 'rl-trained-agents': checked out '3dd2af4cee930750016cf943dc6393bada57b89c'


In [None]:
%cd /content/rl-baselines3-zoo/

/content/rl-baselines3-zoo


### Install pip dependencies

In [None]:
!pip install -r requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m


## Train an RL Agent


The train agent can be found in the `logs/` folder.

Here we will train QRDQN on LunarLander-v2 environment for 100 000 steps. 


In [None]:
!python train.py --algo qrdqn --env LunarLander-v2 --n-timesteps 100000

Seed: 598561340
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 128),
             ('buffer_size', 100000),
             ('exploration_final_eps', 0.18),
             ('exploration_fraction', 0.24),
             ('gamma', 0.995),
             ('gradient_steps', -1),
             ('learning_rate', 'lin_1.5e-3'),
             ('learning_starts', 10000),
             ('n_timesteps', 100000.0),
             ('policy', 'MlpPolicy'),
             ('policy_kwargs', 'dict(net_arch=[256, 256], n_quantiles=170)'),
             ('target_update_interval', 1),
             ('train_freq', 256)])
Using 1 environments
Overwriting n_timesteps with n=100000
Creating test environment
Using cuda device
Log path: logs/qrdqn/LunarLander-v2_1
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 88.8     |
|    ep_rew_mean      | -183     |
|    exploration rate | 0.988    |
| time/               |          |
|

#### On PPO


We can now train it on PPO algorithm.

In [None]:
!python train.py --algo ppo --env LunarLander-v2 --n-timesteps 100 

Seed: 2967772580
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 64),
             ('ent_coef', 0.01),
             ('gae_lambda', 0.98),
             ('gamma', 0.999),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 1024),
             ('n_timesteps', 1000000.0),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=100
Creating test environment
Using cuda device
Log path: logs/ppo/LunarLander-v2_1
Eval num_timesteps=10000, episode_reward=-765.83 +/- 448.75
Episode length: 97.40 +/- 54.21
---------------------------------
| eval/              |          |
|    mean_ep_length  | 97.4     |
|    mean_reward     | -766     |
| time/              |          |
|    total timesteps | 10000    |
---------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 90.1     |
|    ep_re

#### Tune Hyperparameters

We use [Optuna](https://optuna.org/) for optimizing the hyperparameters.

Tune the hyperparameters for PPO, using a tpe sampler and median pruner, 2 parallels jobs,
with a budget of 1000 trials and a maximum of 50000 steps

In [None]:
!python train.py --algo ppo --env LunarLander-v2 -n 1000 -optimize --n-trials 500 --sampler tpe --pruner median

Seed: 335732516
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 64),
             ('ent_coef', 0.01),
             ('gae_lambda', 0.98),
             ('gamma', 0.999),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 1024),
             ('n_timesteps', 1000000.0),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=50000
Optimizing hyperparameters
Sampler: tpe - Pruner: median
[32m[I 2021-10-21 09:54:08,346][0m A new study created in memory with name: no-name-d01fd325-e0e5-46a7-a68e-bacb3723ce46[0m

`n_jobs` argument has been deprecated in v2.7.0. This feature will be removed in v4.0.0. See https://github.com/optuna/optuna/releases/tag/v2.7.0.

[32m[I 2021-10-21 09:56:00,309][0m Trial 0 finished with value: -950.7362820000001 and parameters: {'batch_size': 512, 'n_steps': 512, 'gamma': 0.99, 'learning_rate': 4.34431713042486e-05, 'ent_coef': 2.33897

### Record  a Video

In [None]:
# Set up display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [None]:
!python -m utils.record_video --algo ppo --env LunarLander-v2 --exp-id 0 -f logs/ -n 1000

Loading latest experiment, id=4
Saving video to /content/rl-baselines3-zoo/logs/ppo/LunarLander-v2_4/videos/final-model-ppo-LunarLander-v2-step-0-to-step-1000.mp4


### Display the video

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
show_videos(video_path='logs/qrdqn/videos/', prefix='final')

### Continue Training

Here, we will continue training of the previous model

Following is a list of re-training and evaluating the models. There is not much to say, as we already made comments on the associated report.

In [None]:
!python train.py --algo ppo --env LunarLander-v2 -n 1000000 -i logs/ppo/LunarLander-v2_5/LunarLander-v2#ppo#nom1_newTEST.zip

Seed: 46406515
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 64),
             ('ent_coef', 0.01),
             ('gae_lambda', 0.98),
             ('gamma', 0.999),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 1024),
             ('n_timesteps', 1000000.0),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=1000000
Creating test environment
Loading pretrained agent
Log path: logs/ppo/LunarLander-v2_6
Eval num_timesteps=10000, episode_reward=282.07 +/- 17.91
Episode length: 250.60 +/- 10.86
---------------------------------
| eval/              |          |
|    mean_ep_length  | 251      |
|    mean_reward     | 282      |
| time/              |          |
|    total timesteps | 10000    |
---------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 232      |
| 

In [None]:
!python scripts/all_plots.py -a ppo --env LunarLander-v2 -f logs/


Eval not found for logs/ppo/LunarLander-v2_5
# results_table
| Environments |   PPO   |
|--------------|---------|
|              |logs/    |
|LunarLander-v2|226 +/- 0|
<Figure size 640x480 with 1 Axes>


In [None]:
!python train.py --algo ppo --env LunarLander-v2 -n 100000 -i logs/ppo/LunarLander-v2_6/best_model.zip

Seed: 2316408788
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('activation_fn', 'relu'),
             ('batch_size', 64),
             ('clip_range', 0.1),
             ('ent_coef', 2.9352370524833907e-06),
             ('gae_lambda', 0.95),
             ('gamma', 0.999),
             ('learning_rate', 0.00011596933243196016),
             ("max_grad_norm'", 5),
             ('n_envs', 16),
             ('n_epochs', 10),
             ('n_steps', 32),
             ('n_timesteps', 1000000.0),
             ('net_arch', 'medium'),
             ('policy', 'MlpPolicy'),
             ('vf_coef', 0.4964941214692943)])
Using 16 environments
Overwriting n_timesteps with n=100000
Creating test environment
Loading pretrained agent
Log path: logs/ppo/LunarLander-v2_7
-----------------------------
| time/              |      |
|    fps             | 4369 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 512  |
--------

In [None]:
!python enjoy.py --algo ppo --env LunarLander-v2 --no-render --n-timesteps 1000 --folder logs/

Loading latest experiment, id=4
Loading logs/ppo/LunarLander-v2_4/LunarLander-v2.zip
Episode Reward: 218.93
Episode Length 329
Episode Reward: 281.72
Episode Length 335
Episode Reward: 215.52
Episode Length 325
3 Episodes
Mean reward: 238.72 +/- 30.43
Mean episode length: 329.67 +/- 4.11


In [None]:
!python sb3_evaluator.py --algo ppo --env LunarLander-v2 --folder logs/ppo/LunarLander-v2_4/

/content/rl-baselines3-zoo/data/policies/
['LunarLander-v2#ppo#nom1_nom2.zip']
LunarLander-v2#ppo#nom1_nom2.zip
Hall of fame
Environment : LunarLander-v2
team:  nom1_nom2  	 	 algo: ppo  	 	 mean score:  247.528189685 std:  30.342655527676392
Time : 1mn 46s 565ms


In [None]:
!python sb3_evaluator.py --algo ppo --env LunarLander-v2 --folder logs/ppo/LunarLander-v2_5/

/content/rl-baselines3-zoo/data/policies/
['LunarLander-v2#ppo#TEST_nom2.zip']
LunarLander-v2#ppo#TEST_nom2.zip
Hall of fame
Environment : LunarLander-v2
team:  TEST_nom2  	 	 algo: ppo  	 	 mean score:  267.80452508499997 std:  22.044322989280555
Time : 1mn 1s 570ms


In [None]:
!python sb3_evaluator.py --algo ppo --env LunarLander-v2

/content/rl-baselines3-zoo/data/policies/
['LunarLander-v2#ppo#nom1_newTEST.zip']
LunarLander-v2#ppo#nom1_newTEST.zip
Hall of fame
Environment : LunarLander-v2
team:  nom1_newTEST  	 	 algo: ppo  	 	 mean score:  276.72351061 std:  34.85189371147843
Time : 54s 807ms


In [None]:
!python sb3_evaluator.py --algo ppo --env LunarLander-v2

LunarLander-v2#ppo#nom1_new.zip
Hall of fame
Environment : LunarLander-v2
team:  nom1_new  	 	 algo: ppo  	 	 mean score:  278.74926112500003 std:  27.85180712175887
Time : 48s 508ms


In [None]:
!python sb3_evaluator.py --algo ppo --env LunarLander-v2

LunarLander-v2#ppo#nom1_nom.zip
Hall of fame
Environment : LunarLander-v2
team:  nom1_nom  	 	 algo: ppo  	 	 mean score:  275.83186239 std:  29.82184076749405
Time : 49s 470ms


In [None]:
!python train.py --algo ppo --env LunarLander-v2 -n 1000000 --eval-freq 10000 --eval-episodes 10 --n-eval-envs 1

Seed: 293841494
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 64),
             ('ent_coef', 0.01),
             ('gae_lambda', 0.98),
             ('gamma', 0.999),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 1024),
             ('n_timesteps', 1000000.0),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=1000000
Creating test environment
Using cpu device
Log path: logs/ppo/LunarLander-v2_9
Eval num_timesteps=10000, episode_reward=-648.51 +/- 82.79
Episode length: 100.00 +/- 15.01
---------------------------------
| eval/              |          |
|    mean_ep_length  | 100      |
|    mean_reward     | -649     |
| time/              |          |
|    total timesteps | 10000    |
---------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 91.5     |
|    ep_

In [None]:
!python scripts/all_plots.py -a ppo --env LunarLander-v2 -l ppo -f logs/ppo/LunarLander-v2_9/


Eval not found for logs/ppo/LunarLander-v2_5
Eval not found for logs/ppo/LunarLander-v2_8
  merged_results = np.array(merged_results)
Traceback (most recent call last):
  File "scripts/all_plots.py", line 153, in <module>
    merged_results = np.array(merged_results)
ValueError: could not broadcast input array from shape (101,5) into shape (101)


In [None]:
!python scripts/plot_train.py -a ppo -e LunarLander-v2 -f logs/ -w 500 -x steps -x steps -y reward

Traceback (most recent call last):
  File "scripts/plot_train.py", line 46, in <module>
    for folder in os.listdir(log_path)
FileNotFoundError: [Errno 2] No such file or directory: 'logs/ppo/LunarLander-v2_9/ppo'


In [None]:
!python train.py --algo ppo --env LunarLander-v2 -n 1000000 --eval-freq 10000 --eval-episodes 10 --n-eval-envs 1 -i logs/ppo/LunarLander-v2_1/best_model.zip

Seed: 147388831
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 64),
             ('ent_coef', 0.01),
             ('gae_lambda', 0.98),
             ('gamma', 0.999),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 1024),
             ('n_timesteps', 1000000.0),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=1000000
Creating test environment
Loading pretrained agent
Log path: logs/ppo/LunarLander-v2_2
Eval num_timesteps=10000, episode_reward=254.81 +/- 21.45
Episode length: 315.30 +/- 10.84
---------------------------------
| eval/              |          |
|    mean_ep_length  | 315      |
|    mean_reward     | 255      |
| time/              |          |
|    total timesteps | 10000    |
---------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 300      |
|