# 1. Install and Import dependencies

To install:

`pip install 'git+https://github.com/DLR-RM/stable-baselines3#egg=stable-baselines3[extra]'`

The version on PyPI is currently broken, installing directly from GitHub fixes this.

You may also want to install PyTorch for CUDA, if available:

[PyTorch Install Guide](https://pytorch.org/get-started/locally/)

In [28]:
import os
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

# 2. Load Environment

[CartPole](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

In [29]:
env_name = 'CartPole-v0'
env = gym.make(env_name)

## Environment Example

In [30]:
episodes = 5
for episode in range(episodes):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score += reward
    print(f'Episode {episode+1} finished with score {score}')
env.close()

Episode 1 finished with score 35.0
Episode 2 finished with score 13.0
Episode 3 finished with score 30.0
Episode 4 finished with score 17.0
Episode 5 finished with score 19.0


## Environment Spaces

[Wiki](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

![RL Algorithm Comparison](./RL_Alg_Comparison.png)

In [31]:
print(env.observation_space)
print(env.action_space)

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Discrete(2)


# 3. Training

In [32]:
log_path = os.path.join('Training', 'Logs')

In [33]:
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cuda device


At this point I got a big long error due to having CUDA 11.4 and PyTorch for CUDA 11.3. I downgraded to CUDA 11.3 using [this guide](https://denishartl.com/installing-cuda-11-3-cudnn-tensorflow-2-4-jupyter-on-a-headless-ubuntu-20-04-server/)

In [34]:
model.learn(total_timesteps=20000)

Logging to Training/Logs/PPO_2
-----------------------------
| time/              |      |
|    fps             | 1068 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 841        |
|    iterations           | 2          |
|    time_elapsed         | 4          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00802889 |
|    clip_fraction        | 0.0949     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.686     |
|    explained_variance   | -0.0119    |
|    learning_rate        | 0.0003     |
|    loss                 | 7.64       |
|    n_updates            | 10         |
|    policy_gradient_loss | -0.0146    |
|    value_loss           | 47.5       |
----------------------------------------
---------------------

<stable_baselines3.ppo.ppo.PPO at 0x7f145208a890>

# 4. Save and Reload Model

In [35]:
PPO_Path = os.path.join('Training', 'Saved_Models', 'PPO_Model_Cartpole')
model.save(PPO_Path)

In [36]:
del model
model = PPO.load(PPO_Path, env=env)

# 5. Evaluation

In [37]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)



(200.0, 0.0)

In [38]:
env.close()

# 6. Test

In [39]:
episodes = 5
for episode in range(episodes):
    obs = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
    print(f'Episode {episode+1} finished with score {score}')
env.close()

Episode 1 finished with score [169.]
Episode 2 finished with score [200.]
Episode 3 finished with score [200.]
Episode 4 finished with score [200.]
Episode 5 finished with score [200.]


# 7. Viewing Logs in Tensorboard
Run this in a shell so you don't lock up your notebook.

If Tensorflow not installed, follow [this](https://www.tensorflow.org/install/)

In [40]:
training_log_path = os.path.join(log_path, 'PPO_2')
!tensorboard --logdir={training_log_path}

^C
Traceback (most recent call last):
  File "/home/stark/anaconda3/envs/RL/lib/python3.10/site-packages/tensorboard/compat/__init__.py", line 42, in tf
    from tensorboard.compat import notf  # noqa: F401
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/home/stark/anaconda3/envs/RL/lib/python3.10/site-packages/tensorboard/compat/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stark/anaconda3/envs/RL/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
  File "/home/stark/anaconda3/envs/RL/lib/python3.10/site-packages/tensorboard/main.py", line 39, in run_main
    main_lib.global_init()
  File "/home/stark/anaconda3/envs/RL/lib/python3.10/site-packages/tensorboard/main_lib.py", line 40, in global_init
    if getattr(tf, "__version__", "stub") == "stub":
  File "/home/stark/anaconda3/envs/RL/lib/python3.10/site-packages/tensorboard/lazy.py", line 65, in __getattr__
    return

# 8. Adding a Callback to the Training Stage

In [44]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [45]:
save_path = os.path.join('Training', 'Saved_Models')

In [47]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)
eval_callback = EvalCallback(env,
                             callback_on_new_best=stop_callback,
                             eval_freq=10000,
                             best_model_save_path=save_path,
                             verbose=1)

In [48]:
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cuda device


In [49]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training/Logs/PPO_3
-----------------------------
| time/              |      |
|    fps             | 1276 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 924         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008078067 |
|    clip_fraction        | 0.104       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.00397     |
|    learning_rate        | 0.0003      |
|    loss                 | 8.22        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0169     |
|    value_loss           | 55.1        |
-----------------------------------------
---



Eval num_timesteps=10000, episode_reward=189.60 +/- 20.80
Episode length: 189.60 +/- 20.80
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 190         |
|    mean_reward          | 190         |
| time/                   |             |
|    total_timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.008963438 |
|    clip_fraction        | 0.075       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.613      |
|    explained_variance   | 0.337       |
|    learning_rate        | 0.0003      |
|    loss                 | 25.1        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.018      |
|    value_loss           | 58.7        |
-----------------------------------------
New best mean reward!
------------------------------
| time/              |       |
|    fps             | 753   |
|    iterations      | 5     |
|    ti

<stable_baselines3.ppo.ppo.PPO at 0x7f1575356170>

# 9. Changing Policies

In [50]:
net_arch = [dict(pi=[128, 128, 128, 128], vf=[128, 128, 128, 128])]

In [51]:
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path, policy_kwargs={'net_arch': net_arch})

Using cuda device


In [52]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training/Logs/PPO_4
-----------------------------
| time/              |      |
|    fps             | 930  |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 703        |
|    iterations           | 2          |
|    time_elapsed         | 5          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.01555875 |
|    clip_fraction        | 0.253      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.68      |
|    explained_variance   | -0.000114  |
|    learning_rate        | 0.0003     |
|    loss                 | 3          |
|    n_updates            | 10         |
|    policy_gradient_loss | -0.0267    |
|    value_loss           | 16.6       |
----------------------------------------
---------------------



Eval num_timesteps=10000, episode_reward=195.80 +/- 8.40
Episode length: 195.80 +/- 8.40
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 196         |
|    mean_reward          | 196         |
| time/                   |             |
|    total_timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.011880171 |
|    clip_fraction        | 0.139       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.578      |
|    explained_variance   | 0.474       |
|    learning_rate        | 0.0003      |
|    loss                 | 5.83        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0159     |
|    value_loss           | 35.1        |
-----------------------------------------
------------------------------
| time/              |       |
|    fps             | 560   |
|    iterations      | 5     |
|    time_elapsed    | 18    |


<stable_baselines3.ppo.ppo.PPO at 0x7f15755105e0>

# 10. Using an Alternate Algorithm

In [53]:
from stable_baselines3 import DQN

In [54]:
model = DQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cuda device


In [55]:
model.learn(20000, callback=eval_callback)

Logging to Training/Logs/DQN_1
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.958    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 6218     |
|    time_elapsed     | 0        |
|    total_timesteps  | 89       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.903    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 7166     |
|    time_elapsed     | 0        |
|    total_timesteps  | 204      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.858    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 7541     |
|    time_elapsed     | 0        |
|    total_timesteps  | 298      |
----------------------------------
------------------------



----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 576      |
|    fps              | 14567    |
|    time_elapsed     | 0        |
|    total_timesteps  | 12450    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 580      |
|    fps              | 14583    |
|    time_elapsed     | 0        |
|    total_timesteps  | 12579    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 584      |
|    fps              | 14584    |
|    time_elapsed     | 0        |
|    total_timesteps  | 12638    |
----------------------------------
----------------------------------
| rollout/          

<stable_baselines3.dqn.dqn.DQN at 0x7f1440571420>