## Stablebaselines
https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html

In [1]:
!pip install gymnasium

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1


In [2]:
!pip install stable_baselines3

Collecting stable_baselines3
  Downloading stable_baselines3-2.2.1-py3-none-any.whl (181 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/181.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/181.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.7/181.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: stable_baselines3
Successfully installed stable_baselines3-2.2.1


In [3]:
import gymnasium as gym

from stable_baselines3 import A2C, DQN
import numpy as np
import matplotlib.pyplot as plt

## CartPole

In [4]:
env = gym.make("CartPole-v1", render_mode="rgb_array")

  and should_run_async(code)


### CartPole DQN

In [5]:
model = DQN("MlpPolicy", env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


A standard 3-layer network is used with
* length of state vector as input
* length of action vector as output
* hidden vector size 64

Different tricks are use to stabilze the learning:
* replay buffer to store parts of episodes
* gradient clipping to avoid large steps
* Two networks of same architecture are used. One for action selection, one for Q-value computation. <br>
This reduces bias.

In [6]:
model.policy

  and should_run_async(code)


DQNPolicy(
  (q_net): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=4, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Linear(in_features=64, out_features=2, bias=True)
    )
  )
  (q_net_target): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=4, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Linear(in_features=64, out_features=2, bias=True)
    )
  )
)

The architecture can be changed.

#### Training
100k timesteps: 69 sec <br>
150k timesteps: 128 sec <br>
200k timesteps: 182 sec

In [7]:
model.learn(total_timesteps=200_000)

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0239   |
|    n_updates        | 5694     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 10.4     |
|    ep_rew_mean      | 10.4     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 4512     |
|    fps              | 1820     |
|    time_elapsed     | 40       |
|    total_timesteps  | 72822    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0381   |
|    n_updates        | 5705     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 10.4     |
|    ep_rew_mean      | 10.4     |
|    exploration_rate | 0.05     |
| time/               |          |
|    epis

<stable_baselines3.dqn.dqn.DQN at 0x7d82b271d930>

In [8]:
vec_env = model.get_env()
obs = vec_env.reset()

In [9]:
iprev=0
sm = 0; nn=0
for i in range(2000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = vec_env.step(action)
    #print(i,obs, reward, "done",done, info)
    if done:
        print(i,i-iprev)
        sm += i-iprev; nn+=1
        iprev=i
    #vec_env.render("human")
print("average",sm/nn)

175 175
324 149
459 135
670 211
920 250
1042 122
1203 161
1341 138
1531 190
1665 134
1802 137
average 163.8181818181818


### PPO

**Proximal Policy Optimization** (PPO) is a family of model-free reinforcement learning algorithms developed at OpenAI in 2017. PPO algorithms are **policy gradient** methods, which means that they search the space of policies rather than assigning values to state-action pairs. https://arxiv.org/pdf/1707.06347.pdf

PPO algorithms have some of the benefits of trust region policy optimization (TRPO) algorithms, which avoid parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.
PPO algorithms are simpler to implement, more general, and have better sample complexity.[1] It is done by using a different objective function.[2]

In [10]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

In [11]:
# Parallel environments
n_envs = 4
vec_env = make_vec_env("CartPole-v1", n_envs=n_envs)

In [12]:
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=25000)

Using cuda device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22.4     |
|    ep_rew_mean     | 22.4     |
| time/              |          |
|    fps             | 2167     |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 8192     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 29.2        |
|    ep_rew_mean          | 29.2        |
| time/                   |             |
|    fps                  | 1239        |
|    iterations           | 2           |
|    time_elapsed         | 13          |
|    total_timesteps      | 16384       |
| train/                  |             |
|    approx_kl            | 0.014002295 |
|    clip_fraction        | 0.219       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.682      |
|    explained_variance   | -0.00328    |
|    learnin

<stable_baselines3.ppo.ppo.PPO at 0x7d82a82c73a0>

In [13]:
model.save("ppo_cartpole")

In [14]:
del model # remove to demonstrate saving and loading

model = PPO.load("ppo_cartpole")

In [15]:
obs = vec_env.reset()
sm = 0; nn=0
for j in range(n_envs):
    print("-----",j)
    iprev=0
    for i in range(2000):
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = vec_env.step(action)
        #print(i,obs, reward, "done",done, info)
        if done[j]:
            print(i,i-iprev)
            sm += i-iprev; nn+=1
            iprev=i
    #vec_env.render("human")
print("average",sm/nn)

----- 0
397 397
897 500
1280 383
1663 383
1961 298
----- 1
422 422
737 315
1030 293
1384 354
1884 500
----- 2
254 254
754 500
1254 500
1601 347
1935 334
----- 3
290 290
705 415
1071 366
1316 245
1563 247
average 367.15


### A2C

In [16]:
model = A2C("MlpPolicy", env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [17]:
model.learn(total_timesteps=20_000)

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 26.1     |
|    ep_rew_mean        | 26.1     |
| time/                 |          |
|    fps                | 449      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.665   |
|    explained_variance | 0.375    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -3.2     |
|    value_loss         | 34.7     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 33.3     |
|    ep_rew_mean        | 33.3     |
| time/                 |          |
|    fps                | 449      |
|    iterations         | 200      |
|    time_elapsed       | 2        |
|    total_timesteps    | 1000     |
| train/                |          |
|

<stable_baselines3.a2c.a2c.A2C at 0x7d82a811c130>

In [18]:
vec_env = model.get_env()
obs = vec_env.reset()
iprev=0
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = vec_env.step(action)
    #print(i,obs, reward, "done",done, info)
    if done:
        print(i,i-iprev)
        iprev=i
    #vec_env.render("human")

93 93
195 102
284 89
365 81
424 59
559 135
642 83
717 75
800 83
892 92
985 93
