Performance Check (Discrete actions) #49

araffin · 2020-06-09T07:28:34Z

The discrete action counterpart of #48

Associated PR: #110

A2C
PPO
DQN (I'm currently working on that in Implement DQN #28 and it looks good)

Test envs: Atari Games (Pong - easy, Breakout - medium, ...)

Miffyli · 2020-06-09T16:46:55Z

Initial results with PPO: Seems to mostly match performance of SB PPO2 but with some glaring errors (see training runs with six games with bit different action spaces). It seems that at least few games should be used for evaluation, because in some sb3 version gets similar performance (e.g. MsPacman, Q*bert), but in others it does not reach same numbers (e.g. Breakout, Enduro). I still have double-check the parameters were right etc.

atari_ppo_sb.pdf
atari_ppo_sb3.pdf

araffin · 2020-06-10T08:35:14Z

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

Miffyli · 2020-06-10T08:38:21Z

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

No Zoo, based on this code. These are copied and modified wrappers from SB. The only thing that changes between SB and SB3 runs is where algorithm is imported from, rest is handled by the other code (and is the same).

m-rph · 2020-07-17T07:34:46Z

Cross Posting:

Relevant, I am getting some rather weird performance from DQN, it seems to reach 0 fps (it was with num_threads=1, and old polyak update). When using an ensemble of 10 estimators I got much better performance and I can't pinpoint the issue.

In the policy, instead of having a single Qnetwork, I have n_estimator identical QNetworks and their estimation is averaged.
Note, this was running on GPU and the environment was LunarLander.

n_estimators is a hyper-parameter for a custom version of DQN that uses an ensemble of n_estimators identical (except the weights) to the QNetwork of DQN.

This is observed with the latest version of DQN.

jarlva · 2020-07-17T09:39:23Z

Hello all,

I've been an avid SB1 user for over a year. An amazing framework with thorough documentation and active support group indeed.
New RL developments have propelled RL to new highs. For ex, Async PPO. Which can scale 3X and more on the same hardware. It is my humble opinion that it may be a good time to start thinking seriously about async. I believe that SB3 will greatly benefit from Async. Making it a strong, viable framework into the future!

Miffyli · 2020-07-17T09:42:25Z

@partiallytyped

I will work on DQN next. Could you share what envs/settings you used to get stuck like that with "standard" setup?

@jarlva

This is on the suggestions list for v1.2, I believe. At the moment we are working on optimizing the performance even of the synchronous variants, and PyTorch is not making things too easy with its tendency to use too many threads at the same time etc :)

jarlva · 2020-07-17T09:45:00Z

Completely understand @Miffyli . Would it be helpful to review https://github.com/alex-petrenko/sample-factory

m-rph · 2020-07-17T09:47:45Z

@Miffyli

The script that runs the DQN agent:

from stable_baselines3 import DQN
import argparse



if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr","--learning-rate", type=float, default=1e-4, dest="learning_rate")
    parser.add_argument("env", type=str)
    parser.add_argument("--policy", default="MlpPolicy")
    parser.add_argument("--policy-kwargs", type=eval, default={})
    parser.add_argument("--buffer-size", type=int, default=int(1e5))
    parser.add_argument("--learning-starts", type=int, default=5000)
    parser.add_argument("--batch-size", default=32, type=int)
    parser.add_argument("--tau", type=float, default=1.0)
    parser.add_argument("--gamma", default=0.99, type=float)
    parser.add_argument("--train-freq", type=int, default=4)
    parser.add_argument("--gradient-steps", type=int, default=-1)
    parser.add_argument("--n-episodes-rollout", type=int, default=-1)
    parser.add_argument("--target-update-interval", type=int, default=5000)
    parser.add_argument("--exploration-fraction", type=float, default=0.2)
    parser.add_argument("--exploration-initial-eps", type=float, default=1.0)
    parser.add_argument("--exploration-final-eps", type=float, default=0.05)
    learn = argparse.ArgumentParser()
    learn.add_argument("--n-timesteps", default=int(5e5), type=int, dest="total_timesteps")
    learn.add_argument("--eval-freq", type=int, default=10)
    learn.add_argument("--n-eval-episodes", type=int, default=5)
    agent_args, learn_args = parser.parse_known_args()
    learn_args = learn.parse_args(learn_args)
    
    agent = DQN(**agent_args.__dict__, verbose=2, create_eval_env=True, tensorboard_log=f"tb/dqn_{agent_args.env}")
    agent.learn(**learn_args.__dict__)

The script that I call the above with:

python dqn.py "LunarLander-v2" --n-timesteps=50000 --learning-rate 1e-4 --batch-size 128 --buffer-size 50000 --learning-starts 0 --gamma 0.99 --target-update-interval 1000 --train-freq 4 --gradient-steps -1 --exploration-fraction 0.12 --exploration-final-eps 0.05 --policy-kwargs "dict(net_arch=[256, 256])"

The hyper parameters (except lr) are taken from the zoo.

Refactor buffers

araffin assigned Miffyli Jun 9, 2020

araffin mentioned this issue Jun 9, 2020

Roadmap to Stable-Baselines3 V1.0 #1

Closed

42 tasks

araffin mentioned this issue Jun 10, 2020

Bug in PPO - Performance do not match gSDE paper #52

Closed

araffin mentioned this issue Jul 7, 2020

[Question/Discussion] Comparing stable-baselines3 vs stable-baselines #90

Closed

Miffyli mentioned this issue Jul 16, 2020

Match performance with stable-baselines (discrete case) #110

Merged

20 tasks

m-rph mentioned this issue Jul 17, 2020

[Optimization] Replay Buffers shouldn't use copy when using np.array #112

Closed

araffin closed this as completed in #110 Aug 3, 2020

Shunian-Chen pushed a commit to Shunian-Chen/AIPI530 that referenced this issue Nov 14, 2021

Merge pull request DLR-RM#49 from Antonin-Raffin/refactor/buffers

9d52a7d

Refactor buffers

lagatorc mentioned this issue Jun 5, 2022

[Bug] Numpy Attribute Error on A2C learn() #929

Closed

3 tasks

Abdelrahman-Alkhodary mentioned this issue Aug 25, 2022

My Custom env training go for ever #1032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Check (Discrete actions) #49

Performance Check (Discrete actions) #49

araffin commented Jun 9, 2020 •

edited

Loading

Miffyli commented Jun 9, 2020 •

edited

Loading

araffin commented Jun 10, 2020

Miffyli commented Jun 10, 2020 •

edited

Loading

m-rph commented Jul 17, 2020 •

edited

Loading

jarlva commented Jul 17, 2020

Miffyli commented Jul 17, 2020

jarlva commented Jul 17, 2020 •

edited

Loading

m-rph commented Jul 17, 2020

Performance Check (Discrete actions) #49

Performance Check (Discrete actions) #49

Comments

araffin commented Jun 9, 2020 • edited Loading

Miffyli commented Jun 9, 2020 • edited Loading

araffin commented Jun 10, 2020

Miffyli commented Jun 10, 2020 • edited Loading

m-rph commented Jul 17, 2020 • edited Loading

jarlva commented Jul 17, 2020

Miffyli commented Jul 17, 2020

jarlva commented Jul 17, 2020 • edited Loading

m-rph commented Jul 17, 2020

araffin commented Jun 9, 2020 •

edited

Loading

Miffyli commented Jun 9, 2020 •

edited

Loading

Miffyli commented Jun 10, 2020 •

edited

Loading

m-rph commented Jul 17, 2020 •

edited

Loading

jarlva commented Jul 17, 2020 •

edited

Loading