# Solving RL problems with `ray[rllib]`

<img src="images/cartpole.jpg" width="500"></img>

## Step 1: Initialize `ray`

- `ray` is a package providing distributed computing primitives. `rllib` is built on `ray`.

In [1]:
import ray

ray.init()

{'node_ip_address': '192.168.0.98',
 'raylet_ip_address': '192.168.0.98',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2022-12-23_16-35-03_047201_3487/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-12-23_16-35-03_047201_3487/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2022-12-23_16-35-03_047201_3487',
 'metrics_export_port': 65412,
 'gcs_address': '192.168.0.98:63541',
 'address': '192.168.0.98:63541',
 'node_id': 'c654f33cd9d7ab46008a4f0889879874b4cab53d76b0e12fecbd93fe'}

## Step 2: Run an **experiment** to solve RL problems

An experiment involves four things
- A **RL environment** (e.g. `CartPole-v1`)
- A **RL algorithm** to learn in that environment (e.g. Proximal Policy Optimization (PPO))
- **Configuration** (algorithm config, experiment config, environment config etc.)
- An **experiment runner** (called `tune`)

In [3]:
from ray import tune

tune.run("PPO",
         config={
             "env": "CartPole-v1",
                 # other configurations go here, if none provided, then default configurations will be used
         })

[2m[36m(PPOTrainer pid=3658)[0m 2022-12-23 16:35:56,890	INFO trainer.py:2140 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPOTrainer pid=3658)[0m 2022-12-23 16:35:56,890	INFO ppo.py:249 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPOTrainer pid=3658)[0m 2022-12-23 16:35:56,891	INFO trainer.py:779 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Trial name,status,loc
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658




Trial name,status,loc
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658




Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2022-12-23_16-36-03
  done: false
  episode_len_mean: 22.666666666666668
  episode_media: {}
  episode_reward_max: 90.0
  episode_reward_mean: 22.666666666666668
  episode_reward_min: 9.0
  episodes_this_iter: 174
  episodes_total: 174
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6677238941192627
          entropy_coeff: 0.0
          kl: 0.0263458751142025
          model: {}
          policy_loss: -0.031717997044324875
          total_loss: 222.7870330810547
          vf_explained_var: 0.019372833892703056
          vf_loss: 222.8134765625
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
    num_ste

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,1,3.32443,4000,22.6667,90,9,22.6667


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2022-12-23_16-36-09
  done: false
  episode_len_mean: 67.2
  episode_media: {}
  episode_reward_max: 281.0
  episode_reward_mean: 67.2
  episode_reward_min: 10.0
  episodes_this_iter: 38
  episodes_total: 299
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5797845721244812
          entropy_coeff: 0.0
          kl: 0.010812313295900822
          model: {}
          policy_loss: -0.02183121256530285
          total_loss: 757.8624267578125
          vf_explained_var: 0.13413472473621368
          vf_loss: 757.8809814453125
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_ite

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,3,9.39606,12000,67.2,281,10,67.2


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2022-12-23_16-36-15
  done: false
  episode_len_mean: 128.16
  episode_media: {}
  episode_reward_max: 456.0
  episode_reward_mean: 128.16
  episode_reward_min: 10.0
  episodes_this_iter: 16
  episodes_total: 334
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5438970327377319
          entropy_coeff: 0.0
          kl: 0.005624917335808277
          model: {}
          policy_loss: -0.010234670713543892
          total_loss: 693.485107421875
          vf_explained_var: 0.2024756669998169
          vf_loss: 693.4935302734375
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,5,15.5475,20000,128.16,456,10,128.16


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,6,18.5533,24000,161.53,500,10,161.53


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2022-12-23_16-36-21
  done: false
  episode_len_mean: 200.43
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 200.43
  episode_reward_min: 19.0
  episodes_this_iter: 10
  episodes_total: 356
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 4.999999873689376e-05
          entropy: 0.5294512510299683
          entropy_coeff: 0.0
          kl: 0.00397925078868866
          model: {}
          policy_loss: -0.008955384604632854
          total_loss: 438.78582763671875
          vf_explained_var: 0.16676998138427734
          vf_loss: 438.7941589355469
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_thi

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,8,24.6207,32000,230.13,500,19,230.13


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2022-12-23_16-36-27
  done: false
  episode_len_mean: 265.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 265.0
  episode_reward_min: 19.0
  episodes_this_iter: 10
  episodes_total: 374
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.487382709980011
          entropy_coeff: 0.0
          kl: 0.002882240107282996
          model: {}
          policy_loss: -0.0043022544123232365
          total_loss: 504.5002136230469
          vf_explained_var: 0.026964129880070686
          vf_loss: 504.5043029785156
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,10,30.7445,40000,290.94,500,19,290.94


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 44000
  custom_metrics: {}
  date: 2022-12-23_16-36-34
  done: false
  episode_len_mean: 324.9
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 324.9
  episode_reward_min: 19.0
  episodes_this_iter: 9
  episodes_total: 391
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.01875000074505806
          cur_lr: 4.999999873689376e-05
          entropy: 0.5414067506790161
          entropy_coeff: 0.0
          kl: 0.004195119719952345
          model: {}
          policy_loss: -0.0025419823359698057
          total_loss: 468.4158935546875
          vf_explained_var: 0.08807938545942307
          vf_loss: 468.4183349609375
    num_agent_steps_sampled: 44000
    num_agent_steps_trained: 44000
    num_steps_sampled: 44000
    num_steps_trained: 44000
    num_steps_trained_this_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,11,33.7701,44000,324.9,500,19,324.9


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 52000
  custom_metrics: {}
  date: 2022-12-23_16-36-40
  done: false
  episode_len_mean: 377.26
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 377.26
  episode_reward_min: 56.0
  episodes_this_iter: 8
  episodes_total: 407
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.00937500037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.5010088682174683
          entropy_coeff: 0.0
          kl: 0.006605195347219706
          model: {}
          policy_loss: -0.00037172867450863123
          total_loss: 484.509765625
          vf_explained_var: 0.030472297221422195
          vf_loss: 484.51007080078125
    num_agent_steps_sampled: 52000
    num_agent_steps_trained: 52000
    num_steps_sampled: 52000
    num_steps_trained: 52000
    num_steps_trained_this

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,13,39.8612,52000,377.26,500,56,377.26


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 60000
  custom_metrics: {}
  date: 2022-12-23_16-36-46
  done: false
  episode_len_mean: 426.26
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 426.26
  episode_reward_min: 145.0
  episodes_this_iter: 8
  episodes_total: 423
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.00937500037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.5119425654411316
          entropy_coeff: 0.0
          kl: 0.002727864310145378
          model: {}
          policy_loss: -0.0022586756385862827
          total_loss: 531.2691650390625
          vf_explained_var: -0.018852492794394493
          vf_loss: 531.2714233398438
    num_agent_steps_sampled: 60000
    num_agent_steps_trained: 60000
    num_steps_sampled: 60000
    num_steps_trained: 60000
    num_steps_trained_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,15,45.9458,60000,426.26,500,145,426.26


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,16,49.0046,64000,444.84,500,152,444.84


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 68000
  custom_metrics: {}
  date: 2022-12-23_16-36-52
  done: false
  episode_len_mean: 458.9
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 458.9
  episode_reward_min: 85.0
  episodes_this_iter: 9
  episodes_total: 440
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0023437500931322575
          cur_lr: 4.999999873689376e-05
          entropy: 0.45621252059936523
          entropy_coeff: 0.0
          kl: 0.0029501868411898613
          model: {}
          policy_loss: -0.0014081724220886827
          total_loss: 563.0380249023438
          vf_explained_var: -0.06301168352365494
          vf_loss: 563.0394287109375
    num_agent_steps_sampled: 68000
    num_agent_steps_trained: 68000
    num_steps_sampled: 68000
    num_steps_trained: 68000
    num_steps_trained_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,18,55.0431,72000,472.96,500,85,472.96


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 76000
  custom_metrics: {}
  date: 2022-12-23_16-36-58
  done: false
  episode_len_mean: 478.36
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 478.36
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 456
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0011718750465661287
          cur_lr: 4.999999873689376e-05
          entropy: 0.4239750802516937
          entropy_coeff: 0.0
          kl: 0.005886388476938009
          model: {}
          policy_loss: -0.0022193826735019684
          total_loss: 529.1740112304688
          vf_explained_var: -0.021180717274546623
          vf_loss: 529.1762084960938
    num_agent_steps_sampled: 76000
    num_agent_steps_trained: 76000
    num_steps_sampled: 76000
    num_steps_trained: 76000
    num_steps_trained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,20,61.1105,80000,481.37,500,85,481.37


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 84000
  custom_metrics: {}
  date: 2022-12-23_16-37-04
  done: false
  episode_len_mean: 488.19
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 488.19
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 472
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0005859375232830644
          cur_lr: 4.999999873689376e-05
          entropy: 0.3980620205402374
          entropy_coeff: 0.0
          kl: 0.0037860707379877567
          model: {}
          policy_loss: -0.0029902078676968813
          total_loss: 501.73126220703125
          vf_explained_var: 0.03084975853562355
          vf_loss: 501.73419189453125
    num_agent_steps_sampled: 84000
    num_agent_steps_trained: 84000
    num_steps_sampled: 84000
    num_steps_trained: 84000
    num_steps_traine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,21,64.1397,84000,488.19,500,85,488.19


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 92000
  custom_metrics: {}
  date: 2022-12-23_16-37-10
  done: false
  episode_len_mean: 491.45
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 491.45
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 488
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0001464843808207661
          cur_lr: 4.999999873689376e-05
          entropy: 0.39177200198173523
          entropy_coeff: 0.0
          kl: 0.006307192612439394
          model: {}
          policy_loss: -0.004075607750564814
          total_loss: 513.6663818359375
          vf_explained_var: 0.03715536370873451
          vf_loss: 513.67041015625
    num_agent_steps_sampled: 92000
    num_agent_steps_trained: 92000
    num_steps_sampled: 92000
    num_steps_trained: 92000
    num_steps_trained_thi

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,23,70.3336,92000,491.45,500,85,491.45


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 100000
  custom_metrics: {}
  date: 2022-12-23_16-37-16
  done: false
  episode_len_mean: 495.85
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 495.85
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 504
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0001464843808207661
          cur_lr: 4.999999873689376e-05
          entropy: 0.38255375623703003
          entropy_coeff: 0.0
          kl: 0.004798873793333769
          model: {}
          policy_loss: -0.002857438288629055
          total_loss: 499.1186828613281
          vf_explained_var: -0.08942034095525742
          vf_loss: 499.1215515136719
    num_agent_steps_sampled: 100000
    num_agent_steps_trained: 100000
    num_steps_sampled: 100000
    num_steps_trained: 100000
    num_steps_tra

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,25,76.3871,100000,495.85,500,85,495.85


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,26,79.4227,104000,495.85,500,85,495.85


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 108000
  custom_metrics: {}
  date: 2022-12-23_16-37-22
  done: false
  episode_len_mean: 495.85
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 495.85
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 520
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3.662109520519152e-05
          cur_lr: 4.999999873689376e-05
          entropy: 0.3938816487789154
          entropy_coeff: 0.0
          kl: 0.0015072141541168094
          model: {}
          policy_loss: 0.00038365216460078955
          total_loss: 539.7266235351562
          vf_explained_var: -0.0835423469543457
          vf_loss: 539.7262573242188
    num_agent_steps_sampled: 108000
    num_agent_steps_trained: 108000
    num_steps_sampled: 108000
    num_steps_trained: 108000
    num_steps_tra

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,28,85.4967,112000,495.85,500,85,495.85


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 116000
  custom_metrics: {}
  date: 2022-12-23_16-37-29
  done: false
  episode_len_mean: 495.85
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 495.85
  episode_reward_min: 85.0
  episodes_this_iter: 8
  episodes_total: 536
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 9.15527380129788e-06
          cur_lr: 4.999999873689376e-05
          entropy: 0.38424333930015564
          entropy_coeff: 0.0
          kl: 0.006487591657787561
          model: {}
          policy_loss: -0.0018671861616894603
          total_loss: 507.647216796875
          vf_explained_var: -0.022645175457000732
          vf_loss: 507.6490478515625
    num_agent_steps_sampled: 116000
    num_agent_steps_trained: 116000
    num_steps_sampled: 116000
    num_steps_trained: 116000
    num_steps_tra

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,30,91.5883,120000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 124000
  custom_metrics: {}
  date: 2022-12-23_16-37-35
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 552
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 9.15527380129788e-06
          cur_lr: 4.999999873689376e-05
          entropy: 0.4100276827812195
          entropy_coeff: 0.0
          kl: 0.002599272644147277
          model: {}
          policy_loss: -0.0005167347262613475
          total_loss: 534.5759887695312
          vf_explained_var: 0.02402246743440628
          vf_loss: 534.5764770507812
    num_agent_steps_sampled: 124000
    num_agent_steps_trained: 124000
    num_steps_sampled: 124000
    num_steps_trained: 124000
    num_steps_traine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,31,94.6975,124000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 132000
  custom_metrics: {}
  date: 2022-12-23_16-37-41
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 568
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.28881845032447e-06
          cur_lr: 4.999999873689376e-05
          entropy: 0.4069482982158661
          entropy_coeff: 0.0
          kl: 0.005019642412662506
          model: {}
          policy_loss: 8.927865565055981e-05
          total_loss: 497.6102600097656
          vf_explained_var: -0.0994281992316246
          vf_loss: 497.6101379394531
    num_agent_steps_sampled: 132000
    num_agent_steps_trained: 132000
    num_steps_sampled: 132000
    num_steps_trained: 132000
    num_steps_trained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,33,100.765,132000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 140000
  custom_metrics: {}
  date: 2022-12-23_16-37-47
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 584
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.144409225162235e-06
          cur_lr: 4.999999873689376e-05
          entropy: 0.3817431926727295
          entropy_coeff: 0.0
          kl: 0.0031131699215620756
          model: {}
          policy_loss: -0.0003612424770835787
          total_loss: 545.5170288085938
          vf_explained_var: -0.13232421875
          vf_loss: 545.5173950195312
    num_agent_steps_sampled: 140000
    num_agent_steps_trained: 140000
    num_steps_sampled: 140000
    num_steps_trained: 140000
    num_steps_trained_t

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,35,106.842,140000,500,500,500,500


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,36,109.881,144000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 148000
  custom_metrics: {}
  date: 2022-12-23_16-37-53
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 600
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.8610230629055877e-07
          cur_lr: 4.999999873689376e-05
          entropy: 0.3397999703884125
          entropy_coeff: 0.0
          kl: 0.002356970449909568
          model: {}
          policy_loss: 9.650844731368124e-05
          total_loss: 500.5468444824219
          vf_explained_var: -0.057527437806129456
          vf_loss: 500.54669189453125
    num_agent_steps_sampled: 148000
    num_agent_steps_trained: 148000
    num_steps_sampled: 148000
    num_steps_trained: 148000
    num_steps_tr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,38,116.032,152000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 156000
  custom_metrics: {}
  date: 2022-12-23_16-37-59
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 616
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 7.152557657263969e-08
          cur_lr: 4.999999873689376e-05
          entropy: 0.3305110037326813
          entropy_coeff: 0.0
          kl: 0.003707656404003501
          model: {}
          policy_loss: -0.0004925053799524903
          total_loss: 542.630126953125
          vf_explained_var: -0.0788637027144432
          vf_loss: 542.630615234375
    num_agent_steps_sampled: 156000
    num_agent_steps_trained: 156000
    num_steps_sampled: 156000
    num_steps_trained: 156000
    num_steps_trained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,40,122.091,160000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 164000
  custom_metrics: {}
  date: 2022-12-23_16-38-05
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 632
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.7881394143159923e-08
          cur_lr: 4.999999873689376e-05
          entropy: 0.31372615694999695
          entropy_coeff: 0.0
          kl: 0.0007560974918305874
          model: {}
          policy_loss: -0.001454669632948935
          total_loss: 440.6395263671875
          vf_explained_var: -0.10135575383901596
          vf_loss: 440.6409912109375
    num_agent_steps_sampled: 164000
    num_agent_steps_trained: 164000
    num_steps_sampled: 164000
    num_steps_trained: 164000
    num_steps_tr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,41,125.106,164000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 172000
  custom_metrics: {}
  date: 2022-12-23_16-38-12
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 648
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.470348535789981e-09
          cur_lr: 4.999999873689376e-05
          entropy: 0.31648343801498413
          entropy_coeff: 0.0
          kl: 0.001370706595480442
          model: {}
          policy_loss: 0.002188962185755372
          total_loss: 590.8506469726562
          vf_explained_var: -0.33437660336494446
          vf_loss: 590.8485107421875
    num_agent_steps_sampled: 172000
    num_agent_steps_trained: 172000
    num_steps_sampled: 172000
    num_steps_trained: 172000
    num_steps_train

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,43,131.225,172000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 180000
  custom_metrics: {}
  date: 2022-12-23_16-38-18
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 664
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.2351742678949904e-09
          cur_lr: 4.999999873689376e-05
          entropy: 0.30818021297454834
          entropy_coeff: 0.0
          kl: 0.004482596181333065
          model: {}
          policy_loss: -0.0021901451982557774
          total_loss: 534.6130981445312
          vf_explained_var: -0.13855034112930298
          vf_loss: 534.6152954101562
    num_agent_steps_sampled: 180000
    num_agent_steps_trained: 180000
    num_steps_sampled: 180000
    num_steps_trained: 180000
    num_steps_tr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,45,137.307,180000,500,500,500,500


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,46,140.364,184000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 188000
  custom_metrics: {}
  date: 2022-12-23_16-38-24
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 680
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5.587935669737476e-10
          cur_lr: 4.999999873689376e-05
          entropy: 0.3042510747909546
          entropy_coeff: 0.0
          kl: 0.00245378608815372
          model: {}
          policy_loss: -0.001417657476849854
          total_loss: 499.1313781738281
          vf_explained_var: -0.06330925971269608
          vf_loss: 499.1327819824219
    num_agent_steps_sampled: 188000
    num_agent_steps_trained: 188000
    num_steps_sampled: 188000
    num_steps_trained: 188000
    num_steps_traine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,48,146.509,192000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 196000
  custom_metrics: {}
  date: 2022-12-23_16-38-30
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 696
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.396983917434369e-10
          cur_lr: 4.999999873689376e-05
          entropy: 0.2733024060726166
          entropy_coeff: 0.0
          kl: 0.002296850783750415
          model: {}
          policy_loss: -0.0009159276378341019
          total_loss: 471.1797790527344
          vf_explained_var: 0.13280321657657623
          vf_loss: 471.18072509765625
    num_agent_steps_sampled: 196000
    num_agent_steps_trained: 196000
    num_steps_sampled: 196000
    num_steps_trained: 196000
    num_steps_trai

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,50,152.577,200000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 204000
  custom_metrics: {}
  date: 2022-12-23_16-38-36
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 712
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3.4924597935859225e-11
          cur_lr: 4.999999873689376e-05
          entropy: 0.27746322751045227
          entropy_coeff: 0.0
          kl: 0.006151467561721802
          model: {}
          policy_loss: -0.0030704454984515905
          total_loss: 481.7795715332031
          vf_explained_var: -0.29841360449790955
          vf_loss: 481.78265380859375
    num_agent_steps_sampled: 204000
    num_agent_steps_trained: 204000
    num_steps_sampled: 204000
    num_steps_trained: 204000
    num_steps_t

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,51,155.606,204000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 212000
  custom_metrics: {}
  date: 2022-12-23_16-38-42
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 728
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.7462298967929613e-11
          cur_lr: 4.999999873689376e-05
          entropy: 0.2986508011817932
          entropy_coeff: 0.0
          kl: 0.0013733146479353309
          model: {}
          policy_loss: 0.000683144957292825
          total_loss: 486.56427001953125
          vf_explained_var: -0.2059945911169052
          vf_loss: 486.5635986328125
    num_agent_steps_sampled: 212000
    num_agent_steps_trained: 212000
    num_steps_sampled: 212000
    num_steps_trained: 212000
    num_steps_trai

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,53,161.702,212000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 220000
  custom_metrics: {}
  date: 2022-12-23_16-38-48
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 744
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.365574741982403e-12
          cur_lr: 4.999999873689376e-05
          entropy: 0.2781405746936798
          entropy_coeff: 0.0
          kl: 0.003607149003073573
          model: {}
          policy_loss: -0.00031350748031400144
          total_loss: 477.109619140625
          vf_explained_var: -0.11387369781732559
          vf_loss: 477.1099548339844
    num_agent_steps_sampled: 220000
    num_agent_steps_trained: 220000
    num_steps_sampled: 220000
    num_steps_trained: 220000
    num_steps_trai

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,55,167.797,220000,500,500,500,500


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,56,170.827,224000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 228000
  custom_metrics: {}
  date: 2022-12-23_16-38-54
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 760
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.1827873709912016e-12
          cur_lr: 4.999999873689376e-05
          entropy: 0.2417421042919159
          entropy_coeff: 0.0
          kl: 0.0024954811669886112
          model: {}
          policy_loss: 0.00023529196914751083
          total_loss: 499.0150451660156
          vf_explained_var: -0.09930148720741272
          vf_loss: 499.01483154296875
    num_agent_steps_sampled: 228000
    num_agent_steps_trained: 228000
    num_steps_sampled: 228000
    num_steps_trained: 228000
    num_steps_t

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,58,176.883,232000,500,500,500,500


Result for PPO_CartPole-v1_de04c_00000:
  agent_timesteps_total: 236000
  custom_metrics: {}
  date: 2022-12-23_16-39-00
  done: false
  episode_len_mean: 500.0
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 500.0
  episode_reward_min: 500.0
  episodes_this_iter: 8
  episodes_total: 776
  experiment_id: 0edb2c66a3eb4d7484838f4f8775a744
  hostname: dl
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5.456968427478004e-13
          cur_lr: 4.999999873689376e-05
          entropy: 0.23795464634895325
          entropy_coeff: 0.0
          kl: 0.0027954927645623684
          model: {}
          policy_loss: 0.0004834141000173986
          total_loss: 495.6991271972656
          vf_explained_var: 0.05936768278479576
          vf_loss: 495.69866943359375
    num_agent_steps_sampled: 236000
    num_agent_steps_trained: 236000
    num_steps_sampled: 236000
    num_steps_trained: 236000
    num_steps_tra

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,60,182.991,240000,500,500,500,500




Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_de04c_00000,RUNNING,192.168.0.98:3658,60,182.991,240000,500,500,500,500


[2m[36m(PPOTrainer pid=3658)[0m 2022-12-23 16:39:06,085	ERROR worker.py:430 -- SystemExit was raised from the worker.
[2m[36m(PPOTrainer pid=3658)[0m Traceback (most recent call last):
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 774, in ray._raylet.task_execution_handler
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 595, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 633, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3658)[0m   File "python/ray/_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
[2m[36m(PPOTrainer pid=3658)[0m   File "/home/oscar/anaconda3/envs/fastdeeprl/lib/python3.9/site-packages/ray/_private/function_manager.py", l

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fd8dd018160>

### Configuration

These configurations are applied in sequence

1. [Common config](https://docs.ray.io/en/master/rllib-training.html#common-parameters)
2. [Algorithm specific config (overrides common config)](https://docs.ray.io/en/master/rllib-algorithms.html#ppo)
3. User defined config

### Anatomy of an experiment

<img src="images/ex/2.png" width="750"></img>

In [4]:
tune.run("PPO",
         config={"env": "CartPole-v1",
                 "evaluation_interval": 2,    # num of training iter between evaluations
                 "evaluation_num_episodes": 20,
                 "num_gpus": 2
                 }
         )

Trial name,status,loc
PPO_CartPole-v1_55144_00000,PENDING,


[2m[36m(PPOTrainer pid=3657)[0m 2022-12-23 16:39:14,870	INFO trainer.py:2140 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPOTrainer pid=3657)[0m 2022-12-23 16:39:14,871	INFO ppo.py:249 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPOTrainer pid=3657)[0m 2022-12-23 16:39:14,871	INFO trainer.py:779 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Trial name,status,loc
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657




Trial name,status,loc
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2022-12-23_16-39-28
  done: false
  episode_len_mean: 20.66839378238342
  episode_media: {}
  episode_reward_max: 54.0
  episode_reward_mean: 20.66839378238342
  episode_reward_min: 9.0
  episodes_this_iter: 193
  episodes_total: 193
  experiment_id: e5c692799fb2428096a6846fa76c1b24
  hostname: dl
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6658182740211487
          entropy_coeff: 0.0
          kl: 0.027578743174672127
          model: {}
          policy_loss: -0.03394927456974983
          total_loss: 144.83274841308594
          vf_explained_var: 0.02895354852080345
          vf_loss: 144.86117553710938
        train: null
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
    num_steps_t

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,1,5.69368,4000,20.6684,54,9,20.6684


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,1,5.69368,4000,20.6684,54,9,20.6684


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2022-12-23_16-39-35
  done: false
  episode_len_mean: 38.94174757281554
  episode_media: {}
  episode_reward_max: 142.0
  episode_reward_mean: 38.94174757281554
  episode_reward_min: 9.0
  episodes_this_iter: 103
  episodes_total: 296
  evaluation:
    custom_metrics: {}
    episode_len_mean: 108.85
    episode_media: {}
    episode_reward_max: 210.0
    episode_reward_mean: 108.85
    episode_reward_min: 26.0
    episodes_this_iter: 20
    hist_stats:
      episode_lengths:
      - 104
      - 142
      - 187
      - 142
      - 26
      - 92
      - 162
      - 40
      - 55
      - 177
      - 138
      - 121
      - 95
      - 210
      - 32
      - 79
      - 127
      - 70
      - 97
      - 81
      episode_reward:
      - 104.0
      - 142.0
      - 187.0
      - 142.0
      - 26.0
      - 92.0
      - 162.0
      - 40.0
      - 55.0
      - 177.0
      - 138.0
      - 121.0
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,2,13.2352,8000,38.9417,142,9,38.9417


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2022-12-23_16-39-40
  done: false
  episode_len_mean: 60.35
  episode_media: {}
  episode_reward_max: 255.0
  episode_reward_mean: 60.35
  episode_reward_min: 9.0
  episodes_this_iter: 38
  episodes_total: 334
  experiment_id: e5c692799fb2428096a6846fa76c1b24
  hostname: dl
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5733065009117126
          entropy_coeff: 0.0
          kl: 0.014462259598076344
          model: {}
          policy_loss: -0.02470884658396244
          total_loss: 605.0413818359375
          vf_explained_var: 0.16552267968654633
          vf_loss: 605.063232421875
        train: null
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 4000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,3,18.3743,12000,60.35,255,9,60.35


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,3,18.3743,12000,60.35,255,9,60.35


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2022-12-23_16-39-53
  done: false
  episode_len_mean: 93.66
  episode_media: {}
  episode_reward_max: 488.0
  episode_reward_mean: 93.66
  episode_reward_min: 9.0
  episodes_this_iter: 15
  episodes_total: 349
  evaluation:
    custom_metrics: {}
    episode_len_mean: 332.2
    episode_media: {}
    episode_reward_max: 500.0
    episode_reward_mean: 332.2
    episode_reward_min: 55.0
    episodes_this_iter: 20
    hist_stats:
      episode_lengths:
      - 375
      - 342
      - 182
      - 233
      - 500
      - 238
      - 500
      - 289
      - 391
      - 413
      - 449
      - 62
      - 289
      - 456
      - 196
      - 445
      - 299
      - 455
      - 475
      - 55
      episode_reward:
      - 375.0
      - 342.0
      - 182.0
      - 233.0
      - 500.0
      - 238.0
      - 500.0
      - 289.0
      - 391.0
      - 413.0
      - 449.0
      - 62.0
      - 289.0
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,4,30.5998,16000,93.66,488,9,93.66


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2022-12-23_16-39-58
  done: false
  episode_len_mean: 130.41
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 130.41
  episode_reward_min: 9.0
  episodes_this_iter: 13
  episodes_total: 362
  experiment_id: e5c692799fb2428096a6846fa76c1b24
  hostname: dl
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5274227261543274
          entropy_coeff: 0.0
          kl: 0.003228301415219903
          model: {}
          policy_loss: -0.014217283576726913
          total_loss: 646.3196411132812
          vf_explained_var: 0.12532773613929749
          vf_loss: 646.333251953125
        train: null
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 4

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,5,35.7388,20000,130.41,500,9,130.41


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,5,35.7388,20000,130.41,500,9,130.41


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,5,35.7388,20000,130.41,500,9,130.41


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2022-12-23_16-40-11
  done: false
  episode_len_mean: 164.04
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 164.04
  episode_reward_min: 9.0
  episodes_this_iter: 11
  episodes_total: 373
  evaluation:
    custom_metrics: {}
    episode_len_mean: 377.85
    episode_media: {}
    episode_reward_max: 500.0
    episode_reward_mean: 377.85
    episode_reward_min: 163.0
    episodes_this_iter: 20
    hist_stats:
      episode_lengths:
      - 249
      - 361
      - 282
      - 163
      - 384
      - 500
      - 500
      - 500
      - 398
      - 380
      - 337
      - 352
      - 419
      - 447
      - 354
      - 421
      - 284
      - 368
      - 500
      - 358
      episode_reward:
      - 249.0
      - 361.0
      - 282.0
      - 163.0
      - 384.0
      - 500.0
      - 500.0
      - 500.0
      - 398.0
      - 380.0
      - 337.0
      - 352.0
      - 419.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,6,48.9021,24000,164.04,500,9,164.04


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2022-12-23_16-40-16
  done: false
  episode_len_mean: 203.33
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 203.33
  episode_reward_min: 12.0
  episodes_this_iter: 11
  episodes_total: 384
  experiment_id: e5c692799fb2428096a6846fa76c1b24
  hostname: dl
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5009874105453491
          entropy_coeff: 0.0
          kl: 0.002488570287823677
          model: {}
          policy_loss: -0.009245635010302067
          total_loss: 364.897705078125
          vf_explained_var: 0.24103020131587982
          vf_loss: 364.9064636230469
        train: null
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,7,54.0168,28000,203.33,500,12,203.33


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,7,54.0168,28000,203.33,500,12,203.33


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,7,54.0168,28000,203.33,500,12,203.33


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2022-12-23_16-40-31
  done: false
  episode_len_mean: 237.33
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 237.33
  episode_reward_min: 12.0
  episodes_this_iter: 8
  episodes_total: 392
  evaluation:
    custom_metrics: {}
    episode_len_mean: 475.0
    episode_media: {}
    episode_reward_max: 500.0
    episode_reward_mean: 475.0
    episode_reward_min: 368.0
    episodes_this_iter: 20
    hist_stats:
      episode_lengths:
      - 483
      - 500
      - 500
      - 453
      - 500
      - 432
      - 500
      - 500
      - 500
      - 500
      - 500
      - 442
      - 500
      - 433
      - 368
      - 500
      - 495
      - 500
      - 394
      - 500
      episode_reward:
      - 483.0
      - 500.0
      - 500.0
      - 453.0
      - 500.0
      - 432.0
      - 500.0
      - 500.0
      - 500.0
      - 500.0
      - 500.0
      - 442.0
      - 500.0


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,8,69.2515,32000,237.33,500,12,237.33


Result for PPO_CartPole-v1_55144_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2022-12-23_16-40-36
  done: false
  episode_len_mean: 271.75
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 271.75
  episode_reward_min: 18.0
  episodes_this_iter: 8
  episodes_total: 400
  experiment_id: e5c692799fb2428096a6846fa76c1b24
  hostname: dl
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.4935498833656311
          entropy_coeff: 0.0
          kl: 0.003270569257438183
          model: {}
          policy_loss: -0.00635184021666646
          total_loss: 239.37002563476562
          vf_explained_var: 0.26556897163391113
          vf_loss: 239.375732421875
        train: null
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 4



Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,9,74.3803,36000,271.75,500,18,271.75


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_55144_00000,RUNNING,192.168.0.98:3657,9,74.3803,36000,271.75,500,18,271.75


[2m[36m(PPOTrainer pid=3657)[0m 2022-12-23 16:40:41,084	ERROR worker.py:430 -- SystemExit was raised from the worker.
[2m[36m(PPOTrainer pid=3657)[0m Traceback (most recent call last):
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 774, in ray._raylet.task_execution_handler
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 595, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 633, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
[2m[36m(PPOTrainer pid=3657)[0m   File "python/ray/_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
[2m[36m(PPOTrainer pid=3657)[0m   File "/home/oscar/anaconda3/envs/fastdeeprl/lib/python3.9/site-packages/ray/_private/function_manager.py", l

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fd8d7fc1160>