# RLlib Sample Application: CartPole-v1

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to trains a policy with the `CartPole-v1` environment:

  - <https://gym.openai.com/envs/CartPole-v1/> 

Even though this is a relatively simple and quick example to run, its results can be understood quite visually.

For more background about this problem, see:

  - ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077)  
AG Barto, RS Sutton and CW Anderson  
*IEEE Transactions on Systems, Man, and Cybernetics* (1983)
  
  - ["Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)"](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288)  
[Greg Surma](https://twitter.com/GSurma)

---

First, make sure that Ray and RLlib are installed, as well as Gym…

In [None]:
!pip install ray[rllib]
!pip install gym

Then start Ray…

In [1]:
import ray
import ray.rllib.agents.ppo as ppo

ray.shutdown()
ray.init(ignore_reinit_error=True)

2020-07-06 13:40:12,792	INFO resource_spec.py:212 -- Starting Ray with 3.37 GiB memory available for workers and up to 1.7 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 13:40:13,347	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.65',
 'raylet_ip_address': '192.168.1.65',
 'redis_address': '192.168.1.65:6379',
 'object_store_address': '/tmp/ray/session_2020-07-06_13-40-12_777049_83989/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-07-06_13-40-12_777049_83989/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-07-06_13-40-12_777049_83989'}

After a successful launch, the Ray dashboard will be running on a local port:

In [7]:
print("Dashboard URL: http://{}".format(ray.get_webui_url()))

Dashboard URL: http://localhost:8265


Open that URL in another tab to view the Ray dashboard as the example runs. We'll also set up a checkpoint location to store the trained policy:

In [2]:
import os
import shutil

CHECKPOINT_ROOT = "tmp/ppo/cart"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Next we'll train an RLlib policy with the `CartPole-v1` environment <https://gym.openai.com/envs/CartPole-v1/>

In [3]:
SELECT_ENV = "CartPole-v1"

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-07-06 13:40:32,685	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-07-06 13:40:32,685	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


In [4]:
N_ITER = 40
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"

for n in range(N_ITER):
    result = agent.train()
    file_name = agent.save(CHECKPOINT_ROOT)

    print(s.format(
        n + 1,
        result["episode_reward_min"],
        result["episode_reward_mean"],
        result["episode_reward_max"],
        result["episode_len_mean"],
        file_name
        ))

  1 reward   9.00/ 22.65/ 63.00 len  22.65 saved tmp/ppo/cart/checkpoint_1/checkpoint-1
  2 reward  12.00/ 42.72/151.00 len  42.72 saved tmp/ppo/cart/checkpoint_2/checkpoint-2
  3 reward  12.00/ 68.17/322.00 len  68.17 saved tmp/ppo/cart/checkpoint_3/checkpoint-3
  4 reward  13.00/ 97.87/408.00 len  97.87 saved tmp/ppo/cart/checkpoint_4/checkpoint-4
  5 reward  13.00/131.53/500.00 len 131.53 saved tmp/ppo/cart/checkpoint_5/checkpoint-5
  6 reward  13.00/165.24/500.00 len 165.24 saved tmp/ppo/cart/checkpoint_6/checkpoint-6
  7 reward  13.00/202.48/500.00 len 202.48 saved tmp/ppo/cart/checkpoint_7/checkpoint-7
  8 reward  22.00/233.83/500.00 len 233.83 saved tmp/ppo/cart/checkpoint_8/checkpoint-8
  9 reward  22.00/271.82/500.00 len 271.82 saved tmp/ppo/cart/checkpoint_9/checkpoint-9
 10 reward  22.00/302.99/500.00 len 302.99 saved tmp/ppo/cart/checkpoint_10/checkpoint-10
 11 reward  29.00/333.17/500.00 len 333.17 saved tmp/ppo/cart/checkpoint_11/checkpoint-11
 12 reward  29.00/363.38/500

Do the episode rewards increase after multiple iterations?
That shows how the policy is improving.

Also, print out the policy and model to see the results of training in detail…

In [5]:
policy = agent.get_policy()
model = policy.model
print(model.base_model.summary())

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
observations (InputLayer)       [(None, 4)]          0                                            
__________________________________________________________________________________________________
fc_1 (Dense)                    (None, 256)          1280        observations[0][0]               
__________________________________________________________________________________________________
fc_value_1 (Dense)              (None, 256)          1280        observations[0][0]               
__________________________________________________________________________________________________
fc_2 (Dense)                    (None, 256)          65792       fc_1[0][0]                       
______________________________________________________________________________________________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "cartpole" agent operating within the simulation: moving the cart left or right to avoid having the pole fall over.

In [6]:
! rllib rollout \
    tmp/ppo/cart/checkpoint_40/checkpoint-40 \
    --config "{\"env\": \"CartPole-v1\"}" \
    --run PPO \
    --steps 2000

2020-07-06 13:44:33,438	INFO resource_spec.py:212 -- Starting Ray with 4.0 GiB memory available for workers and up to 2.02 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 13:44:34,053	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-07-06 13:44:35,190	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-07-06 13:44:35,190	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-07-06 13:44:40,796	INFO trainable.py:423 -- Restored on 192.168.1.65 from checkpoint: tmp/ppo/cart/checkpoint_40/checkpoint-40
2020-07-06 13:44:40,796	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 40, '_timesteps_total': None, '_time_total': 195.11768531799316, '_episodes_total': 621}
Episode #0: reward: 500.0
Episode #1: reward: 500.0
Episode

The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

---

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (copy/paste the URL it generates) to visualize key metrics from training with RLlib…

In [None]:
!pip install tensorflow
!tensorboard --logdir=$HOME/ray_results