# RLlib Sample Application: CartPole

First, let's make sure that Ray and RLlib are installed…

In [1]:
!pip install ray[rllib]
!pip install ray[debug]
!pip install ray[tune]
!pip install pandas
!pip install requests
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.1.0-cp37-cp37m-macosx_10_11_x86_64.whl (120.8 MB)
[K     |████████████████████████████████| 120.8 MB 10.4 MB/s eta 0:00:01    |█████▏                          | 19.4 MB 14.1 MB/s eta 0:00:08
[?25hCollecting keras-applications>=1.0.8
  Using cached Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
Processing /Users/deanwampler/Library/Caches/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6/termcolor-1.1.0-cp37-none-any.whl
Collecting google-pasta>=0.1.6
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 9.4 MB/s  eta 0:00:01
[?25hCollecting wrapt>=1.11.1
  Downloading wrapt-1.12.1.tar.gz (27 kB)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.2.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 5.5 MB/s  eta 0:00:01
[?25hCollecting tensorboard<2.2.0,>=2.1.0
  Downloading tensorboard-2.1.1-py3-none-any.whl (3.8 MB)
[K     |██

Then we start Ray…

In [8]:
import ray

In [9]:
import ray.rllib.agents.ppo as ppo

In [10]:
ray.shutdown()
ray.init(ignore_reinit_error=True)

2020-03-26 13:59:34,521	INFO resource_spec.py:212 -- Starting Ray with 4.25 GiB memory available for workers and up to 2.14 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-03-26 13:59:35,642	INFO services.py:498 -- Failed to connect to the redis server, retrying.
2020-03-26 13:59:36,225	INFO services.py:1078 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:64846',
 'object_store_address': '/tmp/ray/session_2020-03-26_13-59-34_507167_72537/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-03-26_13-59-34_507167_72537/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-03-26_13-59-34_507167_72537'}

After a successful launch, there should be a log output line that reads something to the effect of `View the Ray dashboard at localhost:8265` in which case open another browser tab for the Ray dashboard at <http://localhost:8265/>

Next we'll train an RLlib policy with the `CartPole-v0` environment, which is a relatively simple and quick example. For more details about this problem, see the tutorial [*Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)*](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288) by [Greg Surma](https://twitter.com/GSurma).

In [11]:
config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

n_iter = 10
reward_history = []

agent = ppo.PPOTrainer(config, env="CartPole-v0")

for _ in range(n_iter):
    result = agent.train()
    print(result)

    max_reward = result["episode_reward_max"]
    reward_history.append(max_reward)

    file_name = agent.save("/tmp/ppo")
    print(f"\n{file_name}")

2020-03-26 15:11:39,770	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


ImportError: Could not import tensorflow

In [4]:
print(reward_history)

[75.0, 145.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0]


The history of `max_reward` shows that this model `200` by the third iteration -- which is good, since the [*solution*](https://gym.openai.com/envs/CartPole-v0/) for `CartPole-v0` is to get an average reward of `195.0` over a hundred consecutive trials.

In [7]:
! rllib rollout \
    /tmp/ppo/checkpoint_10/checkpoint-10 \
    --config "{\"env\": \"CartPole-v0\"}" --run PPO \
    --steps 2000

2020-03-21 18:13:13,752	INFO resource_spec.py:212 -- Starting Ray with 4.15 GiB memory available for workers and up to 2.09 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-03-21 18:13:14,122	INFO services.py:1078 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-03-21 18:13:14,807	INFO trainer.py:420 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-03-21 18:13:14,840	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
  obj = yaml.load(type_)
2020-03-21 18:13:19,802	INFO trainable.py:416 -- Restored on 192.168.1.65 from checkpoint: /tmp/ppo/checkpoint_10/checkpoint-10
2020-03-21 18:13:19,802	INFO trainable.py:423 -- Current state after restoring: {'_iteration': 10, '_timesteps_total': 40000, '_time_total': 41.65511107444763, '_episodes_total': 448}
[2m[36m(pid=3040)[0m   obj = yaml.

Now that we've trained a model, we can look at its resulting policy…

In [5]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(256, 2) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(2,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(256, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>


In [6]:
model.base_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
observations (InputLayer)       [(None, 4)]          0                                            
__________________________________________________________________________________________________
fc_1 (Dense)                    (None, 256)          1280        observations[0][0]               
__________________________________________________________________________________________________
fc_value_1 (Dense)              (None, 256)          1280        observations[0][0]               
__________________________________________________________________________________________________
fc_2 (Dense)                    (None, 256)          65792       fc_1[0][0]                       
______________________________________________________________________________________________