# RLlib Sample Application: trivial-v0

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to trains a policy with the `trivial-v0` environment:

  - <https://github.com/DerwenAI/gym_trivial> 

This example is designed to illustrate how to build a minimal `gym` environment which implements all of the features required. Effectively, it simulates the operation of an `AND` gate in TTL which is trivial.

---

First, make sure that Ray and RLlib are installed…

In [19]:
!pip install ray[rllib]
!pip install ray[debug]
!pip install ray[tune]
!pip install pandas
!pip install requests
!pip install tensorflow



Then start Ray…

In [20]:
import ray
import ray.rllib.agents.ppo as ppo

ray.shutdown()
ray.init(ignore_reinit_error=True)

2020-04-07 00:15:49,753	INFO resource_spec.py:212 -- Starting Ray with 4.0 GiB memory available for workers and up to 2.01 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-07 00:15:50,035	INFO services.py:1078 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.244',
 'redis_address': '192.168.1.244:21270',
 'object_store_address': '/tmp/ray/session_2020-04-07_00-15-49_739753_32015/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-04-07_00-15-49_739753_32015/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-04-07_00-15-49_739753_32015'}

After a successful launch, the last log output line should read `View the Ray dashboard at localhost:8265`

Open <http://localhost:8265/> in another tab to view the Ray dashboard as the example runs.

---

Next we'll train an RLlib policy with the `trivial-v0` environment <https://github.com/DerwenAI/gym_trivial>

In [21]:
!pip install git+git://github.com/DerwenAI/gym_trivial.git#egg=pkg&subdirectory=gym-trivial

In [22]:
from gym_trivial.envs.trivial_env import Trivial
from ray.tune.registry import register_env

register_env("trivial-v0", lambda config: Trivial())

By default, training runs for only `1` iteration. Increase the `n_iter` setting if you want to see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `/tmp/ppo/triv` directory.

In [23]:
SELECT_ENV = "trivial-v0"
N_ITER = 1

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

reward_history = []

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

for _ in range(N_ITER):
    result = agent.train()
    print(result)

    max_reward = result["episode_reward_max"]
    reward_history.append(max_reward)

    file_name = agent.save("/tmp/ppo/triv")
    print(f"\n{file_name}")

  obj = yaml.load(type_)


[2m[36m(pid=32307)[0m   obj = yaml.load(type_)
[2m[36m(pid=32309)[0m   obj = yaml.load(type_)
[2m[36m(pid=32307)[0m [0, 1]
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32307)[0m [1, 0]
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32307)[0m [0, 1]
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32307)[0m [0, 1]
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32307)[0m [0, 1]
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32307)[0m [1, 0]
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m invalid step
[2m[36m(pid=32307)[0m [1, 1]
[2m[36m(pid=32307)[0m win
[2m[36m(pid=32309)[0m [1, 0]
[2m[36m(pid=32309)[0m [1, 1]
[2m[36m(pi

In [24]:
print(reward_history)

[100.0]


Do the episode rewards increase after multiple iterations?
That shows how the policy is improving.

Also, print out the policy and model to see the results of training in detail…

In [25]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(256, 2) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(2,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(256, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
__________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "trivial" agent operating within the simulation: setting either of two inputs until both are set.

**(NB: the `rollout` CLI script will not work with custom env)**

In [26]:
! rllib rollout \
    /tmp/ppo/triv/checkpoint_2/checkpoint-2 \
    --config "{\"env\": \"trivial-v0\"}" --run PPO \
    --steps 2000

2020-04-07 00:16:07,743	INFO resource_spec.py:212 -- Starting Ray with 4.59 GiB memory available for workers and up to 2.3 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-07 00:16:08,135	INFO services.py:1078 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-04-07 00:16:08,874	INFO trainer.py:420 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-04-07 00:16:08,923	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 118, in spec
    return self.env_specs[id]
KeyError: 'trivial-v0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/bin/rllib", line 10, in <module>
    sys.exit(cli())
  File "/o

The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

---

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (copy/paste the URL it generates) to visualize key metrics from training with RLlib…

In [None]:
!tensorboard --logdir=~/ray_results/