# Ray RLlib - Extra Application Example - FrozenLake-v0

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy with the `FrozenLake-v0` environment ([gym.openai.com/envs/FrozenLake-v0/](https://gym.openai.com/envs/FrozenLake-v0/)).

For more background about this problem, see:

* ["Introduction to Reinforcement Learning: the Frozen Lake Example"](https://reinforcementlearning4.fun/2019/06/09/introduction-reinforcement-learning-frozen-lake-example/), [Rodolfo Mendes](https://twitter.com/rodmsmendes)
* ["Gym Tutorial: The Frozen Lake"](https://reinforcementlearning4.fun/2019/06/16/gym-tutorial-frozen-lake/), [Rodolfo Mendes](https://twitter.com/rodmsmendes)

In [2]:
import pandas as pd
import json, os, shutil, sys
import ray
import ray.rllib.agents.ppo as ppo

Let's start up Ray as in the previous lesson:

In [3]:
!../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [4]:
ray.init(ignore_reinit_error=True)

2020-06-13 14:07:00,642	INFO resource_spec.py:212 -- Starting Ray with 3.56 GiB memory available for workers and up to 1.79 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-13 14:07:00,977	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:62162',
 'object_store_address': '/tmp/ray/session_2020-06-13_14-07-00_633039_11686/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-13_14-07-00_633039_11686/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-06-13_14-07-00_633039_11686'}

In [5]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8266


Set up the checkpoint location:

In [6]:
checkpoint_root = 'tmp/ppo/frozen-lake'
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)   # clean up old runs

Next we'll train an RLlib policy with the `FrozenLake-v0` environment.

By default, training runs for `10` iterations. Increase the `n_iter` setting if you want to see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `/tmp/ppo/taxi` directory.

> **Note:** If you prefer to use a different directory root than `/tmp`, change it in the next cell **and** in the `rllib rollout` command below.

In [7]:
SELECT_ENV = "FrozenLake-v0"
N_ITER = 10

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-06-13 14:07:21,109	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 14:07:21,131	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 14:07:23,577	INFO trainable.py:217 -- Getting current IP.




In [8]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'{n+1:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}, len mean: {result["episode_len_mean"]:8.4f}. Checkpoint saved to {file_name}')
reward_history = []

  1: Min/Mean/Max reward:   0.0000/  0.0178/  1.0000, len mean:   7.9109. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_1/checkpoint-1
  2: Min/Mean/Max reward:   0.0000/  0.0201/  1.0000, len mean:   8.0563. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_2/checkpoint-2
  3: Min/Mean/Max reward:   0.0000/  0.0274/  1.0000, len mean:   8.4262. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_3/checkpoint-3
  4: Min/Mean/Max reward:   0.0000/  0.0246/  1.0000, len mean:   8.2177. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_4/checkpoint-4
  5: Min/Mean/Max reward:   0.0000/  0.0439/  1.0000, len mean:   8.7632. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_5/checkpoint-5
  6: Min/Mean/Max reward:   0.0000/  0.0291/  1.0000, len mean:   8.9262. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_6/checkpoint-6
  7: Min/Mean/Max reward:   0.0000/  0.0375/  1.0000, len mean:   9.3934. Checkpoint saved to tmp/ppo/frozen-lake/checkpoint_7/checkpoint-7
  8: Min/Mean/Max re

In [9]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(16, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(16, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(256, 4) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(4,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(256, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "character" agent operating within the simulation: trying to find a walkable path to a goal tile.

In [12]:
!RAY_ADDRESS=auto rllib rollout \
    tmp/ppo/frozen-lake/checkpoint_10/checkpoint-10 \
    --config "{\"env\": \"FrozenLake-v0\"}" --run PPO \
    --steps 2000

2020-06-13 14:09:48,509	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 14:09:48,521	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 14:09:51,641	INFO trainable.py:217 -- Getting current IP.
2020-06-13 14:09:51,700	INFO trainable.py:217 -- Getting current IP.
2020-06-13 14:09:51,700	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: tmp/ppo/frozen-lake/checkpoint_10/checkpoint-10
2020-06-13 14:09:51,700	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 10, '_timesteps_total': 40000, '_time_total': 50.994168519973755, '_episodes_total': 4509}
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SFFF
FH[41mF[0mH
FFFH
HFFG
  (Left)
SFFF
F[41mH[0m

The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

## Exercise ("Homework")

In addition to _Taxi_ and _Frozen Lake_, there are other so-called ["toy text"](https://gym.openai.com/envs/#toy_text) problems you can try.