# RLlib Sample Application: Taxi-v3

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to trains a policy with the `Taxi-v3` environment:

  - <https://gym.openai.com/envs/Taxi-v3/>

For more background about this problem, see:

  - ["Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"](https://arxiv.org/abs/cs/9905014)  
[Thomas G. Dietterich](https://twitter.com/tdietterich)
  - ["Reinforcement Learning: let’s teach a taxi-cab how to drive"](https://towardsdatascience.com/reinforcement-learning-lets-teach-a-taxi-cab-how-to-drive-4fd1a0d00529)  
[Valentina Alto](https://twitter.com/AltoValentina)
  
---

First, make sure that Ray and RLlib are installed, along with Gym…

In [1]:
!pip install ray[rllib]
!pip install gym



Then start Ray…

In [2]:
import ray
import ray.rllib.agents.ppo as ppo

ray.shutdown()
ray.init(ignore_reinit_error=True)

2020-07-05 22:55:01,189	INFO resource_spec.py:212 -- Starting Ray with 3.47 GiB memory available for workers and up to 1.74 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-05 22:55:01,711	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.244',
 'raylet_ip_address': '192.168.1.244',
 'redis_address': '192.168.1.244:6379',
 'object_store_address': '/tmp/ray/session_2020-07-05_22-55-01_175825_77327/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-07-05_22-55-01_175825_77327/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-07-05_22-55-01_175825_77327'}

After a successful launch, the Ray dashboard will be running on a local port:

In [3]:
print("Dashboard URL: http://{}".format(ray.get_webui_url()))

Dashboard URL: http://localhost:8265


Open that URL in another tab to view the Ray dashboard as the example runs. We'll also set up a checkpoint location to store the trained policy:

In [2]:
import os
import shutil

CHECKPOINT_ROOT = "tmp/ppo/taxi"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Next we'll configure RLlib to train a policy with the `Taxi-v3` environment:

In [5]:
SELECT_ENV = "Taxi-v3"

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-07-05 22:55:11,288	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-07-05 22:55:11,289	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


The following training runs for `30` iterations. Increase the `N_ITER` value to train further and improve the rewards.

In [6]:
N_ITER = 30
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"

for n in range(N_ITER):
    result = agent.train()
    file_name = agent.save(CHECKPOINT_ROOT)
    
    print(s.format(
        n + 1,
        result["episode_reward_min"],
        result["episode_reward_mean"],
        result["episode_reward_max"],
        result["episode_len_mean"],
        file_name
        ))

  1 reward -902.00/-751.75/-345.00 len 194.80 saved tmp/ppo/taxi/checkpoint_1/checkpoint-1
  2 reward -902.00/-751.85/-345.00 len 193.70 saved tmp/ppo/taxi/checkpoint_2/checkpoint-2
  3 reward -902.00/-725.72/-340.00 len 193.00 saved tmp/ppo/taxi/checkpoint_3/checkpoint-3
  4 reward -902.00/-705.04/-151.00 len 192.59 saved tmp/ppo/taxi/checkpoint_4/checkpoint-4
  5 reward -902.00/-682.85/-151.00 len 192.62 saved tmp/ppo/taxi/checkpoint_5/checkpoint-5
  6 reward -902.00/-643.69/-128.00 len 190.27 saved tmp/ppo/taxi/checkpoint_6/checkpoint-6
  7 reward -902.00/-585.58/-78.00 len 185.95 saved tmp/ppo/taxi/checkpoint_7/checkpoint-7
  8 reward -794.00/-524.43/-21.00 len 176.76 saved tmp/ppo/taxi/checkpoint_8/checkpoint-8
  9 reward -794.00/-482.32/-21.00 len 172.15 saved tmp/ppo/taxi/checkpoint_9/checkpoint-9
 10 reward -713.00/-443.42/-21.00 len 166.61 saved tmp/ppo/taxi/checkpoint_10/checkpoint-10
 11 reward -875.00/-422.87/-17.00 len 162.83 saved tmp/ppo/taxi/checkpoint_11/checkpoint-11


Do the min/mean/max rewards increase after multiple iterations?
Are the mean episode lengths decreasing?
Those metrics show whether the policy is improving with additional training.

Also, let's view the policy and model to see the results of training in detail…

In [8]:
policy = agent.get_policy()
model = policy.model
print(model.base_model.summary())

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
observations (InputLayer)       [(None, 500)]        0                                            
__________________________________________________________________________________________________
fc_1 (Dense)                    (None, 256)          128256      observations[0][0]               
__________________________________________________________________________________________________
fc_value_1 (Dense)              (None, 256)          128256      observations[0][0]               
__________________________________________________________________________________________________
fc_2 (Dense)                    (None, 256)          65792       fc_1[0][0]                       
______________________________________________________________________________________________

Notice how the "InputLayer" has a shape with 500 inputs, encoded as one for each possible state. Then the output layer has one output, which is the action.

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

The output from the following command visualizes the "taxi" agent operating within its simulation: picking up a passenger, driving, turning, dropping off a passenger ("put-down"), and so on. 

A 2-D map of the *observation space* is visualized as text, which needs some decoding instructions:

  * `R` -- R(ed) location in the Northwest corner
  * `G` -- G(reen) location in the Northeast corner
  * `Y` -- Y(ellow) location in the Southwest corner
  * `B` -- B(lue) location in the Southeast corner
  * `:` -- cells where the taxi can drive
  * `|` -- obstructions ("walls") which the taxi must avoid
  * blue letter represents the current passenger’s location for pick-up
  * purple letter represents the drop-off location
  * yellow rectangle is the location of taxi/agent (empty)
  * green rectangle is the location of taxi/agent (full)

That allows for a total of 500 states, and these known states are numbered between 0 and 499.

The *action space* for the taxi/agent is defined as:

  * move the taxi one square North
  * move the taxi one square South
  * move the taxi one square East
  * move the taxi one square West
  * pick-up the passenger
  * put-down the passenger

The *rewards* are structured as −1 for each action plus:

 * +20 points when the taxi performs a correct drop-off for the passenger
 * -10 points when the taxi attempts illegal pick-up/drop-off actions

Admittedly it'd be better if these state visualizations showed the *reward* along with observations.

In [10]:
! rllib rollout \
    tmp/ppo/taxi/checkpoint_30/checkpoint-30 \
    --config "{\"env\": \"Taxi-v3\"}" \
    --run PPO \
    --steps 2000

2020-07-05 20:52:12,962	INFO resource_spec.py:212 -- Starting Ray with 3.66 GiB memory available for workers and up to 1.83 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-05 20:52:13,576	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-07-05 20:52:14,691	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-07-05 20:52:14,691	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-07-05 20:52:18,753	INFO trainable.py:423 -- Restored on 192.168.1.244 from checkpoint: tmp/ppo/taxi/checkpoint_30/checkpoint-30
2020-07-05 20:52:18,753	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 30, '_timesteps_total': None, '_time_total': 150.79380083084106, '_episodes_total': 832}
+---------+
|[34;1mR[0m: | : :G|
| : | : : |
| : : : : 

The rollout uses the last saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (copy/paste the URL it generates) to visualize key metrics from training with RLlib…

In [None]:
!pip install tensorflow
!tensorboard --logdir=$HOME/ray_results/