# Ray RLlib - Introduction to RLlib

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](01-Introduction-to-Reinforcement-Learning.ipynb), we learned the basic concepts of reinforcement learning, with a "taste" of [RLlib](https://rllib.io) and [OpenAI Gym](https://gym.openai.com). This lesson takes a step back to provide more information about RLlib and the features it provides. The subsequent lessons will continue our exploration of RL algorithms and tools.

For more detailed information about RLlib and its open source community, see the following:

* [rllib.io](http://rllib.io) (the documentation)
* [GitHub repo](https://github.com/ray-project/ray/tree/master/rllib#rllib-scalable-reinforcement-learning)

RLlib is structured conceptually like this:

![RLlib Stack](../images/rllib/RLlib-Stack-smaller.png)

The _(1) Application Support_ boxes are components used for particular RL algorithms. The _(2) Abstractions for RL_ provide building blocks used by the many algorithms that are implemented in RLlib (listed below). They also provide hooks for implementing your own algorithms. RLlib leverages Ray for efficient, cluster-wide, _(3) Distributed Execution_.

Let's start up Ray as in the previous lesson:

In [1]:
!../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


## RLlib in 60 Seconds (plus some...)

Here is a fast introduction to using RLlib from a command line, adapted from the [documentation](https://docs.ray.io/en/latest/rllib.html#rllib-in-60-seconds).

First you would install [PyTorch](http://pytorch.org/) or [TensorFlow](https://www.tensorflow.org/), whichever you prefer.  Then install RLlib. (All are already installed in this tutorial environment.)

```shell
pip install ray[rllib]  # or consider using: ray[debug]
```

Then try training _CartPole_ using _PPO_ with the `rllib` CLI, where we'll stop at 20 iterations, saving checkpoints every 10 iterations and at the end:

```shell
rllib train --run PPO --env CartPole-v0 --stop='{"training_iteration": 20}' --ray-address auto --checkpoint-freq 10 --checkpoint-at-end
```

We also specify using the Ray cluster that's already running and save a model checkpoint at the end.

The `rllib` CLI has a `--help` flag that prints details about the supported options:

```shell
rllib --help          # general help
rllib train --help    # specific help on the training options
rllib rollout --help  # specific help on the rollout options
```

_Rollout_ means running a episode with the trained model, which you specify by passing a checkpoint directory to the command.

You can execute the same training logic (but will different output) using the following Python code, which leverages [Ray Tune](http://tune.io), specifically the [tune.run](https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run) method:

```python
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, 
    config={"env": "CartPole-v0"},
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    verbose=2            # 2 for INFO; change to 1 or 0 to reduce the output.
    )  
```

You can view the training results, which are written to the directory `$HOME/ray_results` using TensorBoard:

```shell
tensorboard --logdir=~/ray_results
```

Try the CLI if you like now. The following `ray train` command will take between and two minutes on a "recent vintage" laptop:

You could also run this command in a separate terminal window.

> **Tip:** The output will be long. In each cell, right click and select _Enable scrolling for outputs_.

In [3]:
!rllib train --run PPO --env CartPole-v0 --stop='{"training_iteration": 20}' --ray-address auto --checkpoint-freq 10 --checkpoint-at-end

2020-06-12 09:50:03,878	INFO resource_spec.py:212 -- Starting Ray with 4.0 GiB memory available for workers and up to 2.01 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-12 09:50:04,221	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
== Status ==
Memory usage on this node: 11.3/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 3/8 CPUs, 0/0 GPUs, 0.0/4.0 GiB heap, 0.0/1.37 GiB objects
Result logdir: /Users/deanwampler/ray_results/default
Number of trials: 1 (1 RUNNING)
+-----------------------+----------+-------+
| Trial name            | status   | loc   |
|-----------------------+----------+-------|
| PPO_CartPole-v0_00000 | RUNNING  |       |
+-----------------------+----------+-------+


[2m[36m(pid=42245)[0m 2020-06-12 09:50:09,149	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=42245)[0m 2020

To run TensorBoard, open a terminal (you can use the `+` item under the _Edit_ menu) and run this command:

```shell
!tensorboard --logdir=~/ray_results
```

(This command can't be run in a notebook cell...)

TensorBoard lets you browse the training run data. Open the URL shown in the output.

Here is a [screenshot](../images/rllib/TensorBoard-CartPole-PPO.png).

Note that those results were written to the directory `~/ray_results`, by default, during training runs. In that directory you'll find results for all the RL training we'll do in this tutorial. You may wish to clean out old results periodically. For the run we just did, look for the results in `~/ray_results/default/PPO-CartPole-V0_0_*`. 

Once trained, you can use `rllib rollout <checkpoint> --run PPO` to reply a session using the trained model in `<checkpoint>`, which in this case will be a directory with a name like this:

```
$HOME/ray_results/default/PPO_CartPole-v0_0_YYYY-MM-DD_HH-MM-SS.../checkpoint_20/checkpoint-20/
```

Find that directory for your run and try the following command, replacing `checkpoint_dir` with the correct full path:

In [None]:
!rllib rollout checkpoint_dir --run PPO 

See [this RLlib page on training policies](https://docs.ray.io/en/master/rllib-training.html) for more examples.

## Key Concepts: Policies, Samples, and Trainers

### Policies

[Policies](https://docs.ray.io/en/latest/rllib-concepts.html#policies) are Python classes that define how an agent acts in an environment.

[Rollout workers](https://docs.ray.io/en/latest/rllib-concepts.html#policy-evaluation) query the policy to determine agent actions. 

In a [gym](https://docs.ray.io/en/latest/rllib-env.html#openai-gym) environment, there is a single agent and policy. In [vector environments](https://docs.ray.io/en/latest/rllib-env.html#vectorized), policy inference is for multiple agents at once, and in [multi-agent and hierachical environments](https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical), there may be multiple policies, each controlling one or more agents.

![Environments and Policies in RLlib](../images/rllib/multi-flat.svg)

Policies can be implemented using any framework ([code](https://github.com/ray-project/ray/blob/master/rllib/policy/policy.py)). However, for TensorFlow and PyTorch, RLlib has [build_tf_policy](https://docs.ray.io/en/latest/rllib-concepts.html#building-policies-in-tensorflow) and [build_torch_policy](https://docs.ray.io/en/latest/rllib-concepts.html#building-policies-in-pytorch) helper functions, respectively, that let you define a trainable policy with a functional-style API. An example is in the [documentation](https://docs.ray.io/en/latest/rllib.html#policies).

### Sample Batches

From single processes to [large clusters](https://docs.ray.io/en/latest/rllib-training.html#specifying-resources>), all data interchange in RLlib uses [sample batches](https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py). Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size `rollout_fragment_length` from rollout workers, and concatenates one or more of these batches into a batch of size `train_batch_size` that is the input to SGD.

A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network:

```python
 { 'action_logp': np.ndarray((200,), dtype=float32, min=-0.701, max=-0.685, mean=-0.694),
   'actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.495),
   'dones': np.ndarray((200,), dtype=bool, min=0.0, max=1.0, mean=0.055),
   'infos': np.ndarray((200,), dtype=object, head={}),
   'new_obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.018),
   'obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.016),
   'rewards': np.ndarray((200,), dtype=float32, min=1.0, max=1.0, mean=1.0),
   't': np.ndarray((200,), dtype=int64, min=0.0, max=34.0, mean=9.14)}
```

In [multi-agent mode](https://docs.ray.io/en/latest/rllib-concepts.html#policies-in-multi-agent), sample batches are collected separately for each individual policy.

### Training

Policies each define a `learn_on_batch()` method that improves the policy given a sample batch of input. For TensorFlow and PyTorch policies, this is implemented using a _loss function_ that takes as input sample batch tensors and outputs a scalar loss value. Here are a few example loss functions:

* Simple [policy gradient loss](https://github.com/ray-project/ray/blob/master/rllib/agents/pg/pg_tf_policy.py).
* Simple [Q-function loss](https://github.com/ray-project/ray/blob/a1d2e1762325cd34e14dc411666d63bb15d6eaf0/rllib/agents/dqn/simple_q_policy.py#L136)
* Importance-weighted [APPO surrogate loss](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo_policy.py)

RLlib [trainer classes](https://docs.ray.io/en/latest/rllib-concepts.html#trainers) coordinate the distributed workflow of running rollouts and optimizing policies. They do this by leveraging Ray [parallel iterators](https://docs.ray.io/en/latest/iter.html) to implement the desired computation pattern. The following figure shows *synchronous sampling*, the simplest of [these patterns](https://docs.ray.io/en/latest/rllib-algorithms.html):

![Synchronous Sampling](../images/rllib/a2c-arch.svg)

    Synchronous Sampling (e.g., A2C, PG, PPO)

RLlib uses [Ray actors](https://docs.ray.io/en/latest/actors.html) to scale training from a single core to many thousands of cores in a cluster. You can [configure the parallelism](https://docs.ray.io/en/latest/rllib-training.html#specifying-resources) used for training by changing the `num_workers` parameter. Check out our [scaling guide](https://docs.ray.io/en/latest/rllib-training.html#scaling-guide) for more details here.

## Application Support

Beyond environments defined in Python, [RLlib supports]((https://docs.ray.io/en/latest/rllib.html#application-support)) batch training on [offline datasets](https://docs.ray.io/en/latest/rllib-offline.html), and also provides a variety of integration strategies for [external applications](https://docs.ray.io/en/latest/rllib-env.html#external-agents-and-applications).

### Customization

RLlib provides ways to customize almost all aspects of training, including the [environmen](https://docs.ray.io/en/latest/rllib-env.html#configuring-environments), [neural network model](https://docs.ray.io/en/latest/rllib-models.html#tensorflow-models), [action distribution](https://docs.ray.io/en/latest/rllib-models.html#custom-action-distributions), and [policy definitions](https://docs.ray.io/en/latest/rllib-concepts.html#policies>).

![RLlib components](../images/rllib/RLlib-components.svg)

## Algorithms Implemented in RLlib

Here is the current list of supported algorithms in RLlib. The links go to the corresponding RLlib documentation, which includes links to the original papers and other references.

See also the documentation's [Feature Compatibility Matrix](https://docs.ray.io/en/latest/rllib-algorithms.html#feature-compatibility-matrix), which lists the algorithms and useful properties for them.

### High-throughput Architectures

* [Distributed Prioritized Experience Replay (Ape-X)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#distributed-prioritized-experience-replay-ape-x)
* [Importance Weighted Actor-Learner Architecture (IMPALA)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#importance-weighted-actor-learner-architecture-impala)
* [Asynchronous Proximal Policy Optimization (APPO)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#asynchronous-proximal-policy-optimization-appo)
* [Decentralized Distributed Proximal Policy Optimization (DD-PPO)](https://docs.ray.io/en/latest/rllib-algorithms.html#decentralized-distributed-proximal-policy-optimization-dd-ppo)

### Gradient-based

* [Advantage Actor-Critic (A2C, A3C)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#advantage-actor-critic-a2c-a3c)
* [Deep Deterministic Policy Gradients (DDPG, TD3)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#deep-deterministic-policy-gradients-ddpg-td3)
* [Deep Q Networks (DQN, Rainbow, Parametric DQN)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#deep-q-networks-dqn-rainbow-parametric-dqn)
* [Policy Gradients](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#policy-gradients)
* [Proximal Policy Optimization (PPO)](https://docs.ray.io/en/latest/rllib-algorithms.html#proximal-policy-optimization-ppo)
* [Soft Actor-Critic (SAC)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#soft-actor-critic-sac)

### Gradient-free

* [Augmented Random Search (ARS)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#augmented-random-search-ars)
* [Evolution Strategies](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#evolution-strategies)

### Multi-agent Specific

* [QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#qmix-monotonic-value-factorisation-qmix-vdn-iqn)
* [Multi-Agent Deep Deterministic Policy Gradient (contrib/MADDPG)](https://docs.ray.io/en/latest/rllib-algorithms.html#multi-agent-deep-deterministic-policy-gradient-contrib-maddpg)

### Offline

* [Advantage Re-Weighted Imitation Learning (MARWIL)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#advantage-re-weighted-imitation-learning-marwil)

### Contextual Bandits (contrib/bandits)

* [Linear Upper Confidence Bound (contrib/LinUCB)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-upper-confidence-bound-contrib-linucb)
* [Linear Thompson Sampling (contrib/LinTS)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-thompson-sampling-contrib-lints)

### Other

* [Single-Player Alpha Zero (contrib/AlphaZero)](https://docs.ray.io/en/latest/rllib-algorithms.html#single-player-alpha-zero-contrib-alphazero)

The next lesson, [03: Application: Cart Pole](03-Application-Cart-Pole.ipynb) returns to the _cart pole_ example, where we train a moving car to balance a vertical pole. Based on the `CartPole-v0` environment from OpenAI Gym.