# Ray RLlib - Online learning with Deep Q Networks

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson demonstrates how to set up a server that simultaneously serves and learns a policy. We'll use the [Ray implementation](https://docs.ray.io/en/latest/rllib-algorithms.html#dqn) of the _Deep Q Networks_ (DQN) algorithm, which has better sample efficiency compared to PPO, by which we mean that the algorithm improves more quickly with each data sample. Hence, it learns a good policy from fewer "experiences".

DQN was *the* first deep (neural network-based) RL algorithm, and is described in detail in the [original paper](https://arxiv.org/abs/1312.5602).

In DQN, instead of training a policy network to directly emit output actions from the observation, we learn a *Q* function that models the expected outcome of taking certain actions. This model is then used to compute the optimal actions at each step.

Unlike policy gradient algorithms such as PPO, DQN can learn from past experiences through *experience replay*. This allows DQN to use experiences multiple times over the course of training, improving its sample efficiency. In this lesson we are going to use a single-process configuration for DQN, but RLlib does provide a [distributed variant of DQN](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#distributed-prioritized-experience-replay-ape-x). 

![dqn](../images/rllib/dqn.png)

## Running a Simple Policy Server

Open a new terminal in Jupyter Lab using the "+" button under the _File_ and _Edit_ menus. Change to the `ray-rllib` directory first. Then run the following command:

```shell
python serving/simple_policy_server.py --action-size=2 --observation-size=4 --run=DQN --checkpoint-file=cartpole
```

> **Note:** If you are working on a laptop and using Anaconda, you may need to activate your `anyscale-academy` environment first in the terminal, as described in the [README](../README.md)

Take a look at the source file, [serving/simple_policy_server.py](serving/simple_policy_server.py) to see how the policy server is implemented. In policy serving, there is no step() function that RLlib can run.
Intuitively, this is because there is no simulator -- the policy must interact with the real world and you can't call step() on that.
To support this use case, RLlib provides a special [ExternalEnv environment type](https://github.com/ray-project/ray/blob/master/rllib/env/external_env.py) in which the environment executes in its own thread of control. When a decision needs to be made, the policy is queried via `ExternalEnv`'s `self.get_action(obs)` and rewards are reported via `self.log_returns()`.

In `simple_policy_server.py`, you'll notice that the environment makes use of the built in [PolicyServer class](https://github.com/ray-project/ray/blob/master/rllib/utils/policy_server.py). All this class does is handle incoming HTTP requests and does the following for each episode:
1. Call `self.start_episode()` to get a new episode_id.
2. Call `self.get_action(episode_id, obs)` or `self.log_action(episode_id, obs, action)`
3. Call `self.log_returns(episode_id, reward)`
4. Call `self.end_episode(episode_id, obs)`

Note that `PolicyServer` is a very basic server, however `ExternalEnv` can work with any kind of Python server implementation.

Let's start up Ray as in the previous lesson:

In [None]:
!../tools/start-ray.sh --check --verbose

## Connect to the policy server you started and initialize an environment

In [1]:
from ray.rllib.utils.policy_client import PolicyClient
client = PolicyClient("http://localhost:8900")



We'll start by solving a problem you've already seen before: CartPole. It's important to note however that here RLlib is not interacting directly with the CartPole-v0 environment instance at all. All policy interactions and learning will be through the HTTP policy client:

![client](../images/rllib/client.png)


In [2]:
import gym
env = gym.make("CartPole-v0")

**EXERCISE**: Implement `run_one_episode(env)` below to rollout a CartPole episode using the client and env created above. You'll find the [policy client documentation](https://ray.readthedocs.io/en/latest/rllib-package-ref.html#ray.rllib.utils.PolicyClient) to be helpful here.

You should see a mean episode reward of about 20, which corresponds to that of an untrained policy.

In [3]:
def run_one_episode(env):
    obs = env.reset()
    done = False
    episode_id = client.start_episode() # We start by requesting a new episode id from the server
    total_reward = 0
    while not done:
        action = ??? # TODO call get_action to get the action for the current observation
        obs, rew, done, info = env.step(action)
        ??? # TODO tell the server about the recent returns of the action
        if done:
            ??? # TODO tell the server the episode ended
        total_reward += rew
    print("Episode reward", total_reward)

run_one_episode(env)

SyntaxError: invalid syntax (<ipython-input-3-51732ea8c651>, line 7)

**EXERCISE**: Run episodes until the server reaches a peak reward of ~200 (this might take a few hundred episodes). In another terminal, open TensorBoard to monitor the learning curve of the policy.

If you run into problems, you can restart the server with Ctrl-C, and it will load its last saved state from the file specifed by `--checkpoint-file`.

In [4]:
for _ in range(10):
    run_one_episode(env)

NameError: name 'run_one_episode' is not defined

## Serving a Pong AI

Next, we'll use the same policy server to train a Pong AI. Interrupt your cartpole server with Ctrl-C, and use the following configuration to start a policy server suitable for learning a Pong policy. Note here that the observation size is only slightly increased from 4 to 8. This is because the policy is going to operate over a minimal state description of the Pong game, instead of on raw images (that would take much longer to train and would require a GPU).

**EXERCISE**: Set up the pong policy server, replacing the cartpole server.

`$ python serving/simple_policy_server.py --action-size=3 --observation-size=8 --run=DQN --checkpoint-file=pong`

Next, we'll set up a web server so that you can play Pong against the policy in your browser.

**EXERCISE**: Set up the web server in another terminal and play a game of Pong against the AI.

`$ python serving/pong_web_server.py`

The web server will listen on port 8900 for HTTP connections. It will internally connect to the policy server already running on port 3000.

#### **In a new browser tab, you should go to a URL like `https://hub.mybinder.org/user/ray-project-tutorial-YOUR_NOTEBOOK_ID/proxy/8900/serving/javascript-pong/static/index.html`**

![web](../images/rllib/web.png)

As you play the game of Pong, notice the policy decisions being made by the policy server. It will look something like this:

![log](../images/rllib/log.png)

Also note that the policy is probably not very good. If you play enough games, it will eventually get better, but we have a faster way of fixing this...

**EXERCISE**: To further improve your policy, run the `python serving/do_rollouts.py` script to automatically run many more games against the live policy. This will eventually train the agent to play competitively, after 60k or so steps. Try playing against the server while these rollouts are happening in the background! Do you notice the policy improving? The learning curve in TensorBoard will look something like this:

![learning](../images/rllib/learning.png)

## Optional exercise: Learning from logs data

We've included two datasets, `data_small.gz` and `data_large.gz` that you can use to help your training. The following cell will load the small dataset:

In [None]:
import json
import gzip
episodes = []  # list of episodes
for line in gzip.open("serving/data_small.gz"):
    episodes.append(json.loads(line.decode("utf-8")))
for i in range(10):
    print(episodes[0][i])  # each episode is a list of these dicts representing steps

**EXERCISE**: Using the `client.log_action(episode_id, obs, action)` and `client.log_returns(episode_id, reward)` calls, replay the log data to the policy server. Try both the small and large dataset.
Loading the large dataset can take a while due to the naive approach of sending steps one by one. While it is in progress, try playing the Pong AI and see if you can see any improvement (you might want to change the --checkpoint-file flag to start over from a fresh policy).

Note that in a production setting, you wouldn't want to send logs to the server over the network like this. Instead, you can write an environment that reads from log shards in parallel and use RLlib to distribute the computation.

In [None]:
steps = 0  # step counter
for episode in episodes:
    episode_id = client.start_episode()
    print("Replaying episode", episode_id, "steps total", steps)
    for step in episode:
        ??? # TODO: log the action
        ??? # TODO: log the returns
        steps += 1
    client.end_episode(episode_id, [0]*8)  # assume last observation is all zeros

## More optional exercises

**EXERCISE**: Try the above exercises but starting the server with `--run=PG` to use a policy gradient algorithm instead of DQN. How does this compare in learning performance? What about in ability to leverage offline data?

**EXERCISE**: Come up with a simple "toy environment" with a small action and observation space. Use one of the RLlib [algorithms](https://ray.readthedocs.io/en/latest/rllib.html#algorithms) to solve that environment.
