# Ray RLlib - Introduction to Reinforcement Learning

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

_Reinforcement Learning_ is the category of machine learning that focuses on training one or more _agents_ to achieve maximal _rewards_ while operating in an environment. This lesson discusses the core concepts of RL, while subsequent lessons explore RLlib in depth. We'll use two examples with exercises to give you a taste of RL. If you already understand RL concepts, you can either skim this lesson or skip to the [next lesson](02-About-RLlib.ipynb).

## What Is Reinforcement Learning?

Let's explore the basic concepts of RL, specifically the _Markov Decision Process_ abstraction, and to show its use in Python.

Consider the following image:

![RL Concepts](../images/rllib/RL-concepts.png)

In RL, one or more **agents** interact with an **environment** to maximize a **reward**. The agents make **observations** about the **state** of the environment and take **actions** that are believed will maximize the long-term reward. However, at any particular moment, the agents can only observe the immediate reward. So, the training process usually involves lots and lot of replay of the game, the robot simulator traversing a virtual space, etc., so the agents can learn from repeated trials what decisions/actions work best to maximize the long-term, cummulative reward.

The trail and error search and delayed reward are the distinguishing characterists of RL vs. other ML methods ([Sutton 2018](06-RL-References.ipynb#Books)).

The way to formalize trial and error is the **exploitation vs. exploration tradeoff**. When an agent finds what appears to be a "rewarding" sequence of actions, the agent may naturally want to continue to **exploit** these actions. However, even better actions may exist. An agent won't know whether alternatives are better or not unless some percentage of actions taken **explore** the alternatives. So, all RL algorithms include a strategy for exploitation and exploration.

## RL Applications

RL has many potential applications. RL became "famous" due to these successes, including achieving expert game play, training robots, autonomous vehicles, and other simulated agents:

![AlphaGo](../images/rllib/alpha-go.jpg)
![Game](../images/rllib/breakout.png)

![Stacking Legos with Sawyer](../images/rllib/stacking-legos-with-sawyer.gif)
![Walking Man](../images/rllib/walking-man.gif)

![Autonomous Vehicle](../images/rllib/daimler-autonomous-car.jpg)
![Four-legged Robot](../images/rllib/four-legged-robot.jpeg)

Credits:
* [AlphaGo](https://www.youtube.com/watch?v=l7ngy56GY6k)
* [Breakout](https://towardsdatascience.com/tutorial-double-deep-q-learning-with-dueling-network-architectures-4c1b3fb7f756) ([paper](https://arxiv.org/abs/1312.5602))
* [Stacking Legos with Sawyer](https://robohub.org/soft-actor-critic-deep-reinforcement-learning-with-real-world-robots/)
* [Walking Man](https://openai.com/blog/openai-baselines-ppo/)
* [Autonomous Vehicle](https://www.daimler.com/innovation/case/autonomous/intelligent-drive-2.html)
* [Four-legged Robot](https://robohub.org/four-legged-robot-that-efficiently-handles-challenging-terrain/)

Recently other industry applications have emerged, include the following:

* **Process optimization:** industrial processes (factories, pipelines) and other business processes, routing problems, cluster optimization.
* **Ad serving and recommendations:** Some of the traditional methods, including _collaborative filtering_, are hard to scale for very large data sets. RL systems are being developed to do an effective job more efficiently than traditional methods.
* **Finance:** Markets are time-oriented _environments_ where automated trading systems are the _agents_. 

## Markov Decision Processes

At its core, Reinforcement learning builds on the concepts of [Markov Decision Process (MDP)](https://en.wikipedia.org/wiki/Markov_decision_process), where the current state, the possible actions that can be taken, and overall goal are the building blocks.

An MDP models sequential interactions with an external environment. It consists of the following:

- a **state space** where the current state of the system is sometimes called the **context**.
- a set of **actions** that can be taken at a particular state $s$ (or sometimes the same set for all states).
- a **transition function** that describes the probability of being in a state $s'$ at time $t+1$ given that the MDP was in state $s$ at time $t$ and action $a$ was taken. The next action is selected stocastically based on these probabilities.
- a **reward function**, which determines the reward received at time $t$ following action $a$, based on the decision of **policy** $\pi$.

The goal of MDP is to develop a **policy** $\pi$ that specifies what action $a$ should be chosen for a given state $s$ so that the cummulative reward is maximized. Since the policy fixes a single action for each state, the transition probabilities reduce to the probability of transitioning to state $s'$ given the current state is $s$, independent of actions. Various algorithms can be used to compute this policy. 

Often this cummulative reward is computed using the **discounted sum** over all rewards observed:

\begin{equation}
\arg\max_{\pi} \sum_{t=1}^T \gamma^t R_t(\pi),
\end{equation}

where $T$ is the number of steps taken in the MDP (this is a random variable and may depend on $\pi$), $R_t$ is the reward received at time $t$ (also a random variable which depends on $\pi$), and $\gamma$ is the **discount factor**. The value of $\gamma$ is between 0 and 1, meaning it has the effect of "discounting" earlier rewards vs. more recent rewards. 

More details about MDP are available [here](https://en.wikipedia.org/wiki/Markov_decision_process). Note what we said in the third bullet, that the new state only depends on the previous state and the action taken. The assumption is that we can simplify our effort by ignoring all the previous states except the last one and still achieve good results. This is known as the [Markov property](https://en.wikipedia.org/wiki/Markov_property).

## The Elements of RL

Here are the elements of RL that expand on MDP concepts ([Sutton 2018](http://incompleteideas.net/book/bookdraft2018jan1.pdf)):

#### Policies

Unlike MDP, the **transition function** probabilities are often not known in advance, but must be learned. Learning is done through repeated "play", where the agent interacts with the environment.

This makes the **policy** $\pi$ harder to determine. Also, specifying one action per state, as in MDP, is usually relaxed, making the choice of action stochastic.

#### Reward Signal

The idea of a **reward signal** encapsulates the desired goal for the system and provides feedback for updating the policy based on how well particular events or actions contribute rewards towards the goal.

#### Value Function

The **value function** encapsulates the maximum commulative reward likely to be achieved starting from a given state. This is harder to determine than the simple reward returned after taking an action. In fact, much of the research in RL over the decades has focused on finding better and more efficient implementations of value functions. To illustrate the challenge, repeatedly taking one sequence of actions may yield low rewards for a while, but eventually provide large rewards, or vice-versa. 

#### Model

An optional feature, some RL algorithms develop a **model** of the environment to anticipate the resulting states and rewards for future actions. Hence, they are useful for _planning_ scenarios. Methods for solving RL problems that use models are called _model-based methods_, while methods that learn by trial and error are called _model-free methods_.

## Reinforcement Learning Examples

Let's finish this introduction with two examples. One is a popular "hello world" (1) example for RL, balancing a pole vertically on a moving cart, called _CartPole_. The second example provides a "taste" of RLlib and a popular RL algorithm, _Proximal Policy Optimization_, again using _CartPole_.

(1) In books and tutorials on particular programming languages, it is a tradition that the very first program shown prints the message "Hello World!".

### CartPole and OpenAI

The [OpenAI "gym" environment](https://gym.openai.com/) provides MDP interfaces to a variety of simulators. For example, the _CartPole_ environment interfaces with a simple simulator that simulates the physics of balancing a pole on a cart. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0. Here is an image from that website:

![Cart Pole](../images/rllib/Cart-Pole.png)

This example fits into the MDP framework as follows.
- The **state** consists of the position and velocity of the cart (moving in one dimension from left to right) as well as the angle and angular velocity of the pole that is balancing on the cart.
- The **actions** are to decrease or increase the cart's velocity by one unit.
- The **transition function** is deterministic and is determined by simulating physical laws.
- The **reward function** is a constant 1 as long as the pole is upright, and 0 once the pole has fallen over. Therefore, maximizing the reward means balancing the pole for as long as possible.
- The **discount factor** in this case can be taken to be 1.

More information about the `gym` Python module is available at https://gym.openai.com/. The list of all the available Gym environments is in [this wiki page](https://github.com/openai/gym/wiki/Table-of-environments). We'll even create our own in subsequent lessons.

In [1]:
import gym
import numpy as np
import pandas as pd
import json

The code below illustrates how to create and manipulate MDPs in Python. An MDP can be created by calling `gym.make`. Gym environments are identified by names like `CartPole-v0`. A **catalog of built-in environments** can be found at https://gym.openai.com/envs.

In [2]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


Reset the state of the MDP by calling `env.reset()`. This call returns the initial state of the MDP.

In [3]:
state = env.reset()
print('The starting state is:', state)

The starting state is: [ 0.0077461  -0.00489211  0.00790247  0.04630035]


Recall that the state is the position of the cart, its velocity, the angle of the pole, and the angular velocity of the pole.

The `env.step` method takes an action. In the case of the CartPole environment, the appropriate actions are 0 or 1, for pushing the cart to the left or right, respectively. `env.step()` returns a tuple of four things:
1. the new state of the environment
2. a reward
3. a boolean indicating whether the simulation has finished
4. a dictionary of miscellaneous extra information

Let's show what happens if we take one step with an action of 0.

In [4]:
action = 0
state, reward, done, info = env.step(action)
print(state, reward, done, info)

[ 0.00764826 -0.20012648  0.00882847  0.34146606] 1.0 False {}


A **rollout** is a simulation of a policy (trained or "hand-coded") in an environment. It alternates between choosing actions using some policy and taking those actions in the environment.

The code below performs a rollout in a given environment. It takes **random actions** until the simulation has finished and returns the cumulative reward.

In [5]:
def random_rollout(env):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = np.random.choice([0, 1])
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
    
    # Return the cumulative reward.
    return cumulative_reward    

Try rerunning the following cell a few times. How much do the answers change? Note that the maximum possible reward for _CartPole_ is 200. You'll probably get numbers under 50.

In [6]:
reward = random_rollout(env)
print(reward)
reward = random_rollout(env)
print(reward)

17.0
13.0


### Exercise 1

Choosing actions at random in `random_rollout` is not a very effective policy, as the previous results showed. Finish implementing the `rollout_policy` function below, which takes an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

> **Note:** Exercise solutions for this tutorial can be found [here](solutions/Ray-RLlib-Solutions.ipynb).

In [8]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # EXERCISE: Fill out this function by copying the appropriate part of 'random_rollout'
    # and modifying it to choose the action using the policy.
    raise NotImplementedError

    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

NotImplementedError: 

We'll return to _CartPole_ in lesson [03: Application Cart Pole](03-Application-Cart-Pole.ipynb).

### Reinforcement Learning Example: Cart Pole with Proximal Policy Optimization

This section demonstrates how to use the _proximal policy optimization_ (PPO) algorithm implemented by [RLlib](http://rllib.io). PPO is a popular way to develop a policy. RLlib also uses [Ray Tune](http://tune.io), the Ray Hyperparameter Tuning framework, which is covered in the [Ray Tune Tutorial](../ray-tune/00-Ray-Tune-Overview.ipynb).

We'll provide relatively little explanation of **RLlib** concepts for now, but explore them in greater depth in subsequent lessons. For more on RLlib, see the documentation at http://rllib.io.

PPO is described in detail in [this paper](https://arxiv.org/abs/1707.06347). It is a variant of _Trust Region Policy Optimization_ (TRPO) described in [this earlier paper](https://arxiv.org/abs/1502.05477). [This OpenAI post](https://openai.com/blog/openai-baselines-ppo/) provides a more accessible introduction to PPO.

PPO works in two phases. In the first phase, a large number of rollouts are performed in parallel. The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. In the second phase, we use SGD (_stochastic gradient descent_) to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![PPO](../images/rllib/ppo.png)

> **NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `num_gpus` field of the `config` dictionary. Hence, for normal usage, one or more GPUs is recommended.

(The original version of this example can be found [here](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/)).

In [9]:
# import gym  # imported above already, but listed here for completeness
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

The following script checks if the Ray cluster is already running. If not, it tells you what to do to start Ray.

In [10]:
!../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


Now start Ray in this "driver" process. This must be done before we instantiate any RL agents.

In [11]:
ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15832',
 'object_store_address': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764'}

> **Tip:** Having trouble starting Ray? See the [Troubleshooting](../reference/Troubleshooting-Tips-Tricks.ipynb) tips.

The next cell prints the URL for the Ray Dashboard. **This is only correct if you are running this tutorial on a laptop.** Click the link to open the dashboard.

If you are running on the Anyscale platform, use the URL provided by your instructor to open the Dashboard.

In [12]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Instantiate a PPOTrainer object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used. In a cluster, these actors will be spread over the available nodes.
- `num_sgd_iter` is the number of epochs of SGD (stochastic gradient descent, i.e., passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO, for each _minibatch_ ("chunk") of training data. Using minibatches is more efficient than training with one record at a time.
- `sgd_minibatch_size` is the SGD minibatch size (batches of data) that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers. Here, we have two hidden layers of size 100, each.
- `num_cpus_per_worker` when set to 0 prevents Ray from pinning a CPU core to each worker, which means we could run out of workers in a constrained environment like a laptop or a cloud VM.

In [13]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 1
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0 

In [16]:
agent = PPOTrainer(config, 'CartPole-v0')

2020-06-13 10:39:13,753	INFO trainable.py:217 -- Getting current IP.


Now let's train the policy on the `CartPole-v0` environment for `N` steps. The JSON object returned by each call to `agent.train()` contains a lot of information we'll inspect below. For now, we'll extract information we'll graph, such as `episode_reward_mean`. The _mean_ values are more useful for determining successful training.

In [17]:
N=10
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'],  
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}')

0: Min/Mean/Max reward:  8.0000/21.5892/62.0000
1: Min/Mean/Max reward: 11.0000/34.7807/132.0000
2: Min/Mean/Max reward: 11.0000/57.6300/200.0000
3: Min/Mean/Max reward: 11.0000/89.7900/200.0000
4: Min/Mean/Max reward: 16.0000/119.8200/200.0000
5: Min/Mean/Max reward: 16.0000/140.3400/200.0000
6: Min/Mean/Max reward: 29.0000/162.1400/200.0000
7: Min/Mean/Max reward: 29.0000/175.8800/200.0000
8: Min/Mean/Max reward: 29.0000/182.2700/200.0000
9: Min/Mean/Max reward: 29.0000/191.4800/200.0000


Now let's convert the episode data to a Pandas `DataFrame` for easy manipulation. The results indicate how much reward the policy is receiving (`episode_reward_*`) and how many time steps of the environment the policy ran (`episode_len_mean`). The maximum possible reward for this problem is `200`. The reward mean and trajectory length are very close because the agent receives a reward of one for every time step that it survives. However, this is specific to this environment and not true in general.

In [18]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,8.0,21.589189,62.0,21.589189
1,1,11.0,34.780702,132.0,34.780702
2,2,11.0,57.63,200.0,57.63
3,3,11.0,89.79,200.0,89.79
4,4,16.0,119.82,200.0,119.82
5,5,16.0,140.34,200.0,140.34
6,6,29.0,162.14,200.0,162.14
7,7,29.0,175.88,200.0,175.88
8,8,29.0,182.27,200.0,182.27
9,9,29.0,191.48,200.0,191.48


In [19]:
df.columns.tolist()

['n',
 'episode_reward_min',
 'episode_reward_mean',
 'episode_reward_max',
 'episode_len_mean']

Let's plot the data.

In [20]:
import sys
sys.path.append("..")
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

In [21]:
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

Since the length and reward means are equal, we'll only plot one line:

In [22]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                      title='Episode Rewards', x_axis_label='n', y_axis_label='reward')

([image](../images/rllib/Cart-Pole-Episode-Rewards.png))

The model is quickly able to hit the maximum value of 200, but the mean is what's most valueable. After 10 steps, we're close to the maximum for the mean.

FYI, here are two views of the whole value for one result. First, a "pretty print" output.

> **Tip:** The output will be long. When this happens for a cell, right click and select _Enable scrolling for outputs_.

In [23]:
print(pretty_print(results[-1]))

custom_metrics: {}
date: 2020-06-13_10-40-56
done: false
episode_len_mean: 191.48
episode_reward_max: 200.0
episode_reward_mean: 191.48
episode_reward_min: 29.0
episodes_this_iter: 21
episodes_total: 503
experiment_id: bc9476c18325499a89dde1579b8818e1
hostname: DWAnyscaleMBP.local
info:
  grad_time_ms: 1345.964
  learner:
    default_policy:
      cur_kl_coeff: 0.03750000149011612
      cur_lr: 4.999999873689376e-05
      entropy: 0.5548126697540283
      entropy_coeff: 0.0
      kl: 0.009755834005773067
      model: {}
      policy_loss: -0.007021887693554163
      total_loss: 605.2608642578125
      vf_explained_var: 0.29809436202049255
      vf_loss: 605.2674560546875
  load_time_ms: 5.997
  num_steps_sampled: 40000
  num_steps_trained: 39680
  sample_time_ms: 2737.097
  update_time_ms: 46.446
iterations_since_restore: 10
node_ip: 192.168.1.149
num_healthy_workers: 1
off_policy_estimator: {}
optimizer_steps_this_iter: 1
perf:
  cpu_util_percent: 41.78333333333333
  ram_util_percent:

We'll learn about more of these values as continue the tutorial.

The whole, long JSON blob, which includes the historical stats about episode rewards and lengths:

In [24]:
results[-1]

{'episode_reward_max': 200.0,
 'episode_reward_min': 29.0,
 'episode_reward_mean': 191.48,
 'episode_len_mean': 191.48,
 'episodes_this_iter': 21,
 'policy_reward_min': {},
 'policy_reward_max': {},
 'policy_reward_mean': {},
 'custom_metrics': {},
 'hist_stats': {'episode_reward': [200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   175.0,
   200.0,
   200.0,
   200.0,
   200.0,
   29.0,
   200.0,
   174.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   195.0,
   200.0,
   200.0,
   134.0,
   200.0,
   159.0,
   192.0,
   200.0,
   200.0,
   200.0,
   194.0,
   200.0,
   191.0,
   200.0,
   115.0,
   200.0,
   200.0,
   200.0,
   200.0,
   197.0,
   200.0,
   144.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   200.0,
   199.0,
   200.0,
   179.0,
   200.0,
   187.0,
   200.0,
   200.0,
   200.0,
   200.0,
   140.0,
   200.0,
   155.0,


Let's plot the `episode_reward` values:

In [25]:
episode_rewards = results[-1]['hist_stats']['episode_reward']
df_episode_rewards = pd.DataFrame(data={'episode':range(len(episode_rewards)), 'reward':episode_rewards})
plot_line(df_episode_rewards, x_col='episode', y_col='reward', title='Episode Rewards', x_axis_label='episode', y_axis_label='reward')

([image](../images/rllib/Cart-Pole-Episode-Rewards2.png))

For a well-trained model, most runs do very well while occasional runs do poorly.

### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network (the `config['model']['fcnet_hiddens']` setting) and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [26]:
# Make edits here:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0

agent = PPOTrainer(config, 'CartPole-v0')

2020-06-13 10:45:49,464	INFO trainable.py:217 -- Getting current IP.
2020-06-13 11:30:59,475	ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-06-13 11:30:59,451	ERROR worker.py:1092 -- listen_error_messages_raylet: Connection closed by server.


Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_minibatch_size`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around `N` = 20 or 30 training iterations.

In [27]:
N=5
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'Max reward: {episode["episode_reward_max"]}')

Max reward: 66.0
Max reward: 103.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0


# Using Model Checkpoints

Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model file, which can be used later to restore the model.

In [28]:
checkpoint_path = agent.save()
print(checkpoint_path)

/Users/deanwampler/ray_results/PPO_CartPole-v0_2020-06-12_09-38-52v04r6vuz/checkpoint_5/checkpoint-5


Now let's use the trained policy to make predictions.

> **Note:** Here we are loading the trained policy in the same process, but in practice, this would normally be done in a different process, for example on a production cluster separate from the training cluster.

In [29]:
trained_config = config.copy()

test_agent = PPOTrainer(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

2020-06-12 09:47:03,656	INFO trainable.py:217 -- Getting current IP.
2020-06-12 09:47:03,720	INFO trainable.py:217 -- Getting current IP.
2020-06-12 09:47:03,721	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/PPO_CartPole-v0_2020-06-12_09-38-52v04r6vuz/checkpoint_5/checkpoint-5
2020-06-12 09:47:03,722	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': 20000, '_time_total': 15.386901617050171, '_episodes_total': 400}


Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

Verify that the cummulative reward received roughly matches up with the reward printed above. It will be at or near 200.

In [30]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)  # key line; get the next action
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

200.0


The next lesson, [02: Introduction to RLlib](02-Introduction-to-RLlib.ipynb) steps back to introduce to RLlib, its goals and the capabilities it provides.