# Ray RLlib Tutorial - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This notebook contains the solutions for all the exercises in the RLlib tutorial.

First, we have to setup everything needed from the other notebooks.

In [8]:
import gym
import numpy as np
import pandas as pd
import json

## 01 Introduction to Reinforcement Learning

### Exercise 1

Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

In [2]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


In [3]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
        
    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

The first sample policy got an average reward of 9.31.
The second sample policy got an average reward of 29.52.


### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [2]:
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [4]:
ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:42830',
 'object_store_address': '/tmp/ray/session_2020-06-01_17-56-50_285894_82926/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-01_17-56-50_285894_82926/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-01_17-56-50_285894_82926'}

Here's one possible set. It takes longer for teh max reward to reach 200, so I increased the number of episodes `N` to 10.

In [6]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10                       # was 30
config['sgd_minibatch_size'] = 256                # was 128
config['model']['fcnet_hiddens'] = [20, 20]       # was [100, 100]
config['num_cpus_per_worker'] = 0

In [17]:
agent = PPOTrainer(config, 'CartPole-v0')

2020-06-02 06:22:14,226	INFO trainable.py:217 -- Getting current IP.


In [18]:
N=20                # was 5
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'Max reward: {episode["episode_reward_max"]}')    

Max reward: 81.0
Max reward: 70.0
Max reward: 80.0
Max reward: 119.0
Max reward: 130.0
Max reward: 140.0
Max reward: 128.0
Max reward: 155.0
Max reward: 146.0
Max reward: 175.0
Max reward: 160.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0
Max reward: 200.0


In [19]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,22.907514,81.0,22.907514
1,1,24.83125,70.0,24.83125
2,2,27.438356,80.0,27.438356
3,3,33.2,119.0,33.2
4,4,36.862385,130.0,36.862385
5,5,40.58,140.0,40.58
6,6,41.42,128.0,41.42
7,7,52.59,155.0,52.59
8,8,59.3,146.0,59.3
9,9,67.91,175.0,67.91


In [20]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [21]:
source = ColumnDataSource(df)

plot = figure(title='Episode reward and length means/maxes')
plot.grid.grid_line_alpha=0.2
plot.xaxis.axis_label = 'n'
plot.yaxis.axis_label = 'value'

plot.line(x='n', y='episode_reward_mean', source=source, color='blue', legend_label='Episode reward mean', name='Episode reward mean')
plot.circle(x='n', y='episode_reward_mean', source=source, color='blue', size=8)
plot.line(x='n', y='episode_reward_max', source=source, color='green', legend_label='Episode reward max', name='Episode reward max')
plot.circle(x='n', y='episode_reward_max', source=source, color='green', size=8)
plot.legend.location = "top_left"

hover = HoverTool()
hover.tooltips = [
    ("n", "$x"),
    ("mean", "$y")]
plot.add_tools(hover)

show(plot)

([image](images/rllib/episode-rewards-means-maxes-exercise.png))

Try it again with slightly larger neural network layers.