# Ray RLlib Tutorial - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This notebook contains the solutions for all the exercises in the RLlib tutorial.

First, we have to setup everything needed from the other notebooks.

In [1]:
import gym
import numpy as np
import pandas as pd
import json

## 01 Introduction to Reinforcement Learning

### Exercise 1

Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

In [2]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


In [3]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
        
    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

The first sample policy got an average reward of 9.45.
The second sample policy got an average reward of 30.08.


### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [4]:
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [5]:
ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15832',
 'object_store_address': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764'}

Here's one possible set. It takes longer for teh max reward to reach 200, so I increased the number of episodes `N` to 10.

In [6]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10                       # was 30
config['sgd_minibatch_size'] = 256                # was 128
config['model']['fcnet_hiddens'] = [20, 20]       # was [100, 100]
config['num_cpus_per_worker'] = 0

In [7]:
agent = PPOTrainer(config, 'CartPole-v0')

2020-06-13 09:09:06,050	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 09:09:06,151	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 09:09:09,143	INFO trainable.py:217 -- Getting current IP.


In [8]:
N=20                # was 5
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'],  
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'{n}: Min/Mean/Max reward: {result["episode_reward_min"]:7.4f}/{result["episode_reward_mean"]:7.4f}/{result["episode_reward_max"]:7.4f}')

0: Min/Mean/Max reward: 10.0000/21.6033/60.0000
1: Min/Mean/Max reward:  9.0000/23.2035/82.0000
2: Min/Mean/Max reward:  9.0000/26.6779/96.0000
3: Min/Mean/Max reward: 10.0000/31.4453/107.0000
4: Min/Mean/Max reward: 11.0000/32.5410/109.0000
5: Min/Mean/Max reward: 11.0000/39.5644/112.0000
6: Min/Mean/Max reward: 10.0000/42.1700/108.0000
7: Min/Mean/Max reward: 13.0000/45.3700/115.0000
8: Min/Mean/Max reward: 16.0000/49.3800/111.0000
9: Min/Mean/Max reward: 15.0000/52.7400/115.0000
10: Min/Mean/Max reward: 14.0000/61.7400/141.0000
11: Min/Mean/Max reward: 14.0000/65.7400/141.0000
12: Min/Mean/Max reward: 15.0000/66.6500/200.0000
13: Min/Mean/Max reward: 15.0000/82.0900/200.0000
14: Min/Mean/Max reward: 14.0000/96.3500/200.0000
15: Min/Mean/Max reward: 14.0000/114.2600/200.0000
16: Min/Mean/Max reward: 14.0000/126.1600/200.0000
17: Min/Mean/Max reward: 14.0000/141.5400/200.0000
18: Min/Mean/Max reward: 20.0000/150.1000/200.0000
19: Min/Mean/Max reward: 53.0000/157.5700/200.0000


In [9]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,10.0,21.603261,60.0,21.603261
1,1,9.0,23.203488,82.0,23.203488
2,2,9.0,26.677852,96.0,26.677852
3,3,10.0,31.445312,107.0,31.445312
4,4,11.0,32.540984,109.0,32.540984
5,5,11.0,39.564356,112.0,39.564356
6,6,10.0,42.17,108.0,42.17
7,7,13.0,45.37,115.0,45.37
8,8,16.0,49.38,111.0,49.38
9,9,15.0,52.74,115.0,52.74


In [11]:
import sys
sys.path.append("../..")
from util.line_plots import plot_line, plot_line_with_min_max

import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [14]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                      title='Episode Rewards', x_axis_label='n', y_axis_label='reward')

([image](../../images/rllib/Cart-Pole-Episode-Rewards-Exercise.png))

Compare this graph with the graph in the lesson, where we used a stronger network:

![](../../images/rllib/Cart-Pole-Episode-Rewards.png)

Note that we only used 5 episodes before. If you compare the graphs at n=4, you see that this execise solution is training more slowly, but it after N=10, the mean reward grows quickly.

Try it again with slightly larger and/or small neural network layers.