# Ray RLlib Tutorial - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This notebook contains the solutions for all the exercises in the RLlib tutorial.

First, we have to setup everything needed from the other notebooks.

In [1]:
import gym
import numpy as np
import pandas as pd
import json, sys, os

## 01 Introduction to Reinforcement Learning

### Exercise 1

Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

In [2]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


In [3]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
        
    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

The first sample policy got an average reward of 9.47.
The second sample policy got an average reward of 28.99.


### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [4]:
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [5]:
!../../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [6]:
ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False)

{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:6379',
 'object_store_address': '/tmp/ray/session_2020-06-26_05-43-51_233970_89279/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-26_05-43-51_233970_89279/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-26_05-43-51_233970_89279'}

Here's one possible set. It takes longer for the max reward to reach 200, so I increased the number of episodes `N` to 10.

In [7]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10                       # was 30
config['sgd_minibatch_size'] = 256                # was 128
config['model']['fcnet_hiddens'] = [20, 20]       # was [100, 100]
config['num_cpus_per_worker'] = 0

In [8]:
agent = PPOTrainer(config, 'CartPole-v0')

2020-06-26 12:12:15,197	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-06-26 12:12:15,198	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


In [9]:
N=20                # was 10
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'],  
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}')

  0: Min/Mean/Max reward:   8.0000/ 22.4324/122.0000
  1: Min/Mean/Max reward:   8.0000/ 25.0898/ 96.0000
  2: Min/Mean/Max reward:  10.0000/ 28.5608/112.0000
  3: Min/Mean/Max reward:  10.0000/ 30.7687/107.0000
  4: Min/Mean/Max reward:   9.0000/ 40.3846/113.0000
  5: Min/Mean/Max reward:  11.0000/ 41.8235/125.0000
  6: Min/Mean/Max reward:  11.0000/ 46.9400/144.0000
  7: Min/Mean/Max reward:  14.0000/ 52.3300/149.0000
  8: Min/Mean/Max reward:  14.0000/ 56.0000/138.0000
  9: Min/Mean/Max reward:  15.0000/ 66.2300/162.0000
 10: Min/Mean/Max reward:  13.0000/ 75.2900/193.0000
 11: Min/Mean/Max reward:  13.0000/ 80.9700/193.0000
 12: Min/Mean/Max reward:  20.0000/ 86.6300/178.0000
 13: Min/Mean/Max reward:  20.0000/ 88.3700/164.0000
 14: Min/Mean/Max reward:  20.0000/ 94.1800/200.0000
 15: Min/Mean/Max reward:  22.0000/103.9200/200.0000
 16: Min/Mean/Max reward:  28.0000/113.0500/200.0000
 17: Min/Mean/Max reward:  28.0000/112.7100/200.0000
 18: Min/Mean/Max reward:  24.0000/113.8200/20

In [10]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,8.0,22.432432,122.0,22.432432
1,1,8.0,25.08982,96.0,25.08982
2,2,10.0,28.560811,112.0,28.560811
3,3,10.0,30.768657,107.0,30.768657
4,4,9.0,40.384615,113.0,40.384615
5,5,11.0,41.823529,125.0,41.823529
6,6,11.0,46.94,144.0,46.94
7,7,14.0,52.33,149.0,52.33
8,8,14.0,56.0,138.0,56.0
9,9,15.0,66.23,162.0,66.23


In [11]:
import sys
sys.path.append("../../..")
from util.line_plots import plot_line, plot_line_with_min_max

import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [12]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                      title='Episode Rewards', x_axis_label='n', y_axis_label='reward')

([image](../../../images/rllib/Cart-Pole-Episode-Rewards-Exercise.png))

Compare this graph with the graph in the lesson, where we used a stronger network:

![](../../../images/rllib/Cart-Pole-Episode-Rewards.png)

Note that we only used 5 episodes before. If you compare the graphs at n=4, you see that this execise solution is training more slowly, but it after N=10, the mean reward grows quickly.

Try it again with slightly larger and/or small neural network layers.

## 05: Custom Environments and Reward Shaping

### Exercise 1: A Custom Environment with Rewards

Now we'll create an `n-Chain` environment, which represents moves along a linear chain of states, with two actions:

     (0) **forward**: move along the chain but returns no reward
     (1) **backward**: returns to the beginning and has a small reward

The end of the chain, however, provides a large reward, and by moving **forward** at the end of the chain, this large reward can be repeated.

#### Step 1: Implement `ChainEnv._setup_spaces`

Use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` in `ChainEnv` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. The observation space is an integer in the range `[0 to n-1]`.
2. The action space is an integer in `[0, 1]`.

For example:

```python
self.action_space = spaces.Discrete(2)
self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of `(state, reward, done, info)`. Right now, the reward is always 0. Modify `step()` so that the following rewards are returned for the given actions: 

1. `action == 1` will return `self.small_reward`.
2. `action == 0` will return 0 if `self.state < self.n - 1`.
3. `action == 0` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [13]:
sys.path.append('..')
from test_exercises import test_chain_env_spaces, test_chain_env_reward, test_chain_env_behavior
from gym import spaces

In [14]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.small_reward
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.large_reward
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_chain_env_spaces(ChainEnv)
test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...
Success! You've setup the spaces correctly.
Testing if reward has been setup correctly...
Success! You've setup the rewards correctly.


### Exercise 2: Improve the Policy

Modify `ShapedChainEnv.step()` in the next cell to provide a reward that encourages the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment (the action -> state behavior should be the same).

You can change the reward to be whatever you wish. We'll text it in the next section.

### Evaluate `ShapedChainEnv` by Running the Cell(s) Below

This trains PPO on the new env and counts the number of states seen.

First, we'll set up things we need from the lesson notebook.

In [15]:
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

In [16]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

Now here's one solution, where the reward calculations are the only difference from the previous implementation of `step`. This problem is actually difficult to solve, because it's hard to encourage exploration with just the reward alone. 

The key is to penalize action 1 (go back to the beginning), because you always get a small reward if you stay there, so there's a temptation to exploit that action and keep accruing the small reward until you hit the goal. Hence, this solution sets the reward to zero and instead returns `self.small_reward * step.state` for the action 0 when the state is `self.state < self.n - 1` (i.e., not at the right-hand end). Hence, the reward grows as we move to the right. We tried many variations and most returned similar results that were not especially satisfactory.

In [163]:
[np.random.choice(range(10,20)) for _ in range(10)]

[17, 11, 18, 15, 18, 14, 10, 11, 11, 15]

In [174]:
class ShapedChainEnv(ChainEnv):
    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning
            reward = 0
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = np.random.choice(range(0, self.small_reward))  
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.large_reward
        self._counter += 1
        done = self._counter >= 2*self._horizon
        return self.state, reward, done, {}

test_chain_env_behavior(ShapedChainEnv)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


In [143]:
def do_training(chainEnvClass, iterations=20):
    trainer = PPOTrainer(trainer_config, chainEnvClass)
    print(f'Training iterations: ', end='')
    for i in range(iterations):
        print('.', end='')
        trainer.train()
    print('')

In [175]:
do_training(ShapedChainEnv)



Training iterations: ....................


In [125]:
def do_rollout(chainEnvClass):
    env = chainEnvClass({})
    max_states = []
    cumulative_rewards = []
    for i in range(5):
        state = env.reset()
        done = False
        max_states.append(-1)
        cumulative_rewards.append(0)
        while not done:
            action = trainer.compute_action(state)
            state, reward, done, results = env.step(action)
            max_states[i] = max(max_states[i], state)
            cumulative_rewards[i] += reward
            # The "results" returned by env.step are empty.
            #print(f'state = {state:3d}, reward = {reward:6.3f}, cumulative_reward = {cumulative_reward:6.3f}, done = {str(done):5s}, max_state = {max_state}') 

    print(f'Cumulative rewards: {cumulative_rewards}')
    print(f'Max states you visited are: {max_states}. The mean of this list is {np.mean(max_states)}. (There are {env.n} states.)')
    actual = np.mean(max_states) / env.n
    desired = 0.7
    print(f'This policy traversed on average {actual*100:4.1f}% of the available states.')
    assert visited > 0.7, f"{actual*100:4.1f}% is less than the desired percentage of {desired*100:4.1f}%."

In [181]:
do_rollout(ShapedChainEnv)

Cumulative rewards: [159, 107, 137, 141, 123]
Max states you visited are: [8, 5, 9, 4, 5]. The mean of this list is 6.2. (There are 20 states.)
This policy traversed on average 31.0% of the available states.


AssertionError: 31.0% is less than the desired percentage of 70.0%.

2020-06-27 11:29:03,585	ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-06-27 11:29:03,615	ERROR import_thread.py:93 -- ImportThread: Connection closed by server.


(If it fails to pass, rerun the previous cell.)

This may have taken you several tries. The key is to penalize action 1 (go back to the beginning), because you always get a small reward if you stay there, so there's a temptation to exploit that action and keep accruing the small reward until you hit the goal. Hence, this solution sets the reward to zero and instead returns `self.small_reward` for the action 0 when the state is `self.state < self.n - 1` (i.e., not at the right-hand end).

What does it take to visit all states? This variant only stops when it has successfully visited all states.

In [105]:
class ShapedChainEnvVisited(ChainEnv):

    def __init__(self, env_config = None):
        super().__init__(env_config)
        self.visited = set()
        
    def step(self, action):
        assert self.action_space.contains(action)
        self.visited.add(self.state)
        if action == 1:  # 'backwards': go back to the beginning
            reward = 0
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = self.calc_reward_for_action_0()
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.large_reward
        self._counter += 1
        done = len(self.visited) == self.n
        return self.state, reward, done, {}

    def calc_reward_for_action_0(self):
        return self.state * (self.large_reward - self.small_reward) / self.n


test_chain_env_behavior(ShapedChainEnv)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


In [107]:
trainer = PPOTrainer(trainer_config, ShapedChainEnvVisited);
print(f'Training iterations: ', end='')
for i in range(20):
    print('.', end='')
    trainer.train()
print('')



Training iterations: ....................


It can take a while for the following to run.

In [111]:
env = ShapedChainEnvVisited({})

state = env.reset()
max_state = -1
cumulative_reward = 0
step_count = 0
done = False
while not done:
    step_count += 1
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward
    # The "results" returned by env.step are empty.
    #print(f'state = {state:3d}, reward = {reward:6.3f}, cumulative_reward = {cumulative_reward:6.3f}, done = {str(done):5s}, max_state = {max_state}') 

print(f'Cumulative reward: {cumulative_reward}')
print(f'Step count:        {step_count}')
print(f'Max state:         {max_state}')

Cumulative reward: 14211.199999998242
Step count:        45516
Max state:         19


A large number, but a bit misleading, due to the probabilistic nature of selecting actions, it's possible that some states "avoid" getting occupied for an unusually long time.