# Ray RLlib - Explore RLlib - Custom Environments and Reward Shaping

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson demonstrates how to adapt your own problem to use [Ray RLlib](http://rllib.io).

We cover two important concepts: 

1. How to create your own _Markov Decision Process_ abstraction.
2. How to shape the reward of your environment so make your agent more effective. 

In [1]:
import numpy as np
import pandas as pd
import json, os, shutil, sys
import gym

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

In [2]:
sys.path.append('../..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

In [3]:
!../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [4]:
ray.init(address='auto', ignore_reinit_error=True)

{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:6379',
 'object_store_address': '/tmp/ray/session_2020-06-28_07-02-18_715649_66267/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-28_07-02-18_715649_66267/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-28_07-02-18_715649_66267'}

In [5]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


## Different Spaces

The first thing to do when formulating an RL problem is to specify the dimensions of your observation space and action space. Abstractions for these are provided in Gym. 

### Matching Different Actions to Their Corresponding Space

Let's familiarize ourselves with different Gym spaces. For example:

    discrete = spaces.Discrete(10)
    print("Random sample of this space: ", [discrete.sample() for i in range(4)])

Use `help(gym.spaces)` or `help([specific space])` (i.e., `help(gym.spaces.Discrete)`) for more info.

In [6]:
help(gym.spaces)

Help on package gym.spaces in gym:

NAME
    gym.spaces

PACKAGE CONTENTS
    box
    dict
    discrete
    multi_binary
    multi_discrete
    space
    tests (package)
    tuple
    utils

CLASSES
    builtins.object
        gym.spaces.space.Space
            gym.spaces.box.Box
            gym.spaces.dict.Dict
            gym.spaces.discrete.Discrete
            gym.spaces.multi_binary.MultiBinary
            gym.spaces.multi_discrete.MultiDiscrete
            gym.spaces.tuple.Tuple
    
    class Box(gym.spaces.space.Space)
     |  Box(low, high, shape=None, dtype=<class 'numpy.float32'>)
     |  
     |  A (possibly unbounded) box in R^n. Specifically, a Box represents the
     |  Cartesian product of n closed intervals. Each interval has the form of one
     |  of [a, b], (-oo, b], [a, oo), or (-oo, oo).
     |  
     |  There are two common use cases:
     |  
     |  * Identical bound for each dimension::
     |      >>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
  

In [7]:
help(gym.spaces.Discrete)

Help on class Discrete in module gym.spaces.discrete:

class Discrete(gym.spaces.space.Space)
 |  Discrete(n)
 |  
 |  A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`. 
 |  
 |  Example::
 |  
 |      >>> Discrete(2)
 |  
 |  Method resolution order:
 |      Discrete
 |      gym.spaces.space.Space
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __init__(self, n)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  contains(self, x)
 |      Return boolean specifying if x is a valid
 |      member of this space
 |  
 |  sample(self)
 |      Randomly sample an element of this space. Can be 
 |      uniform or non-uniform sampling based on boundedness of space.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None
 |  
 |  ---------------------

Note the following example values in `action_space_examples` that the correspond to the declares spaces in `action_space_map`.

In [8]:
from gym import spaces

action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,), dtype=np.float64),  # the dtype can be omitted.
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1), dtype=np.float64),
    "multi_discrete": spaces.MultiDiscrete([ 5, 2, 2, 4 ])
}

action_space_examples = {
    "discrete_10": 1,
    "box_1": np.array([0.89089584]),
    "box_3x1": np.array([[-1.2657754], [-1.6528835], [ 0.5982418]]),
    "multi_discrete": np.array([0, 0, 0, 2]),
}

for space_id, state in action_space_examples.items():
    assert action_space_map[space_id].contains(state), (f'Looks like {space_id} to {state} is matched incorrectly.')



Here's a space with 10 discrete values, 0 through 9, from which we sample and then update a counts map.

In [9]:
counts = {key:0 for key in range(10)}
counts

for i in range(200):
    key = spaces.Discrete(10).sample()
    counts[key] = counts[key] + 1
counts

{0: 15, 1: 24, 2: 16, 3: 19, 4: 21, 5: 24, 6: 22, 7: 11, 8: 27, 9: 21}

You can have more than one dimension of discrete (or continuous) values.

In [10]:
md = spaces.MultiDiscrete([ 5, 2, 2, 4 ])
[md.sample() for _ in range(20)]

[array([4, 1, 0, 2]),
 array([4, 0, 0, 0]),
 array([1, 1, 0, 1]),
 array([2, 0, 0, 0]),
 array([2, 0, 1, 0]),
 array([1, 0, 0, 2]),
 array([3, 0, 0, 0]),
 array([1, 0, 1, 2]),
 array([1, 0, 0, 2]),
 array([3, 0, 0, 0]),
 array([3, 1, 0, 0]),
 array([0, 0, 1, 3]),
 array([2, 1, 0, 0]),
 array([1, 0, 0, 1]),
 array([2, 0, 0, 0]),
 array([2, 0, 1, 1]),
 array([0, 0, 1, 3]),
 array([2, 1, 0, 2]),
 array([2, 0, 0, 2]),
 array([1, 1, 1, 0])]

Note that the values for each dimension in the discrete space are inclusive, but zero-offset. For example, in the samples shown, the first integer returned in the array is 0-4, inclusive.

In [11]:
box = spaces.Box(-2, 2, shape=(3,2), dtype=np.float64)
[box.sample() for _ in range(20)]

[array([[-0.21508895, -0.95937773],
        [-0.84855983,  0.24701504],
        [ 0.62692108,  1.8701334 ]]),
 array([[-0.65042381, -1.14259923],
        [-0.20224788, -0.72529317],
        [-1.95928477, -1.65749218]]),
 array([[-1.79506632,  1.06640788],
        [-0.13193177,  1.39794078],
        [ 0.31178606,  1.97095426]]),
 array([[-0.97017207,  0.29517695],
        [-1.07587172,  1.34458648],
        [-0.55446192, -0.80038273]]),
 array([[-1.55730893,  1.73353339],
        [ 0.83212828, -0.5899059 ],
        [ 0.21135169,  0.5654232 ]]),
 array([[ 0.7636343 , -1.87491873],
        [ 0.58766069,  0.69793298],
        [-1.44319028, -0.32817514]]),
 array([[ 1.57448007,  0.21163613],
        [ 0.83482313, -0.9489593 ],
        [-0.23429545,  1.58978036]]),
 array([[ 1.71631837, -1.38578722],
        [ 1.73922473,  1.36287033],
        [-1.69934275, -0.1863998 ]]),
 array([[ 0.93703627,  1.85544766],
        [ 0.98367903, -1.5482977 ],
        [-1.23829554, -0.72698416]]),
 array([[ 

### Exercise 1: A Custom Environment with Rewards

Now we'll create an `n-Chain` environment, which represents moves along a linear chain of states, with two actions:

* (0) **forward**: move along the chain but returns no reward
* (1) **backward**: returns to the beginning and has a small reward

The end of the chain, however, provides a large reward, and by moving **forward** at the end of the chain, this large reward can be repeated.

#### Step 1: Implement `ChainEnv._setup_spaces`

Use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` in `ChainEnv` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. The observation space is an integer in the range `[0 to n-1]`.
2. The action space is an integer in `[0, 1]`.

For example:

```python
self.action_space = spaces.Discrete(2)
self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of `(state, reward, done, info)`. Right now, the reward is always 0. Modify `step()` so that the following rewards are returned for the given actions: 

1. `action == 1` will return `self.small_reward`.
2. `action == 0` will return 0 if `self.state < self.n - 1`.
3. `action == 0` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [12]:
from test_exercises import test_chain_env_spaces, test_chain_env_reward, test_chain_env_behavior
from gym import spaces

In [13]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = None
        self.observation_space = None
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_chain_env_spaces(ChainEnv)
test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...


AssertionError: Action Space not implemented!

### Train a Policy on the Environment 

Now we'll train a policy on the environment and evaluate the policy. You'll see that despite an extremely high reward, the policy has barely explored the state space. 

In order to proceed, we'll import an implementation of the previous exercise, but you should actually comment-out the next cell once you complete the previous exercise!

In [14]:
from chain_env import ChainEnv

In [15]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

In [16]:
def do_training(chainEnvClass, config = trainer_config, iterations=20):
    trainer = PPOTrainer(config, chainEnvClass)
    print(f'Training iterations: ', end='')
    for i in range(iterations):
        print('.', end='')
        trainer.train()
    print('')
    return trainer

In [17]:
trainer = do_training(ChainEnv, config=trainer_config, iterations=20)

2020-06-28 07:57:58,270	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-06-28 07:57:58,271	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


...................


In [18]:
env = ChainEnv({})
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print(f'Cumulative reward you received is: {cumulative_reward}. Congratulations!')
print(f'Max state you visited is: {max_state}. This is out of {env.n} states.')

Cumulative reward you received is: 40. Congratulations!
Max state you visited is: 0. This is out of 20 states.


We only visited a small number of states, maybe only 1 or 2 (max == 0 or 1?).

## Shaping the Reward to Encourage Desired Behavior

We see that despite an extremely high reward, the policy has barely explored the state space. This is often the situation - where the reward designed to encourage a particular solution is suboptimal, and the behavior created is unintended.

### Exercise 2: Improve the Policy

Modify `ShapedChainEnvVisited.step()` in the next cell to return rewards that encourage the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment. That is, the action -> state behavior should be the same. You can change the reward to be whatever you wish. We'll test it in the next section.

This implementation also adds a constructor argument `done_percentage`, which specifies what percentage of states, between `0.0` and `1.0` must be visited before `done` is reached. Play with this number when you modify the rewards to gain a sense of how long it takes to explore the action space. Note that there is a "safety"; it stops after `10*env.n` iterations, even if the percentage of visited states isn't reached. As the code exists in the following cell, it will always hit this safety!

In [34]:
class ShapedChainEnvVisited(ChainEnv):

    def __init__(self, env_config = None):
        super().__init__(env_config)
        self.visited = set()
        self.done_percentage = 0.5
        self.done_n = self.done_percentage * self.n
        
    def step(self, action):
        assert self.action_space.contains(action)
        self.visited.add(self.state)
        if action == 1:  # 'backwards': go back to the beginning
            reward = self.small_reward
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.large_reward
        self._counter += 1
        done = len(self.visited) >= self.done_n
        if not done and self._counter > (self.n*10):
            done = True
            visited_per = (len(self.visited)*100.0)/self.n
            print(f'Stopping after {self.n*10} iterations. Visited {visited_per:6.2f}% of the states.')
        return self.state, reward, done, {}

test_chain_env_behavior(ShapedChainEnvVisited)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


### Evaluate `ShapedChainEnv` by Running the Cell(s) Below

This trains PPO on the new env and counts the number of states seen.

In [35]:
trainer = do_training(ShapedChainEnvVisited, config=trainer_config, iterations=20)



[2m[36m(pid=67715)[0m Stopping after 200 iterations. Visited  40.00% of the states.
.[2m[36m(pid=67715)[0m Stopping after 200 iterations. Visited  40.00% of the states.
..................


Let's see how long it takes to get to 50% (the value hard-coded for `done_percentage`). 

In [39]:
env = ShapedChainEnvVisited({})

state = env.reset()
done = False
max_state = -1
cumulative_reward = 0
while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print(f'Cumulative reward you received is: {cumulative_reward}!')
print(f'Max state you visited is: {max_state}. (There are {env.n} states.)')
desired = env.done_percentage
actual = (max_state+1)/env.n  # add one because of zero indexing
print(f"This policy traversed {actual*100:4.1f}% of the available states.")
assert actual > desired, f"{actual*100:4.1f}% is less than the desired percentage of {desired*100:4.1f}%."

Stopping after 200 iterations. Visited   5.00% of the states.
Cumulative reward you received is: 402!
Max state you visited is: 0. (There are 20 states.)
This policy traversed  5.0% of the available states.


AssertionError:  5.0% is less than the desired percentage of 50.0%.