# Ray RLlib - Custom Environments and Reward Shaping

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson demonstrates how to adapt your own problem to use [Ray RLlib](http://rllib.io).

We cover two important concepts: 

1. How to create your own _Markov Decision Process_ abstraction.
2. How to shape the reward of your environment so make your agent more effective. 

In [None]:
import numpy as np
import pandas as pd
import json, os, shutil, sys
import gym

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

In [None]:
sys.path.append('..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

In [None]:
!../tools/start-ray.sh --check --verbose

In [None]:
ray.init(address='auto', ignore_reinit_error=True)

In [1]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

2020-05-05 08:33:11,667	INFO resource_spec.py:212 -- Starting Ray with 4.25 GiB memory available for workers and up to 2.13 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-05 08:33:12,001	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8267[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:37100',
 'object_store_address': '/tmp/ray/session_2020-05-05_08-33-11_657093_49830/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-05_08-33-11_657093_49830/sockets/raylet',
 'webui_url': 'localhost:8267',
 'session_dir': '/tmp/ray/session_2020-05-05_08-33-11_657093_49830'}

## 1. Different Spaces

The first thing to do when formulating an RL problem is to specify the dimensions of your observation space and action space. Abstractions for these are provided in ``gym``. 

### **Exercise 1:** Match different actions to their corresponding space.

The purpose of this exercise is to familiarize you with different Gym spaces. For example:

    discrete = spaces.Discrete(10)
    print("Random sample of this space: ", [discrete.sample() for i in range(4)])

Use `help(spaces)` or `help([specific space])` (i.e., `help(spaces.Discrete)`) for more info.

In [2]:
help(spaces.Discrete)

Help on class Discrete in module gym.spaces.discrete:

class Discrete(gym.spaces.space.Space)
 |  Discrete(n)
 |  
 |  A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`. 
 |  
 |  Example::
 |  
 |      >>> Discrete(2)
 |  
 |  Method resolution order:
 |      Discrete
 |      gym.spaces.space.Space
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __init__(self, n)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  contains(self, x)
 |      Return boolean specifying if x is a valid
 |      member of this space
 |  
 |  sample(self)
 |      Randomly sample an element of this space. Can be 
 |      uniform or non-uniform sampling based on boundedness of space.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None
 |  
 |  ---------------------

Fix `action_space_jumble` kens to be correct for the corresponding `action_space_map`.

In [5]:
action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,)),
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1)),
    "multi_discrete": spaces.MultiDiscrete([ 5, 2, 2, 4 ])
}

action_space_jumble = {
    "discrete_10": 1,
    "multi_discrete": np.array([0, 0, 0, 2]),
    "box_3x1": np.array([[-1.2657754], [-1.6528835], [ 0.5982418]]),
    "box_1": np.array([0.89089584]),
}


for space_id, state in action_space_jumble.items():
    assert action_space_map[space_id].contains(state), (
        "Looks like {} to {} is matched incorrectly.".format(space_id, state))
    
print("Success!")

Success!


In [15]:
counts = {key:0 for key in range(10)}
counts

for i in range(200):
    key = spaces.Discrete(10).sample()
    counts[key] = counts[key] + 1
counts

In [18]:
[spaces.MultiDiscrete([ 5, 2, 2, 4 ]).sample() for _ in range(20)]

[array([3, 0, 0, 1]),
 array([4, 0, 0, 3]),
 array([0, 1, 1, 0]),
 array([2, 1, 0, 1]),
 array([0, 0, 0, 0]),
 array([2, 1, 1, 2]),
 array([0, 0, 0, 3]),
 array([2, 1, 0, 2]),
 array([2, 0, 1, 0]),
 array([1, 1, 1, 3]),
 array([2, 0, 0, 3]),
 array([2, 0, 0, 1]),
 array([4, 0, 0, 2]),
 array([2, 1, 1, 1]),
 array([3, 1, 0, 0]),
 array([3, 0, 1, 3]),
 array([1, 0, 1, 0]),
 array([1, 1, 0, 2]),
 array([2, 1, 1, 1]),
 array([2, 0, 0, 1])]

## **Exercise 2**: Setting up a custom environment with rewards

We'll setup an `n-Chain` environment, which presents moves along a linear chain of states, with two actions:

     (0) forward, which moves along the chain but returns no reward
     (1) backward, which returns to the beginning and has a small reward

The end of the chain, however, presents a large reward, and by moving 'forward', at the end of the chain this large reward can be repeated.

#### Step 1: Implement ``ChainEnv._setup_spaces``

We'll use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. Observation space is an integer in ``[0 to n-1]``.
2. Action space is an integer in ``[0, 1]``.

For example:

```python
    self.action_space = spaces.Discrete(2)
    self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of ``(state, reward, done, info)``. Right now, the reward is always 0. 

Implement it so that 

1. ``action == 1`` will return `self.small_reward`.
2. ``action == 0`` will return 0 if `self.state < self.n - 1`.
3. ``action == 0`` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

In [None]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = None
        self.observation_space = None
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = -1
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_exercises.test_chain_env_spaces(ChainEnv)
test_exercises.test_chain_env_reward(ChainEnv)

### Let's now train a policy on the environment and evaluate this policy on our environment.

You'll see that despite an extremely high reward, the policy has barely explored the state space.

In [None]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

In [None]:
trainer = PPOTrainer(trainer_config, ChainEnv);
for i in range(20):
    print("Training iteration {}...".format(i))
    trainer.train()

In [None]:
env = ChainEnv({})
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print("Cumulative reward you've received is: {}. Congratulations!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(max_state, env.n))

## Exercise 3: Shaping the reward to encourage proper behavior.

You'll see that despite an extremely high reward, the policy has barely explored the state space. This is often the situation - where the reward designed to encourage a particular solution is suboptimal, and the behavior created is unintended.

#### Modify `ShapedChainEnv.step` to provide a reward that encourages the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment (the action -> state behavior should be the same).

You can change the reward to be whatever you wish.

In [None]:
class ShapedChainEnv(ChainEnv):
    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning
            reward = -1
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = -1
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = -1
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}
    
test_exercises.test_chain_env_behavior(ShapedChainEnv)

### Evaluate `ShapedChainEnv` by running the cell below.

This trains PPO on the new env and counts the number of states seen.

In [None]:
trainer = PPOTrainer(trainer_config, ShapedChainEnv);
for i in range(20):
    print("Training iteration {}...".format(i))
    trainer.train()

env = ShapedChainEnv({})

max_states = []

for i in range(5):
    state = env.reset()
    done = False
    max_state = -1
    cumulative_reward = 0
    while not done:
        action = trainer.compute_action(state)
        state, reward, done, results = env.step(action)
        max_state = max(max_state, state)
        cumulative_reward += reward
    max_states += [max_state]

print("Cumulative reward you've received is: {}!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(np.mean(max_states), env.n))
assert (env.n - np.mean(max_states)) / env.n < 0.2, "This policy did not traverse many states."