# Ray RLlib - Explore RLlib - Custom Environments and Reward Shaping

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson demonstrates how to adapt your own problem to use [Ray RLlib](http://rllib.io).

We cover two important concepts: 

1. How to create your own _Markov Decision Process_ abstraction.
2. How to shape the reward of your environment so make your agent more effective. 

----------------
이번 장에서는 직접 만든 환경을 Ray RLlib를 이용하여 사용하는 방법에 대해서 설명하겠습니다.

이번 장의 2가지 핵심 포인트는 다음과 같습니다.

1. 나만의 _Markov Decision Process_ 을 만드는 방법
2. 효과적인 agent 학습을 위해 보상(reward)시스템을 설계하는 방법 

In [1]:
import numpy as np
import pandas as pd
import json, os, shutil, sys
import gym

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

In [2]:
sys.path.append('../..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

In [3]:
ray.init(ignore_reinit_error=True)

2020-09-27 12:30:12,207	INFO resource_spec.py:212 -- Starting Ray with 85.16 GiB memory available for workers and up to 40.5 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-27 12:30:12,656	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '203.237.46.206',
 'raylet_ip_address': '203.237.46.206',
 'redis_address': '203.237.46.206:6379',
 'object_store_address': '/tmp/ray/session_2020-09-27_12-30-12_205959_280453/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-09-27_12-30-12_205959_280453/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-09-27_12-30-12_205959_280453'}

In [4]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


## Different Spaces

The first thing to do when formulating an RL problem is to specify the dimensions of your observation space and action space. Abstractions for these are provided in Gym. 

-------------
강화학습(RL) 문제를 정의하는데 있어서 가장 첫 번째로 해야할 일을 관측값(observation)과 행동(action)에 대한 차원을 구체화 해야합니다.
아래의 예제들은 Gym을 이용해 작성되었습니다.

### Matching Different Actions to Their Corresponding Space

Let's familiarize ourselves with different Gym spaces. For example:

    discrete = spaces.Discrete(10)
    print("Random sample of this space: ", [discrete.sample() for i in range(4)])

Use `help(gym.spaces)` or `help([specific space])` (i.e., `help(gym.spaces.Discrete)`) for more info.

--------------
Gym에 대해서 좀더 알아 보겠습니다. 예를들어,

    discrete = spaces.Discrete(10)
    print("Random sample of this space: ", [discrete.sample() for i in range(4)])
    
help함수를 이용해 좀더 구체적인 정보를 얻을 수 있습니다

In [5]:
help(gym.spaces)

Help on package gym.spaces in gym:

NAME
    gym.spaces

PACKAGE CONTENTS
    box
    dict
    discrete
    multi_binary
    multi_discrete
    space
    tests (package)
    tuple
    utils

CLASSES
    builtins.object
        gym.spaces.space.Space
            gym.spaces.box.Box
            gym.spaces.dict.Dict
            gym.spaces.discrete.Discrete
            gym.spaces.multi_binary.MultiBinary
            gym.spaces.multi_discrete.MultiDiscrete
            gym.spaces.tuple.Tuple
    
    class Box(gym.spaces.space.Space)
     |  A (possibly unbounded) box in R^n. Specifically, a Box represents the
     |  Cartesian product of n closed intervals. Each interval has the form of one
     |  of [a, b], (-oo, b], [a, oo), or (-oo, oo).
     |  
     |  There are two common use cases:
     |  
     |  * Identical bound for each dimension::
     |      >>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
     |      Box(3, 4)
     |      
     |  * Independent bound for each dimen

In [6]:
help(gym.spaces.Discrete)

Help on class Discrete in module gym.spaces.discrete:

class Discrete(gym.spaces.space.Space)
 |  A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`. 
 |  
 |  Example::
 |  
 |      >>> Discrete(2)
 |  
 |  Method resolution order:
 |      Discrete
 |      gym.spaces.space.Space
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __init__(self, n)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  contains(self, x)
 |      Return boolean specifying if x is a valid
 |      member of this space
 |  
 |  sample(self)
 |      Randomly sample an element of this space. Can be 
 |      uniform or non-uniform sampling based on boundedness of space.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None
 |  
 |  ------------------------------------------

Note the following example values in `action_space_examples` that the correspond to the declares spaces in `action_space_map`.
____
아래의 예제에서 `action_space_examples`는 `action_space_map`에 대응대는 값들을 나타내고 있습니다. 

In [7]:
from gym import spaces

action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,), dtype=np.float64),  # the dtype can be omitted.
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1), dtype=np.float64),
    "multi_discrete": spaces.MultiDiscrete([ 5, 2, 2, 4 ])
}

action_space_examples = {
    "discrete_10": 1,
    "box_1": np.array([0.89089584]),
    "box_3x1": np.array([[-1.2657754], [-1.6528835], [ 0.5982418]]),
    "multi_discrete": np.array([0, 0, 0, 2]),
}

for space_id, state in action_space_examples.items():
    assert action_space_map[space_id].contains(state), (f'Looks like {space_id} to {state} is matched incorrectly.')



Here's a space with 10 discrete values, 0 through 9, from which we sample and then update a counts map.
____
여기서 0~9까지 10개의 정수로 이루워진 공간(space)에 부터 하나를 선택(sampling)하고 횟수(count)를 업데이트 하게 됩니다. 

In [8]:
counts = {key:0 for key in range(10)}
counts

for i in range(200):
    key = spaces.Discrete(10).sample()
    counts[key] = counts[key] + 1
counts

{0: 19, 1: 27, 2: 13, 3: 24, 4: 15, 5: 23, 6: 22, 7: 25, 8: 17, 9: 15}

You can have more than one dimension of discrete (or continuous) values.
______
다차원의 (혹은 연속적인 공간에서의) 값을 설정할 수도 있습니다.

In [9]:
md = spaces.MultiDiscrete([ 5, 2, 2, 4 ])
[md.sample() for _ in range(20)]

[array([4, 0, 1, 0]),
 array([1, 0, 1, 2]),
 array([3, 0, 1, 3]),
 array([3, 0, 0, 0]),
 array([2, 1, 1, 0]),
 array([1, 1, 1, 2]),
 array([2, 1, 1, 1]),
 array([1, 0, 1, 0]),
 array([2, 1, 0, 0]),
 array([4, 0, 1, 1]),
 array([3, 1, 1, 2]),
 array([1, 0, 0, 0]),
 array([3, 1, 0, 1]),
 array([0, 0, 0, 3]),
 array([0, 1, 1, 3]),
 array([2, 1, 1, 2]),
 array([2, 1, 1, 0]),
 array([2, 1, 0, 0]),
 array([2, 1, 0, 3]),
 array([3, 0, 0, 3])]

Note that the values for each dimension in the discrete space are inclusive, but zero-offset. For example, in the samples shown, the first integer returned in the array is 0-4, inclusive.
______
각 차원에 대한 값들은 반복적으로 선택될 수 있으며, 서로 독립적인다. 예를들어, 첫 번째 차원에서 0~4 숫자중 한가지를 독립적으로 반복하여 선택할 수 있다.

In [10]:
box = spaces.Box(-2, 2, shape=(3,2), dtype=np.float64)
[box.sample() for _ in range(20)]

[array([[-0.79400374, -0.89244377],
        [ 0.65044125, -1.92038612],
        [ 0.73991394, -1.60240738]]),
 array([[-0.89296331, -0.96551173],
        [ 0.55588666, -0.83695289],
        [-1.19497815,  0.36724385]]),
 array([[-1.19887293,  1.75579537],
        [-1.77803333, -1.87905783],
        [-0.32348429, -1.93672769]]),
 array([[ 1.9031015 ,  0.19564132],
        [-1.55658858, -0.91911957],
        [-1.16232201, -0.84083103]]),
 array([[ 0.87684356,  1.70554173],
        [ 1.34991938,  0.16133106],
        [-0.14777836, -0.76922947]]),
 array([[0.54981688, 1.67526495],
        [0.630599  , 0.56716344],
        [0.9435345 , 0.79983371]]),
 array([[-0.3118908 , -0.50034357],
        [-1.29654765, -1.0501355 ],
        [-0.32083383,  1.69562941]]),
 array([[-1.1776331 ,  0.83123156],
        [ 1.75838583,  0.3906876 ],
        [-0.36098672,  1.79583711]]),
 array([[ 0.9835056 , -1.55206052],
        [-0.24646835, -0.96972703],
        [-0.77998458, -0.58894662]]),
 array([[ 0.6818

### Exercise 1: A Custom Environment with Rewards

Now we'll create an `n-Chain` environment, which represents moves along a linear chain of states, with two actions:

* (0) **forward**: move along the chain but returns no reward
* (1) **backward**: returns to the beginning and has a small reward

The end of the chain, however, provides a large reward, and by moving **forward** at the end of the chain, this large reward can be repeated.

_______
이제, `n-Chain`이라는 환경을 만들어 보겠습니다. 이 환경은 선형적으로 엮여 있는 state들에 대해서 2가지의 행동(action)을 고려합니다.

* (0) **forward**: 보상 없이 앞으로 이동
* (1) **backward**: 작은 보상과 함께 처음지점으로 이동

#### Step 1: Implement `ChainEnv._setup_spaces`

Use a `spaces.Discrete` action space and observation space. Implement `ChainEnv._setup_spaces` in `ChainEnv` so that `self.action_space` and `self.obseration_space` are proper gym spaces.
  
1. The observation space is an integer in the range `[0 to n-1]`.
2. The action space is an integer in `[0, 1]`.

For example:

```python
self.action_space = spaces.Discrete(2)
self.observation_space = ...
```

You should see a message indicating tests passing when done correctly!

______
`spaces.Discrete`를 이용하여 행동(action)과 관찰(observation) 공간(space)를 만들겠습니다. 각 공간은 `self.action_space`와 `self.observation_space`로서, `ChainEnv`의 `ChainEnv._setup_spaces`에 정의됩니다.

1. 관찰(observation) 공간은 `[0 to n-1]`사이의 정수를 가집니다.
2. 행동(action) 공간은 `[0, 1]` 사이의 정수를 가집니다.

예를들면 아래와 같습니다:

```python
self.action_space = spaces.Discrete(2)
self.observation_space = ...
```

#### Step 2: Implement a reward function.

When `env.step` is called, it returns a tuple of `(state, reward, done, info)`. Right now, the reward is always 0. Modify `step()` so that the following rewards are returned for the given actions: 

1. `action == 1` will return `self.small_reward`.
2. `action == 0` will return 0 if `self.state < self.n - 1`.
3. `action == 0` will return `self.large_reward` if `self.state == self.n - 1`.

You should see a message indicating tests passing when done correctly. 

___________
`env.step`를 호출하면, 튜플형태로 `(state, reward, done, info)`가 결과물로 나옵니다. 현재, 보상은 항상 0 입니다. `step()`을 아래같이 수행되도록 보상시스템을 수정해보세요.

1. `행동(action) 이 1`일때는 `self.small_reward`을.
2. `행동(action) 이 0`일때 만약 `self.state < self.n - 1`이라면 0을.
3. `행동(action) 이 0`이면서 `self.state == self.n - 1`이라면, `self.large_reward`을 가집니다.


In [11]:
from test_exercises import test_chain_env_spaces, test_chain_env_reward, test_chain_env_behavior
from gym import spaces

In [12]:
class ChainEnv(gym.Env):
    
    def __init__(self, env_config = None):
        env_config = env_config or {}
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)  # payout for 'backwards' action
        self.large_reward = env_config.get("large", 10)  # payout at end of chain for 'forwards' action
        self.state = 0  # Start at beginning of the chain
        self._horizon = self.n
        self._counter = 0  # For terminating the episode
        self._setup_spaces()
    
    def _setup_spaces(self):
        ##############
        # TODO: Implement this so that it passes tests
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)
        ##############

    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:  # 'backwards': go back to the beginning, get small reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.small_reward
            ##############
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain, collect large reward
            ##############
            # TODO 2: Implement this so that it passes tests
            reward = self.large_reward
            ##############
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}

    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
# Tests here:
test_chain_env_spaces(ChainEnv)
test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...
Success! You've setup the spaces correctly.
Testing if reward has been setup correctly...
Success! You've setup the rewards correctly.


### Train a Policy on the Environment 

Now we'll train a policy on the environment and evaluate the policy. You'll see that despite an extremely high reward, the policy has barely explored the state space. 

In order to proceed, we'll import an implementation of the previous exercise, but you should actually comment-out the next cell once you complete the previous exercise!
_____
이제, 환경에 대한 정책(policy)를 훈련시키고 평가해 보겠습니다. 결과적으로 아무리 높은 보상을 준다할지라도, 정책(policy)이 상태(state) 공간(space)를 탐험하지는 않는 것을 확인할 수 있으실겁니다.



In [13]:
from chain_env import ChainEnv

In [14]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config['num_workers'] = 1
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10
trainer_config["framework"] = 'torch'

In [15]:
def do_training(chainEnvClass, config = trainer_config, iterations=20):
    trainer = PPOTrainer(config, chainEnvClass)
    print(f'Training iterations: ', end='')
    for i in range(iterations):
        print('.', end='')
        trainer.train()
    print('')
    return trainer

In [16]:
trainer = do_training(ChainEnv, config=trainer_config, iterations=20)

2020-09-27 12:30:18,348	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


...................


In [17]:
env = ChainEnv({})
state = env.reset()

done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print(f'Cumulative reward you received is: {cumulative_reward}. Congratulations!')
print(f'Max state you visited is: {max_state}. This is out of {env.n} states.')

Cumulative reward you received is: 40. Congratulations!
Max state you visited is: 0. This is out of 20 states.


We only visited a small number of states, maybe only 1 or 2 (max == 0 or 1?).

## Shaping the Reward to Encourage Desired Behavior

We see that despite an extremely high reward, the policy has barely explored the state space. This is often the situation - where the reward designed to encourage a particular solution is suboptimal, and the behavior created is unintended.

_____
이전에 언급했듯이, 아무리 높은 보상을 준다할지라도, 정책(policy)이 상태(state) 공간(space)를 탐험하지는 않는 것을 확인할 수 있습니다. 이럴경우 보상시스템을 특정 방향 혹은 행동에 더 치중할 수 있도록 만들어 줄 수 있습니다.

### Exercise 2: Improve the Policy

Modify `ShapedChainEnvVisited.step()` in the next cell to return rewards that encourage the policy to traverse the chain (not just stick to 0). Do not change the behavior of the environment. That is, the action -> state behavior should be the same. You can change the reward to be whatever you wish. We'll test it in the next section.

This implementation also adds a constructor argument `done_percentage`, which specifies what percentage of states, between `0.0` and `1.0` must be visited before `done` is reached. Play with this number when you modify the rewards to gain a sense of how long it takes to explore the action space. Note that there is a "safety"; it stops after `10*env.n` iterations, even if the percentage of visited states isn't reached. As the code exists in the following cell, it will always hit this safety!

_____
`ShapedChainEnvVisited.step()`을 수정하여 정책(policy)이 chain을 0에 머물게 하지 않고 가로지리는 행위에 대해서 보상을 받을 수 있도록 하겠습니다. 여기서 환경의 행동에 대한 부분은 수정하지 마세요. 행동(action) -> 상태(state)는 이전과 동일하면서, 원하는 행위에 대한 보상만 수정하겠습니다.

여기서 `done_percentage`가 추가되는데, 이는 상태(state)들에 대한 퍼센트를 의미하며 `done`에 도달하기 전까지 방문비율을 `0.0` 에서 `1.0`사이의 숫자로 나타냅니다. 보상시스템을 수정했을때, `done_percentage`가 어떻게 달라지는지 잘 관찰해 보세요. 다만, `10*env.n`회 이상 진행될 경우, 방문비율에 상관없이 멈추도록 했습니다.

In [18]:
class ShapedChainEnvVisited(ChainEnv):

    def __init__(self, env_config = None):
        super().__init__(env_config)
        self.visited = set()
        self.done_percentage = 0.5
        self.done_n = self.done_percentage * self.n
        
    def step(self, action):
        assert self.action_space.contains(action)
        self.visited.add(self.state)
        if action == 1:  # 'backwards': go back to the beginning
            reward = self.small_reward*-0.1
            self.state = 0
        elif self.state < self.n - 1:  # 'forwards': go up along the chain
            reward = 0
            self.state += 1
        else:  # 'forwards': stay at the end of the chain
            reward = self.large_reward
        self._counter += 1
        done = len(self.visited) >= self.done_n
        if not done and self._counter > (self.n*10):
            done = True
            visited_per = (len(self.visited)*100.0)/self.n
            print(f'Stopping after {self.n*10} iterations. Visited {visited_per:6.2f}% of the states.')
        return self.state, reward, done, {}

test_chain_env_behavior(ShapedChainEnvVisited)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


### Evaluate `ShapedChainEnv` by Running the Cell(s) Below

This trains PPO on the new env and counts the number of states seen.
___
PPO 알고리즘을 적용해 보겠습니다.

In [19]:
trainer = do_training(ShapedChainEnvVisited, config=trainer_config, iterations=20)



[2m[36m(pid=280845)[0m Stopping after 200 iterations. Visited  30.00% of the states.
.[2m[36m(pid=280845)[0m Stopping after 200 iterations. Visited  40.00% of the states.
..................


Let's see how long it takes to get to 50% (the value hard-coded for `done_percentage`). 
___
`done_percentage`가 50%에 도달하기까지 얼마나 걸리는지 한번 관찰해보세요.

In [21]:
env = ShapedChainEnvVisited({})

state = env.reset()
done = False
max_state = -1
cumulative_reward = 0
while not done:
    action = trainer.compute_action(state)
    state, reward, done, results = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward += reward

print(f'Cumulative reward you received is: {cumulative_reward}!')
print(f'Max state you visited is: {max_state}. (There are {env.n} states.)')
desired = env.done_percentage
actual = (max_state+1)/env.n  # add one because of zero indexing
print(f"This policy traversed {actual*100:4.1f}% of the available states.")
assert actual > desired, f"{actual*100:4.1f}% is less than the desired percentage of {desired*100:4.1f}%."

Cumulative reward you received is: -3.0000000000000004!
Max state you visited is: 10. (There are 20 states.)
This policy traversed 55.0% of the available states.


In [22]:
ray.shutdown()  # "Undo ray.init()".