# Ray RLlib - Explore RLlib - Sample Application: BipedalWalker-v3 (Optional)
© 2019-2020, Anyscale. All Rights Reserved

This example uses a harder problem, the _Bipedal Walker_, a two-legged "robot" in two dimensions (see [here](https://gym.openai.com/envs/BipedalWalker-v2/) and [here](https://github.com/openai/gym/wiki/BipedalWalker-v2); we'll actually use version 3, not 2). 
![Bipedal Walker](../../images/rllib/Bipedal-Walker.png)

([source](https://gym.openai.com/envs/BipedalWalker-v2/))

Reward is given for moving forward, a total of 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, so a more optimal agent that minimizes torque application will get a better score. The state consists of the hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints, joints angular speed, legs contact with ground, and 10 LIDAR rangefinder measurements. There are no coordinates in the state vector.

This notebook requires more computation than the other lessons to achieve a well-trained policy. However, to make it faster, we provide a checkpoint from previous training episodes, which will accelerate your efforts somewhat. Even starting with the provided checkpoint, you'll see good results. However, consider iterating on the neural network structure and run more training iterations. How well can you train the walker?

---
Reward는 앞으로 나아갈 때 주어지며, 끝까지가면 총 300+점이 됩니다. 만약 로봇이 넘어지면 -100을 받게됩니다. 모터 토크로 제어하면 적은 수의 포인트가 계산되기(->움직이기) 때문에 토크를 사용하는 것을 최소한으로 하는 더 최적화된 agent가 더좋은 점수를 받을 수 있을 것입니다. state 는 1)선체 각도 속도, 2)각속도, 3)수평 속도, 4)수직 속도, 5)조인트 위치, 6)조인트 각속도, 7)지면과 다리 사이의 접촉점, 8)10 LIDAR 레인지파인더 측정값들로 구성됩니다. state 벡터에는 좌표가 없습니다.(x, y 같은 거 없습니다.)

이번 노트북은 잘 훈련된 policy를 만들기 위해서는 다른 예제들보다 더 많은 계산이 필요합니다. 하지만 더 빠르게 진행하기 위해, 이미 훈련된 에피소드로부터의 체크포인트를 제공하고 있어서 어느 정도 진행상황을 가속화 시킬 수 있습니다. 제공된 체크포인트부터 시작해도 좋은 결과를 볼 수 있을 것입니다. 그러나 신경망 구조의 반복을 고려하여 더 많은 훈련 반복을 실행합시다. 당신은 얼마나 walker를 잘 훈련시킬 수 있을까요?

First, import Ray and the PPO module in RLlib, then start Ray.

먼저 Ray의 RLlib에서 PPO를 불러옵니다.

In [2]:
import ray
import ray.rllib.agents.ppo as ppo

Instructions for updating:
non-resource variables are not supported in the long term


In [3]:
import pandas as pd
import json, os, shutil, sys

In [4]:
sys.path.append('../..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

Model *checkpoints* will get saved after each iteration into directories under `tmp/ppo/bipedal-walker`, i.e., relative to this directory. 
The default directories for checkpoints are `$HOME/ray_results/<algo_env>/.../checkpoint_N`.

> **Note:** If you prefer to use a different directory root, change it in the next cell _and_ in the `rllib rollout` command below.

---

모델의 체크포인트는 (상대경로인) `tmp/ppo/bipedal-walker`에 매 iteration 마다 저장될 것입니다. 디폴트 저장 디렉토리는 `$HOME/ray_results/<algo_env>/.../checkpoint_N`입니다.
> **노트:** 만약 다른 저장소에 저장하고 싶다면 다음 셀에서 시작하는 `rllib rollout` 커맨드 아래에서 바꾸실 수 있습니다.

In [5]:
checkpoint_root = 'tmp/ppo/bipedal-walker'

Clean up output of previous lessons (optional):

이전에 했던 렛슨 기록들을 지웁니다.(선택)

In [6]:
# Where checkpoints are written:
#shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = f'{os.getenv("HOME")}/ray_results/'
#shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Start Ray:

시작하시죠!

In [7]:
ray.init(ignore_reinit_error=True)

2020-09-27 11:51:30,667	INFO services.py:1086 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.43.117',
 'raylet_ip_address': '192.168.43.117',
 'redis_address': '192.168.43.117:6379',
 'object_store_address': 'tcp://127.0.0.1:60395',
 'raylet_socket_name': 'tcp://127.0.0.1:54372',
 'webui_url': '127.0.0.1:8265',
 'session_dir': 'C:\\Users\\LG\\AppData\\Local\\Temp\\ray\\session_2020-09-27_11-51-27_924905_23740',
 'metrics_export_port': 62175,
 'node_id': '45f37eb9ef3f9328b9c49e174e6188354900dc2c'}

Traceback (most recent call last):
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard\dashboard.py", line 960, in <module>
    metrics_export_address=metrics_export_address)
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard\dashboard.py", line 512, in __init__
    build_dir = setup_static_dir(self.app)
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard\dashboard.py", line 411, in setup_static_dir
    "&& npm run build)", build_dir)
FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm ci && npm run build): 'D:\\Installation\\Anaconda3\\envs\\signal\\lib\\site-packages\\ray\\dashboard\\client/build'



The Ray Dashboard is useful for monitoring Ray:

Ray 대쉬보드는 모니터링하기에 좋습니다. 그러나 윈도우는 안되고 리눅스에서만 되네요. 왜죠..?ㅠ

In [1]:
#print(f'Dashboard URL: http://{ray.get_webui_url()}')

Next we'll train a policy for the [Bipedal Walker](https://gym.openai.com/envs/BipedalWalker-v2/) environment.

> **Note:** If you change the values shown for `config['model']['fcnet_hiddens']`, make the same change in the `rllib rollout` command below!

----
Bipedal Walker의 policy를 훈련시켜보겠습니다.

> **노트:** 만약  `config['model']['fcnet_hiddens']`에 다른 값을 넣으신다면 아래 나오는 `rllib rollout` 커맨드 아래에도 다른 값을 넣어주셔야 합니다.

In [8]:
SELECT_ENV = "BipedalWalker-v3"                 # Specifies the OpenAI Gym environment
N_ITER = 20                                     # Number of training runs. We'll only do 20 because this is compute intensive.
                                                # If you have a powerful machine or cluster or more time, try a bigger number like 50 or 100!

config = ppo.DEFAULT_CONFIG.copy()              # PPO's default configuration. See the next code cell.
config["log_level"] = "WARN"                    # Suppress too many messages, but try "INFO" to see what can be printed.


# Other settings we might adjust: 
config['num_workers'] = 4                       # Use > 1 for using more CPU cores, including over a cluster
config['num_sgd_iter'] = 50                     # Number of SGD (stochastic gradient descent) iterations per training minibatch.
                                                # I.e., for each minibatch of data, do this many passes over it to train. 
config['sgd_minibatch_size'] = 250              # The amount of data records per minibatch
config['model']['fcnet_hiddens'] = [512, 512]   # Larger network than we used for CartPole.
config['num_cpus_per_worker'] = 0               # This avoids running out of resources in the notebook environment when this cell is re-executed

Recall you can see what configuration settings are defined for PPO. Note in particular the parameters for the deep learning `model`. As you try to make the performance better and better, what else might you modify here?

----
이전에 배운 PPO를 정의하는 configuration setting 값들을 기억해봅시다. 딥러닝 모델을 위한 특정 파라미터들을 정의합니다. 더 좋은 결과를 내기 위해서 여기서 바꿔보시는 건 어떨까요?

In [9]:
ppo.DEFAULT_CONFIG

{'num_workers': 2,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 200,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 4000,
 'model': {'fcnet_hiddens': [512, 512],
  'fcnet_activation': 'tanh',
  'conv_filters': None,
  'conv_activation': 'relu',
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  '_time_major': False,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_model_config': {},
  'custom_action_dist': None,
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': None,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr': 5e-05,
 'monitor': False,
 'log_level': 'WARN',
 'callbacks': ra

만약 실행이 되지 않는다면 아나콘다 가상환경에서 아래의 커맨드를 통해 설치해주세요.

```
conda install swig # needed to build Box2D in the pip install
pip install box2d-py # a repackaged version of pybox2d
```

https://stackoverflow.com/questions/44198228/install-pybox2d-for-python-3-6-with-conda-4-3-21

> **Note:** If you get warnings like _WARN: Box bound precision lowered by casting to float32_, you can safely ignore them. They come from the definitions in the Bipedal Walker environment for state and action spaces where 32-bit floats are used instead of 64-bit.

---
> **노트:** 만약 WARN이 뜬다면 Box 값의 범위가 float32로 되어있을 수 있습니다. 이를 무시하셔도 좋습니다. 64비트 대신에 32 비트 실수로 state와 action이 정의된 Bipedal Walker 환경이 정의되어있을 때 나타나는 현상입니다.

In [10]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-09-27 12:00:35,984	ERROR syncer.py:63 -- Log sync requires rsync to be installed.
2020-09-27 12:00:35,991	INFO trainer.py:588 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-09-27 12:00:35,995	INFO trainer.py:615 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=1404)[0m Instructions for updating:
[2m[36m(pid=1404)[0m non-resource variables are not supported in the long term
[2m[36m(pid=88)[0m Instructions for updating:
[2m[36m(pid=88)[0m non-resource variables are not supported in the long term
[2m[36m(pid=11212)[0m Instructions for updating:
[2m[36m(pid=11212)[0m non-resource variables are not supported in the long term
[2m[36m(pid=9444)[0m Instructions for updating:
[2m[36m(pid=9444)[0m non-resource variables are not supported in the long term
2020-09-27 12:00:49,286	INFO trainable.py:255 -- Trainable.setup took 13.298 seconds. If your tra

Restore from a previously-captured checkpoint, after training for 100 iterations:

이전에 100번 정도 미리 훈련시킨 체크포인트를 가져옵니다.

In [11]:
agent.restore('bipedal-walker-checkpoint/checkpoint-100')

2020-09-27 12:01:14,811	INFO trainable.py:482 -- Restored on 192.168.43.117 from checkpoint: bipedal-walker-checkpoint/checkpoint-100
2020-09-27 12:01:14,813	INFO trainable.py:489 -- Current state after restoring: {'_iteration': 100, '_timesteps_total': None, '_time_total': 791.705001115799, '_episodes_total': 318}


Train for an additional `N_ITER` iterations. 

> **Note:** Depending on the machine or cluster you are running on, this can take a long time. If you are on a powerful laptop or running in a cluster, or you don't mind waiting, try using a larger value for `N_ITER`.

`N_ITER`번 추가로 더 학습시킵니다.

> **노트:** 당신이 쓰고 있는 머신과 클러스터에 따라 걸리는 시간 차이가 납니다. 만약 파워풀한 랩탑이나 클러스터링을 하고 있다면 망설이지 말고 `N_ITER` 값을 올리세요.

In [12]:
results = []
episode_data = []
episode_json = []
for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')

Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.


[2m[36m(pid=11212)[0m Instructions for updating:
[2m[36m(pid=11212)[0m Prefer Variable.assign which has equivalent behavior in 2.X.
[2m[36m(pid=9444)[0m Instructions for updating:
[2m[36m(pid=9444)[0m Prefer Variable.assign which has equivalent behavior in 2.X.
[2m[36m(pid=88)[0m Instructions for updating:
[2m[36m(pid=88)[0m Prefer Variable.assign which has equivalent behavior in 2.X.
[2m[36m(pid=1404)[0m Instructions for updating:
[2m[36m(pid=1404)[0m Prefer Variable.assign which has equivalent behavior in 2.X.


  0: Min/Mean/Max reward:      nan/     nan/     nan. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_101\checkpoint-101
  1: Min/Mean/Max reward: -119.4667/132.6253/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_102\checkpoint-102
  2: Min/Mean/Max reward: -119.4667/132.6253/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_103\checkpoint-103
  3: Min/Mean/Max reward: -119.4667/158.6643/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_104\checkpoint-104
  4: Min/Mean/Max reward: -119.4667/167.0379/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_105\checkpoint-105
  5: Min/Mean/Max reward: -119.4667/167.0379/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_106\checkpoint-106
  6: Min/Mean/Max reward: -119.4667/160.7977/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpoint_107\checkpoint-107
  7: Min/Mean/Max reward: -119.4667/164.5419/209.4024. Checkpoint saved to tmp/ppo/bipedal-walker\checkpo

The episode rewards should increase after multiple iterations. Try tweaking the config parameters. Smaller values for the `num_sgd_iter`, `sgd_minibatch_size`, or the `model`'s `fcnet_hiddens` will train faster, but take longer to improve the policy.

에피소드 리워드는 여러번의 반복 후에 오를 것 입니다. config 파라미터들을 조정해보세요. 더 적은 수로 `num_sgd_iter`, `sgd_minibatch_size`, `model`'s `fcnet_hiddens`들을 조정하면 훈련이 더 빨라질 것이나 policy를 향상 시키는데에는 더 오래걸릴 것 입니다.

In [13]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,,,,
1,1,-119.466701,132.625259,209.402426,1299.8
2,2,-119.466701,132.625259,209.402426,1299.8
3,3,-119.466701,158.664303,209.402426,1433.222222
4,4,-119.466701,167.037946,209.402426,1484.538462
5,5,-119.466701,167.037946,209.402426,1484.538462
6,6,-119.466701,160.797722,209.402426,1447.166667
7,7,-119.466701,164.541854,209.402426,1462.45
8,8,-119.466701,166.290014,209.402426,1474.954545
9,9,-120.506855,157.534498,209.402426,1441.444444


In [14]:
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

Here are the results training starting from the iteration-100 checkpoint and training for an additional `N_ITER` iterations:

100번 미리 훈련시킨 체크포인트에서 출발하여 `N_ITER`번 더 훈련시킨 결과를 봅시다.

In [15]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                       title='Bipel Walker Episode Rewards', x_axis_label = 'n', y_axis_label='reward')

([image](../../images/rllib/Bipedal-Walker-Rewards-120.png))

Compare with these images after 50 and 100 iterations. Note the sign of the `reward` in all graphs!

50번 더 훈련시킨 것과 100번 더 훈련시킨 것을 비교해봅시다. reward를 그려보았습니다.

After 100 iterations, starting from a checkpoint at 50 (so 50 _new_ iterations):

![image](../../images/rllib/Bipedal-Walker-Rewards-100.png)

After the first 50 iterations:

![image](../../images/rllib/Bipedal-Walker-Rewards-50.png)

By 100 iterations, the reward has mostly leveled off.

100번 훈련된 그래프에서 보면, reward가 대부분 평준화(안정)되어 있음을 알 수 있습니다.

Let's print out the policy and model to see the results of training in detail…

policy와 model을 출력하여 training 결과를 자세히 확인해 봅시다.

In [16]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(24, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(24, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(512, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(512, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(512, 8) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(8,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(512, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "functional_1"
_

## Rollout

Next we'll use the [RLlib rollout CLI](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), to evaluate the trained policy.

We'll use the last saved checkpoint you created for the rollout, `checkpoint_100` (or a different number you might have, see the output from the training above), evaluated through `2000` steps.

> **Notes:** 
>
> 1. If you changed `checkpoint_root` value above, then change it here, too. Note that bugs in variable substitution in Jupyter notebooks, we can't use variables in the next cell, unfortunately.
> 2. If you changed the model parameters, specifically the `fcnet_hiddens` array in the `config` object above, make the same change here.

---
훈련된 policy를 평가하기 위해 RLlib의 rollout CLI를 사용해봅시다.

가장 마지막에 저장된 체크포인트를 불러옵니다.(만약 위에서 다른 값을 넣어주었다면 다른 숫자를 넣으세요.) 2000 스텝을 통해 평가해보겠습니다.

> **노트:** 
>
> 1.만일 `checkpoint_root`를 위에서 바꿨었다면, 여기에서도 바꿔주셔야 합니다. 주피터 노트북들에서는 변수 치환에 관한 버그가 생길 수 있습니다. 이전 셀에서 사용한 변수들을 다음 셀에서 사용할 수 없습니다.
> 2. 만일 모델 파라미터들 특히 `fcnet_hiddens`를 위의 `config`에서 바꿨었다면 여기에서도 똑같이 바꿔줘야합니다.

You may need to make one more modification, depending on how you are running this tutorial:

1. Running on your laptop? - Remove the line `--no-render`. 
2. Running on the Anyscale Service? The popup windows that would normally be created by the rollout can't be viewed in this case. Hence, the `--no-render` flag suppresses them. The code cell afterwards provides a sample video. You can try adding `--video-dir tmp/ppo/cart`, which will generate MP4 videos, then download them to view them. Or copy the `Video` cell below and use it to view the movies.

---
만일 더 조정해보고 싶다면 아래와 같은 튜토리얼을 기반으로 돌려보세요. 일반적으로 원격 설치에서 생성되는 팝업 창은 볼 수 없습니다. 그래서 

1. 노트북으로 돌리고 있습니까? - `--no-render`를 제거하세요.
2. AnyScale 서비스를 이용하고 있습니까? `--no-render` 플래그가 팝업창을 띄우지 않게 합니다. 이 코드 셀은 샘플 비디오를 보여줍니다. MP4 비디오를 만드는 `--video-dir tmp/ppo/cart`를 추가하여 다운로드하고 볼 수 있습니다. 또는 아래의 `Video`셀을 복사하고 비디오를 보는데 사용하세요.

In [19]:
!RAY_ADDRESS=auto rllib rollout tmp/ppo/bipedal-walker/checkpoint_100/checkpoint-100 \
    --config "{\"env\": \"BipedalWalker-v3\", \"model\": {\"fcnet_hiddens\": [512, 512]}}" \
    --run PPO \
    --no-render \
    --steps 2000

'RAY_ADDRESS'은(는) 내부 또는 외부 명령, 실행할 수 있는 프로그램, 또는
배치 파일이 아닙니다.


Here is a sample episode video after training 100 times.

> **Note:** This video was created by running the previous `rllib rollout` command with the additional argument `--video-dir tmp/ppo/bipedal-walker` (then the video was copied to the location below). It creates one video per episode.

100번 훈련한 후의 샘플 비디오를 보여줍니다.

> **노트:** 이 비디오는 `--video-dir tmp/ppo/bipedal-walker`를 추가하여 이전에 실행한 `rllib rollout` 커맨드로 생성된 것 입니다.(그 다음 비디오는 아래 경로에 복사됩니다.) 에피소드당 비디오를 생성합니다.

In [19]:
from IPython.display import Video

sample_video='../../images/rllib/Bipedal-Walker-Example-100.mp4'
Video(sample_video, embed=True)

Finally, use [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) to visualize the results.

텐서보드를 이용하여 결과를 시각화해보세요.

In [20]:
ray.shutdown()  # "Undo ray.init()".

## Exercise 1 ("Homework")

Try a long training run while you do other work. Increase `N_ITER` above to some large number. When it finishes, change the `rllib rollout` command to use the last checkpoint. How well does it run? 

Redo the experiment a few times. You might increase `N_ITER`. For each run, load the last checkpoint that was saved in the previous run. How well can you train the walker?

## Exercise 2 ("Homework")

In addition to _Cart Pole_, _Bipedal Walker_, and _Mountain Car_ (see the `extras` folder), there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try. Make a copy of this notebook and edit as required.