# Ray RLlib - Explore RLlib - Sample Application: CartPole

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

We were briefly introduced to the `CartPole` example and the OpenAI gym `CartPole-v1` environment ([gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)) in the [reinforcement learning introduction](../01-Introduction-to-Reinforcement-Learning.ipynb). This lesson uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy for `CartPole`.

Recall that the `gym` Python module provides MDP interfaces to a variety of simulators, like the simple simulator for the physics of balancing a pole on a cart that is used by the CartPole environment. The `CartPole` problem is described at https://gym.openai.com/envs/CartPole-v1.

![Cart Pole](../../images/rllib/Cart-Pole.png)

([source](https://gym.openai.com/envs/CartPole-v1/))

Even though this is a relatively simple and quick example to run, its results can be understood quite visually. `CartPole` is one of OpenAI Gym's ["classic control"](https://gym.openai.com/envs/#classic_control) examples.

For more background about this problem, see:

* ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077), AG Barto, RS Sutton, and CW Anderson, *IEEE Transactions on Systems, Man, and Cybernetics* (1983). The same Sutton and Barto who wrote [*Reinforcement Learning: An Introduction*](https://mitpress.mit.edu/books/reinforcement-learning-second-edition).
* ["Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)"](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288), [Greg Surma](https://twitter.com/GSurma).

First, import Ray and the PPO module in RLlib, then start Ray.

In [4]:
import ray
import ray.rllib.agents.ppo as ppo

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
non-resource variables are not supported in the long term


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [5]:
import pandas as pd
import json, os, shutil, sys

In [6]:
sys.path.append('../..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

ModuleNotFoundError: No module named 'util'

Model *checkpoints* will get saved after each iteration into directories under `tmp/ppo/cart`, i.e., relative to this directory. 
The default directories for checkpoints are `$HOME/ray_results/<algo_env>/.../checkpoint_N`.

모델 *checkpoints*는 "tmp/ppo/cart"의 디렉토리에 반복될 때마다 저장된다. 이 디렉토리는 상대적이다.
체크포인트의 기본 디렉토리는 `$HOME/ray_results/<algo_env>/.../checkpoint_N`.이다.

> **Note:** If you prefer to use a different directory root, change it in the next cell _and_ in the `rllib rollout` command below.

> **참고:** 다른 디렉토리 루트를 사용하고자 하는 경우 아래 'rllib rollball' 명령에서 다음 셀 _and_에서 변경하십시오.

In [5]:
checkpoint_root = 'tmp/ppo/cart'

Clean up output of previous lessons (optional):

In [6]:
# Where checkpoints are written:
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = f'{os.getenv("HOME")}/ray_results/'
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Start Ray:

In [7]:
ray.init(ignore_reinit_error=True)

2020-09-20 18:49:51,993	INFO resource_spec.py:231 -- Starting Ray with 3.37 GiB memory available for workers and up to 1.7 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-20 18:49:52,614	INFO services.py:1193 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '127.0.0.2',
 'raylet_ip_address': '127.0.0.2',
 'redis_address': '127.0.0.2:42977',
 'object_store_address': '/tmp/ray/session_2020-09-20_18-49-51_988458_6022/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-09-20_18-49-51_988458_6022/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-09-20_18-49-51_988458_6022'}

The Ray Dashboard is useful for monitoring Ray:

In [8]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Next we'll train an RLlib policy with the [`CartPole-v1` environment](https://gym.openai.com/envs/CartPole-v1/).

If you've gone through the _Multi-Armed Bandits_ lessons, you may recall that we used [Ray Tune](http://tune.io), the Ray Hyperparameter Tuning system, to drive training. Here we'll do it ourselves.

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to train longer and see the resulting rewards improve. However, if the max score of `200` is achieved early, you can use a smaller number of iterations.


- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used. In a cluster, these actors will be spread over the available nodes.
- `num_sgd_iter` is the number of epochs of SGD (stochastic gradient descent, i.e., passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO, for each _minibatch_ ("chunk") of training data. Using minibatches is more efficient than training with one record at a time.
- `sgd_minibatch_size` is the SGD minibatch size (batches of data) that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers. Here, we have two hidden layers of size 100, each.
- `num_cpus_per_worker` when set to 0 prevents Ray from pinning a CPU core to each worker, which means we could run out of workers in a constrained environment like a laptop or a cloud VM.

> **Note:** If you change the values shown for `config['model']['fcnet_hiddens']`, make the same change in the `rllib rollout` command below!

다음으로 우리는 ["CartPole-v1 environment](https://gym.openai.com/envs/CartPole-v1/))의 RLLIB 정책을 훈련할 것이다.

만약 여러분이 _Multi-Armed Bandits_ 노트북을 공부하였다면, 여러분은 우리가 레이 하이퍼 파라미터 튜닝 시스템인 [Ray Tune](http://tune.io)을 사용하여 훈련을 했다는 것을 기억할 것이다. 여기서는 그것을 우리가 직접 할 겁니다.

기본적으로 훈련은 `10`회 반복된다. 더 오래 훈련하고 그에 따른 보상이 개선되도록 하려면 `N_ITER` 설정을 늘리면 된다. 그러나 최대 점수인 `200점`을 조기에 달성하면 적은 횟수의 반복을 사용할 수 있다.

- `num_workers`는 에이전트가 만들어낼 actor 수. 이것은 사용될 병렬의 정도를 결정한다. 클러스터에서 이러한 행위자들은 사용 가능한 노드 상에 분산될 것이다.

- `num_sgd_iter`는 각 훈련 데이터의 _minibatch_("chunk")에 대해 PPO를 반복할 때마다 PPO 에이전트 목표를 최적화하는 데 사용되는 SGD(stopic gradient download, 즉 데이터를 통과)의 에폭 수입니다. 미니배치를 사용하는 것이 한 번에 한 개의 기록으로 훈련하는 것보다 더 효율적이다.

- `sgd_minibatch_size`는 PPO 에이전트 목적 최적화에 사용될 SGD 미니배치 크기(데이터의 배치)이다.

- `model`은 정책 매개변수에 사용되는 신경망을 설명하는 매개변수 사전이 수록되어 있다. `fcnet_hiddens` 매개변수는 숨겨진 층의 크기를 나열한 것이다. 여기, 우리는 각각 100사이즈의 숨겨진 층을 두 개 가지고 있다.

>**참고:** 'config['model']['fcnet_hiddens']'에 표시된 값을 변경하면 아래의 rllib rollball' 명령에서도 동일하게 변경하십시오!

In [9]:
SELECT_ENV = "CartPole-v1"                      # Specifies the OpenAI Gym environment for Cart Pole
N_ITER = 10                                     # Number of training runs.

config = ppo.DEFAULT_CONFIG.copy()              # PPO's default configuration. See the next code cell.
config["log_level"] = "WARN"                    # Suppress too many messages, but try "INFO" to see what can be printed.
# Other settings we might adjust:
config['num_workers'] = 1                       # Use > 1 for using more CPU cores, including over a cluster
config['num_sgd_iter'] = 10                     # Number of SGD (stochastic gradient descent) iterations per training minibatch.
                                                # I.e., for each minibatch of data, do this many passes over it to train. 
config['sgd_minibatch_size'] = 250              # The amount of data records per minibatch
config['model']['fcnet_hiddens'] = [100, 50]    #
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

Out of curiousity, let's see what configuration settings are defined for PPO. Note in particular the parameters for the deep learning `model`:

PPO에 대해 어떤 구성 설정이 정의되어 있는지 자세히 알아보십시오. 특히 딥러닝 `model`에 대한 매개변수를 참고하십시오.

In [10]:
ppo.DEFAULT_CONFIG

{'num_workers': 2,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 200,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 4000,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [100, 50],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_model_config': {},
  'custom_action_dist': None,
  'custom_preprocessor': None,
  'custom_options': -1},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': None,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr': 5e-05,
 'monitor

In [11]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

results = []
episode_data = []
episode_json = []
for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')

2020-09-20 18:49:55,633	INFO trainer.py:605 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-09-20 18:49:55,636	INFO trainer.py:632 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


[2m[36m(pid=6131)[0m   _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[2m[36m(pid=6131)[0m   _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[2m[36m(pid=6131)[0m   _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[2m[36m(pid=6131)[0m   _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[2m[36m(pid=6131)[0m   _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[2m[36m(pid=6131)[0m   np_resource = np.dtype([("resource", np.ubyte, 1)])
[2m[36m(pid=6131)[0m   _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[2m[36m(pid=6131)[0m   _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[2m[36m(pid=6131)[0m   _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[2m[36m(pid=6131)[0m   _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[2m[36m(pid=6131)[0m   _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[2m[36m(pid=6131)[0m   np_resource = np.dtype([("resource", np.ubyte, 1)])
[2m[36m(pid=6131)[0m Instructions for updating:
[2m[36m(pid=6131)[0m non-resource variab

Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.


[2m[36m(pid=6131)[0m Instructions for updating:
[2m[36m(pid=6131)[0m Prefer Variable.assign which has equivalent behavior in 2.X.


  0: Min/Mean/Max reward:   9.0000/ 22.7200/ 59.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_1/checkpoint-1
  1: Min/Mean/Max reward:   9.0000/ 30.3636/ 99.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_2/checkpoint-2
  2: Min/Mean/Max reward:   9.0000/ 40.5300/115.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_3/checkpoint-3
  3: Min/Mean/Max reward:  12.0000/ 56.5100/174.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_4/checkpoint-4
  4: Min/Mean/Max reward:  14.0000/ 79.7000/241.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_5/checkpoint-5
  5: Min/Mean/Max reward:  21.0000/104.4500/380.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_6/checkpoint-6
  6: Min/Mean/Max reward:  21.0000/128.8300/380.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_7/checkpoint-7
  7: Min/Mean/Max reward:  21.0000/160.3900/500.0000. Checkpoint saved to tmp/ppo/cart/checkpoint_8/checkpoint-8
  8: Min/Mean/Max reward:  21.0000/186.2100/500.0000. Checkpoint saved to tmp/ppo/cart/checkpoin

The episode rewards should increase after multiple iterations. Try tweaking the config parameters. Smaller values for the `num_sgd_iter`, `sgd_minibatch_size`, or the `model`'s `fcnet_hiddens` will train faster, but take longer to improve the policy.

에피소드 보상은 여러 번 반복한 후에 증가해야 한다. 구성 매개 변수를 조정해 보십시오. `num_sgd_iter`, `sgd_minibatch_size` 또는 모델의 `fcnet_hiddens`의 값이 작을수록 교육 속도는 빨라지지만 정책 개선에는 시간이 더 걸린다.

In [12]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,9.0,22.72,59.0,22.72
1,1,9.0,30.363636,99.0,30.363636
2,2,9.0,40.53,115.0,40.53
3,3,12.0,56.51,174.0,56.51
4,4,14.0,79.7,241.0,79.7
5,5,21.0,104.45,380.0,104.45
6,6,21.0,128.83,380.0,128.83
7,7,21.0,160.39,500.0,160.39
8,8,21.0,186.21,500.0,186.21
9,9,21.0,205.35,500.0,205.35


In [13]:
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [14]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                       title='Cart Pole Episode Rewards', x_axis_label = 'n', y_axis_label='reward')

NameError: name 'plot_line_with_min_max' is not defined

([image](../../images/rllib/Cart-Pole-Episode-Rewards3.png))

Also, print out the policy and model to see the results of training in detail…

또한 정책과 모델을 출력하여 교육 결과를 자세히 확인하십시오.

In [15]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 100) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(100,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(4, 100) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(100,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(100, 50) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(50,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(100, 50) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(50,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(50, 2) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(2,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(50, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
________________

## Rollout

Next we'll use the [RLlib rollout CLI](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), to evaluate the trained policy.

This visualizes the `CartPole` agent operating within the simulation: moving the cart left or right to avoid having the pole fall over.

We'll use the last saved checkpoint, `checkpoint_10` (or whatever you set for `N_ITER` above) for the rollout, evaluated through `2000` steps.

> **Notes:** 
>
> 1. If you changed `checkpoint_root` above to be different than `tmp/ppo/cart`, then change it here, too. Note that bugs in variable substitution in Jupyter notebooks, we can't use variables in the next cell, unfortunately.
> 2. If you changed the model parameters, specifically the `fcnet_hiddens` array in the `config` object above, make the same change here.

You may need to make one more modification, depending on how you are running this tutorial:

1. Running on your laptop? - Remove the line `--no-render`. 
2. Running on the Anyscale Service? The popup windows that would normally be created by the rollout can't be viewed in this case. Hence, the `--no-render` flag suppresses them. The code cell afterwords provides a sample video. You can try adding `--video-dir tmp/ppo/cart`, which will generate MP4 videos, then download them to view them. Or copy the `Video` cell below and use it to view the movies.

#### Rollout

다음으로 [RLlib rollout CLI](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies),를 사용하여 교육된 정책을 평가하십시오.

이는 시뮬레이션 내에서 작동 중인 `CartPole` 에이전트인 카트를 왼쪽이나 오른쪽으로 움직여 폴이 넘어지지 않도록 하는 모습을 시각화한다.

마지막 저장된 체크포인트인 `checkpoint_10`(또는 `N_Iter`)을 사용할 것이며 이는 `2000`단계를 통해 평가한 롤아웃을 위한 것이다.

> **Notes:** 
>
> 1. 위의 checkpoint_root를 tmp/po/cart와 다르게 변경했다면 여기서도 변경하십시오. Jupyter 노트북의 가변 대체 버그에 주목하라, 불행히도 다음 셀에서는 변수를 사용할 수 없다.
> 2. 모델 매개변수, 특히 위의 config 객체의 fcnet_hiddens array를 변경한 경우 여기서도 동일하게 변경하십시오.

이 튜토리얼을 실행하는 방법에 따라 한 번 더 수정하십시오.:

1. 랩톱에서 실행 중이라면 `---no-render` 라인을 제거하십시오.
2. Anyscale Service로 실행중이라면 일반적으로 원격 설치에서 생성되는 팝업 창은 이 경우 볼 수 없다. 따라서-`--no-render` flag은 그들을 차단한다. 코드 셀 애프터 워드는 샘플 비디오를 제공한다. MP4 비디오를 생성하는 비디오-dir tmp/ppo/cart를 추가한 후 다운로드하여 볼 수 있다. 또는 아래 비디오 셀을 복사하여 영화를 보는 데 사용하십시오.

In [7]:
!rllib rollout tmp/ppo/cart/checkpoint_10/checkpoint-10 \
    --config "{\"env\": \"CartPole-v1\", \"model\": {\"fcnet_hiddens\": [100, 50]}}" \
    --run PPO \
    --no-render \
    --steps 2000 --video-dir video

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Instructions for updating:
non-resource variables are not supported in the long term
2020-09-20 20:55:42,155	INFO resource_spec.py:231 -- Starting Ray with 3.56 GiB memory available for workers and up to 1.8 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-20 20:55:42,668	INFO services.py:1193 -- View the Ray dashb

Here is a sample episode. 

> **Note:** This video was created by running the previous `rllib rollout` command with the argument `--video-dir some_directory`. It creates one video per episode.

여기 샘플 에피소드가 있다.

> **Note:** 이 동영상은 이전의 `rllib rollout` 명령을 `--video-dir some_directory`라는 인수로 실행함으로써 만들어졌다. 그것은 회당 하나의 비디오를 만든다.

In [9]:
from IPython.display import Video

cart_pole_sample_video='video/Cart-Pole-Example-Video.mp4'
Video(cart_pole_sample_video)

ValueError: To embed videos, you must pass embed=True (this may make your notebook files huge)
Consider passing Video(url='...')

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) as discussed in [02 Introduction to RLlib](../02-Introduction-to-RLlib.ipynb). Select the Cart Pole runs and visualize the key metrics from training with RLlib.

```shell
tensorboard --logdir=$HOME/ray_results
```

마지막으로 [02 Introduction to RLlib](../02-Introduction-to-RLlib.ipynb))에서 설명한 대로 [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started)를 론칭한다. 카트 폴 실행을 선택하고 RLLIB 교육을 통해 얻은 주요 메트릭을 시각화하십시오.

For more examples of working with Gym environments, go through the next lesson, [Bipedal Walker](02-Bipedal-Walker.ipynb), then any of the "extra" lessons:

* [Extras: Application - Mountain Car](extras/Extra-Application-Mountain-Car.ipynb) -- Based on the `MountainCar-v0` environment from OpenAI Gym.
* [Extras: Application - Taxi](extras/Extra-Application-Taxi.ipynb) -- Based on the `Taxi-v3` environment from OpenAI Gym.
* [Extras: Application - Frozen Lake](extras/Extra-Application-Frozen-Lake.ipynb) -- Based on the `FrozenLake-v0` environment from OpenAI Gym.

Use TensorBoard to visualize their training runs, metrics, etc., as well. (These notebooks won't mention this suggestion.)

Gym 환경에서 작업하는 더 많은 예제를 보려면 다음 과정인 [Bipedal Walker](02-Bipedal-Walker.ipynb)에서 "추가" 수업을 수행하십시오.


## Exercise ("Homework")

In addition to _Cart Pole_, _Bipedal Walker_, and _Mountain Car_, there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try. Make a copy of this notebook and edit as required.

In [10]:
ray.shutdown()  # "Undo ray.init()".