# Ray RLlib - Introduction to Reinforcement Learning

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

_Reinforcement Learning_ is the category of machine learning that focuses on training one or more _agents_ to achieve maximal _rewards_ while operating in an environment. This lesson discusses the core concepts of RL, while subsequent lessons explore RLlib in depth. We'll use two examples with exercises to give you a taste of RL. If you already understand RL concepts, you can either skim this lesson or skip to the [next lesson](02-Introduction-to-RLlib.ipynb).

_강화학습_(이하 RL)은 환경과 상호작용하며 최대 보상을 달성하기 위해 하나 이상의 에이전트를 훈련시키는 데 초점을 맞춘 기계 학습 방법론 중 하나입니다. 이 수업은 RL의 핵심 개념을 살펴보고, RLlib를 깊이 있게 탐구합니다. 우리는 두 가지 예를 사용하여 RL을 체험해볼 것입니다. 만약 이미 RL 개념을 알고있다며, 이 부분은 건너뛸 수 있습니다.

## What Is Reinforcement Learning?

Let's explore the basic concepts of RL, specifically the _Markov Decision Process_ abstraction, and to show its use in Python.

RL의 기본 개념, 특히 _Markov Decision Process_ 개념을 탐구하고 Python에서 그 사용법을 봅시다.

Consider the following image:

![RL Concepts](../images/rllib/RL-concepts.png)

In RL, one or more **agents** interact with an **environment** to maximize a **reward**. The agents make **observations** about the **state** of the environment and take **actions** that are believed will maximize the long-term reward. However, at any particular moment, the agents can only observe the immediate reward. So, the training process usually involves lots and lot of replay of the game, the robot simulator traversing a virtual space, etc., so the agents can learn from repeated trials what decisions/actions work best to maximize the long-term, cumulative reward.

RL에서 하나 이상의 **agent**는 **environment**과 상호작용하여 **reward**을 최대화 합니다. agent는 environment의 **state**에 대해 **observations**을 하고, 장기 reward을 극대화할 수 있는 **policy**를 취합니다. 그러나 agent들은 어떤 특정한 순간의 즉각적인 reward만을 알 수 있습니다. 따라서 training 과정은 대개 게임의 많은 재생, 가상공간에서의 로봇 시뮬레이터 등을 수반하므로, agent는 장기적인 누적 reward을 극대화하기 위해 어떤 decision/action이 가장 잘 작동하는지 반복적인 실험을 통해 배울 수 있습니다.

---

The trail and error search and delayed reward are the distinguishing characterists of RL vs. other ML methods ([Sutton 2018](06-RL-References.ipynb#Books)).

The way to formalize trial and error is the **exploitation vs. exploration tradeoff**. When an agent finds what appears to be a "rewarding" sequence of actions, the agent may naturally want to continue to **exploit** these actions. However, even better actions may exist. An agent won't know whether alternatives are better or not unless some percentage of actions taken **explore** the alternatives. So, all RL algorithms include a strategy for exploitation and exploration.

**1)trail and error search** 및 **2)delayed reward**은 RL을 다른 ML 방법론들과 구별할 수 있는 특징입니다.

trail and error search 방법은 **exploitation vs. exploration tradeoff**로 나타낼 수 있습니다. agent가 reward를 주는 일련의 action들을 발견하면, agent는 자연스럽게 이러한 action들을 계속 **exploit**하기를 원할 수 있습니다. 하지만, 더 나은 action들이 존재할 수도 있습니다. agent는 지금 찾은 대안이 더 좋은지를 어느정도 다른 대안들을 explore하지 않는 한 알 수 없습니다. 따라서 모든 RL 알고리즘에는 적절한 exploitation과 exploration을 하기 위한 전략이 포함됩니다.

## RL Applications

RL has many potential applications. RL became "famous" due to these successes, including achieving expert game play, training robots, autonomous vehicles, and other simulated agents:

RL은 다양한 분야에 활용될 잠재성을 가지고 있습니다. RL로 전문가처럼 게임 플레이를 하고, 훈련 로봇, 자율 차량 및 기타 시뮬레이션에서 잘 작동하는 agent들을 만들어내는 성공들로 인해 유명해졌습니다.

![AlphaGo](../images/rllib/alpha-go.jpg)
![Game](../images/rllib/breakout.png)

![Stacking Legos with Sawyer](../images/rllib/stacking-legos-with-sawyer.gif)
![Walking Man](../images/rllib/walking-man.gif)

![Autonomous Vehicle](../images/rllib/daimler-autonomous-car.jpg)
!["Cassie": Two-legged Robot](../images/rllib/cassie-crouched.png)

Credits:
* [AlphaGo](https://www.youtube.com/watch?v=l7ngy56GY6k)
* [Breakout](https://towardsdatascience.com/tutorial-double-deep-q-learning-with-dueling-network-architectures-4c1b3fb7f756) ([paper](https://arxiv.org/abs/1312.5602))
* [Stacking Legos with Sawyer](https://robohub.org/soft-actor-critic-deep-reinforcement-learning-with-real-world-robots/)
* [Walking Man](https://openai.com/blog/openai-baselines-ppo/)
* [Autonomous Vehicle](https://www.daimler.com/innovation/case/autonomous/intelligent-drive-2.html)
* ["Cassie": Two-legged Robot](https://mime.oregonstate.edu/research/drl/robots/cassie/) (Uses Ray!)

Recently other industry applications have emerged, include the following:

* **Process optimization:** industrial processes (factories, pipelines) and other business processes, routing problems, cluster optimization.
* **Ad serving and recommendations:** Some of the traditional methods, including _collaborative filtering_, are hard to scale for very large data sets. RL systems are being developed to do an effective job more efficiently than traditional methods.
* **Finance:** Markets are time-oriented _environments_ where automated trading systems are the _agents_. 

최근에 다른 산업들에서도 RL을 적용될 수 있는 분야들이 생겨나고 있으며 아래와 같은 분야들을 포함합니다.

* **Process optimization:** 산업 프로세스 (공장, 파이프 라인) 및 기타 비즈니스 프로세스, 라우팅 문제, 클러스터 최적화에 대해 연구하고 있습니다.
* **Ad serving and recommendations:** _colaborative filtering_ 을 포함한 전통적인 방법 중 일부는 매우 큰 데이터 세트에 대해 스케일링하기가 어렵습니다. RL 시스템은 기존의 방법보다 효과적인 작업을 하기 위해 개발되고 있습니다.
* **Finance:** 시장은 시간 의존적인 _environments_ 이며, 자동화된 거래 시스템 _agent_ 를 연구합니다.

## Markov Decision Processes

At its core, Reinforcement learning builds on the concepts of [Markov Decision Process (MDP)](https://en.wikipedia.org/wiki/Markov_decision_process), where the current state, the possible actions that can be taken, and overall goal are the building blocks.

An MDP models sequential interactions with an external environment. It consists of the following:

- a **state space** where the current state of the system is sometimes called the **context**.
- a set of **actions** that can be taken at a particular state $s$ (or sometimes the same set for all states).
- a **transition function** that describes the probability of being in a state $s'$ at time $t+1$ given that the MDP was in state $s$ at time $t$ and action $a$ was taken. The next state is selected stochastically based on these probabilities.
- a **reward function**, which determines the reward received at time $t$ following action $a$, based on the decision of **policy** $\pi$.

---

현재 상태(current state), 취할 수 있는 가능한 actions, 전반적인 목표(overall goal)가 구성 요소인 마르코프 의사결정 과정(MDP)의 개념을 바탕으로 RL이 구축된다. MDP는 외부 environment과의 순차적 상호작용을 모델로 한다. 아래는 MDP의 구성 요소들 입니다.
- 시스템의 현재 상태를 (**context**라고 부르기도 하는) **state space**
- 특정 state $s$ (또는 모든 states에서 동일한 세트)에서 선택할 수 있는 **actions**
- 시간 $t$에 agent가 state $s$에 있고 action $a$d을 선택했을 때, 시간 $t+1$에 state $s'$에 있을 확률을 알려주는 함수 **transition function**. 다음 state $s'$는 이러한 확률에 기초하여 확률적으로 선택됩니다.
- agent가 **policy** $\pi$의 결정에 따라 시간 $t$에 action $a$을 선택하고 받을 reward을 결정하는 **reward function**



---
The goal of MDP is to develop a **policy** $\pi$ that specifies what action $a$ should be chosen for a given state $s$ so that the cumulative reward is maximized. When it is possible for the policy "trainer" to fully observe all the possible states, actions, and rewards, it can define a deterministic policy, fixing a single action choice for each state. In this scenario, the transition probabilities reduce to the probability of transitioning to state $s'$ given the current state is $s$, independent of actions, because the state now leads to a deterministic action choice. Various algorithms can be used to compute this policy. 

MDP의 목표는 누적 reward가 극대화 되도록 주어진 state에 대해 어떤 action를 선택해야 하는지를 알려주는 **policy**을 만드는 것입니다. policy "trainer"가 가능한 모든 state, action 및 reward을 완전히 관찰할 수 있을 때 각 state에 대한 단일 action 선택을 할 수 있는 결정론적(deterministic) policy을 정의할 수 있습니다. 
이러한 시나리오에서 state에 따라 확정적인 action을 선택하게 되기 때문에, action과는 독립적으로 transition probabilities은 현재 state $s$에서 다음 state $s'$로 전환되는 확률로 수렴합니다. 이 policy를 구하는 데 다양한 알고리즘을 사용할 수 있습니다.

---
Put another way, if the policy isn't deterministic, then the transition probability to state $s'$ at a time $t+1$ when action $a$ is taken for state $s$ at time $t$, is given by:

\begin{equation}
P_a(s',s) = P(s_{t+1} = s'|s_t=s,a)
\end{equation}

When the policy is deterministic, this transition probability reduces to the following, independent of $a$:

\begin{equation}
P(s',s) = P(s_{t+1} = s'|s_t=s)
\end{equation}

To be clear, a deterministic policy means that one and only one action will always be selected for a given state $s$, but the next state $s'$ will still be selected stochastically.

In the general case of RL, it isn't possible to fully know all this information, some of which might be hidden and evolving, so it isn't possible to specify a fully-deterministic policy.

다른 방법으로, policy가 결정론적이 아니라면 action $a$를 선택하고 시간 $t+1$에 state $s'$이 될 확률, 즉 transition probability는 다음과 같이 나타낼 수 있습니다.

\begin{equation}
P_a(s',s) = P(s_{t+1} = s'|s_t=s,a)
\end{equation}

만약 policy가 결정론적이라면, transition probability은 action $a$와 무관하게 아래와 같이 나타내질 수 있습니다.

\begin{equation}
P(s',s) = P(s_{t+1} = s'|s_t=s)
\end{equation}

분명히, 결정론적 policy는 주어진 state에 대해 오직 하나의 action만 선택되고 다음 state는 여전히 확률 적으로 선택된다는 것을 의미합니다.

RL의 일반적인 경우, 이 모든 정보를 완전히 알 수는 없으며 그 중 일부는 숨겨져 있고 변화할 수 있으므로 완전히 결정론적 policy을 정의할 수 없습니다.

---
Often this cumulative reward is computed using the **discounted sum** over all rewards observed:

\begin{equation}
\arg\max_{\pi} \sum_{t=1}^T \gamma^t R_t(\pi),
\end{equation}

where $T$ is the number of steps taken in the MDP (this is a random variable and may depend on $\pi$), $R_t$ is the reward received at time $t$ (also a random variable which depends on $\pi$), and $\gamma$ is the **discount factor**. The value of $\gamma$ is between 0 and 1, meaning it has the effect of "discounting" earlier rewards vs. more recent rewards. 

The [Wikipedia page on MDP](https://en.wikipedia.org/wiki/Markov_decision_process) provides more details. Note what we said in the third bullet, that the new state only depends on the previous state and the action taken. The assumption is that we can simplify our effort by ignoring all the previous states except the last one and still achieve good results. This is known as the [Markov property](https://en.wikipedia.org/wiki/Markov_property). This assumption often works well and it greatly reduces the resources required.

이 누적 reward는 자주 관찰된 모든 reward에 대해 할인된 합계를 사용하여 계산됩니다.

\begin{equation}
\arg\max_{\pi} \sum_{t=1}^T \gamma^t R_t(\pi),
\end{equation}

여기서 T는 MDP에서 취한 step의 수(이것은 무작위 변수이며 $\pi$에 의존할 수 있음), $R_t$는 시간 t에서 받은 reward(또한 $\pi$에 의존하는 무작위 변수), $\gamma$는 **discount factor**이다. $\gamma$ 값은 0에서 1 사이이며, 이는 초기 reward 대 최근의 reward을 "할인"하는 효과를 가지고 있습니다.

[MDP의 위키피디아 페이지](https://en.wikipedia.org/wiki/Markov_decision_process)는 더 많은 세부사항을 제공합니다. 우리가 세 번째 bullet에서 한 말에서 보면, 새로운 state는 이전 state와 선택한 action에만 달려 있다. 마지막 state를 제외한 모든 이전 state를 무시하고 여전히 좋은 결과를 달성함으로써 우리의 노력을 단순화할 수 있다는 [Markov property](https://en.wikipedia.org/wiki/Markov_property) 가정이다. 이 가정은 종종 잘 작동하며 필요한 자원을 크게 감소시킵니다.

## The Elements of RL

Here are the elements of RL that expand on MDP concepts (see [Sutton 2018](https://mitpress.mit.edu/books/reinforcement-learning-second-edition) for more details):

여기 MDP 개념에서 확장되는 RL의 요소가 있습니다.

#### Policies

Unlike MDP, the **transition function** probabilities are often not known in advance, but must be learned. Learning is done through repeated "play", where the agent interacts with the environment.

This makes the **policy** $\pi$ harder to determine. Because the fully state space usually can't be fully known, the choice of action $a$ for given state $s$ almostly always remains a stochastic choice, never deterministic, unlike MDP.

MDP와 달리 **transition function**에서 나오는 확률은 사전에 알려지지 않은 경우가 많지만 반드시 배워야 합니다. 학습은 agent가 environment와 상호작용하는 반복적인 "play"를 통해 이루어집니다.

이것은 **policy**를 결정하기 어렵게 만듭니다. 완전한 state 공간은 일반적으로 완전히 알 수 없기 때문에 주어진 state에 대한 action 선택은 거의 항상 (MDP와 달리)결정론적이 아닌 확률적 선택으로 남아있습니다.

#### Reward Signal

The idea of a **reward signal** encapsulates the desired goal for the system and provides feedback for updating the policy based on how well particular events or actions contribute rewards towards the goal.

**reward signal**는 시스템에 대한 원하는 목표를 요약하고 특정 사건이나 action이 목표에 대한 reward에 얼마나 잘 기여하는지에 따라 policy를 업데이트하기 위한 피드백을 제공합니다.

#### Value Function

The **value function** encapsulates the maximum cumulative reward likely to be achieved starting from a given state for an **episode**. This is harder to determine than the simple reward returned after taking an action. In fact, much of the research in RL over the decades has focused on finding better and more efficient implementations of value functions. To illustrate the challenge, repeatedly taking one sequence of actions may yield low rewards for a while, but eventually provide large rewards. Conversely, always choosing a different sequence of actions may yield a good reward at each step, but be suboptimal for the cumulative reward.

**value function**는 **episode**에서 주어진 state에서 시작하여 달성될 가능성이 있는 최대 누적 reward을 요약합니다. 이것은 action를 취한 후에 반환된 단순한 reward보다 결정하기 어렵습니다. 사실, 수십 년 동안 RL의 많은 연구는 value function의 더 좋고 효율적인 구현을 찾는 데 초점을 맞추고 있습니다. 한 가지 도전을 설명해보면, 한 가지 일련의 action을 반복적으로 취하면 잠시 동안은 낮은 reward을 얻을 수 있지만, 결국 큰 reward을 제공할 수 있습니다. 반대로, 항상 다른 일련의 action을 선택하면 각 단계에서 좋은 reward을 얻을 수 있지만 누적된 reward에 대해서는 최적이 아닙니다.

#### Episode

A sequence of steps by the agent starting in an initial state. At each step, the agent observes the current state, chooses the next action, and receives the new reward. Episodes are used for both training policies and replaying with an existing policy (called _rollout_).

초기 state에서 시작하는 agent에 의한 일련의 step들을 말합니다. 각 단계에서 agent는 현재 state를 관찰하고 다음 action을 선택하고 새로운 reward를 받습니다. 에피소드는 training policies과 existing policy으로 다시 play하는 데 모두 사용됩니다(_rollout_라고 함).

#### Model

An optional feature, some RL algorithms develop or use a **model** of the environment to anticipate the resulting states and rewards for future actions. Hence, they are useful for _planning_ scenarios. Methods for solving RL problems that use models are called _model-based methods_, while methods that learn by trial and error are called _model-free methods_.

선택적인 특징으로, 일부 RL 알고리즘은 미래의 action에 대한 결과 state와 reward을 예상하기 위해 env의 **model**을 개발하거나 사용합니다. 따라서, 그것들은 _planning_ 시나리오에 유용합니다. model을 사용하는 RL 문제를 해결하기 위한 방법을 _model 기반 방법_ 이라고 하며 시행착오로 학습하는 메소드를 _model-free 방법_ 이라고 합니다.

## Reinforcement Learning Example

Let's finish this introduction let's learn about the popular "hello world" (1) example environment for RL, balancing a pole vertically on a moving cart, called `CartPole`. Then we'll see how to use RLlib to train a policy using a popular RL algorithm, _Proximal Policy Optimization_, again using `CartPole`.

(1) In books and tutorials on programming languages, it is a tradition that the very first program shown prints the message "Hello World!".

이 소개를 끝내고 인기 있는 "hello world" RL의 환경을 예로 들어보겠습니다. 카트폴이라는 움직이는 카트에 수직으로 폴을 균형을 맞추는 문제입니다. 그러고 나서 대중적인 RL 알고리즘인 _Proximal Policy Optimization_ 를 사용하여 policy을 훈련시키기 위해 RLlib을 사용하는 방법을 다시 'CartPole'을 사용하여 볼 것입니다.

### CartPole and OpenAI

The popular [OpenAI "gym" environment](https://gym.openai.com/) provides MDP interfaces to a variety of simulated environments. Perhaps the most popular for learning RL is `CartPole`, a simple environment that simulates the physics of balancing a pole on a moving cart. The `CartPole` problem is described at https://gym.openai.com/envs/CartPole-v1. Here is an image from that website, where the pole is currently falling to the right, which means the cart will need to move to the right to restore balance:

인기있는 OpenAI "gym" 환경은 다양한 시뮬레이션 환경에 MDP 인터페이스를 제공합니다. 아마도 RL을 배우는 데 가장 인기 있는 것은 움직이는 카트에 장대를 매는 물리학을 시뮬레이션하는 간단한 환경인 카트폴일 것입니다. 카트폴 문제는 https://gym.openai.com/envs/CartPole-v1 에서 더 자세히 설명하고 있습니다. 다음은 현재 폴이 오른쪽으로 떨어지고 있는 웹 사이트의 이미지입니다. 즉, 카트가 균형을 회복하기 위해 오른쪽으로 이동해야 합니다.

![Cart Pole](../images/rllib/Cart-Pole.png)

This example fits into the MDP framework as follows:
- The **state** consists of the position and velocity of the cart (moving in one dimension from left to right) as well as the angle and angular velocity of the pole that is balancing on the cart.
- The **actions** are to decrease or increase the cart's velocity by one unit. A negative velocity means it is moving to the left.
- The **transition function** is deterministic and is determined by simulating physical laws. Specifically, for a given **state**, what should we choose as the next velocity value? In the RL context, the correct velocity value to choose has to be learned. Hence, we learn a _policy_ that approximates the optimal transition function that could be calculated from the laws of physics.
- The **reward function** is a constant 1 as long as the pole is upright, and 0 once the pole has fallen over. Therefore, maximizing the reward means balancing the pole for as long as possible.
- The **discount factor** in this case can be taken to be 1, meaning we treat the rewards at all time steps equally and don't discount any of them.

More information about the `gym` Python module is available at https://gym.openai.com/. The list of all the available Gym environments is in [this wiki page](https://github.com/openai/gym/wiki/Table-of-environments). We'll use a few more of them and even create our own in subsequent lessons.

이 예는 다음과 같이 MDP 프레임 워크에 적합합니다.

- **state**는 카트의 위치와 속도(좌에서 우로 한 차원으로 이동)와 카트에서 균형을 이루고 있는 폴의 각도와 각속도로 구성됩니다.
- **actions**은 카트의 속도를 한 단위로 줄이거나 증가시키는 것입니다. 음수는 그것이 왼쪽으로 움직이는 것을 의미합니다.
- **transition function**는 결정론적이며 물리적 법칙을 시뮬레이션하여 결정됩니다. 구체적으로, 주어진 **state**의 경우, 다음 속도 값으로 무엇을 선택해야 하는가?에 대한 내용입니다. RL에서 선택해야 할 정확한 속도 값을 배워야 합니다. 따라서 물리학 법칙에서 계산할 수 있는 최적의 transition function를 근사화하는 _policy_ 를 배웁니다.
- **reward function**는 pole이 서있을 때 상수 1이고,pole이 넘어지면 0입니다. 따라서 reward을 극대화한다는 것은 가능한 한 오랫동안 pole의 균형을 잡는 것을 의미합니다.
- 이 경우 **discount factor**는 1로 간주할 수 있습니다. 즉, 우리는 항상 보상을 동등하게 취급하고 그 중 어느 것도 할인하지 않습니다.

In [1]:
import gym
import numpy as np
import pandas as pd
import json

The code below illustrates how to create and manipulate MDPs in Python. An MDP can be created by calling `gym.make`. Gym environments are identified by names like `CartPole-v1`. A **catalog of built-in environments** can be found at https://gym.openai.com/envs.

아래 코드는 파이썬에서 MDP를 만들고 조작하는 방법을 보여줍니다. MDP는 `gym.make`로 만들어질 수 있습니다. Gym environments은 `CartPole-v1` 같은 이름으로 구분됩니다. **catalog of built-in environments**는 https://gym.openai.com/envs에서 찾을 수 있습니다.

In [2]:
env = gym.make('CartPole-v1')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v1>>>


Reset the state of the MDP by calling `env.reset()`. This call returns the initial state of the MDP.

`env.reset()`로 MDP의 상태를 reset합니다. MDP의 초기 state를 불러옵니다.

In [3]:
state = env.reset()
print('The starting state is:', state)

The starting state is: [ 0.00862049 -0.02136743 -0.03003316  0.04897816]


In [31]:
#!pip list

Recall that the state is the position of the cart, its velocity, the angle of the pole, and the angular velocity of the pole.

state는 카트의 위치, 속도, 극의 각도 및 극의 각속도임을 기억하십시오.

The `env.step` method takes an action. In the case of the `CartPole` environment, the appropriate actions are 0 or 1, for pushing the cart to the left or right, respectively. `env.step()` returns a tuple of four things:
1. the new state of the environment
2. a reward
3. a boolean indicating whether the simulation has finished
4. a dictionary of miscellaneous extra information

Let's show what happens if we take one step with an action of 0.

`env.step`의 방법은 action을 받습니다. 카트폴 환경의 경우 카트를 좌우로 밀어내는 데 필요한 action는 0이나 1입니다. `env.step()`는 4개의 요소를 가지고 있는 튜플을 반환합니다.
1. the new state of the environment
2. a reward
3. a boolean indicating whether the simulation has finished
4. a dictionary of miscellaneous extra information

0의 동작으로만 한 step가면 어떻게 되는지 봅시다.

In [4]:
action = 0
state, reward, done, info = env.step(action)
print(state, reward, done, info)

[ 0.00819315 -0.21604615 -0.0290536   0.33203612] 1.0 False {}


A **rollout** is a simulation of a policy in an environment. It is used both during training and when running simulations with a trained policy. 

The code below performs a rollout in a given environment. It takes **random actions** until the simulation has finished and returns the cumulative reward.

**rollout**은 environment에서 policy을 시뮬레이션하는 것입니다. 훈련 중일 때 및 훈련된 policy으로 시뮬레이션을 실행할 때 모두 사용됩니다.

아래 코드는 주어진 env에서 rollout을 수행합니다. 시뮬레이션이 완료되고 누적 reward을 반환할 때까지 **random actions**이 필요합니다.

In [5]:
def random_rollout(env):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = np.random.choice([0, 1])
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
    
    # Return the cumulative reward.
    return cumulative_reward    

Try rerunning the following cell a few times. How much do the answers change? Note that the maximum possible reward for `CartPole` is 500. You'll probably get numbers under 200.

다음 셀을 몇 번 다시 실행해 보십시오. 답은 얼마나 변하는가? 카트폴의 최대 reward은 500달러다. 아마 200 미만의 숫자를 얻을 수 있을 겁니다.

In [27]:
reward = random_rollout(env)
print(reward)
reward = random_rollout(env)
print(reward)

47.0
13.0


### Exercise 1

Choosing actions at random in `random_rollout` is not a very effective policy, as the previous results showed. Finish implementing the `rollout_policy` function below, which takes an environment *and* a policy. Recall that the *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, like we just did (with poor results), the action should be chosen **with the policy** (as a function of the state).

> **Note:** Exercise solutions for this tutorial can be found [here](solutions/Ray-RLlib-Solutions.ipynb).

앞선 결과에서 알 수 있듯이 무작위로 `random_rollout`에서 action을 선택하는 것은 그다지 효과적인 policy가 아닙니다. 아래의 `rollout_policy` 기능을 실행하여 env과 policy을 취합니다. *policy*은 *state*를 취하여 *action*을 반환하는 함수라는 것을 기억하세요. 차이점은 우리가 방금 했던 것처럼 **random action**을 선택하는 대신 **with the policy**으로 액션을 선택해야 한다는 것입니다.

In [28]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # EXERCISE: Fill out this function by copying the appropriate part of 'random_rollout'
    # and modifying it to choose the action using the policy.
    # --------------------------------------------------------------------------
    while not done:
        # Choose a action using the policy
        action = policy(state)
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward

    # raise NotImplementedError
    # --------------------------------------------------------------------------

    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

The first sample policy got an average reward of 9.37.
The second sample policy got an average reward of 28.74.


We'll return to `CartPole` in lesson [01: Application Cart Pole](explore-rllib/01-Application-Cart-Pole.ipynb) in the `explore-rllib` section.

### RLlib Reinforcement Learning Example: Cart Pole with Proximal Policy Optimization

This section demonstrates how to use the _proximal policy optimization_ (PPO) algorithm implemented by [RLlib](http://rllib.io). PPO is a popular way to develop a policy. RLlib also uses [Ray Tune](http://tune.io), the Ray Hyperparameter Tuning framework, which is covered in the [Ray Tune Tutorial](../ray-tune/00-Ray-Tune-Overview.ipynb).

We'll provide relatively little explanation of **RLlib** concepts for now, but explore them in greater depth in subsequent lessons. For more on RLlib, see the documentation at http://rllib.io.

PPO is described in detail in [this paper](https://arxiv.org/abs/1707.06347). It is a variant of _Trust Region Policy Optimization_ (TRPO) described in [this earlier paper](https://arxiv.org/abs/1502.05477). [This OpenAI post](https://openai.com/blog/openai-baselines-ppo/) provides a more accessible introduction to PPO.

PPO works in two phases. In the first phase, a large number of rollouts are performed in parallel. The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. In the second phase, we use SGD (_stochastic gradient descent_) to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![PPO](../images/rllib/ppo.png)

> **NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `num_gpus` field of the `config` dictionary. Hence, for normal usage, one or more GPUs is recommended.

(The original version of this example can be found [here](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/)).

PPO는 TRPO의 변형 알고리즘이다. 총 2단계로 진행되는데, 첫번째 단계에서 많은 rollout들이 동시에 수행된다. 이 rollout들이 드라이버에서 모아지고 이 rollout들을 기반으로 surrogate(대리) 최적화가 정의된다. 그 다음 단계에서 SGD를 가지고 목적함수를 최대화 시키는 정책을 찾는다. 이때 목적함수에는 현재 정책과 너무 다를(=발산하는) 경우 패널티를 주는 항이 추가되어 있다.

> 여러개 GPU로 이루어지는 데이터 병렬과정에서는 SGD가 최고의 퍼포먼스를 나타낸다. 이것은 `config`에 있는 `num_gpus`로 확인할 수 있다. 

In [8]:
# import gym  # imported above already, but listed here for completeness
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

The following script checks if the Ray cluster is already running. If not, it tells you what to do to start Ray.

In [20]:
#!../tools/start-ray.sh --check --verbose

D:\Installation\Anaconda3\envs\signal\Scripts


Now start Ray in this "driver" process. This must be done before we instantiate any RL agents.

In [11]:
# ray.init(address='auto', ignore_reinit_error=True, log_to_driver=False)

2020-08-16 12:52:44,284	ERROR worker.py:655 -- Calling ray.init() again after it has already been called.


In [10]:
ray.init()

2020-08-16 12:52:32,789	INFO resource_spec.py:212 -- Starting Ray with 4.79 GiB memory available for workers and up to 2.4 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-16 12:52:34,188	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.0.4',
 'raylet_ip_address': '192.168.0.4',
 'redis_address': '192.168.0.4:6379',
 'object_store_address': 'tcp://127.0.0.1:64523',
 'raylet_socket_name': 'tcp://127.0.0.1:63464',
 'webui_url': 'localhost:8265',
 'session_dir': 'C:\\Users\\LG\\AppData\\Local\\Temp\\ray\\session_2020-08-16_12-52-32_779317_31072'}

Traceback (most recent call last):
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard/dashboard.py", line 960, in <module>
    metrics_export_address=metrics_export_address)
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard/dashboard.py", line 513, in __init__
    build_dir = setup_static_dir(self.app)
  File "D:\Installation\Anaconda3\envs\signal\lib\site-packages\ray\dashboard/dashboard.py", line 414, in setup_static_dir
    "&& npm run build)", build_dir)
FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm ci && npm run build): 'D:\\Installation\\Anaconda3\\envs\\signal\\lib\\site-packages\\ray\\dashboard\\client/build'



> **Tip:** Having trouble starting Ray? See the [Troubleshooting](../reference/Troubleshooting-Tips-Tricks.ipynb) tips.

The next cell prints the URL for the Ray Dashboard. **This is only correct if you are running this tutorial on a laptop.** Click the link to open the dashboard.

If you are running on the Anyscale platform, use the URL provided by your instructor to open the Dashboard.

In [12]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Instantiate a PPOTrainer object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used. In a cluster, these actors will be spread over the available nodes.
- `num_sgd_iter` is the number of epochs of SGD (stochastic gradient descent, i.e., passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO, for each _minibatch_ ("chunk") of training data. Using minibatches is more efficient than training with one record at a time.
- `sgd_minibatch_size` is the SGD minibatch size (batches of data) that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers. Here, we have two hidden layers of size 100, each.
- `num_cpus_per_worker` when set to 0 prevents Ray from pinning a CPU core to each worker, which means we could run out of workers in a constrained environment like a laptop or a cloud VM.

PPOTrainer 인스턴스를 가져온다. config로 네트워크와 학습과정에 대해 명시한다.

- `num_workers` 만들어질 agent의 수. 병렬화 정도를 나타낸다. In a cluster, these actors will be spread over the available nodes.
- `num_sgd_iter` SGD epoch 수 (stochastic gradient descent, i.e., passes through the data). the PPO surrogate objective를 최적화 시키는 정도는 나타낸다. minibatch를 사용하여 더 효과적인 training이 가능하다.
- `sgd_minibatch_size` SGD minibatch 크기 (batches of data). PPO surrogate objective 최적화하는데 사용.
- `model` dictionary 자료형으로 policy를 나타내는 the neural net에 대해 나타낸다. `fcnet_hiddens` 는 hidden layers의 크기를 나타낸다. 여기서는 size 100의 hidden layer를 2개 가지고 있다.
- `num_cpus_per_worker` when set to 0 prevents Ray from pinning a CPU core to each worker, which means we could run out of workers in a constrained environment like a laptop or a cloud VM.

In [13]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 1
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0 

In [14]:
agent = PPOTrainer(config, 'CartPole-v1')

2020-08-16 12:53:12,282	ERROR syncer.py:46 -- Log sync requires rsync to be installed.
2020-08-16 12:53:12,286	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-08-16 12:53:12,287	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Now let's train the policy on the `CartPole-v1` environment for `N` steps. The JSON object returned by each call to `agent.train()` contains a lot of information we'll inspect below. For now, we'll extract information we'll graph, such as `episode_reward_mean`. The _mean_ values are more useful for determining successful training.

agent를 training해보고 `episode_reward_mean`를 뽑아보자.

In [15]:
N=10
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min':  result['episode_reward_min'],  
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max':  result['episode_reward_max'],  
               'episode_len_mean':    result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}')

  0: Min/Mean/Max reward:   9.0000/ 22.0389/ 71.0000
  1: Min/Mean/Max reward:  11.0000/ 37.2037/120.0000
  2: Min/Mean/Max reward:  11.0000/ 53.9600/153.0000
  3: Min/Mean/Max reward:  14.0000/ 83.5700/323.0000
  4: Min/Mean/Max reward:  14.0000/112.4800/342.0000
  5: Min/Mean/Max reward:  14.0000/144.4800/500.0000
  6: Min/Mean/Max reward:  14.0000/178.7600/500.0000
  7: Min/Mean/Max reward:  14.0000/210.0000/500.0000
  8: Min/Mean/Max reward:  14.0000/241.3100/500.0000
  9: Min/Mean/Max reward:  21.0000/277.6800/500.0000


Now let's convert the episode data to a Pandas `DataFrame` for easy manipulation. The results indicate how much reward the policy is receiving (`episode_reward_*`) and how many time steps of the environment the policy ran (`episode_len_mean`). The maximum possible reward for this problem is `500`. The reward mean and trajectory length are very close because the agent receives a reward of one for every time step that it survives. However, this is specific to this environment and not true in general.

이제 에피소드 데이터를 Pandas `DataFrame`으로 변환해보자. 그 결과 정책이 얼마나 많은 보상을 받고 있는지(`episode_reward_*`), 정책이 실행한 환경의 몇 단계(`episode_len_mean`)를 알 수 있다. 이에 대한 최대 보상은 500달러다. 보상 평균과 궤적 길이는 에이전트가 생존하는 매 단계마다 보상을 받기 때문에 매우 가깝습니다. 그러나 이것은 이러한 환경에만 국한되어 있으며 일반적이지 않다.

In [16]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,9.0,22.038889,71.0,22.038889
1,1,11.0,37.203704,120.0,37.203704
2,2,11.0,53.96,153.0,53.96
3,3,14.0,83.57,323.0,83.57
4,4,14.0,112.48,342.0,112.48
5,5,14.0,144.48,500.0,144.48
6,6,14.0,178.76,500.0,178.76
7,7,14.0,210.0,500.0,210.0
8,8,14.0,241.31,500.0,241.31
9,9,21.0,277.68,500.0,277.68


In [17]:
df.columns.tolist()

['n',
 'episode_reward_min',
 'episode_reward_mean',
 'episode_reward_max',
 'episode_len_mean']

Let's plot the data.

In [18]:
import sys
sys.path.append("..")
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

In [19]:
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

Since the length and reward means are equal, we'll only plot one line:

In [20]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                      title='Episode Rewards', x_axis_label='n', y_axis_label='reward')

([image](../images/rllib/Cart-Pole-Episode-Rewards.png))

The model is quickly able to hit the maximum value of 500, but the mean is what's most valueable. After 10 steps, we're more than half way there.

모델은 500의 최대 값을 빠르게 맞출 수 있지만, 평균은 가장 가치 있는 것이다. 10단계를 밟고 나면, 우리는 절반 이상 거기에 있다.

FYI, here are two views of the whole value for one result. First, a "pretty print" output.

> **Tip:** The output will be long. When this happens for a cell, right click and select _Enable scrolling for outputs_.

In [21]:
print(pretty_print(results[-1]))

custom_metrics: {}
date: 2020-08-16_12-54-45
done: false
episode_len_mean: 277.68
episode_reward_max: 500.0
episode_reward_mean: 277.68
episode_reward_min: 21.0
episodes_this_iter: 9
episodes_total: 449
experiment_id: 257cac17466e46329227635856c5b9c8
hostname: Jungyeon
info:
  learner:
    default_policy:
      cur_kl_coeff: 0.07500000298023224
      cur_lr: 4.999999873689376e-05
      entropy: 0.5292267203330994
      entropy_coeff: 0.0
      kl: 0.005547081585973501
      model: {}
      policy_loss: -0.004298275336623192
      total_loss: 658.5515747070312
      vf_explained_var: 0.00033683545188978314
      vf_loss: 658.5554809570312
  num_steps_sampled: 40000
  num_steps_trained: 40000
iterations_since_restore: 10
node_ip: 192.168.0.4
num_healthy_workers: 1
off_policy_estimator: {}
perf:
  cpu_util_percent: 34.78181818181818
  ram_util_percent: 52.57272727272728
pid: 31072
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
  mean_env_wait_ms: 0.081441

We'll learn about more of these values as continue the tutorial.

The whole, long JSON blob, which includes the historical stats about episode rewards and lengths:

In [22]:
results[-1]

{'episode_reward_max': 500.0,
 'episode_reward_min': 21.0,
 'episode_reward_mean': 277.68,
 'episode_len_mean': 277.68,
 'episodes_this_iter': 9,
 'policy_reward_min': {},
 'policy_reward_max': {},
 'policy_reward_mean': {},
 'custom_metrics': {},
 'hist_stats': {'episode_reward': [500.0,
   500.0,
   500.0,
   500.0,
   388.0,
   345.0,
   500.0,
   500.0,
   500.0,
   323.0,
   156.0,
   96.0,
   21.0,
   186.0,
   78.0,
   155.0,
   163.0,
   219.0,
   56.0,
   144.0,
   161.0,
   177.0,
   135.0,
   130.0,
   153.0,
   130.0,
   79.0,
   110.0,
   114.0,
   31.0,
   166.0,
   124.0,
   181.0,
   117.0,
   42.0,
   145.0,
   159.0,
   202.0,
   103.0,
   206.0,
   293.0,
   188.0,
   313.0,
   86.0,
   158.0,
   220.0,
   228.0,
   193.0,
   224.0,
   183.0,
   177.0,
   39.0,
   264.0,
   144.0,
   342.0,
   124.0,
   200.0,
   199.0,
   206.0,
   117.0,
   494.0,
   294.0,
   500.0,
   281.0,
   500.0,
   294.0,
   353.0,
   394.0,
   220.0,
   256.0,
   164.0,
   22.0,
   306.0,


Let's plot the `episode_reward` values:

In [23]:
episode_rewards = results[-1]['hist_stats']['episode_reward']
df_episode_rewards = pd.DataFrame(data={'episode':range(len(episode_rewards)), 'reward':episode_rewards})
plot_line(df_episode_rewards, x_col='episode', y_col='reward', title='Episode Rewards', x_axis_label='episode', y_axis_label='reward')

([image](../images/rllib/Cart-Pole-Episode-Rewards2.png))

For a well-trained model, most runs do very well while occasional runs do poorly. Try plotting other results episodes by changing the array index in `results[-1]` to another number between `0` and `9`. (The length of `results` is `10`.)

잘 훈련된 모델의 경우, 대부분의 런은 매우 잘 작동하는 반면, 때때로 런은 잘 작동하지 않는다. 결과[-1]의 배열 지수를 0과 9 사이의 다른 숫자로 변경하여 다른 결과 에피소드를 플롯해 보십시오.(결과의 길이는 10이다.)

### Exercise 2

The current network and training configuration are too large and heavy-duty for a simple problem like `CartPole`. Modify the configuration to use a smaller network (the `config['model']['fcnet_hiddens']` setting) and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [24]:
# Make edits here:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0

agent = PPOTrainer(config, 'CartPole-v1')

2020-08-16 12:54:51,234	ERROR syncer.py:46 -- Log sync requires rsync to be installed.


Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_minibatch_size`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around `N` = 20 or 30 training iterations.

In [25]:
N=5
results = []
episode_data = []
episode_json = []
for n in range(N):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    print(f'Max reward: {episode["episode_reward_max"]}')

Max reward: 78.0
Max reward: 101.0
Max reward: 217.0
Max reward: 364.0
Max reward: 364.0


# Using Checkpoints

You checkpoint the current state of a trainer to save what it has learned. Checkpoints are used for subsequent _rollouts_ and also to continue training later from a known-good state.  Calling `agent.save()` creates the checkpoint and returns the path to the checkpoint file, which can be used later to restore the current state to a new trainer. Here we'll load the trained policy into the same process, but often it would be loaded in a new process, for example on a production cluster for serving that is separate from the training cluster.

In [26]:
checkpoint_path = agent.save()
print(checkpoint_path)

C:\Users\LG/ray_results\PPO_CartPole-v1_2020-08-16_12-54-51qk_ig89p\checkpoint_5\checkpoint-5


Now load the checkpoint in a new trainer:

In [22]:
trained_config = config.copy()
test_agent = PPOTrainer(trained_config, 'CartPole-v1')
test_agent.restore(checkpoint_path)

2020-08-12 21:55:01,143	ERROR syncer.py:46 -- Log sync requires rsync to be installed.
2020-08-12 21:55:06,454	INFO trainable.py:423 -- Restored on 192.168.0.4 from checkpoint: C:\Users\LG/ray_results\PPO_CartPole-v1_2020-08-12_21-54-259jnj5anj\checkpoint_5\checkpoint-5
2020-08-12 21:55:06,456	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 29.93517565727234, '_episodes_total': 400}


Use the previously-trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action. This is an example of _rollout_, which we'll study in a subsequent lesson.

Verify that the cumulative reward received roughly matches up with the reward printed above. It will be at or near 200.

In [23]:
env = gym.make('CartPole-v1')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)  # key line; get the next action
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

367.0


In [24]:
ray.shutdown()  # "Undo ray.init()".

The next lesson, [02: Introduction to RLlib](02-Introduction-to-RLlib.ipynb) steps back to introduce to RLlib, its goals and the capabilities it provides.