# Ray RLlib Multi-Armed Bandits - A Simple Bandit Example

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with three arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

세 개의 팔을 가진 아주 간단한 contextual bandit 사례를 살펴보자. RLlib와 Ray의 하이퍼 파라미터 튜닝 라이브러리인 [Tune](http://tune.io)를 사용하여 테스트를 실행하겠다.

In [1]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import random, time
import ray

We define the bandit as a subclass of an OpenAI Gym environment. We set the action space to have three discrete variables, one action for each arm, and an observation space (the context) in the range -1.0 to 1.0, inclusive. (See the [configuring environments](https://docs.ray.io/en/latest/rllib-env.html#configuring-environments) documentation for more details about creating custom environments.)

There are two contexts defined. Note that we'll randomly pick one of them to use when `reset` is called, but it stays fixed (static) throughout the episode (the set of steps between calls to `reset`).

우리는 bandit을 OpenAI Gym 환경의 하위 클래스로 정의한다. action space는 3개의 이산형 변수로,각 암에 대해 1개의 action을 가진다. observation(context)는 -1.0 ~ 1.0 범위를 가진다. 
([configuring environments](https://docs.ray.io/en/latest/rllib-env.html#configuring-environments) 사용자 지정 환경 생성에 대한 자세한 내용은 환경 구성 문서를 참조하십시오)

두 가지의 context가 정의되어 있다. 리셋이 호출될 때 사용할 수 있도록 임의로 하나를 선택하지만, 에피소드 내내 고정(정적) 상태를 유지한다는 점에 유의하여야한다.(리셋할 호출 사이의 단계 설정)

In [2]:
class SimpleContextualBandit (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {        # 2 contexts: -1 and 1
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])
        #return observation which is static

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

Look at the definition of `self.rewards_for_context`. For context `-1.`, choosing the **third** arm (index 2 in the array) maximizes the reward, yielding `10.0` for each pull. Similarly, for context `1.`, choosing the **first** arm (index 0 in the array) maximizes the reward. It is never advantageous to choose the second arm.

We'll see if our training results agree ;)

self.rewards_for_context의 정의를 보자. context `-1.` 의 경우, **3** 암(배열에서 색인 2)을 선택하면 각 당김에 대해 10.0이 주어지며 보상이 최대화된다. 마찬가지로 context  `1.`의 경우 **1** 암(배열에서 색인 0)을 선택하면 보상이 최대화된다. 두 번째 팔을 선택하는 것은 결코 유리하지 않다.

훈련 결과가 일치하는지 확인할 것이다 ;)

Try repeating the next two code cells enough times to see the `current_context` set to `1.0` and `-1.0`, which is initialized randomly in `reset()`.

`reset()`에서 무작위로 초기화되는 `current_context`가 1.0과-1.0으로 설정되는 것을 볼 수 있을 만큼 다음 두 개의 코드 셀을 반복해 보십시오.

In [3]:
bandit = SimpleContextualBandit()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [-1.  1.], bandit = SimpleContextualBandit(action_space=Discrete(3), observation_space=Box(2,), current_context=1.0, rewards per context={-1.0: [-10, 0, 10], 1.0: [10, 0, -10]})'

The `bandit.current_context` and the observation of the current environment will remain fixed through the episode.

`bandit.current_context`와 현재 환경에 대한 observation은 에피소드를 통해 고정된 상태를 유지할 것이다

In [4]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

current_context = 1.0
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], action = 2, reward =  -10, done = True , info = {'regret': 20}
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], action = 0, reward =   10, done = True , info = {'regret': 0}
observation = [-1.  1.], action = 2, reward =  -10, done = True , info = {'regret': 20}
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], action = 0, reward =   10, done = True , info = {'regret': 0}
observation = [-1.  1.], action = 1, reward =    0, done = True , info = {'regret': 10}


Look at the `current_context`. If it's `1.0`, does the `0` (first) action yield the highest reward and lowest regret? If it's `-1.0`, does the `2` (third) action yield the highest reward and lowest regret? The `1` (second) action always returns `0` reward, so it's never optimal. 

`current_context`를 보자. `1.0` 이라면 `0` (첫 번째) 행동은 가장 높은 reward과 가장 낮은 regret를 주는 것일까? `-1.0`이라면 `2(3번째)` 행동이 가장 높은 reward과 가장 낮은 regret을 주는가? 
`1(두 번째)` 행동은 항상 0의 보상을 돌려주기 때문에 결코 최적이 아니다.

## Using LinUCB

For this simple example, we can easily determine the best actions to take. Let's see how well our system does. We'll train with [LinUCB](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb), a linear version of _Upper Confidence Bound_, for the exploration-exploitation strategy. _LinUCB_ assumes a linear dependency between the expected reward of an action and its context. Recall that a linear function is of the form $z = ax + by + c$, for example, where $x$, $y$, and $z$ are variables and $a$, $b$, and $c$ are constants. _LinUCB_ models the representation space using a set of linear predictors. Hence, the $Q_t(a)$ _value_ function discussed for UCB in the [previous lesson](02-Exploration-vs-Exploitation-Strategies.ipynb) is assumed to be a linear function here.


이 간단한 예를 위해, 우리는 가장 좋은 조치를 쉽게 결정할 수 있다. 일단 우리 시스템이 얼마나 잘 작동하는지 봅시다. 우리는 [LinUCB](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb), 에서 exploration-exploitstion 전략을 위해 선형 버전의 _Upper Confidence Bound_와 함께 훈련할 것이다. _LinUCB_는 어떤 행동의 예상 보상과 그 상황 사이의 선형 종속성을 가정한다.

선형 함수는 $z = ax + by + c$ 형식이며, $x$, $y$, $z$는 변수, $a$, $b$, $c$는 상수라는 점을 기억하자. _LinUCB_는 일련의 선형 예측 변수를 사용하여 표현 공간을 모형화한다. 따라서 [previous lesson](02-탐색-vs-탐색-전략.ipynb)에서 UCB에 대해 논의한 $Q_t(a)$ _value_ 함수는 여기서 선형 함수로 가정한다.


---------------

Look again at how we defined `rewards_for_context`. Is it linear as expected for _LinUCB_?

우리가 reward_for_context를 어떻게 정의했는지 다시 한 번 살펴보십시오. LinUCB의 예상대로 선형인가?

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

Yes, for each arm, the reward is linear in the context. For example, the first arm has a reward of `-10` for context `-1.0` and `10` for context `1.0`. Crucially, the _same_ linear function that works for the first arm will work for the other two arms if you multiplied the constants in the linear function by `0` and `-1`, respectively. Hence, we expect _LinUCB_ to work well for this example.

그렇다! 각 팔마다 reward는 context에 대하여 선형이다. 예를 들어, 첫 번째 팔은 컨텍스트 `-1.0`에 대해 `-10`, 컨텍스트 `1.0`에 대해 `10`의 보상을 받는다. 결정적으로 첫 번째 암에 작용하는 동일한 선형 함수는 선형 함수의 상수를 각각 `0`과 `-1`로 곱하면 다른 두 암에 대해 작용한다. 따라서 우리는 이 예에서 LinUCB가 잘 작동하기를 기대한다.

Now use Tune to train the policy for this bandit. But first, we want to start Ray on your laptop or connect to the running Ray cluster if you are working on the Anyscale platform.

이제 Tune을 사용하여 이 bandit의 정책을 훈련해보자. 
하지만 먼저, Anyscale 플랫폼에서 작업하는 경우 랩톱에서 Ray를 시작하거나 실행 중인 Ray 클러스터에 연결하십시오.

In [5]:
ray.init(ignore_reinit_error=True)

2020-08-29 15:10:09,704	INFO resource_spec.py:231 -- Starting Ray with 3.37 GiB memory available for workers and up to 1.71 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-29 15:10:10,406	INFO services.py:1193 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.2.60',
 'raylet_ip_address': '192.168.2.60',
 'redis_address': '192.168.2.60:29414',
 'object_store_address': '/tmp/ray/session_2020-08-29_15-10-09_702756_6627/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-08-29_15-10-09_702756_6627/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-08-29_15-10-09_702756_6627'}

In [6]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit,
}

In [7]:
from ray.tune.progress_reporter import JupyterNotebookReporter

Calling `ray.tune.run` below would handle Ray initialization for us, if Ray isn't already running. If you want to prevent this and have Tune exit with an error when Ray isn't already initialized, then pass `ray_auto_init=False`.

Ray가 돌아가지 않는 상황에서 `Ray.tun.run`을 실행하면 Ray 초기화를 처리할 수 있다. Ray가 아직 초기화되지 않은 상태에서 이를 방지하고 Tune exit를 에러 없이 사용하고 싶다면 `ray_auto_init=False`를 전달하십시오.

In [8]:
analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
    verbose=2, #ray_auto_init=False   # Change to 0 or 1 to reduce the output.
    )



Trial name,status,loc
contrib_LinUCB_SimpleContextualBandit_4e4b4_00000,RUNNING,


[2m[36m(pid=6699)[0m 2020-08-29 15:10:17,109	INFO trainer.py:632 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for contrib_LinUCB_SimpleContextualBandit_4e4b4_00000:
  custom_metrics: {}
  date: 2020-08-29_15-10-17
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: fd90e5ceb3874e6f94eba2ef888dd7ae
  experiment_tag: '0'
  hostname: k-14z970-gr30k
  info:
    learner:
      default_policy:
        cumulative_regret: 10
        update_latency: 0.00042057037353515625
    num_steps_sampled: 100
    num_steps_trained: 100
  iterations_since_restore: 1
  node_ip: 192.168.2.60
  num_healthy_workers: 0
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 46.8
    ram_util_percent: 42.1
  pid: 6699
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 0.04692596964316795
    mean_inference_ms: 0.8593526217016847
    mean_processing_ms: 0.5114409002927269
  time_since_restore: 0.20624709129333496
  time_this_

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBandit_4e4b4_00000,TERMINATED,,2,0.436876,200,10


(A lot of output is printed with `verbose` set to `2`. Use `0` for no output and `1` for short summaries.)

(많은 출력물은 `verbose`가 `2`로 설정되어 인쇄된다. 출력 없음에는 `0`을 사용하고, 짧은 요약에는 `1`을 사용한다.)

How long did it take?

얼마나 걸렸는가?

In [12]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

   2.18 seconds,    0.04 minutes


We can see some of the final data as a dataframe:

최종 데이터의 일부를 데이터 프레임으로 볼 수 있다.

In [14]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_healthy_workers,timesteps_total,done,episodes_total,training_iteration,...,timers/learn_time_ms,timers/learn_throughput,info/num_steps_sampled,info/num_steps_trained,perf/cpu_util_percent,perf/ram_util_percent,info/learner/default_policy/cumulative_regret,info/learner/default_policy/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,0,200,True,200,2,...,0.827,1208.907,200,200,,,10,0.000618,<class '__main__.SimpleContextualBandit'>,/home/jhmbabo/ray_results/contrib/LinUCB/contr...


The easiest way to inspect the progression of training is to use TensorBoard.

1. If you are runnng on the Anyscale Platform, click the _TensorBoard_ link. 
2. If you running this notebook on a laptop, open a terminal window using the `+` under the _Edit_ menu, run the following command, then open the URL shown.

```
tensorboard --logdir ~/ray_results 
```

훈련 진행 상황을 점검하는 가장 쉬운 방법은 텐서보드를 이용하는 것이다.

1. Anyscale Platform에서 실행 중인 경우 _TensorBoard_ 링크를 클릭하십시오.
2. 노트북에서 이 노트북을 실행 중인 경우 _Edit_ 메뉴의 '+'를 사용하여 터미널 창을 열고 다음 명령을 실행한 후 표시된 URL을 여십시오.


You may have many data sets plotted from previous tutorial lessons. In the _Runs_ on the left, look for one named something like this:

```
contrib/LinUCB/contrib_LinUCB_SimpleContextualBandit_0_YYYY-MM-DD_HH-MM-SSxxxxxxxx  
```

If you have several of them, you want the one with the latest timestamp. To select just that one, click _toggler all runs_ below the list of runs, then select the one you want. You should see something like [this image](../../images/rllib/TensorBoard1.png).

The graph for the metric we were optimizing, the mean reward, is shown with a rectangle surrounding it. It improved steadily during the training runs. For this simple example, the reward mean is easily found in 200 steps.


이전 튜토리얼 수업에서 플로팅된 데이터 세트가 많을 수 있다. 왼쪽의 _Runs_에서 다음과 같은 이름을 가진 것을 찾으십시오.

```
contrib/LinUCB/contrib_LinUCB_SimpleContextualBandit_0_YYYY-MM-DD_HH-MM-SSxxxxxxxx  
```

여러 개의 타임스탬프가 있는 경우 최신 타임스탬프가 있는 타임스탬프를 사용하십시오. 이 항목만 선택하려면 실행 목록 아래에 있는 _모든 실행 전환_을 클릭한 다음 원하는 실행을 선택하십시오.[this image](../../images/rllib/TensorBoard1.png).

우리가 최적화하고 있던 메트릭스 그래프, 평균 보상은 그것을 둘러싼 직사각형과 함께 표시된다. 훈련 기간 동안 꾸준히 향상되었다. 이 간단한 예로 보상 평균은 200단계에서 쉽게 찾을 수 있다.


## Exercise 1

Change the the `step` method to randomly change the `current_context` on each invocation:

```python
def step(self, action):
    result = super().step(action)
    self.current_context = random.choice([-1.,1.])
    return (np.array([-self.current_context, self.current_context]), reward, True,
            {
                "regret": 10 - reward
            })
```

Repeat the training and analysis. Does the training behavior change in any appreciable way? Why or why not?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.

## Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

## Exercise 3 (Optional)

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
    verbose=1)
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

We'll continue exploring usage of _LinUCB_ in the next lesson, [04 Linear Upper Confidence Bound](04-Linear-Upper-Confidence-Bound.ipynb) and _LinTS_ in the following lesson, [05 Thompson Sampling](05-Linear-Thompson-Sampling.ipynb).

In [11]:
ray.shutdown()  # "Undo ray.init()".