# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .


[previous lesson](02-Simple-Multi-Armed-Bandit.ipynb)에서는 exoloration-explotation 전략 [RLLIB documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)에서 예상 보상간 선형 종속성을 가정하여 'LinUCB'(선형 상부 신뢰 경계)를 사용했다.그것은 어떤 행동의 예상 보상과 그 맥락 사이의 선형 종속성을 가정한다.

이제 _parametric actions_ 가 있는 권고 환경에서 _LinUCB_ 를 사용할 것이며, 이는 연속적인 매개변수를 갖는 이산 작용이다. 각 단계에서 에이전트는 사용할 작업과 해당 작업에 사용할 매개 변수를 선택해야 한다. 이는 문맥의 복잡성과 시간에 따른 최고 평균 보상을 달성하기 위한 최적의 조치를 찾아야 하는 과제를 증가시킨다.

UCB에 대한 이전 토론은 [02 Exploration vs.Exploitation](02-Exploration-vs-Exploitation-Strategies.ipynb)와 [previous lesson](03-단순-다중 암기-반딧.ipynb)에서 확인하십시오.

In [1]:
import os
import time
import pandas as pd
import numpy as np

import ray
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

lz4 not available, disabling sample compression. This will significantly impact RLlib performance. To install lz4, run `pip install lz4`.


Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):


`ParametricItemRecoEnv`를 환경으로 사용한다.([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py))
이것은 임의로 생성된 특징과 가시성 및 선택적으로 숨겨져 있는 "items"("매개변수")을 생성하는 권장 환경("RecoEnv")이다.
기본 크기는 `DEFAULT_RECO_CONFIG` [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py))에 의해 
변한다. 

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated, so it is nontrivial, but that means it is confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

이런 환경은 의도적으로 복잡하기 때문에 비교가 안 되지만, 그것은 처음에는 이해하기가 혼란스럽다는 것을 의미한다. 자, 이것의 행동을 살펴봅시다. 

In [2]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)


In [3]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [4]:
take_step()
take_step()


    action = 13, 
    obs:
        'item': [[0.02184632 0.38962268 0.27080361 0.06139751 0.16461075 0.24289644
  0.27257102 0.21258848 0.00558852 0.22869068 0.12475031 0.31299057
  0.24970903 0.38632188 0.39342137 0.18108697]] (25 items), 
        'item_id': [85 46 88 35 76 22  8 44 49 92 31 41 28 33 84 61  2 43 47 39 81 38 29 34
 64],
        'response': [0.69418285042524], 
    reward = 0.69418285042524, 
    finished? = True, 
    info = {'regret': 0.17517313859435446}
    

    action = 8, 
    obs:
        'item': [[0.28213117 0.26317668 0.21036557 0.37617669 0.00369184 0.04024035
  0.05232339 0.08387468 0.32555471 0.05242273 0.35259738 0.34257478
  0.04837945 0.37538231 0.23187539 0.32639976]] (25 items), 
        'item_id': [24 65 44 30 59 95 16 86 78 22 48 28 71 35 46 90 92 31 10  0 18 82 12 43
 62],
        'response': [0.7414915799974681], 
    reward = 0.7414915799974681, 
    finished? = True, 
    info = {'regret': 0.12786440902212637}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

각 단계의 보상은 무작위로 생성된 다양한 데이터 행렬의 행렬 곱셈을 사용하여 무작위로 계산한 다음, '단계'에 지정된 특정 동작에 의해 색인화된 반응(보상)을 선택한다. 그러나 구성된 보상금은 항상 약 0.6에서 0.9 사이에 나오며, 후회하는 것은 가능한 모든 행동에 대한 최대 값에서 지정된 조치에 대한 보상을 뺀 값이다.

표시된 `item`은 환경에 있는 모든 _항목의 하위 집합이며, `item_id`는 더 큰 항목 집합에 나타난 항목의 해당 지표가 된다. 이 25개 항목의 목록은 이 두 단계에서 볼 수 있듯이 각 단계별로 임의로 선택된다.

In the following `num_candidates` steps, which defaults to 25, you may see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward, but not for all runs. Which one has the lowest regret?

기본값이 25인 다음의 `num_pandidates` 단계에서는 0.0의 한 가지 후회를 볼 수 있는데, 이는 모두 대해서는 아니지만 최대한의 보상을 받고 액션을 선택했을 때의 일이다. 어느 것이 가장 후회가 적은가

In [5]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward:7.5f}, regret = {info["regret"]:7.5f}')

  0: reward = 0.79091, regret = 0.07845
  1: reward = 0.77886, regret = 0.08691
  2: reward = 0.72152, regret = 0.14784
  3: reward = 0.64857, regret = 0.21899
  4: reward = 0.70884, regret = 0.12224
  5: reward = 0.59567, regret = 0.21234
  6: reward = 0.82330, regret = 0.03029
  7: reward = 0.65356, regret = 0.21221
  8: reward = 0.72152, regret = 0.14425
  9: reward = 0.84557, regret = 0.02378
 10: reward = 0.76898, regret = 0.10037
 11: reward = 0.76763, regret = 0.04319
 12: reward = 0.70214, regret = 0.12894
 13: reward = 0.70325, regret = 0.16966
 14: reward = 0.68384, regret = 0.16173
 15: reward = 0.82330, regret = 0.04962
 16: reward = 0.68384, regret = 0.18193
 17: reward = 0.78811, regret = 0.06548
 18: reward = 0.68217, regret = 0.17142
 19: reward = 0.53511, regret = 0.31848
 20: reward = 0.69312, regret = 0.16047
 21: reward = 0.75453, regret = 0.09906
 22: reward = 0.83108, regret = 0.00000
 23: reward = 0.78036, regret = 0.08719
 24: reward = 0.78497, regret = 0.07180


The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

Note that we imported `UCB_CONFIG` above, which has the properties defined that are expected _LinUCB_. We'll add another property to it for the environment. (Subsequent lessons will show other ways to work with the configuration.)

이전의 단순한 bandit보다 더 나은, 평균 보상을 찾기 위한 훈련이 더 어려울 것이다.

이제 `ParametricItemRecoEnv`를 살펴보았으니 _LinUCB_와 함께 사용해 보자.

상기 `UCB_CONFIG`를 import했는데, 이 속성은 _LinUCB_ 로 정의되어 있다. 환경을 위해 또 다른 속성을 추가하겠다. (후속적인 노트북들은 구성을 사용하는 다른 방법을 보여줄 것이다.)

In [6]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


Now let's use [Ray Tune](http://tune.io) to train. First start Ray or connect to a running cluster.

In [7]:
ray.init(ignore_reinit_error=True)

2020-09-12 16:49:27,974	INFO resource_spec.py:231 -- Starting Ray with 3.42 GiB memory available for workers and up to 1.71 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-12 16:49:28,535	INFO services.py:1193 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.5',
 'raylet_ip_address': '192.168.1.5',
 'redis_address': '192.168.1.5:6379',
 'object_store_address': '/tmp/ray/session_2020-09-12_16-49-27_972797_31350/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-09-12_16-49-27_972797_31350/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-09-12_16-49-27_972797_31350'}

Now run Tune:

In [8]:
analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    verbose=1
)

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_7dc80_00000,TERMINATED,,20,10.1914,2000,0.87834
contrib_LinUCB_ParametricItemRecoEnv_7dc80_00001,TERMINATED,,20,9.97493,2000,0.889897
contrib_LinUCB_ParametricItemRecoEnv_7dc80_00002,TERMINATED,,20,10.1432,2000,0.84754
contrib_LinUCB_ParametricItemRecoEnv_7dc80_00003,TERMINATED,,20,10.352,2000,0.893462
contrib_LinUCB_ParametricItemRecoEnv_7dc80_00004,TERMINATED,,20,5.20884,2000,0.878567


How long did it take?

In [9]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

  23.19 seconds,    0.39 minutes


Let's look at the data:

In [10]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_healthy_workers,timesteps_total,done,episodes_total,training_iteration,...,config/sample_batch_size,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_pytorch,logdir
0,0.914414,0.808219,0.87834,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jhmbabo/ray_results/contrib/LinUCB/contr...
1,0.898686,0.848535,0.889897,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jhmbabo/ray_results/contrib/LinUCB/contr...
2,0.901036,0.777299,0.84754,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jhmbabo/ray_results/contrib/LinUCB/contr...
3,0.904757,0.825116,0.893462,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jhmbabo/ray_results/contrib/LinUCB/contr...
4,0.892011,0.819377,0.878567,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jhmbabo/ray_results/contrib/LinUCB/contr...


Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

`episode_reward_mean` 값을 기록해 두십시오. 이제 그 시도의 _cumulative regrets_ 를 분석해 보자. 어쩔 수 없이 차선책을 택하기도 하지만 시간이 지날수록 이런 일이 줄어들지 않았을까.

One of the columns in the trial dataframes is the `info/learner/default_policy/cumulative_regret`. Let's combine the trail DataFrames into a single DataFrame, then group over the `info/number_steps_trained` and project out the `info/learner/default_policy/cumulative_regret`. Finally, aggregate for each `info/number_steps_trained` to compute the `mean`, `max`, `min`, and `std` (standard deviation) for the cumulative regret.


평가 데이터 프레임의 열 중 하나는 `info/learner/default_policy/cumulative_regret`이다. 데이터프레임을 단일 DataFrame으로 결합한 다음, `info/number_step_learned`에 그룹화하고 `info/learner/default_policy/cumulation_reregret`를 투영해 봅시다. 마지막으로, 훈련된 각 `info/number_steps_trained`을 집계하여 `mean`, `max`, `min`, `std`(표준편차)를 계산한다.

In [11]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("info/num_steps_trained")[
    "info/learner/default_policy/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [12]:
df

Unnamed: 0_level_0,mean,max,min,std
info/num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.222065,3.40944,3.060718,0.142948
200,3.917847,4.288206,3.731046,0.226175
300,4.280946,4.724587,3.977982,0.312146
400,4.551197,5.16677,4.098278,0.442226
500,4.72489,5.374023,4.209122,0.4839
600,4.902917,5.538157,4.335885,0.523566
700,5.002488,5.671459,4.364087,0.597292
800,5.115244,5.792358,4.39214,0.628825
900,5.208474,5.977497,4.431327,0.67848
1000,5.275504,6.086007,4.468954,0.703333


It will be easier to understand these results with a graph:

In [15]:
from bokeh_util import plot_cumulative_regret
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

ModuleNotFoundError: No module named 'bokeh_util'

In [16]:
plot_cumulative_regret(df)

NameError: name 'plot_cumulative_regret' is not defined

([image](../../images/rllib/LinUCB-Cumulative-Regret.png))

So the _cumulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.

따라서 _누적_ 후회는 다섯 번의 모든 시험에서 전체 훈련 단계 수에 대해 증가하지만, 더 큰 단계 숫자의 경우, 우리가 배울수록 후회의 추가 양이 감소하기 때문에, 시스템이 평균 보상을 최적화함에 따라 그래프가 평준화되기 시작한다.

The environment we're using randomly generates data on every step, so there will always be some regret even if we train for a longer period of time.

우리가 사용하는 환경은 모든 단계에서 무작위로 데이터를 생성하기 때문에 더 긴 시간 훈련을 하더라도 항상 약간의 후회가 있을 것이다.

## Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.

In [None]:
ray.shutdown()  # "Undo ray.init()".