# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .

In [1]:
import os
import time
import pandas as pd
import numpy as np

import ray
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Instructions for updating:
non-resource variables are not supported in the long term


Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated, so it is nontrivial, but that means it is confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

In [2]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)


In [3]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [4]:
take_step()
take_step()


    action = 10, 
    obs:
        'item': [[0.05048774 0.47194573 0.03921339 0.30052419 0.20338461 0.04726611
  0.12836221 0.07173992 0.22054862 0.1293311  0.10709322 0.30443624
  0.41910891 0.14934622 0.16536158 0.47204157]] (25 items), 
        'item_id': [50 53 39 90 40 54 70 11 85 21 42 35 48 55 74 49 51 60 92 68 62 29 81 72
 93],
        'response': [0.8220774924228555], 
    reward = 0.8220774924228555, 
    finished? = True, 
    info = {'regret': 0.10491187669598778}
    

    action = 6, 
    obs:
        'item': [[0.31548275 0.11021091 0.353385   0.23614226 0.39811143 0.42167002
  0.27204124 0.15484602 0.19972821 0.14739303 0.10179081 0.13443356
  0.22163083 0.3115957  0.18870421 0.03907462]] (25 items), 
        'item_id': [16 91  8 59 41 64 92  0 57 99 66 26 84 35 19 49 72 83 27 88  1  4 50 65
 85],
        'response': [0.7968325805543849], 
    reward = 0.7968325805543849, 
    finished? = True, 
    info = {'regret': 0.13711410502985222}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

In the following `num_candidates` steps, which defaults to 25, you may see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward, but not for all runs. Which one has the lowest regret?

In [5]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward:7.5f}, regret = {info["regret"]:7.5f}')

  0: reward = 0.83539, regret = 0.06087
  1: reward = 0.77759, regret = 0.13725
  2: reward = 0.83032, regret = 0.02600
  3: reward = 0.83455, regret = 0.10204
  4: reward = 0.82078, regret = 0.11317
  5: reward = 0.83310, regret = 0.10349
  6: reward = 0.79832, regret = 0.13563
  7: reward = 0.81241, regret = 0.09064
  8: reward = 0.79307, regret = 0.14352
  9: reward = 0.79832, regret = 0.13827
 10: reward = 0.81913, regret = 0.10786
 11: reward = 0.70258, regret = 0.22441
 12: reward = 0.76002, regret = 0.17657
 13: reward = 0.82647, regret = 0.07658
 14: reward = 0.79832, regret = 0.11652
 15: reward = 0.73610, regret = 0.17874
 16: reward = 0.83455, regret = 0.10204
 17: reward = 0.83831, regret = 0.08868
 18: reward = 0.83032, regret = 0.06594
 19: reward = 0.85926, regret = 0.04379
 20: reward = 0.85031, regret = 0.07668
 21: reward = 0.73631, regret = 0.20028
 22: reward = 0.91484, regret = 0.00000
 23: reward = 0.79832, regret = 0.09794
 24: reward = 0.78102, regret = 0.13381


The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

Note that we imported `UCB_CONFIG` above, which has the properties defined that are expected _LinUCB_. We'll add another property to it for the environment. (Subsequent lessons will show other ways to work with the configuration.)

In [6]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


Now let's use [Ray Tune](http://tune.io) to train. First start Ray or connect to a running cluster.

In [7]:
ray.init(ignore_reinit_error=True)

2020-09-01 04:04:52,133	INFO resource_spec.py:231 -- Starting Ray with 8.06 GiB memory available for workers and up to 4.05 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-01 04:04:52,652	INFO services.py:1193 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '127.0.1.1',
 'raylet_ip_address': '127.0.1.1',
 'redis_address': '127.0.1.1:6379',
 'object_store_address': '/tmp/ray/session_2020-09-01_04-04-52_131638_4976/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-09-01_04-04-52_131638_4976/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-09-01_04-04-52_131638_4976'}

Now run Tune:

In [8]:
analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    verbose=1
)

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_da6f0_00000,TERMINATED,,20,5.79818,2000,0.899049
contrib_LinUCB_ParametricItemRecoEnv_da6f0_00001,TERMINATED,,20,5.72762,2000,0.868243
contrib_LinUCB_ParametricItemRecoEnv_da6f0_00002,TERMINATED,,20,5.74713,2000,0.898301
contrib_LinUCB_ParametricItemRecoEnv_da6f0_00003,TERMINATED,,20,5.61002,2000,0.89894
contrib_LinUCB_ParametricItemRecoEnv_da6f0_00004,TERMINATED,,20,5.40985,2000,0.862724


How long did it take?

In [9]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

   9.74 seconds,    0.16 minutes


Let's look at the data:

In [10]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_healthy_workers,timesteps_total,done,episodes_total,training_iteration,...,config/sample_batch_size,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_pytorch,logdir
0,0.909727,0.841158,0.899049,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jungyeon/ray_results/contrib/LinUCB/cont...
1,0.910887,0.804417,0.868243,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jungyeon/ray_results/contrib/LinUCB/cont...
2,0.929446,0.836176,0.898301,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jungyeon/ray_results/contrib/LinUCB/cont...
3,0.923361,0.852523,0.89894,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jungyeon/ray_results/contrib/LinUCB/cont...
4,0.886507,0.803067,0.862724,1.0,100,0,2000,True,2000,20,...,-1,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,-1,/home/jungyeon/ray_results/contrib/LinUCB/cont...


Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

One of the columns in the trial dataframes is the `info/learner/default_policy/cumulative_regret`. Let's combine the trail DataFrames into a single DataFrame, then group over the `info/number_steps_trained` and project out the `info/learner/default_policy/cumulative_regret`. Finally, aggregate for each `info/number_steps_trained` to compute the `mean`, `max`, `min`, and `std` (standard deviation) for the cumulative regret.

In [11]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("info/num_steps_trained")[
    "info/learner/default_policy/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [12]:
df

Unnamed: 0_level_0,mean,max,min,std
info/num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.387118,3.837414,3.186883,0.257097
200,3.99762,4.355994,3.793556,0.224908
300,4.336677,4.524341,4.104159,0.207311
400,4.526428,4.832583,4.236853,0.276685
500,4.776828,5.174206,4.431874,0.353029
600,4.981517,5.533187,4.572223,0.446839
700,5.079321,5.690763,4.668314,0.482136
800,5.214527,5.876961,4.755728,0.515497
900,5.328502,5.967244,4.857674,0.542761
1000,5.417954,6.05287,4.922895,0.565076


It will be easier to understand these results with a graph:

In [13]:
from bokeh_util import plot_cumulative_regret
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

ModuleNotFoundError: No module named 'util'

In [None]:
plot_cumulative_regret(df)

([image](../../images/rllib/LinUCB-Cumulative-Regret.png))

So the _cumulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.

The environment we're using randomly generates data on every step, so there will always be some regret even if we train for a longer period of time.

## Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.

In [None]:
ray.shutdown()  # "Undo ray.init()".