# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .

In [1]:
import os
import time
import pandas as pd
import numpy as np

import ray
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Instructions for updating:
non-resource variables are not supported in the long term


Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated, so it is nontrivial, but that means it is confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

In [2]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)


In [3]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [4]:
take_step()
take_step()


    action = 4, 
    obs:
        'item': [[0.23366057 0.37518009 0.31429835 0.27246916 0.14240137 0.31069752
  0.3140174  0.04663713 0.24377303 0.06174414 0.11274397 0.36073582
  0.04471896 0.04390381 0.36584711 0.26490772]] (25 items), 
        'item_id': [24 20 80 16 28 31 68 18 19 97  9 53 99  6 87 89 50 37  4 98 23 40 86 12
 76],
        'response': [0.7848750835230395], 
    reward = 0.7848750835230395, 
    finished? = True, 
    info = {'regret': 0.02414592687053685}
    

    action = 1, 
    obs:
        'item': [[0.44621964 0.37957646 0.31492843 0.00722326 0.35564001 0.15423948
  0.04823754 0.05859373 0.30385808 0.3296814  0.15090086 0.29550319
  0.05990986 0.19801203 0.21779258 0.01426956]] (25 items), 
        'item_id': [22 76 47 92 16 30 73 10 57 69 93 51 56 78 11 50 38 44 90 95 41 20 46 15
 25],
        'response': [0.6776993826258662], 
    reward = 0.6776993826258662, 
    finished? = True, 
    info = {'regret': 0.19877127019669705}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

In the following `num_candidates` steps, which defaults to 25, you may see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward, but not for all runs. Which one has the lowest regret?

In [5]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward:7.5f}, regret = {info["regret"]:7.5f}')

  0: reward = 0.77695, regret = 0.10522
  1: reward = 0.72438, regret = 0.14366
  2: reward = 0.72781, regret = 0.15436
  3: reward = 0.77396, regret = 0.10251
  4: reward = 0.71434, regret = 0.12907
  5: reward = 0.75761, regret = 0.03997
  6: reward = 0.64429, regret = 0.17014
  7: reward = 0.70396, regret = 0.14844
  8: reward = 0.78488, regret = 0.09160
  9: reward = 0.56153, regret = 0.28187
 10: reward = 0.84340, regret = 0.00000
 11: reward = 0.69977, regret = 0.15263
 12: reward = 0.79489, regret = 0.08728
 13: reward = 0.74880, regret = 0.05608
 14: reward = 0.77695, regret = 0.04431
 15: reward = 0.78488, regret = 0.05853
 16: reward = 0.88217, regret = 0.00000
 17: reward = 0.70910, regret = 0.17307
 18: reward = 0.69524, regret = 0.11378
 19: reward = 0.62825, regret = 0.22415
 20: reward = 0.70396, regret = 0.16408
 21: reward = 0.66099, regret = 0.16028
 22: reward = 0.70687, regret = 0.16116
 23: reward = 0.84340, regret = 0.03877
 24: reward = 0.68019, regret = 0.16321


The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

Note that we imported `UCB_CONFIG` above, which has the properties defined that are expected _LinUCB_. We'll add another property to it for the environment. (Subsequent lessons will show other ways to work with the configuration.)

In [6]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


Now let's use [Ray Tune](http://tune.io) to train. First start Ray or connect to a running cluster.

In [7]:
info = ray.init(ignore_reinit_error=True, num_cpus=2)

2021-05-22 22:15:37,974	INFO services.py:1172 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Now run Tune:

In [8]:
analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    verbose=1
)

Trial name,# failures,error file
contrib_LinUCB_ParametricItemRecoEnv_3ff48_00000,1,/home/ceteri/ray_results/contrib/LinUCB/contrib_LinUCB_ParametricItemRecoEnv_3ff48_00000_0_2021-05-22_22-15-43/error.txt
contrib_LinUCB_ParametricItemRecoEnv_3ff48_00001,1,/home/ceteri/ray_results/contrib/LinUCB/contrib_LinUCB_ParametricItemRecoEnv_3ff48_00001_1_2021-05-22_22-15-43/error.txt
contrib_LinUCB_ParametricItemRecoEnv_3ff48_00002,1,/home/ceteri/ray_results/contrib/LinUCB/contrib_LinUCB_ParametricItemRecoEnv_3ff48_00002_2_2021-05-22_22-15-47/error.txt
contrib_LinUCB_ParametricItemRecoEnv_3ff48_00003,1,/home/ceteri/ray_results/contrib/LinUCB/contrib_LinUCB_ParametricItemRecoEnv_3ff48_00003_3_2021-05-22_22-15-47/error.txt
contrib_LinUCB_ParametricItemRecoEnv_3ff48_00004,1,/home/ceteri/ray_results/contrib/LinUCB/contrib_LinUCB_ParametricItemRecoEnv_3ff48_00004_4_2021-05-22_22-15-52/error.txt


TuneError: ('Trials did not complete', [contrib_LinUCB_ParametricItemRecoEnv_3ff48_00000, contrib_LinUCB_ParametricItemRecoEnv_3ff48_00001, contrib_LinUCB_ParametricItemRecoEnv_3ff48_00002, contrib_LinUCB_ParametricItemRecoEnv_3ff48_00003, contrib_LinUCB_ParametricItemRecoEnv_3ff48_00004])

How long did it take?

In [None]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

Let's look at the data:

In [None]:
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
df

Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

One of the columns in the trial dataframes is the `info/learner/default_policy/cumulative_regret`. Let's combine the trail DataFrames into a single DataFrame, then group over the `info/number_steps_trained` and project out the `info/learner/default_policy/cumulative_regret`. Finally, aggregate for each `info/number_steps_trained` to compute the `mean`, `max`, `min`, and `std` (standard deviation) for the cumulative regret.

In [None]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("info/num_steps_trained")[
    "info/learner/default_policy/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

df

It will be easier to understand these results with a graph:

In [None]:
df.plot(y="mean", yerr="std", title="Cumulative Regret")

So the _cumulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.

The environment we're using randomly generates data on every step, so there will always be some regret even if we train for a longer period of time.

## Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.

In [None]:
ray.shutdown()