# sMAB Simulation

This notebook shows a simulation framework for the stochastic multi-armed bandit (sMAB). It allows to study the behaviour of the bandit algoritm, to evaluate results and to run experiments on simulated data under different reward and action settings.

In [1]:
import pandas as pd

from pybandits.model import Beta
from pybandits.smab import SmabBernoulli
from pybandits.smab_simulator import SmabSimulator

First we need to define the simulation parameters. The parameters contain:
- Number of update rounds
- Number of samples per batch of update round
- Seed for reproducibility
- Verbosity enabler
- Visualization enabler

Data are processed in batches of size n>=1. Per each batch of simulated samples, the sMAB selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.

In [2]:
# general simulator parameters
n_updates = 10
batch_size = 100
random_seed = None
verbose = True
visualize = True

Next, we initialize the action model and the sMAB. We define three actions, each with a Beta model. The Beta model is a conjugate prior for the Bernoulli likelihood function. The Beta distribution is defined by two parameters: alpha and beta. The action model is defined as a dictionary with the action name as key and the Beta model as value.

In [None]:
# define action model
actions = {
    "a1": Beta(),
    "a2": Beta(),
    "a3": Beta(),
}
# init stochastic Multi-Armed Bandit model
smab = SmabBernoulli(actions=actions)

Finally, we need to define the probabilities of positive rewards per each action, i.e. the ground truth ('Action A': 0.8 that if the bandits selects 'Action A' for samples that belong to group '0', then the environment will return a positive reward with 80% probability).


In [4]:
# init probability of rewards from the environment
prob_rewards = pd.DataFrame(
    [[0.05, 0.80, 0.05]],
    columns=actions.keys(),
)
print("Probability of positive reward for each action:")
prob_rewards

Probability of positive reward for each group/action:


Unnamed: 0,action A,action B,action C
0,0.05,0.8,0.05
1,0.8,0.05,0.05
2,0.8,0.05,0.8


Now, we initialize the cMAB as shown in the previous notebook and the CmabSimulator with the parameters set above.

In [6]:
# init simulation
smab_simulator = SmabSimulator(
    smab=smab,
    batch_size=batch_size,
    n_updates=n_updates,
    prob_rewards=prob_rewards,
    verbose=verbose,
    visualize=visualize,
)

Setup simulation  completed.
Simulated input probability rewards:
        action A  action B  action C
group                              
0      0.041176  0.835294  0.052941
1      0.819277  0.036145  0.054217
2      0.786585  0.042683  0.817073 



Now, we can start simulation process by executing run() which performs the following steps:
```
For i=0 to n_updates:
    Extract batch[i] of samples from X
    Model recommends the best actions as the action with the highest reward probability to each simulated sample in batch[i] and collect corresponding simulated rewards
    Model priors are updated using information from recommended actions and returned rewards
```
Finally, we can visualize the results of the simulation. As defined in the ground truth: 'a2' was the action recommended the most.

In [7]:
smab_simulator.run()

Iteration #1
Start predict batch 1 ...
Start update batch 1 ... 



Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 11 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 10 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.


Iteration #2
Start predict batch 2 ...
Start update batch 2 ... 



Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 9 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 5 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.


Iteration #3
Start predict batch 3 ...
Start update batch 3 ... 



Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 9 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.


Iteration #4
Start predict batch 4 ...
Start update batch 4 ... 



Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.


Iteration #5
Start predict batch 5 ...
Start update batch 5 ... 



Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.


Simulation results (first 10 observations):
      action  reward  group  selected_prob_reward  max_prob_reward  regret  \
0  action C     0.0      1                  0.05              0.8    0.75   
1  action C     1.0      2                  0.80              0.8    0.00   
2  action B     1.0      0                  0.80              0.8    0.00   
3  action C     0.0      1                  0.05              0.8    0.75   
4  action C     0.0      1                  0.05              0.8    0.75   
5  action B     1.0      0                  0.80              0.8    0.00   
6  action A     0.0      0                  0.05              0.8    0.75   
7  action C     0.0      2                  0.80              0.8    0.00   
8  action C     0.0      1                  0.05              0.8    0.75   
9  action C     1.0      2                  0.80              0.8    0.00   

   cum_regret  
0        0.75  
1        0.75  
2        0.75  
3        1.50  
4        2.25  
5        2.

Furthermore, we can examine the number of times each action was selected and the proportion of positive rewards for each action.

In [None]:
smab_simulator.selected_actions_count

In [None]:
smab_simulator.positive_reward_proportion