# Stochastic Bernoulli Bandit

In [1]:
import numpy as np
from rich import print

from pybandits.model import Beta
from pybandits.smab import SmabBernoulli

In [2]:
# print 2 decimal places in the notebook
%precision %.2f

'%.2f'

## 1. Initialization
The following two options are available to initialize the bandit.

### 1.1 Initialize via class constructor

You can initialize the bandit via the class constructor `SmabBernoulli()`. This is useful to impute prior knowledge on the Beta distributions.

In [3]:
mab = SmabBernoulli(
    actions={
        "a1": Beta(n_successes=1, n_failures=1),
        "a2": Beta(n_successes=1, n_failures=1),
        "a3": Beta(n_successes=1, n_failures=1),
    }
)

In [4]:
print(mab)

### 1.2 Initialize via utility function (for cold start)

You can initialize the bandit via the utility function `SmabBernoulliMOCC.cold_start()`. This is particulary useful in a cold start setting when there is no prior knowledge on the Beta distruibutions. In this case for all Betas `n_successes` and `n_failures` are set to `1`.

In [5]:
# generate a smab bernoulli in cold start settings
mab = SmabBernoulli.cold_start(action_ids=["a1", "a2", "a3"])

In [6]:
print(mab)

## 2. Function `predict()`

In [7]:
help(mab.predict)

Help on method predict in module pybandits.smab:

predict(n_samples: pydantic.types.PositiveInt = 1, forbidden_actions: Optional[Set[pybandits.base.ActionId]] = None) -> Tuple[List[pybandits.base.ActionId], List[Dict[pybandits.base.ActionId, pybandits.base.Probability]]] method of pybandits.smab.SmabBernoulli instance
    Predict actions.
    
    Parameters
    ----------
    n_samples : int > 0, default=1
        Number of samples to predict.
    forbidden_actions : Optional[Set[ActionId]], default=None
        Set of forbidden actions. If specified, the model will discard the forbidden_actions and it will only
        consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions.
        Note that: actions = allowed_actions U forbidden_actions.
    
    Returns
    -------
    actions: List[ActionId] of shape (n_samples,)
        The actions selected by the multi-armed bandit model.
    probs: List[Dict[ActionId, Probability]] of shape (n_sam

In [8]:
# predict for 5 samples
actions, probs = mab.predict(n_samples=5)

In [9]:
actions

['a3', 'a1', 'a3', 'a1', 'a3']

In [10]:
probs

[{'a1': 0.68, 'a3': 0.77, 'a2': 0.51},
 {'a1': 0.85, 'a3': 0.18, 'a2': 0.82},
 {'a1': 0.68, 'a3': 0.82, 'a2': 0.42},
 {'a1': 0.98, 'a3': 0.72, 'a2': 0.22},
 {'a1': 0.72, 'a3': 0.83, 'a2': 0.13}]

In [11]:
# predict for 5 samples with forbidden actions, in this case `a1` will never be predicted.
actions, probs = mab.predict(n_samples=5, forbidden_actions=["a1"])

In [12]:
actions

['a2', 'a2', 'a2', 'a3', 'a2']

In [13]:
probs

[{'a3': 0.71, 'a2': 0.86},
 {'a3': 0.51, 'a2': 0.55},
 {'a3': 0.42, 'a2': 0.87},
 {'a3': 0.89, 'a2': 0.52},
 {'a3': 0.41, 'a2': 0.42}]

## 3. Function `update()`

In [14]:
help(mab.update)

Help on method update in module pybandits.smab:

update(actions: List[pybandits.base.ActionId], rewards: List[pybandits.base.BinaryReward]) method of pybandits.smab.SmabBernoulli instance
    Update the stochastic Bernoulli bandit given the list of selected actions and their corresponding binary
    rewards.
    
    Parameters
    ----------
    actions : List[ActionId] of shape (n_samples,), e.g. ['a1', 'a2', 'a3', 'a4', 'a5']
        The selected action for each sample.
    rewards : List[Union[BinaryReward, List[BinaryReward]]] of shape (n_samples, n_objectives)
        The binary reward for each sample.
            If strategy is not MultiObjectiveBandit, rewards should be a list, e.g.
                rewards = [1, 0, 1, 1, 1, ...]
            If strategy is MultiObjectiveBandit, rewards should be a list of list, e.g. (with n_objectives=2):
                rewards = [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1], ...]



In [15]:
# simulate rewards from the environment
rewards = [1, 0, 1, 1, 0]

In [16]:
# update
mab.update(actions=actions, rewards=rewards)
print(mab)

## 4. Example of usage

Simulate 10 updates, for each updates we predict actions for a batch of 1000 samples and then we update the bandit given the rewards.

In [17]:
n_updates = 10
batch_size = 1000

for _ in range(n_updates):
    # predict
    actions, _ = mab.predict(n_samples=batch_size)

    # simulate rewards from the environment
    rewards = np.random.choice([0, 1], size=batch_size).tolist()

    # update
    mab.update(actions=actions, rewards=rewards)

In [18]:
print(mab)