# Contextual Multi-Armed Bandit

For the contextual multi-armed bandit (sMAB) when user information is available (context), we implemented a generalisation of Thompson sampling algorithm ([Agrawal and Goyal, 2014](https://arxiv.org/pdf/1209.3352.pdf)) based on PyMC3.

![title](img/cmab.png)

The following notebook contains an example of usage of the class Cmab, which implements the algorithm above.

In [None]:
import numpy as np

from pybandits.core.cmab import Cmab

First, we need to define the input context matrix $X$ of size ($n\_samples, n\_features$) and the list of possible actions $a_i \in A$ .

In [None]:
# context
n_samples = 1000
n_features = 5
X = 2 * np.random.random_sample((n_samples, n_features)) - 1  # random float in the interval (-1, 1)
print("X: context matrix of shape (n_samples, n_features)")
print(X[:10])

In [None]:
# define actions
actions_ids = ["action A", "action B", "action C"]

We can now init the bandit given the number of features and the list of actions $a_i$.

In [None]:
# init contextual Multi-Armed Bandit model
cmab = Cmab(n_features=n_features, actions_ids=actions_ids)

The predict function below returns the action selected by the bandit at time $t$: $a_t = argmax_k P(r=1|\beta_k, x_t)$. The bandit selects one action per each sample of the contect matrix $X$.

In [None]:
# predict action
pred_actions, _ = cmab.predict(X)
print("Recommended action: {}".format(pred_actions[:10]))

Now, we observe the rewards from the environment. In this example rewards are randomly simulated. 

In [None]:
# simulate reward from environment
simulated_rewards = np.random.randint(2, size=n_samples)
print("Simulated rewards: {}".format(simulated_rewards[:10]))

Finally we update the model providing per each action sample: (i) its context $x_t$ (ii) the action $a_t$ selected by the bandit, (iii) the correspoding reward $r_t$.

In [None]:
# update model
cmab.update(X, actions=pred_actions, rewards=simulated_rewards)