For an introduction to Multi Armed Bandits, refer to bandit_overview
This method is different compared to others. In other methods, we explicity attempt to estimate the 'value' of taking an action (its quality) whereas in this method we approach the problem in a different way. Here, instead of estimating how good an action is through its quality, we only care about its preference of being selected compared to other actions. We denote this preference by Ht(a). The larger the preference of an action 'a', more are the chances of it being selected, but this preference has no interpretation in terms of the reward for that action. Only the relative preference is important.
The action probabilites are related to these action preferences Ht(a) by a softmax function. The probability of taking action aj is given by:
where, A is the total number of actions and πt(a) is the probability of taking action 'a' at timestep 't'.
We initialise the preferences for all the actions to be 0, meaning
After computing πt(a) for all actions at each timestep, the action is sampled using this probability. Then that action is performed and based on the reward we get, we update our preferences.
The update rule bacially performs stochastic gradient ascent:
where, α is the step size, Rt is the reward obtained at time 't' and
For a more detailed mathematical analysis and derivation of the update rule, refer to chapter 2 of Sutton & Barto.
Code to use the Gradient method on a Bernoulli Multi-Armed Bandit:
import gym
import numpy as np
from genrl.bandit import BernoulliMAB, GradientMABAgent, MABTrainer
bandits = 10
arms = 5
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = GradientMABAgent(bandit, alpha=0.1, temp=0.01)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, BayesianUCBMABAgent and MABTrainer.