-
FIXME change the plot, and add more up-to-date explanations!
-
Remark: I am finishing an article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms.
Initially, every child algorithms A_i
has the same "trust" probability p_i
, and at every step, the aggregated bandit first listen to the decision from all its children A_i
(a_{i,t}
in 1 .. K
), and then decide which arm to select by a probabilistic vote: the probability of selecting arm k
is the sum of the trust probability of the children who voted for arm k
.
It could also be done the other way: the aggregated bandit could first decide which children to listen to, then trust him.
But we want to update the trust probability of all the children algorithms, not only one, when it was wised to trust them.
Mathematically, when the aggregated arm choose to pull the arm k
at step t
, if it yielded a positive reward r_{k,t}
, then the probability of all children algorithms A_i
who decided (independently) to chose k
(i.e., a_{i,t} = k
) are increased multiplicatively: p_i <- p_i * exp(+ beta * r_{k,t})
where beta
is a positive learning rate, e.g., beta = 0.1
.
It is also possible to decrease multiplicatively the trust of all the children algorithms who did not decided to chose the arm k
at every step t
: if a_{i,t} != k
then p_i <- p_i * exp(- beta * r_{k,t})
. I did not observe any difference of behavior between these two options (implemented with the Boolean parameter updateAllChildren
).
This algorithm can be seen as the Multi-Armed Bandits (i.e., sequential reinforcement learning) counterpart of an ensemble voting technique, as used for classifiers or regression algorithm in usual supervised machine learning (see, e.g., sklearn.ensemble.VotingClassifier
in scikit-learn).
Another approach could be to do some sort of grid search.
A simple python file, configuration.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
For example, this will compare the classical MAB algorithms UCB
, Thompson
, BayesUCB
, klUCB
, and the less classical AdBandit
algorithms.
configuration = {
"horizon": 10000, # Finite horizon of the simulation
"repetitions": 100, # number of repetitions
"n_jobs": -1, # Maximum number of cores for parallelization: use ALL your CPU
"verbosity": 5, # Verbosity for the joblib calls
# Environment configuration, you can set up more than one.
"environment": [
{
"arm_type": Bernoulli, # Only Bernoulli is available as far as now
"probabilities": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.1]
}
],
# Policies that should be simulated, and their parameters.
"policies": [
{"archtype": UCB, "params": {} },
{"archtype": UCBV, "params": {} },
{"archtype": UCBTuned, "params": {} },
{"archtype": MOSS, "params": {} },
{"archtype": Thompson, "params": {} },
{"archtype": klUCB, "params": {} },
{"archtype": klUCBPlus, "params": {} },
{"archtype": klUCBHPlus, "params": {} },
{"archtype": BayesUCB, "params": {} },
{"archtype": AdBandit, "params": {
"alpha": 0.5, "horizon": 10000 # AdBandit require to know the horizon
} }
]
}
To add an aggregated bandit algorithm (Aggregator
class), you can use this piece of code, to aggregate all the algorithms defined before and dynamically add it to configuration
:
current_policies = configuration["policies"]
configuration["policies"] = current_policies +
[{ # Add one Aggregator policy, from all the policies defined above
"archtype": Aggregator,
"params": {
"learningRate": 0.05, # Tweak this if needed
"updateAllChildren": True,
"children": current_policies,
},
}]
First, install the requirements:
pip install -r requirements.txt
Then, it should be very straight forward to run some experiment.
This will run the simulation, average them (by repetitions
) and plot the results:
python main.py
In a virtualenv
?
If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv
.
$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/AlgoBandits/bin/python3
Also creating executable in /your/path/to/AlgoBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate # in bash, use activate.csh or activate.fish if needed
$ type pip # just to check
pip is /your/path/to/AlgoBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2
And then be sure to use the virtualenv binary for Python, bin/python
, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate
was executed).
Or with a Makefile
?
You can also use the provided Makefile
file to do this simply:
make install # install the requirements
make single # run and log the main.py script
Or within a Jupyter notebook ?
I am writing some Jupyter notebooks, in this folder (
notebooks/
), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.
Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli
arms only):
- 4 different
Aggregator
on 6 policies: - 1
Aggregator
performing very well: - 3 different UCB, with alpha values lower than 0.5 (nothing is known theoretically for alpha < 1/2).
Some examples where Aggregator
performs well
Aggregator
is the best on this example:- And it performed well here also:
The Aggregator
can have a fixed learning rate, whose value has a great effect on its performance, as illustrated here:
This last example shows the aggregation of various policies, all being order-optimal and performing similarly
- Arms are defined in this folder (
Arms/
), see for exampleArms.Bernoulli
- MAB algorithms (also called policies) are defined in this folder (
Policies/
), see for examplePolicies.Dummy
for a fully random policy,Policies.EpsilonGreedy
for the epsilon-greedy random policy,Policies.UCB
for the "simple" UCB algorithm, or alsoPolicies.BayesUCB
,Policies.klUCB
for two UCB-like algorithms,Policies.AdBandits
for the AdBandits algorithm, andPolicies.Aggregator
for my aggregated bandits algorithms. - Environments to encapsulate date are defined in this folder (
Environment/
): MAB problem use the classEnvironment.MAB
, simulation results are stored in aEnvironment.Result
, and the class to evaluate multi-policy single-player multi-env isEnvironment.Evaluator
. configuration.py
imports all the classes, and define the simulation parameters as a dictionary (JSON-like).main.py
runs the simulations, then display the final ranking of the different policies and plots the results (saved to this folder (plots/
)).
For more details, see these UML diagrams:
- Packages: organization of the different files:
- Classes: inheritance diagrams of the different classes:
MIT Licensed (file LICENSE).
漏 2012 Olivier Capp茅, Aur茅lien Garivier, 脡milie Kaufmann and for the initial pymaBandits v1.0 project, and 漏 2016-2017 Lilian Besson for the rest.