Skip to content

Latest commit

History

History
185 lines (149 loc) 路 12.7 KB

Aggregation.md

File metadata and controls

185 lines (149 loc) 路 12.7 KB

1st contribution: The policy aggregation algorithm

  • FIXME change the plot, and add more up-to-date explanations!

  • Remark: I am finishing an article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms.

More mathematical explanations

Initially, every child algorithms A_i has the same "trust" probability p_i, and at every step, the aggregated bandit first listen to the decision from all its children A_i (a_{i,t} in 1 .. K), and then decide which arm to select by a probabilistic vote: the probability of selecting arm k is the sum of the trust probability of the children who voted for arm k. It could also be done the other way: the aggregated bandit could first decide which children to listen to, then trust him.

But we want to update the trust probability of all the children algorithms, not only one, when it was wised to trust them. Mathematically, when the aggregated arm choose to pull the arm k at step t, if it yielded a positive reward r_{k,t}, then the probability of all children algorithms A_i who decided (independently) to chose k (i.e., a_{i,t} = k) are increased multiplicatively: p_i <- p_i * exp(+ beta * r_{k,t}) where beta is a positive learning rate, e.g., beta = 0.1.

It is also possible to decrease multiplicatively the trust of all the children algorithms who did not decided to chose the arm k at every step t: if a_{i,t} != k then p_i <- p_i * exp(- beta * r_{k,t}). I did not observe any difference of behavior between these two options (implemented with the Boolean parameter updateAllChildren).

Ensemble voting for MAB algorithms

This algorithm can be seen as the Multi-Armed Bandits (i.e., sequential reinforcement learning) counterpart of an ensemble voting technique, as used for classifiers or regression algorithm in usual supervised machine learning (see, e.g., sklearn.ensemble.VotingClassifier in scikit-learn).

Another approach could be to do some sort of grid search.


Configuration:

A simple python file, configuration.py, is used to import the arm classes, the policy classes and define the problems and the experiments.

For example, this will compare the classical MAB algorithms UCB, Thompson, BayesUCB, klUCB, and the less classical AdBandit algorithms.

configuration = {
    "horizon": 10000,    # Finite horizon of the simulation
    "repetitions": 100,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [
        {
            "arm_type": Bernoulli,  # Only Bernoulli is available as far as now
            "probabilities": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.1]
        }
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        {"archtype": UCB, "params": {} },
        {"archtype": UCBV, "params": {} },
        {"archtype": UCBTuned, "params": {} },
        {"archtype": MOSS, "params": {} },
        {"archtype": Thompson, "params": {} },
        {"archtype": klUCB, "params": {} },
        {"archtype": klUCBPlus, "params": {} },
        {"archtype": klUCBHPlus, "params": {} },
        {"archtype": BayesUCB, "params": {} },
        {"archtype": AdBandit, "params": {
                "alpha": 0.5, "horizon": 10000  # AdBandit require to know the horizon
        } }
    ]
}

To add an aggregated bandit algorithm (Aggregator class), you can use this piece of code, to aggregate all the algorithms defined before and dynamically add it to configuration:

current_policies = configuration["policies"]
configuration["policies"] = current_policies +
    [{  # Add one Aggregator policy, from all the policies defined above
        "archtype": Aggregator,
        "params": {
            "learningRate": 0.05,  # Tweak this if needed
            "updateAllChildren": True,
            "children": current_policies,
        },
    }]

First, install the requirements:

pip install -r requirements.txt

Then, it should be very straight forward to run some experiment. This will run the simulation, average them (by repetitions) and plot the results:

python main.py

In a virtualenv ?

If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv.

$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/AlgoBandits/bin/python3
Also creating executable in /your/path/to/AlgoBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate  # in bash, use activate.csh or activate.fish if needed
$ type pip  # just to check
pip is /your/path/to/AlgoBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2

And then be sure to use the virtualenv binary for Python, bin/python, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate was executed).

Or with a Makefile ?

You can also use the provided Makefile file to do this simply:

make install  # install the requirements
make single   # run and log the main.py script

Or within a Jupyter notebook ?

I am writing some Jupyter notebooks, in this folder (notebooks/), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.


Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli arms only):

Small tests

5 tests - AdBandit and Aggregator 2000 steps - 100 repetition

Larger tests

  • 4 different Aggregator on 6 policies: 10000 steps - 50 repetition - 6 policies - With 4 Aggregator
  • 1 Aggregator performing very well: 10000 steps - 50 repetition - 6 policies - With Softmax and 1 Aggregator
  • 3 different UCB, with alpha values lower than 0.5 (nothing is known theoretically for alpha < 1/2). 10000 steps - 50 repetition - 3 UCB and Aggregator

Some examples where Aggregator performs well

  • Aggregator is the best on this example: Aggregator is the best here
  • And it performed well here also: one Aggregator does very well

Another example

The Aggregator can have a fixed learning rate, whose value has a great effect on its performance, as illustrated here: 20000 steps - 100 repetition - 6 policies - With 5 Aggregator

One a harder problem

example harder problem

Aggregation of order-optimal policies

This last example shows the aggregation of various policies, all being order-optimal and performing similarly 20000 steps - 20 repetition 7 policies klVariants MOSS UCBTuned - cumulative regret 20000 steps - 20 repetition 7 policies klVariants MOSS UCBTuned - frequency of best arm pull 20000 steps - 20 repetition 7 policies klVariants MOSS UCBTuned - normalized cumulative regret


Code organization

Layout of the code:

UML diagrams

For more details, see these UML diagrams:

  • Packages: organization of the different files: UML Diagram - Packages of AlgoBandits.git
  • Classes: inheritance diagrams of the different classes: UML Diagram - Classes of AlgoBandits.git

馃摐 License ? GitHub license

MIT Licensed (file LICENSE).

漏 2012 Olivier Capp茅, Aur茅lien Garivier, 脡milie Kaufmann and for the initial pymaBandits v1.0 project, and 漏 2016-2017 Lilian Besson for the rest.

Maintenance Ask Me Anything ! Analytics PyPI implementation PyPI pyversions ForTheBadge uses-badges ForTheBadge uses-git