1st contribution: The policy aggregation algorithm

FIXME change the plot, and add more up-to-date explanations!
Remark: I am finishing an article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms.

More mathematical explanations

Initially, every child algorithms A_i has the same "trust" probability p_i, and at every step, the aggregated bandit first listen to the decision from all its children A_i (a_{i,t} in 1 .. K), and then decide which arm to select by a probabilistic vote: the probability of selecting arm k is the sum of the trust probability of the children who voted for arm k. It could also be done the other way: the aggregated bandit could first decide which children to listen to, then trust him.

But we want to update the trust probability of all the children algorithms, not only one, when it was wised to trust them. Mathematically, when the aggregated arm choose to pull the arm k at step t, if it yielded a positive reward r_{k,t}, then the probability of all children algorithms A_i who decided (independently) to chose k (i.e., a_{i,t} = k) are increased multiplicatively: p_i <- p_i * exp(+ beta * r_{k,t}) where beta is a positive learning rate, e.g., beta = 0.1.

It is also possible to decrease multiplicatively the trust of all the children algorithms who did not decided to chose the arm k at every step t: if a_{i,t} != k then p_i <- p_i * exp(- beta * r_{k,t}). I did not observe any difference of behavior between these two options (implemented with the Boolean parameter updateAllChildren).

Ensemble voting for MAB algorithms

This algorithm can be seen as the Multi-Armed Bandits (i.e., sequential reinforcement learning) counterpart of an ensemble voting technique, as used for classifiers or regression algorithm in usual supervised machine learning (see, e.g., sklearn.ensemble.VotingClassifier in scikit-learn).

Another approach could be to do some sort of grid search.

Configuration:

A simple python file, configuration.py, is used to import the arm classes, the policy classes and define the problems and the experiments.

For example, this will compare the classical MAB algorithms UCB, Thompson, BayesUCB, klUCB, and the less classical AdBandit algorithms.

configuration = {
    "horizon": 10000,    # Finite horizon of the simulation
    "repetitions": 100,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [
        {
            "arm_type": Bernoulli,  # Only Bernoulli is available as far as now
            "probabilities": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.1]
        }
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        {"archtype": UCB, "params": {} },
        {"archtype": UCBV, "params": {} },
        {"archtype": UCBTuned, "params": {} },
        {"archtype": MOSS, "params": {} },
        {"archtype": Thompson, "params": {} },
        {"archtype": klUCB, "params": {} },
        {"archtype": klUCBPlus, "params": {} },
        {"archtype": klUCBHPlus, "params": {} },
        {"archtype": BayesUCB, "params": {} },
        {"archtype": AdBandit, "params": {
                "alpha": 0.5, "horizon": 10000  # AdBandit require to know the horizon
        } }
    ]
}

To add an aggregated bandit algorithm (Aggregator class), you can use this piece of code, to aggregate all the algorithms defined before and dynamically add it to configuration:

current_policies = configuration["policies"]
configuration["policies"] = current_policies +
    [{  # Add one Aggregator policy, from all the policies defined above
        "archtype": Aggregator,
        "params": {
            "learningRate": 0.05,  # Tweak this if needed
            "updateAllChildren": True,
            "children": current_policies,
        },
    }]

How to run the experiments ?

First, install the requirements:

pip install -r requirements.txt

Then, it should be very straight forward to run some experiment. This will run the simulation, average them (by repetitions) and plot the results:

python main.py

In a `virtualenv` ?

If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv.

$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/AlgoBandits/bin/python3
Also creating executable in /your/path/to/AlgoBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate  # in bash, use activate.csh or activate.fish if needed
$ type pip  # just to check
pip is /your/path/to/AlgoBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2

And then be sure to use the virtualenv binary for Python, bin/python, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate was executed).

Or with a `Makefile` ?

You can also use the provided Makefile file to do this simply:

make install  # install the requirements
make single   # run and log the main.py script

Or within a Jupyter notebook ?

I am writing some Jupyter notebooks, in this folder (notebooks/), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.

Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli arms only):

Small tests

Larger tests

4 different Aggregator on 6 policies:
1 Aggregator performing very well:
3 different UCB, with alpha values lower than 0.5 (nothing is known theoretically for alpha < 1/2).

Some examples where `Aggregator` performs well

Aggregator is the best on this example:
And it performed well here also:

Another example

The Aggregator can have a fixed learning rate, whose value has a great effect on its performance, as illustrated here:

One a harder problem

Aggregation of order-optimal policies

This last example shows the aggregation of various policies, all being order-optimal and performing similarly

Code organization

Layout of the code:

Arms are defined in this folder (Arms/), see for example Arms.Bernoulli
MAB algorithms (also called policies) are defined in this folder (Policies/), see for example Policies.Dummy for a fully random policy, Policies.EpsilonGreedy for the epsilon-greedy random policy, Policies.UCB for the "simple" UCB algorithm, or also Policies.BayesUCB, Policies.klUCB for two UCB-like algorithms, Policies.AdBandits for the AdBandits algorithm, and Policies.Aggregator for my aggregated bandits algorithms.
Environments to encapsulate date are defined in this folder (Environment/): MAB problem use the class Environment.MAB, simulation results are stored in a Environment.Result, and the class to evaluate multi-policy single-player multi-env is Environment.Evaluator.
configuration.py imports all the classes, and define the simulation parameters as a dictionary (JSON-like).
main.py runs the simulations, then display the final ranking of the different policies and plots the results (saved to this folder (plots/)).

UML diagrams

For more details, see these UML diagrams:

Packages: organization of the different files:
Classes: inheritance diagrams of the different classes:

📜 License ?

MIT Licensed (file LICENSE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation.md

Aggregation.md

1st contribution: The policy aggregation algorithm

More mathematical explanations

Ensemble voting for MAB algorithms

Configuration:

How to run the experiments ?

In a `virtualenv` ?

Or with a `Makefile` ?

Or within a Jupyter notebook ?

Some illustrations

Small tests

Larger tests

Some examples where `Aggregator` performs well

Another example

One a harder problem

Aggregation of order-optimal policies

Code organization

Layout of the code:

UML diagrams

📜 License ?

Files

Aggregation.md

Latest commit

History

Aggregation.md

File metadata and controls

1st contribution: The policy aggregation algorithm

More mathematical explanations

Ensemble voting for MAB algorithms

Configuration:

How to run the experiments ?

In a virtualenv ?

Or with a Makefile ?

Or within a Jupyter notebook ?

Some illustrations

Small tests

Larger tests

Some examples where Aggregator performs well

Another example

One a harder problem

Aggregation of order-optimal policies

Code organization

Layout of the code:

UML diagrams

📜 License ?

In a `virtualenv` ?

Or with a `Makefile` ?

Some examples where `Aggregator` performs well