2nd contribution: Multi-players simulation environment
-
FIXME change the plot, and add more up-to-date explanations!
-
Remark: I am finishing an article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms, the models etc.
There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting.
The basic difference is about collisions : at each time t
, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.
For example, I implemented these different collision models, in CollisionModels.py
:
noCollision
is a limited model where all players can sample an arm with collision. It corresponds to the single-player simulation: each player is a policy, compared without collision. This is for testing only, not so interesting.onlyUniqUserGetsReward
is a simple collision model where only the players alone on one arm sample it and receive the reward. This is the default collision model in the literature, for instance cf. [Shamir et al., 2015] collision model 1 or cf [Liu & Zhao, 2009].rewardIsSharedUniformly
is similar: the players alone on one arm sample it and receive the reward, and in case of more than one player on one arm, only one player (uniform choice, chosen by the base station) can sample it and receive the reward.closerUserGetsReward
is similar but uses another approach to chose who can emit. Instead of randomly choosing the lucky player, it uses a given (or random) vector indicating the distance of each player to the base station (it can also indicate the quality of the communication), and when two (or more) players are colliding, only the one who is closer to the base station can transmit. It is the more physically plausible.
Have a look to:
main_multiplayers.py
andconfiguration_multiplayers.py
to run and configure the simulation,- the
EvaluatorMultiPlayers
class that performs the simulation, - the
ResultMultiPlayers
class to store the results, - and some naive policies are implemented in the
PoliciesMultiPlayers/
folder. As far as now, there is theSelfish
,CentralizedFixed
,CentralizedCycling
,OracleNotFair
,OracleFair
multi-players policy.
- The first one I implemented is the "Musical Chair" policy, from [Shamir et al., 2015], in
MusicalChair
. - Then I implemented the "MEGA" policy from [Avner & Mannor, 2014], in
MEGA
. But it has too much parameter, the question is how to chose them?
A simple python file, configuration.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
See the explanations given for the simple-player case.
nbArms = len(configuration['environment'][0]['params'])
# WARNING do not use this hack if you try to use more than one environment
configuration.update({
# Uncomment the lines you don't want, keep ONLY one line
# --- Using multi-player Selfish policy
"players": Selfish(NB_PLAYERS, Uniform, nbArms).children # Stupid one
"players": Selfish(NB_PLAYERS, TakeRandomFixedArm, nbArms).children # Also stupid one
"players": Selfish(NB_PLAYERS, Softmax, nbArms, temperature=TEMPERATURE).children # Based on empirical means
"players": Selfish(NB_PLAYERS, UCBalpha, nbArms, alpha=1./4).children # This one is efficient!
# --- Using multi-player Centralized policy, un-fair or fair
"players": CentralizedFixed(NB_PLAYERS, nbArms).children
"players": CentralizedCycling(NB_PLAYERS, nbArms).children
# --- Using multi-player Oracle policy, un-fair or fair
# Note: they need a perfect knowledge on the arms, even this is not physically plausible
"players": OracleNotFair(NB_PLAYERS, MAB(configuration['environment'][0])).children
"players": OracleFair(NB_PLAYERS, MAB(configuration['environment'][0])).children
# --- Using single-player Musical Chair policy
"players": Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.1, Time1=HORIZON).children
})
- The multi-players policies are added by giving a list of their children (
CentralizedCycling(*args).children
), who are instances of the proxy classChildPointer
. Each child methods is just passed back to the mother class (the multi-players policy, e.g.,CentralizedCycling
), who can then handle the calls as it wants (can be centralized or not). - I know, it's not perfectly clear yet and not simple to use. Just read the code. I will improve the documentation!
Here are one example of simulation: M = 6
players independently and selfishly play according to the Thompson
policy, against K = 17
unknown Bernoulli arms, in a collision model where only the player alone on an arm can sample it and gets some reward.
-
First, their own personal rewards along the time, showing that all the
M = 6
players achieve similar average reward, about0.82 脳 Horizon
. This is a satisfactory result, as the average availability of the bestM
channels is exactly0.82
: it means that, in average in these 20 simulations, every player achieves a linear reward with the good slope, indicating that they all select their arms among the bestM
channels. -
Then, the centralized regret, which in this case seems to be converging to a constant (or growing very slowly). That is a very good performance: the regret almost stops growing after some 700 steps or so:
-
The same is obtained for other order-optimal single-player policies (eg. KL-UCB, BayesUCB, UCB1 with good choice of alpha etc), in a purely selfish setting where every
M
player runs locally, without knowing the number of players and the collision model, they are able to learn and find an orthogonal affectation among the bestM
channels: -
Of course, this is not perfect, as the limit value of the regret is quite high, when compared to an ideal setting with communication and full knowledge of the arm (
OracleFair
, the best possible policy):
- It compared a
Selfish
policy, where each players runs a UCBalpha or Thompson algorithm, against aMusicalChair
policy.
For a multi-player policy, being fair means that on every simulation with M
players, each player access any of the M
best arms (about) the same amount of time.
It is important to highlight that it has to be verified on each run of the MP policy, having this property in average is NOT enough.
-
For instance, the oracle policy
OracleNotFair
affects each of theM
players to one of theM
best arms, orthogonally, but once they are affected they always pull this arm. It's unfair because one player will be lucky and affected to the best arm, the others are unlucky. The centralized regret is optimal (null, in average), but it is not fair. -
And the other oracle policy
OracleFair
affects an offset to each of theM
players corresponding to one of theM
best arms, orthogonally, and once they are affected they will cycle among the bestM
arms. It's fair because every player will pull theM
best arms an equal number of time. And the centralized regret is also optimal (null, in average). -
Usually, the
Selfish
policy is not fair: as each player is selfish and tries to maximize her personal regret, there is no reason for them to share the time on theM
best arms. -
Conversely, the
MusicalChair
policy is not fair either, and cannot be: when each player has attained the last step, ie. they are all choosing the same arm, orthogonally, and they are not sharing theM
best arms. -
The
MEGA
policy is designed to be fair: when players collide, they all have the same chance of leaving or staying on the arm, and they all sample from theM
best arms equally. -
The
rhoRand
policy is not designed to be fair: when players collide, they all take a random rank from the same distributionrank_k ~ Uniform(U)
. Note that it can be fair, but not with high probability.
First, install the requirements:
pip install -r requirements.txt
Then, it should be very straight forward to run some experiment.
This will run the simulation, average them (by repetitions
) and plot the results:
python main_multiplayers.py
In a virtualenv
?
If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv
.
$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/AlgoBandits/bin/python3
Also creating executable in /your/path/to/AlgoBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate # in bash, use activate.csh or activate.fish if needed
$ type pip # just to check
pip is /your/path/to/AlgoBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2
And then be sure to use the virtualenv binary for Python, bin/python
, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate
was executed).
Or with a Makefile
?
You can also use the provided Makefile
file to do this simply:
make install # install the requirements
make multiplayers # run and log the main.py script
Or within a Jupyter notebook ?
I am writing some Jupyter notebooks, in this folder (
notebooks/
), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.
- Arms are defined in this folder (
Arms/
), see for exampleArms.Bernoulli
main_multiplayers.py
andconfiguration_multiplayers.py
to run and configure the simulation,- the
EvaluatorMultiPlayers
class that performs the simulation, - the
ResultMultiPlayers
class to store the results, - and some naive policies are implemented in the
PoliciesMultiPlayers/
folder. As far as now, there is theSelfish
,CentralizedFixed
,CentralizedCycling
,OracleNotFair
,OracleFair
multi-players policy.
For more details, see these UML diagrams:
- Packages: organization of the different files:
- Classes: inheritance diagrams of the different classes:
MIT Licensed (file LICENSE).
漏 2012 Olivier Capp茅, Aur茅lien Garivier, 脡milie Kaufmann and for the initial pymaBandits v1.0 project, and 漏 2016-2017 Lilian Besson for the rest.