Skip to content

Latest commit

History

History
177 lines (135 loc) 路 15.6 KB

MultiPlayers.md

File metadata and controls

177 lines (135 loc) 路 15.6 KB
  • FIXME change the plot, and add more up-to-date explanations!

  • Remark: I am finishing an article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms, the models etc.

There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting. The basic difference is about collisions : at each time t, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.

Collision models

For example, I implemented these different collision models, in CollisionModels.py:

  • noCollision is a limited model where all players can sample an arm with collision. It corresponds to the single-player simulation: each player is a policy, compared without collision. This is for testing only, not so interesting.
  • onlyUniqUserGetsReward is a simple collision model where only the players alone on one arm sample it and receive the reward. This is the default collision model in the literature, for instance cf. [Shamir et al., 2015] collision model 1 or cf [Liu & Zhao, 2009].
  • rewardIsSharedUniformly is similar: the players alone on one arm sample it and receive the reward, and in case of more than one player on one arm, only one player (uniform choice, chosen by the base station) can sample it and receive the reward.
  • closerUserGetsReward is similar but uses another approach to chose who can emit. Instead of randomly choosing the lucky player, it uses a given (or random) vector indicating the distance of each player to the base station (it can also indicate the quality of the communication), and when two (or more) players are colliding, only the one who is closer to the base station can transmit. It is the more physically plausible.

More details on the code

Have a look to:

Policies designed to be used in the multi-players setting


Configuration:

A simple python file, configuration.py, is used to import the arm classes, the policy classes and define the problems and the experiments. See the explanations given for the simple-player case.

nbArms = len(configuration['environment'][0]['params'])
# WARNING do not use this hack if you try to use more than one environment
configuration.update({
    # Uncomment the lines you don't want, keep ONLY one line
    # --- Using multi-player Selfish policy
    "players": Selfish(NB_PLAYERS, Uniform, nbArms).children   # Stupid one
    "players": Selfish(NB_PLAYERS, TakeRandomFixedArm, nbArms).children  # Also stupid one
    "players": Selfish(NB_PLAYERS, Softmax, nbArms, temperature=TEMPERATURE).children  # Based on empirical means
    "players": Selfish(NB_PLAYERS, UCBalpha, nbArms, alpha=1./4).children  # This one is efficient!
    # --- Using multi-player Centralized policy, un-fair or fair
    "players": CentralizedFixed(NB_PLAYERS, nbArms).children
    "players": CentralizedCycling(NB_PLAYERS, nbArms).children
    # --- Using multi-player Oracle policy, un-fair or fair
    # Note: they need a perfect knowledge on the arms, even this is not physically plausible
    "players": OracleNotFair(NB_PLAYERS, MAB(configuration['environment'][0])).children
    "players": OracleFair(NB_PLAYERS, MAB(configuration['environment'][0])).children
    # --- Using single-player Musical Chair policy
    "players": Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.1, Time1=HORIZON).children
})
  • The multi-players policies are added by giving a list of their children (CentralizedCycling(*args).children), who are instances of the proxy class ChildPointer. Each child methods is just passed back to the mother class (the multi-players policy, e.g., CentralizedCycling), who can then handle the calls as it wants (can be centralized or not).
  • I know, it's not perfectly clear yet and not simple to use. Just read the code. I will improve the documentation!

Some illustrations of multi-players simulations

Here are one example of simulation: M = 6 players independently and selfishly play according to the Thompson policy, against K = 17 unknown Bernoulli arms, in a collision model where only the player alone on an arm can sample it and gets some reward.

  • First, their own personal rewards along the time, showing that all the M = 6 players achieve similar average reward, about 0.82 脳 Horizon. This is a satisfactory result, as the average availability of the best M channels is exactly 0.82: it means that, in average in these 20 simulations, every player achieves a linear reward with the good slope, indicating that they all select their arms among the best M channels. 10000 steps - 20 repetition - Thompson personal rewards

  • Then, the centralized regret, which in this case seems to be converging to a constant (or growing very slowly). That is a very good performance: the regret almost stops growing after some 700 steps or so: 10000 steps - 20 repetition - Thompson selfish

  • The same is obtained for other order-optimal single-player policies (eg. KL-UCB, BayesUCB, UCB1 with good choice of alpha etc), in a purely selfish setting where every M player runs locally, without knowing the number of players and the collision model, they are able to learn and find an orthogonal affectation among the best M channels: 10000 steps - 20 repetition - BayesUCB selfish

  • Of course, this is not perfect, as the limit value of the regret is quite high, when compared to an ideal setting with communication and full knowledge of the arm (OracleFair, the best possible policy): 10000 steps - 20 repetition - OracleFair

Comparison of different multi-players policies

  • It compared a Selfish policy, where each players runs a UCBalpha or Thompson algorithm, against a MusicalChair policy.

10000 steps - 20 repetition - MusicalChair against other selfish policies 10000 steps - 20 repetition - MusicalChair against other selfish policies 2


Fairness vs. unfairness

For a multi-player policy, being fair means that on every simulation with M players, each player access any of the M best arms (about) the same amount of time. It is important to highlight that it has to be verified on each run of the MP policy, having this property in average is NOT enough.

  • For instance, the oracle policy OracleNotFair affects each of the M players to one of the M best arms, orthogonally, but once they are affected they always pull this arm. It's unfair because one player will be lucky and affected to the best arm, the others are unlucky. The centralized regret is optimal (null, in average), but it is not fair.

  • And the other oracle policy OracleFair affects an offset to each of the M players corresponding to one of the M best arms, orthogonally, and once they are affected they will cycle among the best M arms. It's fair because every player will pull the M best arms an equal number of time. And the centralized regret is also optimal (null, in average).

  • Usually, the Selfish policy is not fair: as each player is selfish and tries to maximize her personal regret, there is no reason for them to share the time on the M best arms.

  • Conversely, the MusicalChair policy is not fair either, and cannot be: when each player has attained the last step, ie. they are all choosing the same arm, orthogonally, and they are not sharing the M best arms.

  • The MEGA policy is designed to be fair: when players collide, they all have the same chance of leaving or staying on the arm, and they all sample from the M best arms equally.

  • The rhoRand policy is not designed to be fair: when players collide, they all take a random rank from the same distribution rank_k ~ Uniform(U). Note that it can be fair, but not with high probability.


First, install the requirements:

pip install -r requirements.txt

Then, it should be very straight forward to run some experiment. This will run the simulation, average them (by repetitions) and plot the results:

python main_multiplayers.py

In a virtualenv ?

If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv.

$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/AlgoBandits/bin/python3
Also creating executable in /your/path/to/AlgoBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate  # in bash, use activate.csh or activate.fish if needed
$ type pip  # just to check
pip is /your/path/to/AlgoBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2

And then be sure to use the virtualenv binary for Python, bin/python, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate was executed).

Or with a Makefile ?

You can also use the provided Makefile file to do this simply:

make install       # install the requirements
make multiplayers  # run and log the main.py script

Or within a Jupyter notebook ?

I am writing some Jupyter notebooks, in this folder (notebooks/), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.


Code organization

Layout of the code:

UML diagrams

For more details, see these UML diagrams:

  • Packages: organization of the different files: UML Diagram - Packages of AlgoBandits.git
  • Classes: inheritance diagrams of the different classes: UML Diagram - Classes of AlgoBandits.git

馃摐 License ? GitHub license

MIT Licensed (file LICENSE).

漏 2012 Olivier Capp茅, Aur茅lien Garivier, 脡milie Kaufmann and for the initial pymaBandits v1.0 project, and 漏 2016-2017 Lilian Besson for the rest.

Maintenance Ask Me Anything ! Analytics PyPI implementation PyPI pyversions ForTheBadge uses-badges ForTheBadge uses-git