---
# An example of a small Single-Player simulation

First, be sure to be in the main folder, and import `Evaluator` from `Environment` package:

In [1]:
from sys import path
path.insert(0, '..')

In [2]:
# Local imports
from Environment import Evaluator

 - Setting dpi of all figures to 110 ...
 - Setting 'figsize' of all figures to (19.8, 10.8) ...
Info: Using the Jupyter notebook version of the tqdm() decorator, tqdm_notebook() ...


We also need arms, for instance `Bernoulli`-distributed arm:

In [3]:
# Import arms
from Arms import makeMeans, Bernoulli

Info: numba.jit seems to be available.


And finally we need some single-player Reinforcement Learning algorithms:

In [4]:
# Import algorithms
from Policies import *

Info: numba.jit seems to be available.


For instance, this imported the `UCB` algorithm:

In [5]:
help(UCB)

Help on class UCB in module Policies.UCB:

class UCB(Policies.IndexPolicy.IndexPolicy)
 |  The UCB policy for bounded bandits.
 |  Reference: [Lai & Robbins, 1985].
 |  
 |  Method resolution order:
 |      UCB
 |      Policies.IndexPolicy.IndexPolicy
 |      Policies.BasePolicy.BasePolicy
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  computeIndex(self, arm)
 |      Compute the current index of arm 'arm'.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from Policies.IndexPolicy.IndexPolicy:
 |  
 |  __init__(self, nbArms, lower=0.0, amplitude=1.0)
 |      New generic index policy.
 |      
 |      - nbArms: the number of arms,
 |      - lower, amplitude: lower value and known amplitude of the rewards.
 |  
 |  choice(self)
 |      In an index policy, choose an arm with maximal index (uniformly at random).
 |  
 |  choiceFromSubSet(self, availableArms='all')
 |      In an index policy, choose the best arm from sub-s

---
## Creating the problem

### Parameters for the simulation
- $T = 10000$ is the time horizon,
- $N = 100$ is the number of repetitions,
- `N_JOBS = 4` is the number of cores used to parallelize the code.

In [6]:
HORIZON = 10000
REPETITIONS = 100
N_JOBS = 4

### Some MAB problem with Bernoulli arms
We consider in this example $3$ problems, with `Bernoulli` arms, of different means.

In [7]:
ENVIRONMENTS = [  # 1)  Bernoulli arms
        {   # A very easy problem, but it is used in a lot of articles
            "arm_type": Bernoulli,
            "params": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
        },
        {   # An other problem, best arm = last, with three groups: very bad arms (0.01, 0.02), middle arms (0.3 - 0.6) and very good arms (0.78, 0.8, 0.82)
            "arm_type": Bernoulli,
            "params": [0.01, 0.02, 0.3, 0.4, 0.5, 0.6, 0.795, 0.8, 0.805]
        },
        {   # A very hard problem, as used in [Cappé et al, 2012]
            "arm_type": Bernoulli,
            "params": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.1]
        },
    ]

### Some RL algorithms
We compare Thompson Sampling against $\mathrm{UCB}_1$, and $\mathrm{kl}-\mathrm{UCB}$.

In [8]:
POLICIES = [
        # --- UCB1 algorithm
        {
            "archtype": UCB,
            "params": {}
        },
        # --- Thompson algorithm
        {
            "archtype": Thompson,
            "params": {}
        },
        # --- KL algorithms, here only klUCB
        {
            "archtype": klUCB,
            "params": {}
        },
        # --- BayesUCB algorithm
        {
            "archtype": BayesUCB,
            "params": {}
        },
    ]

Complete configuration for the problem:

In [9]:
configuration = {
    # --- Duration of the experiment
    "horizon": HORIZON,
    # --- Number of repetition of the experiment (to have an average)
    "repetitions": REPETITIONS,
    # --- Parameters for the use of joblib.Parallel
    "n_jobs": N_JOBS,    # = nb of CPU cores
    "verbosity": 6,      # Max joblib verbosity
    # --- Arms
    "environment": ENVIRONMENTS,
    # --- Algorithms
    "policies": POLICIES,
}
configuration

{'environment': [{'arm_type': Arms.Bernoulli.Bernoulli,
   'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]},
  {'arm_type': Arms.Bernoulli.Bernoulli,
   'params': [0.01, 0.02, 0.3, 0.4, 0.5, 0.6, 0.795, 0.8, 0.805]},
  {'arm_type': Arms.Bernoulli.Bernoulli,
   'params': [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.1]}],
 'horizon': 10000,
 'n_jobs': 4,
 'policies': [{'archtype': Policies.UCB.UCB, 'params': {}},
  {'archtype': Policies.Thompson.Thompson, 'params': {}},
  {'archtype': Policies.klUCB.klUCB, 'params': {}},
  {'archtype': Policies.BayesUCB.BayesUCB, 'params': {}}],
 'repetitions': 100,
 'verbosity': 6}

---
## Solving the problem

In [10]:
evaluation = Evaluator(configuration)

Number of policies in this comparison: 4
Time horizon: 10000
Number of repetitions: 100
Sampling rate DELTA_T_SAVE: 1
Creating a new MAB problem ...
  Reading arms of this MAB problem from a dictionnary 'configuration' = {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]} ...
 - with 'arm_type' = <class 'Arms.Bernoulli.Bernoulli'>
 - with 'params' = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
 - with 'arms' = [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)]
 - with 'nbArms' = 9
 - with 'maxArm' = 0.9
 - with 'minArm' = 0.1

This MAB problem has: 
 - a [Lai & Robbins] complexity constant C(mu) = 7.52 ... 
 - a Optimal Arm Identification factor H_OI(mu) = 48.89% ...
Creating a new MAB problem ...
  Reading arms of this MAB problem from a dictionnary 'configuration' = {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.01, 0.02, 0.3, 0.4, 0.5, 0.6, 0.795, 0.8, 0.805]} ...
 - with 'arm_type' = <class 

In [None]:
for envId, env in enumerate(evaluation.envs):
    # Evaluate just that env
    evaluation.startOneEnv(envId, env)


Evaluating environment: <MAB{'nbArms': 9, 'maxArm': 0.90000000000000002, 'minArm': 0.10000000000000001, 'arms': [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)]}>
- Adding policy #1 = {'params': {}, 'archtype': <class 'Policies.UCB.UCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][0]' = {'params': {}, 'archtype': <class 'Policies.UCB.UCB'>} ...
- Adding policy #2 = {'params': {}, 'archtype': <class 'Policies.Thompson.Thompson'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][1]' = {'params': {}, 'archtype': <class 'Policies.Thompson.Thompson'>} ...
- Adding policy #3 = {'params': {}, 'archtype': <class 'Policies.klUCB.klUCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][2]' = {'params': {}, 'archtype': <class 'Policies.klUCB.klUCB'>} ...
- Adding policy #4 = {'params': {}, 'archtype': <class 'Policies.BayesUCB.BayesUCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][3]'

[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.9s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   10.7s




- Evaluating policy #2/4: Thompson ...


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   23.7s finished



Estimated order by the policy Thompson after 10000 steps: [0 3 5 2 1 4 6 7 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 75.31% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 98.77% (relative success)...
  ==> Spearman    distance from optimal ordering: 99.04% (relative success)...
  ==> Gestalt     distance from optimal ordering: 66.67% (relative success)...
  ==> Mean distance from optimal ordering: 84.94% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.8s




- Evaluating policy #3/4: KL-UCB(Bern) ...


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   16.3s finished



Estimated order by the policy KL-UCB(Bern) after 10000 steps: [2 1 0 6 4 5 3 7 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 75.31% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 96.29% (relative success)...
  ==> Spearman    distance from optimal ordering: 98.75% (relative success)...
  ==> Gestalt     distance from optimal ordering: 55.56% (relative success)...
  ==> Mean distance from optimal ordering: 81.48% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    2.4s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   14.7s





[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   33.1s finished



- Evaluating policy #4/4: BayesUCB ...

Estimated order by the policy BayesUCB after 10000 steps: [2 4 0 7 6 5 1 3 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 45.68% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 59.58% (relative success)...
  ==> Spearman    distance from optimal ordering: 64.42% (relative success)...
  ==> Gestalt     distance from optimal ordering: 44.44% (relative success)...
  ==> Mean distance from optimal ordering: 53.53% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    3.7s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   18.6s




Evaluating environment: <MAB{'nbArms': 9, 'maxArm': 0.80500000000000005, 'minArm': 0.01, 'arms': [B(0.01), B(0.02), B(0.3), B(0.4), B(0.5), B(0.6), B(0.795), B(0.8), B(0.805)]}>
- Adding policy #1 = {'params': {}, 'archtype': <class 'Policies.UCB.UCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][0]' = {'params': {}, 'archtype': <class 'Policies.UCB.UCB'>} ...
- Adding policy #2 = {'params': {}, 'archtype': <class 'Policies.Thompson.Thompson'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][1]' = {'params': {}, 'archtype': <class 'Policies.Thompson.Thompson'>} ...
- Adding policy #3 = {'params': {}, 'archtype': <class 'Policies.klUCB.klUCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][2]' = {'params': {}, 'archtype': <class 'Policies.klUCB.klUCB'>} ...
- Adding policy #4 = {'params': {}, 'archtype': <class 'Policies.BayesUCB.BayesUCB'>} ...
  Creating this policy from a dictionnary 'self.cfg['policies'][3]' = {'par

[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   37.8s finished



Estimated order by the policy UCB after 10000 steps: [0 1 3 2 5 4 7 6 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 85.19% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 99.82% (relative success)...
  ==> Spearman    distance from optimal ordering: 99.99% (relative success)...
  ==> Gestalt     distance from optimal ordering: 66.67% (relative success)...
  ==> Mean distance from optimal ordering: 87.92% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.9s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   12.1s




- Evaluating policy #2/4: Thompson ...


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   28.7s finished



Estimated order by the policy Thompson after 10000 steps: [0 4 3 2 1 5 6 7 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 80.25% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 98.77% (relative success)...
  ==> Spearman    distance from optimal ordering: 99.47% (relative success)...
  ==> Gestalt     distance from optimal ordering: 66.67% (relative success)...
  ==> Mean distance from optimal ordering: 86.29% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.6s




- Evaluating policy #3/4: KL-UCB(Bern) ...


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   15.6s finished



Estimated order by the policy KL-UCB(Bern) after 10000 steps: [0 3 1 2 4 5 6 8 7] ...
  ==> Optimal arm identification: 99.38% (relative success)...
  ==> Manhattan   distance from optimal ordering: 85.19% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 99.82% (relative success)...
  ==> Spearman    distance from optimal ordering: 99.98% (relative success)...
  ==> Gestalt     distance from optimal ordering: 77.78% (relative success)...
  ==> Mean distance from optimal ordering: 90.69% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    2.6s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   12.2s




- Evaluating policy #4/4: BayesUCB ...


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   27.3s finished



Estimated order by the policy BayesUCB after 10000 steps: [0 1 3 2 4 5 7 6 8] ...
  ==> Optimal arm identification: 100.00% (relative success)...
  ==> Manhattan   distance from optimal ordering: 90.12% (relative success)...
  ==> Kendell Tau distance from optimal ordering: 99.92% (relative success)...
  ==> Spearman    distance from optimal ordering: 100.00% (relative success)...
  ==> Gestalt     distance from optimal ordering: 77.78% (relative success)...
  ==> Mean distance from optimal ordering: 91.95% (relative success)...


[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    2.5s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   14.8s


In [12]:
for envId, env in enumerate(evaluation.envs):
    evaluation.printFinalRanking(envId)
    
    evaluation.plotRegrets(envId, semilogx=False, plotSTD=False)
    evaluation.plotRegrets(envId, semilogx=True, plotSTD=False)
    
    evaluation.plotRegrets(envId, semilogx=semilogx, meanRegret=True, plotSTD=False)
    
    evaluation.plotBestArmPulls(envId)


Final ranking for this environment #0 :
- Policy 'BayesUCB'	was ranked	1 / 4 for this simulation (last regret = 41.29).
- Policy 'Thompson'	was ranked	2 / 4 for this simulation (last regret = 44.4).
- Policy 'KL-UCB(Bern)'	was ranked	3 / 4 for this simulation (last regret = 57.43).
- Policy 'UCB'	was ranked	4 / 4 for this simulation (last regret = 327.56).


(array([ 327.56,   44.4 ,   57.43,   41.29]), array([3, 1, 2, 0]))


This MAB problem has: 
 - a [Lai & Robbins] complexity constant C(mu) = 7.52 for 1-player problem... 
 - a Optimal Arm Identification factor H_OI(mu) = 48.89% ...


KeyError: 'showplot'