# Bootstrap multiple comparisons tutorial

This Jupyter _Python 3_ notebook has been written to accompany the WSC18 paper:

**PRACTICAL CONSIDERATIONS IN SELECTING THE BEST SET OF SIMULATED SYSTEMS**  _by Christine Currie and Tom Monks_.

The notebook provides a worked example of using BootComp to conduct a 2 stage screening and search of a simulation model.  

## 1. Preamble

### 1.1. Detail of the simulation model

The simulation model was used in a 2017 project in the UK to help a hospital, a community healthcare provider and a clinical commissioning group design and plan a new community rehabilitation ward.  In the UK, patients who require rehabilitation are often stuck in a queuing system where there must wait (inappropriately) in a acute hospital bed for a space in the rehabilitaiton ward.  The model investigated the sizing of the new ward in order to minimise patient waiting time whilst meeting probabilitic constraints regarding ward occupancy (bed utilization) and the number of transfers between single sex bays.

### 1.2. Output data

The output data for the example analysis are bundled with git repository.  There are three .csv files in the data/ directory for 'waiting times', 'utilization' and 'transfers'.  

The model itself is not needed.  There are 50 replications of 1151 competing designs points.  Users can vary the number of replications used in the two stage procedure.  

The experimental design is also included for reference.

## 2. Prerequisites

### 2.1. BootComp Modules

In [3]:
import Bootstrap as bs
import BootIO as io
import ConvFuncs as cf

In [4]:
#DEV
import Bootstrap_crn as crn

### 2.2. Python Data Science Modules

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 3. Procedure: Stage 1

### Step 1: Read in initial $ n_1 $  replications

In [6]:
N_BOOTS = 2000
n_1 = 5
INPUT_DATA1 = "data/replications_wait_times.csv"
INPUT_DATA2 = "data/replications_util.csv"
INPUT_DATA3 = "data/replications_transfers.csv"
DESIGN = "data/doe.csv"

In [7]:
system_data_wait = crn.load_scenarios(INPUT_DATA1, exclude_reps = 50-n_1)
system_data_util = crn.load_scenarios(INPUT_DATA2, exclude_reps = 50-n_1)
system_data_tran = crn.load_scenarios(INPUT_DATA3, exclude_reps = 50-n_1)

N_SCENARIOS = system_data_wait.shape[1]
N_REPS = system_data_wait.shape[0]

print("Loaded waiting time data. {0} systems; {1} replications".format(system_data_wait.shape[1], system_data_wait.shape[0]))
print("Loaded utilzation data. {0} systems; {1} replications".format(system_data_util.shape[1], system_data_util.shape[0]))
print("Loaded transfers data. {0} systems; {1} replications".format(system_data_tran.shape[1], system_data_tran.shape[0]))

Loaded waiting time data. 1051 systems; 5 replications
Loaded utilzation data. 1051 systems; 5 replications
Loaded transfers data. 1051 systems; 5 replications


In [8]:
df_tran = pd.DataFrame(system_data_tran)
df_util = pd.DataFrame(system_data_util)
df_wait = pd.DataFrame(system_data_wait)

### Step 2: Limit to systems that satisfy chance constraints

In [9]:
min_util = 80

In [10]:
p = 0.05

Bootstrap function arguments

In [11]:
N_BOOTS = 1000
args =  bs.BootstrapArguments()

args.nboots = N_BOOTS
args.nscenarios = N_SCENARIOS
args.point_estimate_func = bs.bootstrap_mean


#### Chance constraint 1:  Utilisation Threshold (value for money)

In [12]:
def bootstrap_chance_constraint(data, threshold, boot_args, p=0.05, kind='lower'):
    """
    Bootstrap a chance constraint for k systems and filter out systems 
    where p% of resamples are greater a threshold t.  
    
    Example 1. A lower limit.  If the chance constaint was related to utilization it could be stated as 
    filter out any systems where 5% of the distribution is less than 80%.
    
    Example 2. An upper limit.  If the chance constraint related to unwanted ward transfers it could be stated 
    as filter out any systems where 5% of the distribution is greater than 50 transfers per annum.
    
    Returns a pandas.Series containing of the feasible systems i.e. that do not violate the chance constaint.
    
    @data - a numpy array of the data to bootstrap
    @threshold - the threshold of the chance constraint
    @boot_args - the bootstrap setup class
    @p - the probability cut of for the chance constraint  (default p = 0.05)
    @kind - 'lower' = a lower limit threshold; 'upper' = an upper limit threshold (default = 'lower')
    
    """
    
    valid_operations = ['upper', 'lower']
    
    if kind.lower() not in valid_operations:
        raise ValueError('Parameter @kind must be either set to lower or upper')
    
    resample_list = bs.resample_all_scenarios(data.tolist(), boot_args)
    df_boots = cf.resamples_to_df(resample_list, boot_args.nboots)
    
    if('lower' == kind.lower()):
        
        df_counts = pd.DataFrame(df_boots[df_boots > threshold].count(), columns = {'count'})
    else:
        df_counts = pd.DataFrame(df_boots[df_boots < threshold].count(), columns = {'count'})
        
    df_counts['prop'] = df_counts['count'] / boot_args.nboots
    df_counts['pass'] = np.where(df_counts['prop'] >= (1- p), 1, 0)

    return df_counts.loc[df_counts['pass'] == 1].index
    
    
    

In [13]:
resample_util = bs.resample_all_scenarios(system_data_util.T.tolist(), args)


In [14]:
df_boots_util = cf.resamples_to_df(resample_util, N_BOOTS)
df_boots_util.shape

(1000, 1051)

In [15]:
df_counts = pd.DataFrame(df_boots_util[df_boots_util > min_util].count(), columns = {'count'})
df_counts['prop'] = df_counts['count'] / N_BOOTS
df_counts['pass'] = np.where(df_counts['prop'] >= (1- p), 1, 0)

passed_1 = df_counts.loc[df_counts['pass'] == 1]
passed_1

Unnamed: 0,count,prop,pass
1,1000,1.000,1
2,1000,1.000,1
3,1000,1.000,1
4,1000,1.000,1
5,1000,1.000,1
6,1000,1.000,1
7,1000,1.000,1
8,1000,1.000,1
9,1000,1.000,1
10,1000,1.000,1


#### Chance constraint 2: Upper bound on transfers between bays

In [16]:
max_tran = 50

In [17]:
resample_tran = bs.resample_all_scenarios(system_data_tran.T.tolist(), args)

In [18]:
df_boots_tran = cf.resamples_to_df(resample_tran, N_BOOTS)
df_boots_tran.shape

(1000, 1051)

In [19]:
df_boots_tran.head(5)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051
1,0.0,28.2,37.0,59.6,70.8,112.2,123.6,154.4,182.4,142.2,...,0.4,4.8,8.2,2.8,3.0,2.0,1.0,1.6,0.0,0.0
2,0.2,30.2,40.0,63.2,81.6,98.6,132.4,136.0,176.2,115.0,...,3.6,6.0,5.8,3.8,1.8,2.4,1.0,0.8,0.0,0.0
3,0.4,26.0,37.4,65.4,76.2,85.2,101.4,115.4,161.0,140.6,...,2.4,2.8,3.4,3.0,2.4,2.0,2.0,0.8,1.6,0.0
4,0.2,33.0,36.8,65.8,76.0,118.2,106.0,126.0,154.0,165.8,...,0.8,5.4,7.2,4.8,1.8,1.4,3.0,0.0,0.4,0.0
5,0.0,29.6,33.2,47.8,84.8,96.6,112.8,153.6,143.2,170.0,...,0.0,2.2,4.6,2.8,1.6,1.8,2.0,0.0,0.4,0.0


In [20]:
df_counts = pd.DataFrame(df_boots_tran[df_boots_tran < max_tran].count(), columns = {'count'})
df_counts['prop'] = df_counts['count'] / N_BOOTS
df_counts['pass'] = np.where(df_counts['prop'] >= (1-p), 1, 0)

passed_2 = df_counts.loc[df_counts['pass'] == 1]
passed_2

Unnamed: 0,count,prop,pass
1,1000,1.000,1
2,1000,1.000,1
3,987,0.987,1
15,1000,1.000,1
36,1000,1.000,1
51,981,0.981,1
63,1000,1.000,1
64,1000,1.000,1
65,1000,1.000,1
66,1000,1.000,1


In [21]:
np.array(passed_1.index)

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

In [22]:
passed_1 = bootstrap_chance_constraint(data = system_data_util.T, threshold=min_util, boot_args=args)

In [23]:
print(pd.DataFrame(passed_1))

       0
0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
20    21
21    22
22    23
23    24
24    25
25    26
26    27
27    28
28    29
29    30
..   ...
302  311
303  312
304  315
305  316
306  317
307  319
308  321
309  322
310  324
311  325
312  326
313  328
314  329
315  330
316  331
317  332
318  333
319  335
320  337
321  338
322  341
323  342
324  343
325  344
326  346
327  347
328  348
329  349
330  350
331  351

[332 rows x 1 columns]


In [24]:
np.array(passed_2.index)

array([   1,    2,    3,   15,   36,   51,   63,   64,   65,   66,   67,
         68,   80,   89,   90,  102,  120,  131,  132,  133,  134,  135,
        136,  148,  149,  158,  159,  166,  185,  191,  199,  203,  204,
        205,  206,  207,  208,  209,  210,  220,  221,  222,  223,  230,
        231,  232,  238,  239,  245,  257,  263,  271,  274,  275,  276,
        277,  278,  279,  280,  281,  282,  283,  284,  285,  293,  294,
        295,  296,  297,  303,  304,  305,  306,  311,  312,  313,  318,
        319,  323,  327,  330,  331,  333,  336,  344,  347,  348,  349,
        350,  351,  352,  353,  354,  355,  356,  357,  358,  359,  367,
        368,  369,  370,  371,  372,  377,  378,  379,  380,  381,  382,
        386,  387,  388,  393,  394,  395,  399,  400,  403,  404,  406,
        407,  408,  410,  411,  413,  416,  422,  424,  425,  426,  427,
        428,  429,  430,  431,  432,  433,  434,  435,  436,  437,  438,
        439,  440,  441,  446,  447,  448,  449,  4

In [25]:
passed_2 = bootstrap_chance_constraint(data = system_data_tran.T, threshold=max_tran, boot_args=args, kind='upper')

In [26]:
passed_2

Int64Index([   1,    2,    3,   15,   36,   51,   63,   64,   65,   66,
            ...
            1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051],
           dtype='int64', length=738)

In [27]:
#subset = np.intersect1d(np.array(passed_1.index), np.array(passed_2.index))
subset = np.intersect1d(passed_1, passed_2)

In [28]:
#need to zero index. 
subset_zero = [x - 1 for x in subset]
#subset_zero = subset
subset_waits = df_wait[subset_zero].mean()
subset_waits.rename('wait', inplace=True)
subset_utils = df_util[subset_zero].mean()
subset_utils.rename('util', inplace=True)
subset_tran = df_tran[subset_zero].mean()
subset_tran.rename('tran', inplace=True)

0       0.2
1      28.2
2      37.4
14     36.0
35     32.2
50     23.8
62     35.2
63     15.4
64      8.8
65      2.0
66      0.0
67     26.0
79     26.0
88      0.0
89     33.6
101    31.8
116    30.2
119     0.0
130    21.4
131    10.8
132     3.0
133     0.0
134     0.0
135    23.2
147     0.0
148    21.6
157    23.8
158    28.8
165     0.0
171    37.6
       ... 
270    24.4
273    33.2
274    18.4
275    14.0
276     4.4
277     1.8
278     0.0
279     0.0
280    12.2
282    20.4
283    24.8
284    32.2
292    22.2
293    10.6
296    31.0
302    24.2
304    21.6
305    31.6
310    22.4
311    16.6
318    31.2
329    26.4
330    33.2
332    21.8
343    28.4
346    28.8
347    18.6
348    14.2
349     5.8
350     2.2
Name: tran, dtype: float64

In [29]:
subset_kpi = pd.concat([subset_waits, subset_utils, subset_tran], axis=1)

In [30]:
subset_kpi.sort_values(by=['wait', 'util', 'tran'])

Unnamed: 0,wait,util,tran
279,0.256866,82.502992,0.0
293,0.256866,82.502992,10.6
280,0.256866,82.502992,12.2
311,0.256866,82.502992,16.6
282,0.256866,82.502992,20.4
304,0.256866,82.502992,21.6
283,0.256866,82.502992,24.8
296,0.256866,82.502992,31.0
318,0.256866,82.502992,31.2
284,0.256911,82.502992,32.2


In [31]:
best_system_index = subset_kpi.sort_values(by=['wait', 'util', 'tran']).index[0]

In [32]:
best_system_index

279

### Step 3: setup differences

In [33]:
feasible_systems = df_wait[subset_zero]

In [34]:
diffs =  pd.DataFrame(feasible_systems.as_matrix().T - np.array(feasible_systems[best_system_index])).T
diffs.columns = subset

### Step 4: Bootstrap differences

In [35]:
resample_diffs = bs.resample_all_scenarios(diffs.values.T.tolist(), args)

In [36]:
df_boots_diffs= cf.resamples_to_df(resample_diffs, N_BOOTS)
df_boots_diffs.columns = subset
df_boots_diffs.shape

(1000, 83)

### Step 5: Rank systems

y% of bootstraps are within x% of the mean

In [37]:
x = 0.1
y = 0.95

In [38]:
indifference = feasible_systems[best_system_index].mean() * x
indifference

0.025686578680000002

In [39]:
#convert numbers to 0 or 1
# 1 = difference less than 0.244
# 0 = difference greater than 0.244

def indifferent(x):
    if x <= indifference:
        return 1
    else:
        return 0

df_indifference = df_boots_diffs.applymap(lambda x: indifferent(x))
df_indifference

Unnamed: 0,1,2,3,15,36,51,63,64,65,66,...,319,330,331,333,344,347,348,349,350,351
1,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0


In [40]:
threshold = N_BOOTS * y
df_within_limit = df_indifference.sum(0)
df_within_limit= pd.DataFrame(df_within_limit, columns=['sum'])
take_forward = df_within_limit.loc[df_within_limit['sum'] >= threshold].index

In [41]:
take_forward

Int64Index([280, 281, 283, 284, 285, 294, 297, 305, 306, 312, 319, 331], dtype='int64')

## 4. Procedure - Stage 2

### Step 6: More replicates of promcing solutions.

User simulates $ n_2 $ additional replicates for the feasible solutions brought forward from stage 1.

Example = 50 replicates (45 extra)

In [42]:
df_wait_s2 = pd.DataFrame(crn.load_scenarios(INPUT_DATA1))[take_forward]
df_util_s2 = pd.DataFrame(crn.load_scenarios(INPUT_DATA2))[take_forward]
df_tran_s2 = pd.DataFrame(crn.load_scenarios(INPUT_DATA3))[take_forward]

N_SCENARIOS = df_wait_s2.shape[1]
N_REPS = df_wait_s2.shape[0]

print("Loaded waiting time data. {0} systems; {1} replications".format(df_wait_s2.shape[1], df_wait_s2.shape[0]))
print("Loaded utilzation data. {0} systems; {1} replications".format(df_util_s2.shape[1], df_util_s2.shape[0]))
print("Loaded transfers data. {0} systems; {1} replications".format(df_tran_s2.shape[1], df_tran_s2.shape[0]))

Loaded waiting time data. 12 systems; 50 replications
Loaded utilzation data. 12 systems; 50 replications
Loaded transfers data. 12 systems; 50 replications


### Step 7: Repeat steps 2 - 5

#### Step 2 - Chance contraints

In [43]:
passed_1 = bootstrap_chance_constraint(data = df_util_s2.values.T, threshold=min_util, boot_args=args)

In [44]:
passed_1

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype='int64')

In [45]:
take_forward

Int64Index([280, 281, 283, 284, 285, 294, 297, 305, 306, 312, 319, 331], dtype='int64')

In [46]:
cc_1 = np.array([take_forward[x-1] for x in passed_1])
cc_1

array([280, 281, 283, 284, 285, 294, 297, 305, 306, 312, 319, 331], dtype=int64)

In [47]:
passed_2 = bootstrap_chance_constraint(data = df_tran_s2.values.T, threshold=max_tran, boot_args=args, kind='upper')

In [48]:
passed_2

Int64Index([1, 2, 3, 4, 6, 8, 10], dtype='int64')

In [49]:
cc_2 = np.array([take_forward[x-1] for x in passed_2])
cc_2

array([280, 281, 283, 284, 294, 305, 312], dtype=int64)

In [50]:
subset = np.intersect1d(cc_1, cc_2)
subset

array([280, 281, 283, 284, 294, 305, 312], dtype=int64)

In [51]:
#need to zero index. 
subset_zero = [x - 1 for x in subset]
#subset_zero = subset
subset_waits = df_wait_s2[subset].mean()
subset_waits.rename('wait', inplace=True)
subset_utils = df_util_s2[subset].mean()
subset_utils.rename('util', inplace=True)
subset_tran = df_tran_s2[subset].mean()
subset_tran.rename('tran', inplace=True)

280    16.06
281    21.60
283    33.78
284    39.84
294    22.66
305    39.16
312    34.74
Name: tran, dtype: float64

In [73]:
subset_kpi = pd.concat([subset_waits, subset_utils, subset_tran], axis=1)
subset_kpi.index.rename('System', inplace=True)

In [74]:
subset_kpi.sort_values(by=['wait', 'util', 'tran'])


Unnamed: 0_level_0,wait,util,tran
System,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
280,0.29956,82.729884,16.06
281,0.29956,82.729884,21.6
294,0.29956,82.729884,22.66
283,0.299624,82.729884,33.78
284,0.299798,82.729884,39.84
305,0.300329,82.730468,39.16
312,0.300538,82.731635,34.74


In [54]:
best_system_index = subset_kpi.sort_values(by=['wait', 'util', 'tran']).index[0]

In [55]:
best_system_index

280

### Step [?]  Setup differences from best (stage 2)

In [56]:
feasible_systems = df_wait_s2[subset]
diffs =  pd.DataFrame(feasible_systems.as_matrix().T - np.array(feasible_systems[best_system_index])).T
diffs.columns = subset

### Bootstrap differences

In [57]:
resample_diffs = bs.resample_all_scenarios(diffs.values.T.tolist(), args)


In [58]:
df_boots_diffs= cf.resamples_to_df(resample_diffs, args.nboots)
df_boots_diffs.columns = subset
df_boots_diffs.shape

(1000, 7)

In [59]:
x = 0.1
y = 0.95

In [60]:
indifference = feasible_systems[best_system_index].mean() * x
indifference

0.029956006048000014

In [61]:
df_indifference = df_boots_diffs.applymap(lambda x: indifferent(x))
df_indifference

Unnamed: 0,280,281,283,284,294,305,312
1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1
5,1,1,1,1,1,1,1
6,1,1,1,1,1,1,1
7,1,1,1,1,1,1,1
8,1,1,1,1,1,1,1
9,1,1,1,1,1,1,1
10,1,1,1,1,1,1,1


In [62]:
threshold = args.nboots * y
df_within_limit = df_indifference.sum(0)
df_within_limit= pd.DataFrame(df_within_limit, columns=['sum'])
final_set = df_within_limit.loc[df_within_limit['sum'] >= threshold].index

In [63]:
final_set

Int64Index([280, 281, 283, 284, 294, 305, 312], dtype='int64')

Final set of feasible systems selected from the competing designs

In [75]:
df_doe = pd.read_csv(DESIGN, index_col='System')
temp = df_doe[df_doe.index.isin(final_set)]
subset_kpi = subset_kpi.applymap(lambda x: '%.4f' % x)
df_final = pd.concat([temp, subset_kpi], axis=1)
df_final.sort_values(by=['wait', 'util', 'tran'])



Unnamed: 0_level_0,Total beds,Size of Bays,Number of Bays,Number of Singles,wait,util,tran
System,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
280,47,0,0,47,0.2996,82.7299,16.06
281,47,3,3,38,0.2996,82.7299,21.6
294,47,4,2,39,0.2996,82.7299,22.66
283,47,3,5,32,0.2996,82.7299,33.78
284,47,3,6,29,0.2998,82.7299,39.84
305,47,5,3,32,0.3003,82.7305,39.16
312,47,6,2,35,0.3005,82.7316,34.74
