# BoXHED 2.0 quick start

BoXHED 2.0 is a software package for nonparametrically estimating hazard functions via gradient boosting. It extends BoXHED 1.0 whose paper can be found here: [BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates](http://proceedings.mlr.press/v119/wang20o/wang20o.pdf).

This section provides a demonstration of applying BoXHED 2.0 to a synthetic data example. 

### 1. Importing convenience functions from main.py

Here we introduce the functions we import from main.py (the script we use for evaluating BoXHED2.0).

**_read_synth** reads the synthetic data for training and returns a pandas dataframe.

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

a pandas data frame consisting of the following columns:
* *patient*: the unit number. It starts from 1 to the number of the patients in the datasets.
* *t_start*: the start time of the observation
* *t_end*: the end time of the observation
* *X_i*: other covariates (the name is not important for other covariates)

we will see what the input data looks like shortly. 

In [1]:
from main import _read_synth

**_read_synth_test** reads the synthetic data for testing and returns a pandas dataframe as well as true hazard function (for RMSE calculation)

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

* a numpy array for true hazard function for each row of the test data.
* a pandas data frame consisting of the following columns:
  * *t_start*: the start time of the observation
  * *X_i*: other covariates (the name is not important for other covariates)
  
we will see what the input data looks like shortly. 

In [2]:
from main import _read_synth_test

**drop_rows** drops rows randomly to introduce censoring. 

input:
* *data*: input data as read by *_read_synth*
* *num_irrelevant*: probability of each row staying in the dataset

output:
* a pandas data frame similar to the input, but probably with fewer rows and disccontinuity in time.

In [3]:
from main import drop_rows

as an example, we select arbitrary values for simulation parameters and train/test BoXHED2.0 using them.

### 2. Running an example

selecting specific simulation parameters:

In [4]:
exp_num   = 41  #experiment index. could also be 42, 43, and 44
num_irr   = 20  #number of irrelevant features. could also be 0 and 40
keep_prob = .8  #1-prob_{dropout}. could be any number in (0,1]
num_quant = 256 #number of quantiles. Could be any integer in [8, 256] 

reading in the data

In [5]:
data = _read_synth(exp_num, num_irr)

taking a look at the data we have:

In [7]:
data

Unnamed: 0,patient,t_start,t_end,X_1,X_2,X_3,X_4,X_5,X_6,X_7,...,X_13,X_14,X_15,X_16,X_17,X_18,X_19,X_20,X_21,delta
0,1,0.010000,0.074706,0.265509,0.372124,0.572853,0.908208,0.201682,0.898390,0.944675,...,0.687023,0.384104,0.769841,0.497699,0.717619,0.991906,0.380035,0.777445,0.934705,0
1,1,0.074706,0.107241,0.782933,0.553036,0.529720,0.789356,0.023331,0.477230,0.732314,...,0.070679,0.099466,0.316272,0.518634,0.662005,0.406830,0.912876,0.293603,0.459066,0
2,1,0.107241,0.152631,0.757087,0.202692,0.711121,0.121692,0.245489,0.143304,0.239629,...,0.455274,0.410084,0.810870,0.604933,0.654724,0.353197,0.270260,0.992684,0.633493,0
3,1,0.152631,0.186180,0.511170,0.207545,0.228658,0.595712,0.574872,0.077064,0.035541,...,0.985095,0.507642,0.682788,0.601541,0.238869,0.258166,0.729310,0.452571,0.175127,0
4,1,0.186180,0.257973,0.723726,0.337615,0.630414,0.840615,0.856132,0.391359,0.380494,...,0.293730,0.191260,0.886451,0.503339,0.877058,0.189194,0.758103,0.724499,0.943725,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66924,5000,0.762821,0.827155,0.728067,0.899735,0.727782,0.268609,0.378633,0.526506,0.972564,...,0.714576,0.628564,0.519994,0.974485,0.570136,0.464748,0.581276,0.695218,0.103566,0
66925,5000,0.827155,0.851700,0.699792,0.211259,0.008285,0.327462,0.884665,0.940388,0.083020,...,0.440402,0.065167,0.729235,0.211562,0.665426,0.117271,0.376724,0.613720,0.572595,0
66926,5000,0.851700,0.855931,0.201042,0.138092,0.811001,0.271753,0.091299,0.234880,0.031343,...,0.763636,0.883953,0.758610,0.618501,0.473183,0.212099,0.933392,0.480001,0.970905,0
66927,5000,0.855931,0.943535,0.561700,0.723730,0.008085,0.392067,0.128992,0.192462,0.098971,...,0.690258,0.608737,0.007401,0.707869,0.291385,0.873829,0.948604,0.924394,0.886101,0


as can be seen above, patients are numbered from 1 to N. We also have $t_{start}<t_{end}$ for each trajectory. Also, $t_{{end}_{i}}\leq t_{{start}_{i+1}}$. Delta denotes whether that specific trajectory has resulted in observing the event of interest or not. All the other columns are covariates.

importing the BoXHED model: (for simplicity we have omitted hyper-parameter tuning, but it is implemented in *main.py*)

In [8]:
from boxhed import boxhed
boxhed_ = boxhed(max_depth    = 1,
                 n_estimators = 150)

now we call the preprocessor on the input data:

In [9]:
subjects, X, w, delta = boxhed_.preprocess(
        data             = data,
        quant_per_column = num_quant,
        weighted         = True,
        nthreads         = 1)

boxhed.preprocess has 4 inputs. The only one needing clarification is a boolean *weighted* which decides whether the quantiles are weighted in training or no.

It also has 3 outputs:
* *subjects*: patients for each row of *X* and *y*.
* *X*: input covariates as fed to BoXHED 2.0. It consists of covariates as well as *t_start*
* *w*: duration of each trajectory
* *delta*: denotes whether the event of interest has happened at the end of the trajectory or not.

fitting a BoXHED model would look like:

In [10]:
boxhed_.fit (X, delta, w)

boxhed(n_estimators=150)

we now read the test set and its corresponding true hazard value:

In [11]:
true_haz, test_x = _read_synth_test(exp_num, num_irr) 

making a prediction on the test set:

In [12]:
preds = boxhed_.predict(test_x)

we now measure the RMSE by:

In [13]:
from main import calc_L2
L2 = calc_L2(preds, true_haz)

the point estimate along with the CI is as follows:

In [14]:
L2

[0.1752634914366061, [0.17166798400756458, 0.17885899886564763]]