# BoXHED 2.0

BoXHED 2.0 is a software package for nonparametrically estimating hazard functions via gradient boosting. It extends BoXHED 1.0 whose paper can be found here: [BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates](http://proceedings.mlr.press/v119/wang20o/wang20o.pdf).

## Prerequisites
The software developed and tested in Linux and Mac OS environments. The requirements are the following:
- cmake  (>=3.18.2)
- Python (>=3.8)
- conda

## Quick Start

This section provides a demonstration of applying BoXHED 2.0 to a synthetic data example. 

### 0. Set up a conda environment

We highly recommend devoting a conda environment to BoXHED 2.0. This step makes sure BoXHED 2.0 will not interfere with XGBoost (the library we have borrowed from extensively) when installed. This implementation uses python 3.8.

Installing the conda environment should be done prioer to opening this notebook. Therefore, you need to set up the environment as instructed here and then reopen this notebook. So, please open a terminal and do the following:

First create the conda environment:
```
conda create -n boxhed2.0 python=3.8
```

then activate it
```
conda activate boxhed2.0
```

now install numpy, pandas, scikit-learn, pytz, py3nvml, matplotlib and jupyter notebook by:
```
bash conda_install_packages.sh
```

then run jupyter notebook
```
jupyter notebook 
``` 

now open this notebook using the jupyter you just openened. Then move to step 1.

### 1. Install BoXHED2.0 and Preprocessor

This stage installs BoXHED2.0 as well as the preprocessor. You may install them by running the following line:

In [1]:
! bash setup.sh

~ ~ ~ ~ ~ ~ > creating build directory for boxhed2.0 in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/BoXHED2.0/  ...
 -- successful
~ ~ ~ ~ ~ ~ > running cmake for boxhed in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/BoXHED2.0/build/  ...
 -- successful
~ ~ ~ ~ ~ ~ > running make for boxhed in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/BoXHED2.0/build/  ...
 -- successful
~ ~ ~ ~ ~ ~ > setting up boxhed for python in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/BoXHED2.0/python-package/  ...
 -- successful
~ ~ ~ ~ ~ ~ > boxhed installed successfully  ...
~ ~ ~ ~ ~ ~ > creating build directory for preprocessor in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/build/  ...
 -- successful
~ ~ ~ ~ ~ ~ > running cmake for preprocessor in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/build/  ...
 -- successful
~ ~ ~ ~ ~ ~ > running cmake --build for preprocessor in /Users/a.pakbin-admin/Desktop/boxhed_test/BoXHED2.0Main/build/

### 2. Importing convenience functions from main.py

Here we introduce the functions we import from main.py (the script we use for evaluating BoXHED2.0).

**_read_synth** reads the synthetic data for training and returns a pandas dataframe.

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

a pandas data frame consisting of the following columns:
* *patient*: the unit number. It starts from 1 to the number of the patients in the datasets.
* *t_start*: the start time of the observation
* *t_end*: the end time of the observation
* *X_i*: other covariates (the name is not important for other covariates)

In [2]:
from main import _read_synth

**_read_synth_test** reads the synthetic data for testing and returns a pandas dataframe as well as true hazard function (for RMSE calculation)

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

* a numpy array for true hazard function for each row of the test data.
* a pandas data frame consisting of the following columns:
  * *t_start*: the start time of the observation
  * *X_i*: other covariates (the name is not important for other covariates)

In [3]:
from main import _read_synth_test

**drop_rows** drops rows randomly to introduce censoring. 

input:
* *data*: input data as read by *_read_synth*
* *num_irrelevant*: probability of each row staying in the dataset

output:
* a pandas data frame similar to the input, but probably with fewer rows and disccontinuity in time.

In [4]:
from main import drop_rows

as an example, we select arbitrary values for simulation parameters and train/test BoXHED2.0 using them.

selecting specific simulation parameters:

In [5]:
ind_exp   = 41  #experiment index. could also be 42, 43, and 44
num_irr   = 20  #number of irrelevant features. could also be 0 and 40
keep_prob = .8  #1-prob_{dropout}. could be any number in (0,1]
num_quant = 256 #number of quantiles. Could be any integer in [8, 256] 

reading in the data

In [6]:
data = _read_synth(ind_exp, num_irr)

importing the preprocessor and initializing it:

In [7]:
from preprocessor import preprocessor
prep = preprocessor()

now we call the preprocessor on the input data:

In [8]:
pats, X, y = prep.preprocess(
        data             = data,
        quant_per_column = num_quant,
        weighted         = True,
        nthreads         = 1)

The preprocessor has 4 inputs. The only one needing clarification is a boolean *weighted* which decides whether the quantiles are weighted in training or no.

It also has 3 outputs:
* *pats*: patients for each row of *X* and *y*.
* *X*: input covariates as fed to BoXHED 2.0. It consists of covariates as well as *t_start*
* *y*: has two columns. First column is the dt (duration of the trajectory) and the second column denotes whether the event has happened at the end of the trajectory or not.

we subsequently initialize a BoXHED model: (for simplicity we have omitted grid searching, but it is implemented in *main.py*)

In [9]:
from boxhed import boxhed
boxhed_ = boxhed(max_depth    = 1,
                 n_estimators = 150)

fitting a BoXHED model would look like:

In [10]:
boxhed_.fit (X,y.iloc[:,0], y.iloc[:,1])

boxhed(n_estimators=150)

we now read the test set and its corresponding true hazard value:

In [11]:
true_haz, test_X = _read_synth_test(ind_exp, num_irr) 

fixing the test data on the boundaries:

In [12]:
test_x = prep.fix_data_on_boundaries(test_X)

making a prediction on the test set:

In [13]:
preds = boxhed_.predict(test_X)

we now measure the RMSE by:

In [14]:
from main import calc_L2
L2 = calc_L2(preds, true_haz)

the point estimate along with the CI is as follows:

In [15]:
L2

[0.17424706681739754, [0.17070350517369048, 0.1777906284611046]]