Source/Discrete_specification.Rmd

---
title: "Approaches based on a discrete survival times specification"
author: "Karla Monterrubio-Gómez, Nathan Constantine-Cooke, and Catalina Vallejos"
date: "`r Sys.Date()`"
output:
  html_document:
    code_folding: show
    toc: yes
    css: style.css
    theme: simplex
    highlight: textmate
    includes:
      in_header: head.html
      before_body: navbar.html
    self_contained: false
    toc_float:
      collapsed: no
link-citations: yes
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  cache.lazy = FALSE
)
```


# Dataset

In order to demonstrate the methods, we employ publicly available data. 

The dataset used here corresponds to the Hodgkin’s disease (HD) study described
in Pintilie, 2006. The dataset comprises 865 patients diagnosed with early stage
(I or II) HD, and which were treated either with radiation (RT) or with
radiation and chemotherapy (CMT).

The recorded data includes:

* age: Age (years)
* sex: Sex, F=female and M=Male.
* trtgiven: Treatment given, RT=Radiation, CMT=Chemotherapy and radiation
* medwidsi: Size of mediastinum involvement, N=No, S=Small, L=Large
* extranod: Extranodal disease, Y=Extranodal disease, N= Nodal disease
* clinstg: Clinical stage, 1=Stage I, 2=Stage II
* time: time to failure (years) calculated from the date of diagnosis
* status: 0=censoring, 1=relapse and 2=death.

We now load and display the structure of the HD dataset:

```{r, message=FALSE, warning=FALSE}
library(readr)
hd <- data.frame(read_csv("../Data/HD/hd.csv",
                          col_types = cols(X1 = col_skip())))
str(hd)
```

To proceed with the analysis, it is important to change the data type of sex,
trtgiven, medwidsi, and extranod from character to factor. Similarly, we convert
clinstg from numeric to factor.

```{r}
hd$sex      <- as.factor(hd$sex)
hd$trtgiven <- as.factor(hd$trtgiven)
hd$medwidsi <- as.factor(hd$medwidsi)
hd$extranod <- as.factor(hd$extranod)
hd$clinstg  <- as.factor(hd$clinstg)
```

Now, we explore the number of events for each event type:

```{r}
require(pander)
pander::pander(table(hd$status))
```

Thus, we have `r length(which(hd$status==0))` censored patients, 
`r length(which(hd$status==1))` with relapse, and `r length(which(hd$status==2))`
who died. From now on, we assume that the event of interest is relapse, 
i.e. status=1.

In order to create a test set, we use stratified sampling to partition our
dataset into 80% for train and 20% for test.

```{r}
library(splitstackshape)
set.seed(2022)
split_data <- stratified(hd, c("status"), 0.8, bothSets = TRUE)
hd_train   <- split_data$SAMP1[,-1]
hd_test    <- split_data$SAMP2[,-1]
```

Now, we explore the number of observations per status in both train and test
set:

```{r}
pander::pander(table(hd_train$status))
```

```{r}
pander::pander(table(hd_test$status))
```

# BART

BART has a well documented vignette (@sparapani2021). Here, we focus only on 
demonstrating its usage for a CR setting, which corresponds to Section 5.3 in 
@sparapani2021. In the following, we fit the model with the two different 
likelihood formulations to compare the obtained estimates.
 
The first step is to recast the data creating dummy variables for all 
categorical covariates. This is done for both train and test sets:

```{r}
library(nnet)
library(survival)
library(BART)
library(stats)

xtrain = model.matrix(~. , hd_train[,c(1:6)])[,-1]
xtest = model.matrix(~. , hd_test[,c(1:6)])[,-1]
```


## Model formulation 1

The first method employs two binary likelihoods. The first one is a BART 
survival model for the time to the first event and the second model accounts for 
the probability of the event being of type $k=1$ given that the it occurred. The 
user can fit the model by using function `crisk2.bart()`. Note that the required 
binary event indicators $y_{ijk}$ can be constructed beforehand with 
`surv.pre.bart()` and passed to the function using the `y.train` argument. 
Instead, here we pass arguments `times` and `delta` which will construct event 
indicators internally. If we are interested in predictions, the test set can be 
passed directly when fitting the model through the argument `x.test`. Arguments 
to control the MCMC sampler are: `ndpost`, `nskip`, and `keepevery`. Their 
functionality is documented in the help file for the `crisk2.bart()` function. 


```{r, results='hide'}
bart1 <- crisk2.bart(x.train = xtrain, 
                     times = hd_train$time,   # needed if ytrain is not provided
                     delta = hd_train$status, # needed if ytrain is not provided
                     x.test = xtest, 
                     sparse = FALSE,          # set equal TRUE for variable selection 
                     type = 'pbart',
                     ntree = 30, numcut = 100, 
                     ndpost = 500, nskip = 500, keepevery = 5, 
                     seed = 99)
```

Note that the code shown above does not use multi-threading, but BART permits 
its usage by using `mc.crisk2.bart()` function instead of `crisk2.bart()`.

Studying convergence diagnosis of the MCMC chains is key to ensure the 
predictions are valid. For continuous outcomes, this can be done using the 
standard deviation parameter ($\sigma$) that is estimated by BART via the 
`wbart()` function. Convergence assessment is more challenging when using BART
for survival outcomes. If there is a single event type and the model was fitted
using `surv.bart()`, then convergence can be monitored using its `yhat.train`
output. However, as `yhat.train` is high-dimensional (there is one column for
each person-period pair), examples in the BART library suggest to randomly 
select a subset of individuals and visualise convergence diagnostics associated 
to their associated estimates. Here, we will randomly select 5 individuals.

Generally, convergence would be assessed on estimates generated for the 
training data. However, if there are multiple event types and `crisk2.bart()` 
was used to fit the model, `yhat.train` is not present in the output provided
by the current implementation (the same occurs for `crisk.bart()`). 
Here we explore two possible strategies.

First, we use the posterior predictive distribution for the test dataset which 
was calculated based on MCMC draws generated for the training dataset. 
The following code visualises MCMC draws obtained for `yhat.test` oftained for
the first three individuals in the test set (first time-period only)

```{r}
n.times <- length(bart1$times)
# Select 5 random individuals
set.seed(10)
aux <- sample(seq_len(nrow(hd_test)), size = 5)

for(i in aux) {
  plot(bart1$yhat.test[ , (i-1)*n.times + 1], type = "l")
}
```

We can use Geweke diagnostics to assess converge across all time-points for our
randomly selected individuals. 

```{r}
# code adapted from the example in `demo("geweke.lung.surv.bart", package = "BART")`
mycol <- 0
for(i in aux) {
  mycol <- mycol + 1
  # selects samples for individual i across all time-points
  post.mcmc <- bart1$yhat.test[ , (i-1)*n.times + seq_len(n.times)]
  # calculates the geweke diagnostic
  z <- gewekediag(post.mcmc)$z
  # to set the limits in the plot below
  y <- max(c(4, abs(z)))

  ## plot the z scores vs. time for each patient
  if(i==aux[1]) plot(bart1$times, z, ylim=c(-y, y), type='l',
                  xlab='t', ylab='z')
      else lines(bart1$times, z, type='l', col = mycol)
  lines(bart1$times, rep(1.96, n.times), type='l', lty=2)
  lines(bart1$times, rep(-1.96, n.times), type='l', lty=2)
}
```

Alternatively, when contacting the authors to ask their advice about this, they 
suggested an alternative strategy. This requires fitting a new model, using 
the training data as a test set (i.e. `x.test = xtrain`). The code required to
this is provided below. 

```{r, results='hide', eval = FALSE}
bart1.chains <- crisk2.bart(x.train = xtrain, 
                     times = hd_train$time,   # needed if ytrain is not provided
                     delta = hd_train$status, # needed if ytrain is not provided
                     x.test = xtrain,         
                     sparse = FALSE,          # set equal TRUE for variable selection 
                     type = 'pbart',
                     ntree = 30, numcut = 100, 
                     ndpost = 500, nskip = 500, keepevery = 5, 
                     seed = 99)
```

Similar plots for the Geweke diagnostic criteria can be then generated using 
the code above, replacing `bart1` by `bart1.chains`. 

This, however, can be very time consuming --- particularly for large datasets or
when the number of unique event times is large. Moreover, this would assess MCMC
convergence in a different chain to what is contained in `bart1`. Another option 
would be to use the `predict()` function, using the training dataset as test
data. This allows us to obtain estimates (e.g. `surv.test.mean`) for
individuals in the training set. 

```{r, cache.lazy = FALSE}
# prepare the data to the format required by BART
# The training data is used as test dataset
pre <- surv.pre.bart(x.train = xtrain, x.test = xtrain,
                     times = hd_train$time, delta = hd_train$status)
# generate predictions using 
pred.train <- predict(bart1, newdata = pre$tx.test, newdata2 = pre$tx.test)
```

As before, the Geweke diagnostic criteria can be applied:

```{r}
# code adapted from the example in `demo("geweke.lung.surv.bart", package = "BART")`
mycol <- 0
for(i in aux) {
  mycol <- mycol + 1
  # selects samples for individual i across all time-points
  post.mcmc <- pred.train$yhat.test[ , (i-1)*n.times + seq_len(n.times)]
  # calculates the geweke diagnostic
  z <- gewekediag(post.mcmc)$z
  # to set the limits in the plot below
  y <- max(c(4, abs(z)))

  ## plot the z scores vs. time for each patient
  if(i==aux[1]) plot(pred.train$times, z, ylim=c(-y, y), type='l',
                  xlab='t', ylab='z')
      else lines(pred.train$times, z, type='l', col = mycol)
  lines(pred.train$times, rep(1.96, n.times), type='l', lty=2)
  lines(pred.train$times, rep(-1.96, n.times), type='l', lty=2)
}
```

In both cases, we notice that the Geweke statistics exceed the $95\%$ limits 
several times, suggesting the chains have not converged. Thus, the MCMC should 
be run for a longer number of iterations to obtain valid estimates. In practice,
this means that `nskip` should be increased. However, to avoid long running
times when compiling this vignette, this is left as an exercise to the reader.

The remaining of this vignette will continue as if the sampler had converged. 

CIFs for the subjects in the test set can be obtained through `cif.test.mean()`. 
This provides the posterior mean across MCMC samples. In addition, credible 
intervals can be computed from the samples saved in `cif.test`. 
First, we re-organised the predicted CIF for cause 1 for the test dataset. The 
constructed matrix contains one row per subject and the columns correspond to 
the unique time points at which it is evaluated. Second, we compute 95% 
credible intervals.

```{r}
cif.pred <- matrix(bart1$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )

# Compute 95% credible intervals and put in matrix format:
cif.025 <- apply(bart1$cif.test, 2, quantile, probs = 0.025) 
cif.025 <- matrix(cif.025, nrow=nrow(xtest), byrow = TRUE)
cif.975 <- apply(bart1$cif.test, 2, quantile, probs = 0.975) 
cif.975 <- matrix(cif.975, nrow=nrow(xtest), byrow = TRUE)
```

We show CIF curves for the first (red) and second (blue) individuals in the test 
set along with its corresponding credible intervals:

```{r, fig.dim=c(6,4)}
par(mar = c(4, 4, 2, 0.1))

plot(bart1$times,
     cif.pred[1,],
     type = "l",
     col = "red",
     ylim = c(0, 0.6),
     xlab = "Time (years)",
     ylab = "Cumulative incidence")
points(bart1$times,cif.025[1,], col = "red", type ='s', lwd = 1, lty = 2)
points(bart1$times,cif.975[1,], col = "red", type = 's', lwd = 1, lty = 2)
lines(bart1$times,  cif.pred[2,], col="blue")
points(bart1$times,cif.025[2,], col = "blue", type ='s', lwd = 1, lty = 2)
points(bart1$times,cif.975[2,], col = "blue", type = 's', lwd = 1, lty = 2)
legend("bottomright", legend = c("Patient 1", "Patient 2"),
       lty = c(1,1), col = c("red", "blue"))
```

Similar to other approaches, BART permits to do predictions of the CIF at
a specific time point (e.g. $t=5$ years) for a new dataset (e.g. `hd_test`). 
Note that predictions are only provided for time-points present in the training
dataset. Below, we show results for the first 5 subjects in the test set.

```{r}
BART1.pred <- matrix(bart1$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )
BART1.pred[1:5, which(bart1$times == 5)]
```

Note also that if a new test dataset is available, one can do predictions 
afterwards by making a call to the `predict.crisk2bart()` function. For instance:

```{r}
pre <- surv.pre.bart(x.train=xtrain, x.test=xtest, 
                     times=hd_train$time, 
                     delta =hd_train$status)

bart1.pred <- predict(bart1, newdata=pre$tx.test, newdata2=pre$tx.test)

# Same results are obtained if the same test dataset is used
cif.pred[1:4, 1:4]
matrix(bart1.pred$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )[1:4, 1:4]
```

### DART

It is possible to employ a sparse Dirichlet prior for variable selection 
(DART model). This will help us to determine variable importance. In order to 
fit such model we use again `crisk2.bart()` function and set the `sparse` 
argument equal to `TRUE`.

```{r, results='hide'}
dart1 <- crisk2.bart(x.train = xtrain, 
                     times = hd_train$time,   # needed if ytrain is not provided
                     delta = hd_train$status, # needed if ytrain is not provided
                     x.test = xtest, 
                     sparse = TRUE,          # set equal TRUE for variable selection 
                     type = 'pbart',
                     ntree = 30, numcut = 100, 
                     ndpost = 500, nskip = 500, keepevery = 5, 
                     seed = 99)  
```

The output of the function is the same as discussed in the previous section. 
For simplicity, we have not evaluated convergence here, but the approach 
described above could be applied. CIF estimates can also be obtained as shown
before. 

Here, we illustrate the new functionality provided by DART in terms of variable
selection. The plot below shows the estimated marginal posterior probabilities 
of inclusion associated to each input covariates:

```{r, fig.dim=c(6,4)}
dart1$varprob.mean[-1]

plot(dart1$varprob.mean[-1], 
     ylab='Selection Probability', 
     ylim=c(0, 1))
P <- ncol(xtrain)    # use to set thereshold probability for each covariate
abline(h = 1/P, lty = 2)

dart1$varprob.mean[-1] > 1/P
```

According to the plot above only age and treatment are relevant (this assumes
a $1/P$ threshold). However, note that these results should be interpreted with
caution as convergence diagnostics have not been applied. 


As before, predictions of the CIF at $t=5$ years for the test dataset 
(`hd_test`) can also be obtained.

```{r}
DART1.pred <- matrix(dart1$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )
DART1.pred[1:5, which(dart1$times == 5)]
```

## Model formulation 2

This approach is discussed in Section 3.2 of @Bart2020 and fits also two 
separate BART probit models. The first model, corresponds to the conditional 
probability of a cause $k=1$ event at a given time. The second, models the 
conditional probability of an event of type $k=2$ at a specific time, given that 
the individual is still at risk and did not experience a type $k=1$ event.
In this case, the model is fit with function `crisk.bart()`:

```{r, results='hide'}
bart2 <- crisk.bart(x.train = xtrain, times = hd_train$time, 
                    delta=hd_train$status,
                    x.test = xtest, 
                    sparse=FALSE, 
                    type='pbart',
                    ntree = 30, numcut = 100, 
                    ndpost = 500, nskip = 500, keepevery = 5, 
                    seed=99)
# Parallel computation of the model is available using mc.crisk.bart
```

The output is the same as in model formulation 1 and an analysis of convergence
can be performed as before. For simplicity, this is excluded from this example.

**NOTE: we have used a small number of iterations and the MCMC did not appear to
converge. Therefore, the results shown below need to be interpreted with caution.**


As before, we employ `cif.test.mean` to obtain CIFs for the subjects in the 
test set along with 95% credible intervals.

```{r}
cif2.pred <- matrix(bart2$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )

# Compute 95% credible intervals and put in matrix format:
cif2.025 <- apply(bart2$cif.test, 2, quantile, probs = 0.025) 
cif2.025 <- matrix(cif2.025, nrow=nrow(xtest), byrow = TRUE)
cif2.975 <- apply(bart2$cif.test, 2, quantile, probs = 0.975) 
cif2.975 <- matrix(cif2.975, nrow=nrow(xtest), byrow = TRUE)
```

We show CIF curves for patient 1 (red) and 2 (blue) in the test set along with 
its corresponding credible intervals:
```{r, fig.dim=c(6,4)}
par(mar = c(4, 4, 2, 0.1))

plot(bart2$times,
     cif2.pred[1,],
     type = "l",
     col = "red",
     ylim = c(0, 1),
     xlab = "Time (years)",
     ylab = "Cumulative incidence")
points(bart2$times,cif2.025[1,], col = "red", type ='s', lwd = 1, lty = 2)
points(bart2$times,cif2.975[1,], col = "red", type = 's', lwd = 1, lty = 2)
lines(bart2$times,  cif2.pred[2,], col="blue")
points(bart2$times,cif2.025[2,], col = "blue", type ='s', lwd = 1, lty = 2)
points(bart2$times,cif2.975[2,], col = "blue", type = 's', lwd = 1, lty = 2)
```


Below, we show predictions for 5 subjects in a new dataset (e.g. `hd_test`) at 
a specific time point (e.g. $t=5$ years). Note that predictions are only 
provided for time-points present in the training set.

```{r}
BART2.pred <- matrix(bart2$cif.test.mean, nrow=nrow(xtest), byrow = TRUE )
BART2.pred[1:5, which(bart2$times == 5)]
```


Note that the estimates of the 2 model formulations differ. The next plots 
compare CIFs for patient 1, under the 2 different models to show such differences:

```{r}
plot(bart1$times,cif.pred[1,],lwd=2,type="l", col="#009999", ylim=c(0,1), 
     main="Comparison of different formulations for test patient 1",
     xlab="Time", ylab="CIF(t)")
lines(bart1$times, cif2.pred[2,], col="#FFCC00",lwd=3)
      legend("topright", 
       legend=c("Formulation 1", "Formulation 2"),
       col=c("#009999", "#FFCC00"), lty=c(1,1))
```

# Storing predictions

In order to allow comparison with the predictions generated by other methods,
we save the predictions obtained in this vignette.  

```{r save_pred}
pred_BART <- data.frame("testID" = seq_len(nrow(hd_test)),
                      "crisk2.bart" = BART1.pred[, which(bart1$times == 5)],
                      "crisk2.bart_dart" = DART1.pred[, which(dart1$times == 5)],
                      "crisk.bart" = BART2.pred[, which(bart1$times == 5)])
if (file.exists("/.dockerenv")){ # running in docker
  write.csv(pred_BART, "/Predictions/pred_BART.csv", row.names = FALSE)
} else {
  write.csv(pred_BART, "../Predictions/pred_BART.csv", row.names = FALSE)
}
```


# References

<div id="refs"></div>

# Session Info
```{r}
sessionInfo()
```