-
Notifications
You must be signed in to change notification settings - Fork 5
/
Getting-Started.Rmd
238 lines (191 loc) · 7.16 KB
/
Getting-Started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "Getting-Started"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Getting-Started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
old <- options("digits" = 3)
data.table::setDTthreads(2)
```
This package implements methods to analyse a sequence of target trials.
The steps to do this are:
1. create a data set with all the required variables
2. fit models for censoring and calculate inverse probability weights
3. expand the data set, with records for each eligible patient for each trial period
4. fit models for outcomes
The package provides two options for conducting the analysis: an [all-in-one function](#initiators) and a
set of more [flexible functions](#flexible).
## Required Data
To get started a longitudinal dataset must be created containing:
* time period
* patient identifier
* treatment indicator
* outcome indicator
* censoring indicator
* eligibility indicator for a trial starting in each time period
* other covariates relating to treatment, outcome, or informative censoring to be used in the models for weights or the outcome
An example data set is included to demonstrate the format:
```{r data}
library(TrialEmulation)
# Prepare the example data
data("trial_example")
# Set columns to factors as necessary
trial_example$catvarA <- as.factor(trial_example$catvarA)
trial_example$catvarB <- as.factor(trial_example$catvarB)
head(trial_example)
```
## All-in-one analysis {#initiators}
There is an all-in-one function `initators()` which does all of the steps with one function call.
This is similar to the [INITIATORS SAS macro](https://causalab.sph.harvard.edu/software/).
Call the `initiators()` function to run the complete analysis:
```{r initiators}
result <- initiators(
data = trial_example,
id = "id",
period = "period",
eligible = "eligible",
treatment = "treatment",
estimand_type = "ITT",
outcome = "outcome",
model_var = "assigned_treatment",
outcome_cov = c("catvarA", "catvarB", "nvarA", "nvarB", "nvarC"),
use_censor_weights = FALSE
)
```
```{r initiators_summary}
summary(result)
```
By default, this returns the final `glm` model object and the results using the sandwich estimator.
```{r init_summary}
summary(result$model)
```
Tidy summaries of the robust models are available.
```{r init_robust_summary}
print(result$robust$summary)
```
Also the sandwich robust variance-covariance matrix.
```{r init_robust_matrix}
# only print the first columns
head(result$robust$matrix, c(17, 4))
```
## Flexible Analysis {#flexible}
To gain complete control over the analysis and to inspect the intermediate objects, it can
be useful to run the the data preparation and modelling steps separately.
This also allows for processing of very large data sets, by doing the data preparation in chunks and then
sampling from the expanded trial data.
```{r temp_dir}
# for the purposes of the vignette, we use a temporary directory, however it may be useful to use a permanent
# location in order to inspect the outputs later
working_dir <- file.path(tempdir(TRUE), "trial_emu")
if (!dir.exists(working_dir)) dir.create(working_dir)
```
```{r data_preparation}
prep_data <- data_preparation(
data = trial_example,
id = "id",
period = "period",
eligible = "eligible",
treatment = "treatment",
outcome = "outcome",
outcome_cov = ~ catvarA + catvarB + nvarA + nvarB + nvarC,
data_dir = working_dir,
save_weight_models = TRUE,
estimand_type = "PP",
pool_cense = "none",
use_censor_weights = FALSE,
chunk_size = 500,
separate_files = TRUE,
switch_n_cov = ~ nvarA + nvarB,
quiet = TRUE
)
```
Use `summary` to get an overview of the result.
```{r}
summary(prep_data)
```
For more information about the weighting models, we can inspect them individually
```{r weight_summaries}
prep_data$switch_models$switch_n0
```
If `save_weight_models = TRUE,` the full model objects are saved in `working_dir`, so you can inspect the data used in
those models and further investigate how well those models fit.
```{r weight_files}
list.files(working_dir, "*.rds")
# The path is stored in the saved object
switch_n0 <- readRDS(prep_data$switch_models$switch_n0$path)
summary(switch_n0)
hist(switch_n0$fitted.values, main = "Histogram of weights from model switch_n0")
```
We also see the expanded trial files:
```{r trial_files}
head(prep_data$data)
```
Each of these csv files contains the data for the trial starting at period `_i`.
To create a manageable dataset for the final analysis we can sample from these trials.
Here we sample 10% of the patients without an event at each follow-up time in each trial. All
observations with events are included.
```{r sample}
sampled_data <- case_control_sampling_trials(prep_data, p_control = 0.1)
str(sampled_data)
```
Before proceeding with the modelling, it is possible to manipulate and derive new variables and adjust factor levels
in the `data.frame`. You can also specify transformations in a formula to `outcome_cov`,
`include_followup_time` or `include_trial_period`, such as `~ ns(trial_period)` or
`~ I(nvarA^2) + nvarC > 50 + catvarA:nvarC`.
It is also possible to specify formulas in `trial_msm()` to include, for example, splines with
`ns()`.
Now we can fit the model with the `trial_msm()` function. Since we have sampled from the data,
we should make use of the `use_sample_weights` option to get correct survival estimates.
```{r modelling}
model_result <- trial_msm(
data = sampled_data,
outcome_cov = c("catvarA", "catvarB", "nvarA", "nvarB", "nvarC"),
model_var = "assigned_treatment",
glm_function = "glm",
use_sample_weights = TRUE
)
```
The result is the same type as the previous result from the simple `initiators` function,
containing the `glm` object and the sandwich results.
```{r modelling_result}
summary(model_result)
```
```{r glm_summary}
summary(model_result$model)
```
We can also use this object to predict cumulative incidence curves and use these for treatment comparisons.
Here we use the patients who were in the first trial as the target population. To get the data in the right format,
we can use the `data_template` returned by `data_preparation()`.
```{r predict}
new_data <- data.table::fread(file.path(working_dir, "trial_1.csv"))
new_data <- rbind(data.table::as.data.table(prep_data$data_template), new_data)
model_preds <- predict(model_result, predict_times = c(0:40), newdata = new_data, type = "cum_inc")
```
The `predict` function returns a list of 3 data frames: predictions for assigned treatment 0, for assigned treatment 1
and for the difference. It possible to change the type of predicted values, either cumulative incidence or survival.
```{r}
plot(
model_preds$difference$followup_time,
model_preds$difference$cum_inc_diff,
ty = "l", ylab = "Cumulative Incidence Difference",
xlab = "Follow-up Time",
ylim = c(-0.15, 0.05)
)
lines(model_preds$difference$followup_time, model_preds$difference$`2.5%`, lty = 2)
lines(model_preds$difference$followup_time, model_preds$difference$`97.5%`, lty = 2)
```
```{r cleanup, echo=FALSE}
# clean up
unlink(working_dir, recursive = TRUE)
```
```{r, include=FALSE}
options(old)
data.table::setDTthreads(NULL)
```