/
fitting.Rmd
379 lines (280 loc) · 12.3 KB
/
fitting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
---
title: "Fitting and selecting models"
author: "Kaique S Alves"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Fitting and selecting models}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: inline
---
## Base functions
`epifitter` provides functions for the fit of (non-flexible) two-parameter population dynamics models to disease progress curve (DPC) data: exponential, monomolecular, logistic and Gompertz.
The goal of fitting these models to DPCs is to estimate numerically meaningful parameters: `y0` that represents primary inoculum and `r`, representing the apparent infection rate. Hence, the importance of choosing the model that best describe the epidemics to better understand its proterties and compare epidemics.
Two approaches can be used to obtain the parameters:
- **Linear regression** models that are fitted to a specific transformation of the disease data according to each of the models.
- **Non-linear regression** models fitted to original data
Both approaches are available in *epifitter*. The simplest way to fit these models to single epidemic data is using the `fit_lin()` and `fit_nlin()` functions. For the latter, the alternative `fit_nlin2()` allows estimating a third parameter, the upper asymptote, when maximum disease intensity is not close to 100%. The `fit_multi()` function is the most flexible and allows to fit all models to multiple epidemic datasets.
First, we need to load the packages we'll need for this tutorial.
```{r message=FALSE, warning=FALSE}
library(epifitter)
library(ggplot2)
library(dplyr)
library(magrittr)
library(cowplot)
```
## Basic usage
### Dataset
To use *epifitter* data at least two variables are needed, one representing the time of each assessment of disease intensity during the course of the epidemics and the other representing a disease intensity variable as proportion (e.g. incidence, severity, prevalence). For the case of designed experiments with replicates, a third variable is needed.
Let's simulate a DPC dataset for one epidemics measured at replicated plots. The simulated data resemble a polyciclic epidemics of sigmoid shape. We can do that using the *epifitter* `sim_logistic()` function of epifitter (more about `?sim_logistic` here).
```{r}
dpcL <- sim_logistic(
N = 100, # duration of the epidemics in days
y0 = 0.01, # disease intensity at time zero
dt = 10, # interval between assessments
r = 0.1, # apparent infection rate
alpha = 0.2, # level of noise
n = 7 # number of replicates
)
```
Let's give a look at the simulated dataset.
```{r}
head(dpcL)
```
The `dpc_L` object generated using `sim_logistic()` is a dataframe with four columns. The `y` variable is a vector for the disease intensity as proportion (0 \< y \< 1). To facilitate visualization, let's make a plot using the `ggplot` function of the *ggplot2* package.
```{r}
ggplot(
dpcL,
aes(time, y,
group = replicates
)
) +
geom_point(aes(time, random_y), shape = 1) + # plot the replicate values
geom_point(color = "steelblue", size = 2) +
geom_line(color = "steelblue") +
labs(
title = "Simulated 'complete' epidemics of sigmoid shape",
subtitle = "Produced using sim_logistic()"
)+
theme_minimal_hgrid()
```
### Linear regression using `fit_lin()`
The `fit_lin()` requires at least the `time` and `y` arguments. In the example, we will call the `random_y` which represents the replicates. A quick way to call these variables attached to the dataframe is shown below.
```{r}
f_lin <- fit_lin(
time = dpcL$time,
y = dpcL$random_y
)
```
`fit_lin()` outputs a list object which contains several elements. Three elements of the list are shown by default: stats of model fit, Infection rate and Initial Inoculum
```{r}
f_lin
```
#### Model fit stats
The `Stats` element of the list shows how each of the four models predicted the observations based on three measures:
- Lin's concordance correlation coefficient `CCC` (Lin 2000), a measure of agreement that takes both bias and precision into account
- Coefficient of determination `r_squared` (R<sup>2</sup>), a measure of precision
- Residual standard deviation `RSE` for each model.
The four models are sorted from the high to the low `CCC`. As expected because the `sim_logistic` function was used to create the synthetic epidemic data, the the *Logistic* model was superior to the others.
#### Model coefficients
The estimates, and respective standard error and upper and lower 95% confidence interval, for the two coefficients of interest are shown in the `Infection rate` and `Initial inoculum` elements. For the latter, both the back-transformed (estimate) and the linearized estimate are shown.
#### Global stats
The element `f_lin$stats_all` provides a wide format dataframe with all the stats for each model.
```{r}
f_lin$stats_all
```
#### Model predictions
The predicted values are stored as a dataframe in the `data` element called using the same `$` operator as above. Both the observed and (`y`) and the back-transformed predictions (`predicted`) are shown for each model. The linearized value and the residual are also shown.
```{r}
head(f_lin$data)
```
#### Plot of predictions
The `plot_fit()` produces, by default, a panel of plots depicting the observed and predicted values by all fitted models. The arguments `pont_size` and `line_size` that control for the size of the dots for the observation and the size of the fitted line, respectively.
```{r}
plot_lin <- plot_fit(f_lin,
point_size = 2,
line_size = 1
)
# Default plots
plot_lin
```
#### Publication-ready plots
The plots are *ggplot2* objects which can be easily customized by adding new layers that override plot paramaters as shown below. The argument `models` allows to select the models(s) to be shown on the plot. The next plot was customized for the logistic model.
```{r}
# Customized plots
plot_fit(f_lin,
point_size = 2,
line_size = 1,
models = "Logistic")+
theme_minimal_hgrid(font_size =18) +
scale_x_continuous(limits = c(0,100))+
scale_color_grey()+
theme(legend.position = "none")+
labs(
x = "Time",
y = "Proportion disease "
)
```
### Non-linear regression
#### Two-parameters
The `fit_nlin()` function uses the Levenberg-Marquardt algorithm for least-squares estimation of nonlinear parameters. In addition to time and disease intensity, starting values for `y0` and `r` should be given in the `starting_par` argument. The output format and interpretation is analogous to the `fit_lin()`.
> NOTE: If you encounter error messages saying "matrix at initial parameter estimates", try to modify the starting values for the parameters to solve the problem.
```{r message=FALSE, warning=FALSE}
f_nlin <- fit_nlin(
time = dpcL$time,
y = dpcL$random_y,
starting_par = list(y0 = 0.01, r = 0.03)
)
f_nlin
```
We can check the results using `plot_fit`.
```{r}
plot_fit(f_nlin) +
theme_minimal_hgrid()#changing plot theme
```
### Estimating `K` (maximum disease)
In many epidemics the last measure (final time) of a DPC does not reach the maximum intensity and, for this reason, estimation of maximum asymptote (carrying capacity `K`) may be necessary. The `fin_lin2()` provides an estimation of `K` in addition to the estimates provided by `fit_lin()`.
Before demonstrating the function, we can transform our simulated data by creating another variable with `y_random2` with maximum about 0.8 (80%). Simplest way is to multiply the `y_random` by 0.8.
```{r}
dpcL2 = dpcL %>%
mutate(random_y = random_y * 0.8)
```
Then we run the `fit_nlin2()` for the new dataset.
```{r message=FALSE, warning=FALSE}
f_nlin2 <- fit_nlin2(
time = dpcL2$time,
y = dpcL2$random_y,
starting_par = list(y0 = 0.01, r = 0.2, K = 0.6)
)
f_nlin2
plot_fit(f_nlin2)
```
> NOTE: The exponential model is not included because it doesn't have a maximum asymptote. The estimated value of `K` is the expected 0.8.
### Fit models to multiple DPCs
Most commonly, there are more than one epidemics to analyse either from observational or experimental studies. When the goal is to fit a common model to all curves, the `fit_multi()` function is in hand. Each DPC needs an unique identified to further combined in a single data frame.
#### Data
Let's use the `sim_` family of functions to create three epidemics and store the data in a single `data.frame`. The Gompertz model was used to simulate these data. Note that we allowed to the `y0` and `r` parameter to differ the DPCs. We should combine the three DPCs using the `bind_rows()` function and name the identifier (`.id`), automatically created as a character vector, for each epidemics as 'DPC'.
```{r}
epi1 <- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.1, alpha = 0.4, n = 4)
epi2 <- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.12, alpha = 0.4, n = 4)
epi3 <- sim_gompertz(N = 60, y0 = 0.003, dt = 5, r = 0.14, alpha = 0.4, n = 4)
multi_epidemic <- bind_rows(epi1,
epi2,
epi3,
.id = "DPC"
)
head(multi_epidemic)
```
We can visualize the three DPCs in a same plot
```{r}
p_multi <- ggplot(multi_epidemic,
aes(time, y, shape = DPC, group = DPC))+
geom_point(size =2)+
geom_line()+
theme_minimal_grid(font_size =18) +
labs(
x = "Time",
y = "Proportion disease "
)
p_multi
```
Or use `facet_wrap()` for ploting them separately.
```{r fig.height=10, fig.width=6}
p_multi +
facet_wrap(~ DPC, ncol = 1)
```
#### Using `fit_multi()`
`fit_multi()` requires at least four arguments: time, disease intensity (as proportion), data and the curve identifier (`strata_cols`). The latter argument accepts one or more strata include as `c("strata1",strata2")`. In the example below, the stratum name is `DPC`, the name of the variable.
By default, the linear regression is fitted to data but adding another argument `nlin = T`, the non linear regressions is fitted instead.
```{r}
multi_fit <- fit_multi(
time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC"
)
```
All parameters of the list can be returned using the \$ operator as below.
```{r}
head(multi_fit$Parameters)
```
Similarly, all data can be returned.
```{r}
head(multi_fit$Data)
```
If nonlinear regression is preferred, the `nlim` argument should be set to `TRUE`
```{r}
multi_fit2 <- fit_multi(
time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC",
nlin = TRUE)
head(multi_fit2$Parameters)
```
#### Want to estimate K?
If you want to estimate `K`, set `nlin = TRUE` and `estimate_K = TRUE`.
> NOTE: If you do not set both arguments `TRUE`, `K` will not be estimated, because `nlin` defaut is `FALSE`. Also remember that when estimating K, we don't fit the *Exponential* model.
```{r}
multi_fit_K <- fit_multi(
time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC",
nlin = T,
estimate_K = T
)
```
```{r}
head(multi_fit_K$Parameters)
```
### Graphical outputs
Use [`ggplot2`](https://ggplot2.tidyverse.org/) to produce elegant data visualizations of models curves and the estimated parameters.
#### DPCs and fitted curves
The original data and the predicted values by each model are stored in `multi_fit$Data`. A nice plot can be produced as follows:
```{r}
multi_fit$Data %>%
ggplot(aes(time, predicted, color = DPC)) +
geom_point(aes(time, y), color = "gray") +
geom_line(size = 1) +
facet_grid(DPC ~ model, scales = "free_y") +
theme_minimal_hgrid()+
coord_cartesian(ylim = c(0, 1))
```
Using the *dplyr* function `filter` only the model of interest can be chosen for plotting.
```{r}
multi_fit$Data %>%
filter(model == "Gompertz") %>%
ggplot(aes(time, predicted, color = DPC)) +
geom_point(aes(time, y),
color = "gray",
size = 2
) +
geom_line(size = 1.2) +
theme_minimal_hgrid() +
labs(
x = "Time",
y = "Disease Intensity"
)
```
#### Apparent infection rate
The `multi_fit$Parameters` element is where all stats and parameters as stored. Let's plot the estimates of the apparent infection rate.
```{r}
multi_fit$Parameters %>%
filter(model == "Gompertz") %>%
ggplot(aes(DPC, r)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
width = 0,
size = 1
) +
labs(
x = "Time",
y = "Apparent infection rate"
) +
theme_minimal_hgrid()
```
## References
Lin L (2000). A note on the concordance correlation coefficient. Biometrics 56: 324 - 325.