-
Notifications
You must be signed in to change notification settings - Fork 2
/
semiparametric.Rmd
826 lines (654 loc) · 27.5 KB
/
semiparametric.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
---
title: "Semiparametric Latent Variable Modeling"
output:
rmarkdown::html_vignette:
fig_width: 6
fig_height: 4
bibliography: ../inst/REFERENCES.bib
link-citations: yes
vignette: >
%\VignetteIndexEntry{Semiparametric Latent Variable Modeling}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```r
library(galamm)
library(gamm4)
```
This vignette describes how to use `galamm` to estimate latent variable models with smooth terms, or equivalently, generalized additive mixed models with factor structures. The examples are based on Section 4 and 5 in @sorensenLongitudinalModelingAgeDependent2023, but as we cannot share the data, we have instead simulated somewhat simpler datasets that will be used. We will gradually add complexity, starting with a simple generalized additive mixed model. Please refer to the [introductory vignette](https://lcbc-uio.github.io/galamm/articles/galamm.html) for an overview of the statistical models.
## Generalized Additive Mixed Models
We start by showing how `galamm` can be used to estimated generalized additive mixed models.
### Gaussian Responses
The `cognition` dataset contains simulated data with measurements of abilities in three cognitive domains.
```r
head(cognition)
#> id domain x timepoint item trials y
#> 1 1 1 0.06475113 1 11 1 0.16788973
#> 2 1 1 0.06475113 1 12 1 0.08897838
#> 3 1 1 0.06475113 1 13 1 0.03162123
#> 4 1 1 0.15766278 2 11 1 0.46598362
#> 5 1 1 0.15766278 2 12 1 0.84564656
#> 6 1 1 0.15766278 2 13 1 0.20549872
```
For this first example, we focus only on the first item measured for the first domain.
```r
dat <- subset(cognition, domain == 1 & item == "11")
```
Each subject in this dataset has been measured eight times, and we can plot the measurements as follows:
```r
plot(dat$x, dat$y, type = "n", xlab = "x", ylab = "y")
for(i in unique(dat$id)) {
dd <- dat[dat$id == i, ]
lines(dd$x, dd$y, col = "gray")
}
points(dat$x, dat$y, pch = 20, lwd = .05)
```
![Plot of data for domain 1 and item 11.](semiparametric-spaghetti-plot-1.png)
We use a generalized additive mixed model with random intercepts per subject to estimate the function relating $x$ to $y$. In terms of the model framework outlined in the [introductory vignette](https://lcbc-uio.github.io/galamm/articles/galamm.html), we model the $i$th response from the $j$th subject with
$$
y_{ij} = f(x_{ij}) + \eta_{j} + \epsilon_{ij}
$$
where $f(x_{ij})$ is a smooth function to be estimated, $\eta_{j} \sim N(0, \psi)$ is a random intercept, and $\epsilon_{ij} \sim N(0, \phi)$ is a residual term.
This model can be estimated using `gamm4` as follows:
```r
mod_gamm4 <- gamm4(y ~ s(x), random = ~ (1 | id), data = dat, REML = FALSE)
```
The package `gamm4` uses `lme4` to fit the underlying model, and the resulting model has two components. `mod_gamm4$mer` contains the mixed model representation, whereas in `mod_gamm4$gam` the fixed and random effects corresponding to spline coefficients have been converted into single smooth terms. We can look at the model summary for each:
```r
summary(mod_gamm4$mer)
#> Linear mixed model fit by maximum likelihood ['lmerMod']
#>
#> AIC BIC logLik deviance df.resid
#> 3025.2 3052.1 -1507.6 3015.2 1595
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.93755 -0.65215 0.00612 0.62654 3.14289
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> id (Intercept) 0.8551 0.9247
#> Xr s(x) 2.0341 1.4262
#> Residual 0.2501 0.5001
#> Number of obs: 1600, groups: id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> X(Intercept) 1.26938 0.06657 19.067
#> Xs(x)Fx1 -0.15814 0.20156 -0.785
#>
#> Correlation of Fixed Effects:
#> X(Int)
#> Xs(x)Fx1 0.000
summary(mod_gamm4$gam)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> y ~ s(x)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.26938 0.06657 19.07 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x) 6.681 6.681 324.9 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.253
#> lmer.REML = 3015.2 Scale est. = 0.25012 n = 1600
```
We can also plot the estimated smooth term:
```r
plot(mod_gamm4$gam)
```
![Smooth term estimated by gamm4.](semiparametric-gaussian-gamm4-smooth-1.png)
In contrast, invoking the `plot` function on the mixed model part gives us a diagnostic plot.
```r
plot(mod_gamm4$mer)
```
![Diagnostic plot for gamm4 model.](semiparametric-gaussian-gamm4-diagnostic-1.png)
With `galamm` we use similar argument, but the `random` specification is now part of the model formula.
```r
mod <- galamm(y ~ s(x) + (1 | id), data = dat)
```
As opposed to `gamm4`, `galamm` gives a single summary. As can be seen, smooth terms are both reported as random effects, and in a separate line under the header "Approximate significance of smooth terms:". Reassuringly, the results from fitting the model with `gamm4` and with `galamm` are essentially equally, even though they use somewhat different computational algorithms.
```r
summary(mod)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ s(x) + (1 | id)
#> Data: dat
#>
#> AIC BIC logLik deviance df.resid
#> 3025.2 3052.1 -1507.6 3015.2 1595
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.93755 -0.65215 0.00612 0.62654 3.14290
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> id (Intercept) 0.8551 0.9247
#> Xr s(x) 2.0346 1.4264
#> Residual 0.2501 0.5001
#> Number of obs: 1600, groups: id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.2694 0.06657 19.0672 4.731e-81
#> s(x)Fx1 -0.1582 0.20236 -0.7818 4.343e-01
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x) 6.681 6.681 324.9 <2e-16
```
The `plot` function now gives us a diagnostic plot, which by inspection can be seen to be almost identical to the plot produced from the mixed model part of the `gamm4` model.
```r
plot(mod)
```
![Diagnostic plot for model fitted with galamm.](semiparametric-gaussian-gamm-diagnostic-1.png)
In order to plot the smooth term, we use `plot_smooth`.
```r
plot_smooth(mod)
```
![Smooth term estimated with galamm.](semiparametric-gaussian-gamm-smooth1-1.png)
The `plot_smooth` function is a thin wrapper around the `plot.gam` function provided by the `mgcv` package [@woodGeneralizedAdditiveModels2017a]. This means that the arguments used by `plot.gam` can be used also here, as see with the examples below:
```r
plot_smooth(mod,
shade = TRUE, rug = FALSE, seWithMean = TRUE,
shift = +2
)
```
![Alternative ways of visualizing the smooth term.](semiparametric-gaussian-gamm-smooth2-1.png)
```r
plot_smooth(mod, se = FALSE)
```
![Alternative ways of visualizing the smooth term.](semiparametric-gaussian-gamm-smooth2-2.png)
### Binomial Responses
In the cognition dataset, the responses relating to domain 2 are binomially distributed. We will use the first trial to illustrate how such data can be modeled.
```r
dat <- subset(cognition, domain == 2 & item == "21")
```
Again we can fit this model using `gamm4`.
```r
mod_gamm4 <- gamm4(y ~ s(x),
random = ~ (1 | id),
data = dat, family = binomial
)
```
We can look at the summary output as before.
```r
summary(mod_gamm4$mer)
#> Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
#> Family: binomial ( logit )
#>
#> AIC BIC logLik deviance df.resid
#> 983.7 1005.2 -487.8 975.7 1596
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -8.2547 0.0786 0.1946 0.4248 0.8213
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> id (Intercept) 0.2306 0.4802
#> Xr s(x) 0.9694 0.9846
#> Number of obs: 1600, groups: id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> X(Intercept) 2.8115 0.2034 13.824 < 2e-16 ***
#> Xs(x)Fx1 -1.4110 0.4242 -3.326 0.00088 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> X(Int)
#> Xs(x)Fx1 -0.440
summary(mod_gamm4$gam)
#>
#> Family: binomial
#> Link function: logit
#>
#> Formula:
#> y ~ s(x)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.8115 0.1662 16.91 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(x) 2.14 2.14 113.9 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.116
#> glmer.ML = 908.96 Scale est. = 1 n = 1600
```
And we can plot the smooth term. The diagnostic plot is not very useful in the binomial case, so we omit it.
```r
plot(mod_gamm4$gam)
```
![Smooth term estimated by gamm4.](semiparametric-gamm4-binomial-1.png)
Again the `galamm` syntax is similar, but it puts the random effect specification into the model formula.
```r
mod <- galamm(y ~ s(x) + (1 | id), data = dat, family = binomial)
```
The estimates are very similar, although not identical. The difference in deviance is due to differences in the way deviance is defined. The call `deviance(mod_gamm4$mer)` gives the same value as in the summary for the model fitted with galamm.
```r
summary(mod)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ s(x) + (1 | id)
#> Data: dat
#>
#> AIC BIC logLik deviance df.resid
#> 983.7 1005.2 -487.8 908.8 1596
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -8.2237 0.0792 0.1947 0.4245 0.8226
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> id (Intercept) 0.2316 0.4813
#> Xr s(x) 0.9387 0.9688
#> Number of obs: 1600, groups: id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.808 0.1956 14.354 9.989e-47
#> s(x)Fx1 -1.406 0.4134 -3.402 6.697e-04
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(x) 2.124 2.124 114 <2e-16
```
```r
plot_smooth(mod)
```
![Smooth term estimated with galamm.](semiparametric-gamm-binomial-1.png)
## Generalized Additive Models with Factor Structures
We now add factor structures to the GAMMs. These are the types of models that neither `gamm4` nor `mgcv` are able to estimate (at least without lots of manual hacking), and where `galamm` provides new functionality.
### Gaussian Responses
To illustrate basic usage, we continue with the cognition data, but now use all items of cognitive domain 1. These are all conditionally normal distributed.
```r
dat <- subset(cognition, domain == 1)
head(dat)
#> id domain x timepoint item trials y
#> 1 1 1 0.06475113 1 11 1 0.16788973
#> 2 1 1 0.06475113 1 12 1 0.08897838
#> 3 1 1 0.06475113 1 13 1 0.03162123
#> 4 1 1 0.15766278 2 11 1 0.46598362
#> 5 1 1 0.15766278 2 12 1 0.84564656
#> 6 1 1 0.15766278 2 13 1 0.20549872
```
We now need a factor model to associate the underlying latent trait $\eta$ with the measurements $y_{i}$:
$$
y_{i} = \beta_{i} + \lambda_{i} \eta + \epsilon_{i}
$$
In the structural model, we have a smooth term for the relationship between the latent trait and x, and we have random intercepts for a given timepoint within subject $\zeta^{(2)}$, and for a given subject across timepoints $\zeta^{(3)}$.
$$
\eta = h(x) + \zeta^{(2)} + \zeta^{(3)}.
$$
The reduced form of the model is
$$
y_{i} = \beta_{i} + \lambda_{i} \left\{ h(x) + \zeta^{(2)} + \zeta^{(3)} \right\} + \epsilon_{i}
$$
We will use a varying-coefficient term, where $h(x)$ is being interpreted as a regression coefficient for the effect of $\lambda_{i}$ on $y_{i}$, and the regression term varies with $x$. In contrast to @hastieVaryingCoefficientModels1993 and other uses of varying-coefficient terms, however, in this case the predictor $\lambda_{i}$ is a model parameter. We have three items loading in $\eta$ and fix the first loading to 1 for identifiability, so the loading matrix is as follows:
```r
(loading_matrix <- matrix(c(1, NA, NA), ncol = 1))
#> [,1]
#> [1,] 1
#> [2,] NA
#> [3,] NA
```
We provide thin wrappers around the `s()` and `t2()` functions from `mgcv` to support factor loadings in smooth terms. The wrappers are named `sl()` and `t2l()` to avoid namespace conflicts with `mgcv` and `gamm4`, and the last letter "l" stands for "loading". In this example, we set `factor = "item"` to specify that the loadings to be applied are identified by the "item" variable. Using `mgcv`'s `by` variable would also work in this particular case, i.e., replacing `sl(x, factor = "loading")` with `s(x, by = loading)`. However, in most cases this would lead to identifiability issues due to the way varying-coefficient terms are set up by `mgcv`, so `galamm` provides an additional `factor` argument which alleviates most of these issues.
```r
mod <- galamm(
formula = y ~ 0 + item + sl(x, factor = "loading") +
(0 + loading | id / timepoint),
data = dat,
load.var = "item",
lambda = loading_matrix,
factor = "loading"
)
```
We print the model summary below. In the data simulation, the factor loadings were set to 1, 1.4, and 0.3, respectively, and this is very well recovered. Furthermore, the ground truth standard deviation at the `id` level was 1, at the `timepoint` level it was 0.5, and the residual standard deviation was 0.1. The estimates are close to these values. Real data will typically not have this strong signal, but based on these results, there are no clear indications that the model is implemented incorrectly.
```r
summary(mod)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ 0 + item + sl(x, factor = "loading") + (0 + loading | id/timepoint)
#> Data: dat
#>
#> AIC BIC logLik deviance df.resid
#> -918.2 -853.4 469.1 -938.2 4790
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.76 6.46 13.58 23.31 35.84
#>
#> Lambda:
#> loading SE
#> lambda1 1.0000 .
#> lambda2 1.3973 0.003531
#> lambda3 0.3009 0.002146
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> timepoint:id loading 0.236886 0.48671
#> id loading 0.857051 0.92577
#> Xr s(x):loading 2.030613 1.42500
#> Residual 0.009932 0.09966
#> Number of obs: 4800, groups: timepoint:id, 1600; id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error t value Pr(>|t|)
#> item11 1.2694 0.06663 19.0513 6.412e-81
#> item12 1.7788 0.09307 19.1128 1.977e-81
#> item13 0.3797 0.02019 18.8077 6.531e-79
#> s(x):loadingFx1 -0.1496 0.19977 -0.7488 4.540e-01
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x):loading 8.719 8.719 4469 <2e-16
```
We also plot the smooth term. Since we had a very large amount of data, there is essentially no uncertainty about the estimate.
```r
plot_smooth(mod)
```
![Smooth term for GAMM with factor structure.](semiparametric-gaussian-factor-1.png)
### Binomial Responses
We can now move on to the part of the cognition data that is conditionally binomially distributed. We consider domain 2, where each response measures success or not in a single trial. In this case there are only two items, so we must change the lambda matrix accordingly. Other than that, and setting `family = binomial`, the model is the same as before.
```r
dat <- subset(cognition, domain == 2)
mod <- galamm(
formula = y ~ 0 + item + sl(x, factor = "loading") +
(0 + loading | id / timepoint),
data = dat,
family = binomial,
load.var = "item",
lambda = matrix(c(1, NA), ncol = 1),
factor = "loading"
)
```
The summary is shown below. The factor loading $\lambda_{2} = 2$ was used when simulating the data, and including the uncertainty, our estimate covers the true value well. Also note that the variation between individuals (group `id`) and the variation between timepoints within individuals (group `timepoint:id`) gets lumped together at the `id` level. The estimated variation at the `timepoint:id` level is zero. This is a well-known phenomenon when fitting mixed models, given book-length treatment in @hodgesRichlyParameterizedLinear2013. In this case, it is likely due to the fact that we only have two measurements at each timepoint, and also the fact that we use the Laplace approximation to integrate over the random effects, and this approximation may be inaccurate for binomial data with a low number of repeated observations [@joeAccuracyLaplaceApproximation2008].
```r
summary(mod)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ 0 + item + sl(x, factor = "loading") + (0 + loading | id/timepoint)
#> Data: dat
#>
#> AIC BIC logLik deviance df.resid
#> 1495.5 1538.0 -740.8 1614.8 3193
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -15.8546 0.0279 0.0780 0.1985 1.3948
#>
#> Lambda:
#> loading SE
#> lambda1 1.000 .
#> lambda2 2.202 0.3007
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> timepoint:id loading 0.0000 0.0000
#> id loading 0.6222 0.7888
#> Xr s(x):loading 1.5388 1.2405
#> Number of obs: 3200, groups: timepoint:id, 1600; id, 200; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> item21 2.944 0.1903 15.473 5.249e-54
#> item22 6.319 0.5853 10.796 3.612e-27
#> s(x):loadingFx1 -1.389 0.2837 -4.897 9.733e-07
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(x):loading 2.491 2.491 115.1 <2e-16
```
The true value 2 for the factor loading is well within the 95 % confidence limits.
```r
confint(mod, parm = "lambda")
#> 2.5 % 97.5 %
#> lambda1 1.612341 2.791192
```
## Smooth Terms with Loadings and Factor Interactions
### Gaussian Responses
Domain 1 and 3 both have Gaussian responses, and we can model them jointly.
```r
dat <- subset(cognition, domain %in% c(1, 3))
```
We also add indicator variables for the two domains.
```r
dat <- cbind(
dat,
model.matrix(~ 0 + domain, data = dat)[, c("domain1", "domain3")]
)
```
We define the loading matrix, now having two columns:
```r
(lmat <- matrix(c(
1, NA, NA, 0, 0, 0, 0,
0, 0, 0, 1, NA, NA, NA
), ncol = 2))
#> [,1] [,2]
#> [1,] 1 0
#> [2,] NA 0
#> [3,] NA 0
#> [4,] 0 1
#> [5,] 0 NA
#> [6,] 0 NA
#> [7,] 0 NA
```
Then we define the model. The smooth term is now `sl(x, by = domain, factor = c("ability1", "ability3"))`, indicating that there should be a separate smooth term for each level of `domain`, and that the term should be multiplied by the loading "ability1" or "ability3". We also set `factr = 1e9` to be less strict with regards to convergence than usual, because this model is hard to estimate.
```r
mod_byvar <- galamm(
formula = y ~ domain +
sl(x, by = domain, factor = c("ability1", "ability3")) +
(0 + domain1:ability1 + domain3:ability3 | id / timepoint),
data = dat,
load.var = "item",
lambda = lmat,
factor = c("ability1", "ability3"),
control = galamm_control(
optim_control = list(factr = 1e9, trace = 3,
REPORT = 50, maxit = 1000)
)
)
#> N = 17, M = 20 machine precision = 2.22045e-16
#> At X0, 0 variables are exactly at the bounds
#> At iterate 0 f= 25324 |proj g|= 9642.6
#> At iterate 50 f = 136.9 |proj g|= 716.66
#> At iterate 100 f = -1062.7 |proj g|= 121.27
#> At iterate 150 f = -1069.4 |proj g|= 155.18
#> At iterate 200 f = -1147.3 |proj g|= 84.96
#> At iterate 250 f = -1147.6 |proj g|= 42.374
#> At iterate 300 f = -1209.4 |proj g|= 66.759
#>
#> iterations 338
#> function evaluations 377
#> segments explored during Cauchy searches 341
#> BFGS updates skipped 0
#> active bounds at final generalized Cauchy point 0
#> norm of the final projected gradient 2.43961
#> final function value -1217.34
#>
#> F = -1217.34
#> final value -1217.343067
#> converged
```
The summary shows that we have recovered the true values of the factor loadings well.
```r
summary(mod_byvar)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ domain + sl(x, by = domain, factor = c("ability1", "ability3")) +
#> (0 + domain1:ability1 + domain3:ability3 | id/timepoint)
#> Data: dat
#> Control: galamm_control(optim_control = list(factr = 1e+09, trace = 3, REPORT = 50, maxit = 1000))
#>
#> AIC BIC logLik deviance df.resid
#> -2398.7 -2266.9 1217.3 -2434.7 11182
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2115.14 -1034.62 -885.39 22.95 51.96
#>
#> Lambda:
#> ability1 SE ability3 SE
#> lambda1 1.0000 . . .
#> lambda2 1.3986 0.002569 . .
#> lambda3 0.3014 0.002007 . .
#> lambda4 . . 1.0000 .
#> lambda5 . . 0.9993 0.0007676
#> lambda6 . . 1.0001 0.0007679
#> lambda7 . . 1.9995 0.0015687
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> timepoint:id domain1:ability1 0.23662 0.4864
#> domain3:ability3 0.28037 0.5295 -0.01
#> id domain1:ability1 3.35242 1.8310
#> domain3:ability3 18.30509 4.2784 0.89
#> Xr.0 s(x):domain3:ability3 86.88159 9.3210
#> Xr s(x):domain1:ability1 16.07526 4.0094
#> Residual 0.01007 0.1003
#> Number of obs: 11200, groups: timepoint:id, 1600; id, 200; Xr.0, 8; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.004847 0.004671 -1.0377 2.994e-01
#> domain3 0.002629 0.007592 0.3463 7.291e-01
#> s(x):domain1:ability1Fx1 -0.057658 0.245575 -0.2348 8.144e-01
#> s(x):domain3:ability3Fx1 5.570914 0.314703 17.7021 4.037e-70
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x):domain1:ability1 8.956 8.956 4313 <2e-16
#> s(x):domain3:ability3 8.988 8.988 59831 <2e-16
```
We can plot the estimated smooth terms, which recover their simulated ground truth very well.
```r
plot_smooth(mod_byvar, scale = 0, select = 1)
```
![Estimated smooth term for domain 1 in model with domain 1 and domain 3.](semiparametric-gaussian-by-factor1-1.png)
```r
plot_smooth(mod_byvar, scale = 0, select = 2)
```
![Estimated smooth term for domain 3 in model with domain 1 and domain 3.](semiparametric-gaussian-by-factor2-1.png)
### Mixed Gaussian and Binomial Responses
Domain 1 has Gaussian responses and domain 2 has binomial responses, and we can model them jointly. For the sake of speed, we include only two items for each domain.
```r
dat <- subset(cognition, domain %in% c(1, 2))
```
We also add indicator variables for the two domains.
```r
dat <- cbind(
dat,
model.matrix(~ 0 + domain, data = dat)[, c("domain1", "domain2")]
)
```
We define the loading matrix, now having two columns:
```r
(lmat <- matrix(c(
1, NA, NA, 0, 0,
0, 0, 0, 1, NA
), ncol = 2))
#> [,1] [,2]
#> [1,] 1 0
#> [2,] NA 0
#> [3,] NA 0
#> [4,] 0 1
#> [5,] 0 NA
```
Then we define the model. The smooth term is now `sl(x, by = domain, factor = c("ability1", "ability2"))`, indicating that there should be a separate smooth term for each level of `domain`, and that the term should be multiplied by the loading "ability1" or "ability2". Because this model has some convergence issues, we omit the timepoint-level random intercepts in this example.
```r
mod_byvar_mixed <- galamm(
formula = y ~ domain +
sl(x, by = domain, factor = c("ability1", "ability2")) +
(0 + domain1:ability1 + domain2:ability2 | id),
data = dat,
family = c(gaussian, binomial),
family_mapping = ifelse(dat$domain == 1, 1L, 2L),
load.var = "item",
lambda = lmat,
factor = c("ability1", "ability2"),
control = galamm_control(
optim_control = list(factr = 1e9, trace = 3,
REPORT = 30, maxit = 1000)
)
)
#> N = 12, M = 20 machine precision = 2.22045e-16
#> At X0, 0 variables are exactly at the bounds
#> At iterate 0 f= 8170.1 |proj g|= 2305.5
#> At iterate 30 f = 4726.2 |proj g|= 39.814
#>
#> iterations 44
#> function evaluations 49
#> segments explored during Cauchy searches 44
#> BFGS updates skipped 0
#> active bounds at final generalized Cauchy point 0
#> norm of the final projected gradient 0.286444
#> final function value 4725.04
#>
#> F = 4725.04
#> final value 4725.035302
#> converged
```
We can look at the model summary:
```r
summary(mod_byvar_mixed)
#> GALAMM fit by maximum marginal likelihood.
#> Formula: y ~ domain + sl(x, by = domain, factor = c("ability1", "ability2")) + (0 + domain1:ability1 + domain2:ability2 | id)
#> Data: dat
#> Control: galamm_control(optim_control = list(factr = 1e+09, trace = 3, REPORT = 30, maxit = 1000))
#>
#> AIC BIC logLik deviance df.resid
#> 9476.1 9566.9 -4725.0 98981.3 7987
#>
#> Lambda:
#> ability1 SE ability2 SE
#> lambda1 1.0000 . . .
#> lambda2 1.4082 0.01335 . .
#> lambda3 0.2869 0.01040 . .
#> lambda4 . . 1.0000 .
#> lambda5 . . 0.7996 0.05898
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> id domain1:ability1 2.4081 1.552
#> domain2:ability2 1.4827 1.218 0.40
#> Xr.0 s(x):domain2:ability2 0.8501 0.922
#> Xr s(x):domain1:ability1 2.2392 1.496
#> Number of obs: 8000, groups: id, 200; Xr.0, 8; Xr, 8
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.04306 0.02253 1.9116 5.593e-02
#> domain2 3.23013 0.16886 19.1290 1.449e-81
#> s(x):domain1:ability1Fx1 -0.08920 0.13855 -0.6438 5.197e-01
#> s(x):domain2:ability2Fx1 2.10045 0.47649 4.4082 1.042e-05
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(x):domain1:ability1 7.865 7.865 880 <2e-16
#> s(x):domain2:ability2 3.036 3.036 285 <2e-16
```
We can plot the estimated smooth terms:
```r
plot_smooth(mod_byvar_mixed, scale = 0, select = 1)
```
![Estimated smooth term for domain 1 in mixed response model with domain 1 and domain 2.](semiparametric-mixed-by-factor1-1.png)
```r
plot_smooth(mod_byvar_mixed, scale = 0, select = 2)
```
![Estimated smooth term for domain 1 in mixed response model with domain 1 and domain 2.](semiparametric-mixed-by-factor2-1.png)
# References