-
Notifications
You must be signed in to change notification settings - Fork 0
/
questionlink_tutorial.Rmd
1231 lines (848 loc) · 69.8 KB
/
questionlink_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "QuestionLink Tutorial"
execute:
warning: false
message: false
format:
html:
toc: true
toc-location: right
code-fold: true
embed-resources: true
knitr:
opts_chunk:
R.options:
knitr.table.html.attr: "quarto-disable-processing=true"
editor: visual
---
# 00 Motivation and methodological background
The QuestionLink R-package aims to help researchers who want to harmonize survey data on a concept that were gathered with different questions. Specifically, the focus lies on (single) survey questions used to capture a latent construct, such as attitudes, values, interests, or subjective evaluations.
Consider two different questions on respondents' subjective evaluation of their general health:
> **European Social Survey (ESS)**
>
> How is your health in general? Would you say it is...
>
> Very good / Good / Fair / Bad / Very bad
> **International Social Survey Programme (ISSP)**
>
> In general, would you say your health is...
>
> Excellent / Very good / Good / Fair / Poor
How would you compare (or even combine) responses to these two questions?
Such problems are the purview of ex-post harmonization, a methodological field that seeks to assess and improve the comparability of data that have already been collected. Such harmonization can be challenging, but increasing comparability and combining (survey) data across different data sources opens up [many new opportunities for research](https://blog.gesis.org/the-sum-and-its-parts-the-benefits-of-combining-data-from-different-surveys/). In what follows, we will give a brief overview of the challenges this entails to motivate the QuestionLink R package.
## Harmononizing (single-item) survey questions for latent constructs
Harmonizing data measured with different questions across different surveys is not easy. For example, before we can even think of combining data across different questions, we need to ensure that [different questions actually measure the same concept](https://blog.gesis.org/apples-and-oranges-how-to-find-out-if-two-questions-measure-the-same-concept/). However, even if they do, we often need to align their measurement units. In other words, before comparing or combining data across questions, we need to ensure that their numerical response scores imply the same level of the measured construct.
Consider again the example about general health. Now imagine a respondent who chose the third response option ("good") in the ISSP question. What response would they have chosen in the ESS question? Also the third response option? But that would be "fair" in the ESS. Or perhaps they would have also chosen "good" in the ESS; thus choosing the second instead of the third response option. In any case, we cannot be certain. And as it turns out, the empirical reality is more complicated than just matching the same response options via recoding. Instead, at least for the ESS and ISSP waves in Germany, [QuestionLink derived the following harmonization solution](https://www.gesis.org/en/services/processing-and-analyzing-data/data-harmonization/question-link/general-health) from ISSP to ESS:
![](figures/General_health_example.svg)
As it turns out, matching a "3" in the ISSP question with a "3" in the ESS question would be inadequate. That seems plausible, since "good" and "fair" are rather different response labels. But matching "good" and "good" across ISSP and ESS also does not work perfectly. After all, respondents take the whole response scale into account when answering a question.
### Harmonizing measurement units
However, how did we derive these values and how are they justified? To understand what is happening here, we have to take a step back and look at the difference between (observed) response scores and (latent) construct expressions.
If we have a single question measuring a latent construct (such as general health), we have to keep in mind that the construct itself does not have natural categories or levels. Instead, the latent construct in respondents' minds are amorphous ideas that can be best understood as a continuous value. When respondents then answer a question about that construct, they recall (or infer) their approximate position on that latent, continuous dimension. However, they cannot express this position directly in a survey. Instead, they have to choose a response option they feel best represents that position. And those response options are not continuous, they are discrete and (only) ordinal. In other words, observed response scores are the product of respondents mapping a continous latent construct value onto a discrete, ordinal response scheme.
Visually, we might imagine our sample of respondents to have a continuous distribution of "true" latent construct values. Depending on their value, i.e., their position on the continuum, they choose a response option. This implies that response options "cut" the distribution of latent scores into ordinal response scores. (This simplified view ignores random response error. For a more realistic model, consider IRT models or differential item functioning methods in general.)
![](figures/measurement_units_continuous_to_ordinal.jpg)
In practice, social scientists often ignore this detail, because we can often treat the ordinal observed response scores as pseudo-metric. However, when we harmonize two different questions for the same construct, the issue becomes pressing. That is because different questions (different wording, different response scales etc.) project latent scores differently onto observed scores. They "cut" the latent continuum into ordinal scores with different thresholds; both thresholds in different positions or even a different number of thresholds when the number of response options differs. For a bit more on this issue, we recommend reading this [blog post](https://blog.gesis.org/ceci-nest-pas-une-pipe-disentangling-measurement-and-reality-in-ex-post-harmonization/).
How then can we bridge these differences between two (or more) questions for the same construct? Commonly considered approaches are linear stretching and z-standardization. Both are insufficient in most cases, as you will discover yourself in the section on [insufficient alternatives](#sec-insufficient-alternatives).
### Observed score equating in a random groups design (OSE-RG)
Instead of these alternatives, QuestionLink applies a method called ***observed score equating in a random groups design*** **(OSE-RG)**. In this context, we can only give you a basic intuition. If you are interested, we recommend these two sources to delve deeper into equating.
Kolen and Brennan cover equating in-depth in their book:
> Kolen, M. J., & Brennan, R. L. (2014). *Test Equating, Scaling, and Linking* (3rd ed.). Springer. <https://doi.org/10.1007/978-1-4939-0317-7>
However, equating was originally conceived to harmonize multi-item scales. For applying equating algorithms to harmonize two single survey questions, see:
> Singh, R. K. (2022). Harmonizing single-question instruments for latent constructs with equating using political interest as an example. *Survey Research Methods*, *16*(3), 353--369. <https://doi.org/10.18148/srm/2022.v16i3.7916>
However, we can quickly get an intuition of how OSE-RG works. At its core, OSE-RG tries to answer the question what a respondent with a certain observed score in one question would answer, on average, in another question on the same construct. As in our example above, what would a respondent who chose a "3 - good" in the ISSP choose in the ESS question on average? The result of OSE-RG then was: 2.29 on average. This "on average" both acknowledges the role or random response error as well as the fact that there is seldom an exact integer correspondence between response options of two questions (e.g., 3 = 4).
This still leaves the question, how we can get these **equivalent values**. A first idea might be to show respondents both questions and then compare their responses. In such a setup, we might then take respondents who choose a "1" on question A and average the responses they gave on question B to get the equivalent value. Then we repeat that process for scores 2, 3, and so on. This is, in fact, a practiced equating approach for multi-item scales called **single-group design**. However, for single questions in survey research, the approach is often not feasible. First, the procedure only works if respondents answeres to the questions are independent. Yet with only two questions, respondents will probably rememer their previous answer. This creates a bias that even a randomized order cannot solve completely. Second, we almost never have single-group data available in existing survey data sources and we might not be able to afford collecting single-group data on all questions in all combinations for our harmonization project.
So as the "RG" or OSE-RG already implies, we use another research design in QuestionLink: The **random groups** design. This approach is best understood when we imagine a split-half experiment, where one random half of respondents sees one question and the other half the other question. Now the important point is that due to the randomness of group assignment, we expect no systematic differences between the groups in terms of their actual construct values. In such an experiment, both groups have the same latent distribution of the relevant construct. They feel equally healthy on average, with the same standard deviation and so on. However, the distribution of measured values will be quite different, because each question maps this latent distribution differently onto observed response scores.
Visually, we might imagine the situation like this:
![](figures/08_population_link.jpg)
The continuous distributions in the middle are the "true" contruct distributions in both random groups. They mirror each other, because they are identical through the random groups design. However, we see that the two questions (above and below) have different thresholds. They map the latent scores differently onto observed response scores. This means, for example, that a person with a specific general health (yellow point), would choose different responses in the two questions.
The result on an aggregate level is that we get two different observed response distributions (think: frequency tables) for the two questions, even though they measure the same latent distribution. Responses to both questions are, after all, randomly drawn from the same population:
![](figures/02_random_groups_design.jpg)
What OSE-RG now does is to simply transform scores of one question so that its observed response distribution aligns with the shape of the other question's observed response distribution. That sounds very abstract, but is actually really intuitive. Imagine we want to harmonize two questions and we know that the underlying population (and thus the latent distribution) are the same. How should the average responses to the two questions be, after successful harmonization? Identical, of course. After all, both random groups have the same true mean construct score (e.g., the same mean general health). Similarly, we would expect the same standard deviation, the same skewness, and the same kurtosis.
If we feed data into an OSE-RG algorithm, we get, in essence, a recording table with which we can harmonize one question to the format of the other. If we apply it, we get just what we want: The same response distribution shape in the random groups setting, regardless of the question. There are different algorithms to achieve that. The simplest is linear equating, which assumes normally distributed responses and only harmonizes the mean and standard deviation. However, since response distributions are often skewed, we instead use equipercentile equating, which is more flexible and also harmonizes differences in skewness and kurtosis. You can learn more about these algorithms in two blog posts on [linear equating](https://blog.gesis.org/the-new-normal-linear-equating-of-different-instruments/) and [equipercentile equating](https://blog.gesis.org/cats-are-liquids-equipercentile-equating-of-different-instruments/).
However, in a nutshell, equipercentile equating looks at the cumulative relative frequencies of both questions in the random groups data (i.e., 10% chose response 1, 25% chose response 2 or less, 37% chose response 3 or less and so on). The algorithm then transforms all responses of one question (B) into their percentiles. And then those percentiles are transformed back into (continuous approximations) of responses in the other question (A).
![](figures/05_equipercentile_equating.jpg)
The result after harmonization then looks like this:
![](figures/05_result.jpg)
The result of this recoding of responses to B is that the distribution shape (e.g., mean and standard deviation) now align between A and B. Now it is important no note that this does not only harmonize the current random groups data. Instead, once such a recoding table has been derived from the random groups data, we can use it to harmonize the two questions in many other data sets as well. The only assumption that we need is that the two questions work the same in those other samples as well (see the section on [Population invariance](https://matroth.github.io/questionlink/articles/advanced_use_cases#population-invariance)).
### Finding random groups data
The only remaining issue is that we usually cannot perform methodological split-half experiments for all the many questions we might want to harmonize for a project. However, when we harmonize data from survey programs we might find that we already have random groups data, at least for some of the questions for a construct that we want to harmonize. Specifically, we are looking for instances were two questions were used in random samples of the same population at the same time.
For example, imagine two survey programs which randomly sample the adult German population along overlapping time series. Then we could take samples of both surveys (with their respective questions for our construct) from the same year and use that for our equating.
![](figures/survey_program_links.jpg)
This approach has the advantage that we can reuse existing data. Data that we most likely have researched and prepared for the harmonization project anyway. However, finding such links can be daunting if done manually. Consider the [QuestionLink example for general life satisfaction](https://www.gesis.org/en/services/processing-and-analyzing-data/data-harmonization/question-link/general-life-satisfaction) across several national and international survey programs with random samples of the adult German population:
![](figures/csm_General_Life_Satisfaction_overview_81dd82e781.png)
The figure shows us all the different questions for general life satisfaction (in black to the right of the plot). Then in color, each row is one survey program that used the respective question. And then each dot represents a wave where that question was used in that survey program in a specific year. On the one hand, the figure makes us optimistic, that finding opportunities for equating across survey programs is quite realistic, at least for more common constructs.
On the other hand, we still face some obstacles. First, there are some questions that we cannot harmonize directly, because they do not share a year in which they were used. Consider, for example, the questions "EVS 10pt" and "GPANEL 5pt". Second, we would also like to use all possible opportunities for equating (i.e., connection) instead of just using one possible connection. This is because equating, like all empirical methods, comes with random error (for example due to the randomness of the sampling process). Combining several equatings makes them more robust.
Fortunately, the QuestionLink R-Package helps solves all these issues automatically. To understand how that works, we need to consider the three types of connections that provide opportunities for equating. And we need to understand how to aggregate the results of different connections.
### Types of connections
QuestionLink finds three types of connections: Direct, time-relaxed, and relay connections. The first type, **direct connections**, is the case we have already discussed: Using data where two questions were used in (random samples of) the same populations in the same year:
![](figures/links_direct.jpg)
However, sometimes we can only find instances where two questions were used in almost the same year, but not a direct connection. We call that **time-relaxed connections**, because to use them we need to relax the constraints we put on time overlap. The QuestionLink R-package allows you to choose just how much difference in years you want to allow. Often, one year can be enough. For example, when one survey covers even years and the other survey covers odd years. Time-relaxation is, of course, easier to justify with constructs that are relatively stable over time than with those that fluctuate year by year. However, please consider that QuestionLink uses all possible connections at one. This often means we have counterbalancing time-relaxed connections (i.e., harmonizing one year earlier and then harmonizing one year later):
![](figures/links_time_relaxation.jpg)
The final and most powerful tool are **relay connections**. They become possible, because equating can be chained. Imagine a situation with three questions, A, B, and C. We happen to have random groups data for A-C and for B-C. However, our goal is to harmonize A to C. Fortunately, we can achieve this easily by equating A to C, and then equating the values of A in the format of C to B. In other words, we harmonize A to B via the relay C. The only downside to such relay connections is, that they incur random errors twice during equating. However, we can usually find many of such relay connections and by aggregating them we can mitigate this issue.
![](figures/links_relay.jpg)
With these three types of connections, our chances to be able to harmonize the questions for a construct we are interested in have risen dramatically. At the same time, if we find and apply all these connections we are left with many equating solutions (\~ recoding tables) for each question pair (and in both directions; e.g., A-\>B & B-\>A). And most likely, these solutions will fluctuate due to random sampling errors. What QuestionLink now does is to aggregate all harmonization solutions (one for each connection between two questions in one direction) into a single, robust harmonization solution.
After finding all possible connections and performing equatings for each connection, QuestionLink then **summarizes all possible solutions** for a question pair (in one direction) into a single recoding table. This is done be calculating the median equivalent value for each score of the question we want to harmonize. The result is that random errors are mitigated to a large extent. This might seem abstract, but we demonstrate this in the article of advanced use cases in the section [retrieving recoding tables](https://matroth.github.io/questionlink/articles/advanced_use_cases#retrieving-recoding-tables).
# 01 Setting up a simulated population
Our goal in this section and the next is to simulate survey responses on different measurement instruments that measure the same construct. To do so, we fist simulate the underlying "true" development of that construct in our simulated population over time. Usually, we think of true construct scores as continuous, one-dimensional values. There are no natural steps or categories in constructs such as political interest, life satisfaction, or intelligence. Instead we imagine them to be seamless, gradual values along a dimension.
Our setup here is to create a data frame which will contain 2000 simulated construct values per year for every year from 1980 to 2020.
The code works as follows step-by-step:
1. We create the data frame (or tibble in the tidyverse) and populate it with a year variable with values from 1980 to 2020. For now, the data frame has 40 rows (one for each year).
2. We add a variable `target_construct_mean` which represents the idealized trend over time that we want our simulated construct to follow. The formula does nothing else than create a mean value from `-0.5` to `0.5`. Over the forty years, the mean will gradually (and linearly) increase.
3. Now we simulate for each year a sample of 2000 true scores. This simulation assumes a normal distribution of true scores around our target mean (`target_construct_mean`) for each year with a standard distribution of 1. We "nest" a vector containing all 2000 scores in each cell in the variable `simulated_true_score`. Currently, the data frame still only has 40 rows, but in each cell in the last variable we have crammed 2000 values.
4. In the last step, we use `unnest()` to arrive at a tidier data frame. What unnest does is to unfold the score vectors in `simulated_true_score` so that each row only contains one value. In other words, we now unfold our data frame of 40 rows (each row a year) into a data frame of 40 \* 2000 = 80000 rows (each row a true construct value).
With the `sample_n()` function, we look at 10 randomly selected rows from the 80000 rows of the simulated data frame. We see that each row has a specific `year`, a target construct mean `target_construct_mean` and a specific simulated value `simulated_true_score`.
```{r}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(dplyr)
library(patchwork)
library(tibble)
library(tidyr)
library(purrr)
library(kableExtra)
library(patchwork)
library(viridis)
library(ggplot2)
library(forcats)
## Installing questionlink with devtools
#devtools::install_github("https://github.com/MatRoth/questionlink")
## Installing questionlink with pak
##install.packages("pak")
#pak::pak("https://github.com/MatRoth/questionlink")
library(questionlink)
# Helpers
format_table <- function(df){
df |> kable() |>
kable_styling(position = "left",
full_width = FALSE
)
}
```
```{r}
#| code-fold: show
survey_sample_n = 2000
sim_df <- tibble(
year = c(1980:2020),
# construct linearly increases over time
target_construct_mean = ((year - min(year))/ (max(year)-min(year)))-0.5,
simulated_true_score = map(target_construct_mean, ~ rnorm(survey_sample_n, .x, 1))
) |> unnest(simulated_true_score)
sim_df |> sample_n(10) |> format_table()
```
### Simulated change over time
Now we can plot these mean of these continuous construct scores over time. We see that the "true" average construct score for our simulated population slowly shifts by one standard deviation over the course of 40 years. In substantive terms, imagine the average of an opinion slowly, and steadily increasing, year after year.
```{r}
sim_df |>
ggplot(aes(year, simulated_true_score))+
geom_smooth(se = FALSE)+
theme_minimal()+
scale_y_continuous(breaks = c(-0.5, -0.25, 0, 0.25, 0.5),
minor_breaks = NULL)
```
### Normally distributed simulated scores per year
However, the mean values for each year hide that we actually have a (normal) distribution of construct scores for each year. In other words, there is a spread of lower, medium, and higher construct scores in each year. Below, we see these simulated "true" construct score distributions for different years. For better visibility, only full decades are shown. Observe how the distribution slowly wanders to the right, as the average value increases over time.
*(For those unfamiliar with these type of plots: This is a density plot. It shows for construct values on the x-axis how frequent they are in the data (or how probable each value is, actually). The y-axis values do not have meaningful values. They are simply chosen by R so that the area under the curve is one, so that the probability of the whole curve is one, since it represents all values. As an intuition, think of density plots as histograms for continuous data.)*
```{r}
sim_df |>
filter(year %% 10 == 0) |> #only select full decades
ggplot(aes(simulated_true_score, color = year |> as.factor()))+
geom_density()+
scale_color_viridis(discrete = TRUE, end = 0.8)+ # coloring
theme_minimal()
```
# 02 Simulating survey responses
Now that we have simulated the continuous construct values, we need to derive simulated response values. After all, survey respondents do not give decimal values as their answer. Instead, they choose one of several discrete, ordinal response options, such as "somewhat agree".
To simulate this, we will transform the continuous, simulated scores into discrete, integer response scores. This is done by "cutting" the continuous scores into discrete responses at certain thresholds, using our `cut_responses()` helper function. (Please note that this is an very simplistic view on actual response processes, but it serves our needs here just fine.)
```{r}
# helper function that allows us to cut continuous values into ordinal chunks
cut_responses <- function(x, breaks){
ord_vec <- cut(x, c(-Inf, breaks, Inf), labels = FALSE)
return(ord_vec)
}
```
For example: Calculating simulated responses for the year 2000. Below we see the density plot for the continuous construct values we have simulated for the population in the year 2000. The vertical lines over the density plot represent the thresholds with which we will cut this continuous construct into ordinal responses. Values up to the first (leftmost) threshold will become the response score "1", then between the first and second threshold, a response score of "2" and so on.
```{r}
cutting_example <- sim_df |>
filter(year == 2000) |>
mutate(
example_response = cut_responses(
simulated_true_score,
c(-1.5, -0.3, 0.5, 2)
)
)
cutting_example |>
ggplot()+
geom_density(aes(simulated_true_score))+
geom_vline(xintercept = c(-1.5, -0.5, 0.5, 1.5))+
theme_minimal()
```
Below we see the resulting simulated response distribution. The barchart shows the relative frequency of the five possible response options.
```{r}
cutting_example |>
count(example_response) |>
mutate(`relative frequency (%)` = 100 * n / sum(n)) |>
ggplot(aes(example_response, `relative frequency (%)`))+
geom_bar(stat = "identity")+
theme_minimal()+
ggtitle("Example for a simulated response distribution")
```
### Defining four question versions
Now we apply this process to simulate different questions. That means questions which all capture the same concept, but with different question wording, or different response scales. Such differences mean that the same continuous true score distribution is cut into different observed response distributions, because [the thresholds differ](https://blog.gesis.org/not-by-any-stretch-of-the-imagination-a-cautionary-tale-about-linear-stretching/).
Specifically, we will simulate four different questions (A, B, C, and D), which all capture our simulated construct. Questions A, B, and C are all 5-point response scales. Questions B and C are based on the thresholds of question A, but they are additively shifted (different mean response) and multiplicatively stretched (different standard deviation). Question D, meanwhile, is a 10-point instrument, to demonstrate that QuestionLink can handle different numbers of response options with ease.
```{r}
response_thresholds_A <- c(-1.5, -0.5, 0.5, 1.5)
response_thresholds_B <- (response_thresholds_A - 0.5) * 1.5
response_thresholds_C <- (response_thresholds_A + 0.5) * 0.7
response_thresholds_D <- seq(-1, 2, length.out = 9)
sim_df <- sim_df |>
mutate(
question_A = cut_responses(simulated_true_score, response_thresholds_A),
question_B = cut_responses(simulated_true_score, response_thresholds_B),
question_C = cut_responses(simulated_true_score, response_thresholds_C),
question_D = cut_responses(simulated_true_score, response_thresholds_D)
)
sim_df |>
head() |>
format_table()
```
### Preparing long format for QuestionLink
Currently, the four questions are represented in a wide format data frame. This means we have four variables, one for each question. This also means we have four responses in every row. However, QuestionLink expects data in long-format. That means each row in the dataframe represents a single `response`. Each response has an associated `year` (in which the response was collected) and a `question` (which represents the different measurement instruments). Transforming data from wide to long format is easily done with `tidyr::pivot_longer()`. Below you see the first six rows in the new, long format data frame:
```{r}
qlink_format_df <- sim_df |>
select(year, contains("question")) |>
pivot_longer(contains("question"), values_to = "response") |>
separate(name, into = c("discard", "question")) |>
select(-discard)
qlink_format_df |> head() |> format_table()
```
#### Overview of response distributions
Here we see the simulated response distributions across all four question variants and across the years. Note that the plot only shows the full decades 1980, 1990, 2000, 2010, and 2020.
```{r}
qlink_format_df |>
filter(year %% 10 == 0) |>
count(year, question, response) |>
group_by(year, question) |>
mutate(
response = factor(response),
relative_frequency = 100 * n / sum(n) # Calculates relative response frequencies
) |>
ggplot(aes(response, relative_frequency))+
facet_grid(
rows = vars(year),
cols = vars(question),
scales = "free_x"
)+
geom_bar(stat = "identity")+
theme_minimal(base_size = 16)+
theme(
panel.grid.major.x = element_blank()
)
```
### Creating artificial "gaps" in the timeseries
One of the main advantages of QuestionLink is that it helps find opportunities for harmonization across many different question variants, surveys and their respective waves. To showcase this, we create gaps in the time-series of our simulated question variants. This is a plausible situation, when different survey programs begin or end their time series at different intervals.
First, the current simulated data without gaps in the time-series:
```{r}
qlink_format_all_years_df <- qlink_format_df
plot_question_years <- function(combined_df){
year_min = min(combined_df$year)
year_max = max(combined_df$year)
years_vec = c(year_min:year_max)
major_breaks_vec = years_vec[years_vec %% 5 == 0]
combined_df |>
group_by(year, question) |>
summarise() |>
ungroup() |>
arrange(question) |>
mutate(question = fct_rev(question)) |>
ggplot(aes(year, question, color = question))+
geom_point(size = 3)+
scale_x_continuous(breaks = major_breaks_vec,
minor_breaks = years_vec)+
theme_minimal(base_size = 16)+
theme(legend.position = "none")
}
qlink_format_df |> plot_question_years()
```
#### Showcasing the time-relaxation feature
The logic of OSE-RG, the method behind QuestionLink, dictates that we have to match data from different instruments gathered from the same "population". Here that would ideally mean data randomly sampled from the population (e.g., adult Germans) in the same year. However, sometimes surveys collect data in alternating years and the direct years cannot be easily matched. If a construct does not change dramatically from year to year, we can "relax" this constraint; for example to allow adjacent years. (For more information on time-relaxed connections, see the section on [types of connections].)
For this reason, we make it so that questions are only asked every two years. In even years for A, B and C, but in odd years for D!
```{r}
qlink_format_df <- qlink_format_df |>
filter(!(question == "A" & year %% 2 != 0)) |>
filter(!(question == "B" & year %% 2 != 0)) |>
filter(!(question == "C" & year %% 2 != 0)) |>
filter(!(question == "D" & year %% 2 == 0))
qlink_format_df |>
plot_question_years()
```
#### Showcasing the relay link feature
QuestionLink can also bridge larger gaps in joint time series, where two questions do not overlap in time at all. This becomes possible with relay links, where QuestionLink harmonizes one question to another relay question and then that onwards to the actual target question. In our example data, we will create a gap so that instruments A and B do no longer overlap. However, A and B both overlap with questions C and D. Hence we can form the relay links A -\> C -\> B and A -\> D -\> B. (For more information on relay connections, see the section on [types of connections].)
Thus we arrive at our final simulated survey wave structure. (For simplicity's sake, we assume that each question is used by one of four different surveys sampling the same population, but with different waves schedules.)
```{r}
# Defining gaps in time series by defining the "possible years" for each question
possible_years_lists <- list(
A = c(2000:2015),
B = c(1980:1990),
C = c(1980:2020),
D = c(1985:2005)
)
qlink_format_df <- qlink_format_df |>
mutate(
valid_year = map2_lgl(year, question, ~ .x %in% possible_years_lists[[.y]])
) |>
filter(valid_year) |>
select(-valid_year)
qlink_format_df |>
plot_question_years()
```
# 03 Does QuestionLink work?
After putting in all this work to simulate different survey responses for the same construct, we want to harmonize them again. To show how easy this is with QuestionLink, let us harmonize our four simulated questions in one pass. Later, we will discuss each step and the respective options one by one. For now, note how we can complete the entire harmonization process in one pipe with four lines of code. As an example, we harmonize towards the format of question "A".
```{r}
harmonized_df <- qlink_format_df |>
ql_prepare(use_relay = TRUE, time_relaxation = 1) |>
ql_harmonize() |>
ql_transform(target = "A")
```
## Results of QuestionLink
Did it work? Let us find out by plotting the mean scores across the time-series: Once unharmonized and once harmonized with QuestionLink.
```{r}
unharmonized_mean_plot <- qlink_format_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response, color = question), se = FALSE, method = "lm")+
scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
ggtitle("unharmonized")
harmonized_mean_plot <- harmonized_df |>
ggplot(aes(year))+
geom_smooth(aes(y = median_equivalent, color = question), se = FALSE, method = "lm")+
scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
ggtitle("QuestionLink harmonized")
# Combining both plots with patchwork::
unharmonized_mean_plot + harmonized_mean_plot
```
QuestionLink has harmonized both the differences in the mean and the differences in standard deviation. Hence, intercept and slope of our four time-series now align. Subtantively speaking, this means that in each year we can safely compare the average score across the four instruments. It also means that QuestionLink has harmonized the growth rate over time (i.e., the slope), by harmonizing the standard deviation of all four questions. Below, we veryify this by looking at the standard deviation directly:
```{r}
unharmonized_sd_plot <- qlink_format_df |>
group_by(year, question) |>
summarise(
`standard deviation` = sd(response)
) |>
ggplot(aes(year, `standard deviation`, color = question))+
geom_point()+
theme_minimal()+
ggtitle("unharmonized")+
scale_y_continuous(limits = c(0,3), minor_breaks = NULL)+
theme_minimal()
harmonized_sd_plot <- harmonized_df |>
group_by(year, question) |>
summarise(
`standard deviation` = sd(median_equivalent)
) |>
ggplot(aes(year, `standard deviation`, color = question))+
geom_point()+
theme_minimal()+
ggtitle("QuestionLink harmonized")+
scale_y_continuous(limits = c(0,3), minor_breaks = NULL)+
theme_minimal()
unharmonized_sd_plot + harmonized_sd_plot
```
## Insufficient alternatives {#sec-insufficient-alternatives}
While QuestionLink worked well, you might ask yourself if all this effort was worth it. To address these concerns, let us look at commonly considered alternatives.
### Z-standardization
First, let us look at z-standardization. The approach centers and scales variables so that their mean equals zero ($\bar{x}=0$) and their standard deviation equals one ($S=1$). The resulting scores are called $z$-scores.
The formula is quite simple:
$$
z=\frac{x-\bar{x}}{S}
$$
We center by subtracting the arithmetic mean of the scores from each score and the we scale by dividing by the standard deviation.
One complexity with z-standardization is that we can standardize the same data-set in different ways. As an example, we now z-standardize our simulated data-set in four ways:
1. **Overall:** Here we standardize all responses across all four questions and all forty years in one go. This means we scale with the mean and sd of the entire vector of response scores regardless of question or year.
2. **By question:** However, because we want to harmonize the differences between the four questions, we might instead z-standardize each question separately. This means scaling responses to each question by the mean and sd of each question separately.
3. **By year:** Another common idea is to z-standardize year by year, regardless of the question. This means scaling responses by the mean and sd of all questions year by year.
4. **By year and question combined:** Lastly, we might think that harmonization requires z-standardization over questions and years at the same time. This means scaling responses by the mean and sd of each question in each year separately.
::: callout-note
Note how in the code below, these distinctions are realized with the `dplyr::group_by()` function. That function modifies the following mutate statement so that the `scale()` function standardizes by the correct mean and sd.
:::
```{r}
z_example_df <- qlink_format_df |>
mutate(
response_z_overall = scale(response)
) |>
group_by(question) |>
mutate(
response_z_by_question = scale(response)
) |>
ungroup() |>
group_by(year) |>
mutate(
response_z_by_year = scale(response)
) |>
group_by(year, question) |>
mutate(
response_z_by_year_question = scale(response)
)
```
```{r}
z_overall_plot <- z_example_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response_z_overall, color = question), se = FALSE, method = "lm")+
#scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
theme(legend.position = "none",
axis.title.y = element_blank(),)+
scale_y_continuous(limits = c(-4, 4))+
ggtitle("overall")
z_by_question_plot <- z_example_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response_z_by_question, color = question), se = FALSE, method = "lm")+
#scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank())+
scale_y_continuous(limits = c(-4, 4))+
ggtitle("by question")
z_by_year_plot <- z_example_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response_z_by_year, color = question), se = FALSE, method = "lm")+
#scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank())+
scale_y_continuous(limits = c(-4, 4))+
ggtitle("by year")
z_by_year_question_plot <- z_example_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response_z_by_year_question, color = question), se = FALSE, method = "lm")+
#scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
theme(axis.title.y = element_blank(),
axis.text.y = element_blank())+
scale_y_continuous(limits = c(-4, 4))+
ggtitle("by year and question")
# Combining both plots with patchwork::
z_overall_plot + z_by_question_plot + z_by_year_plot + z_by_year_question_plot + plot_layout(ncol = 4)
```
All four z-standardization solutions do not look as we might have hoped. Let us look at each version in turn to discuss what went wrong.
1. **Overall:** If we apply z-standardization to all scores across questions and years, then nothing changes substantially. Only the units are now z-scores, but all question differences persist exactly as before.
2. **By question:** If we z-standardize each question, we get these parallel lines. Each question timeseries is no centered at zero in the middle. Each has the same growth over time. This obviously does no longer represent the true construct, which grows linearly from 1980 to 2020.
3. **By year:** Scaling by year leads to totally chaotic results. This is because in each year we scale by a mix of the mean and standard deviation of each question.
4. **By year and question combined:** Lastly, scaling by both year and question results in somewhat harmonized data, but unfortunately all time-series information has been lost. Instead of the steady growth of the true construct, we now see no change over time at all.
### Linear Stretching
A second frequently considered alternative is linear stretching. This method sets the minimum and maximum responses of two questions with a different number of scale points as equal. The process is best understood by examining the following function. Here we supply scores of a source question `x` that we want to harmonize. We also supply the maximum and minimum score of that source question `x` (`max_source` and `min_source`) as well as the maximum and minimum score of a target question (`max_target` and `min_target`). The function the linearly "stretches" responses of x towards the different scale point format of the target question.
```{r}
#| code-fold: show
linear_stretching <- function(x,
max_source,
max_target,
min_source = 1,
min_target = 1){
x <- x - min_source #shift so that lowest score becomes 0
x <- x / (max_source - min_source) # scale to range from 0 to 1
x <- x * (max_target - min_target) # scale to range from 0 to target highest minus target lowest
x <- x + min_target # shift so that lowest becomes target lowest
return(x)
}
```
Here an example, where we stretch a 5-point response scale `x` towards a 7-point response scale format `y`.
```{r}
tibble(
x = c(1:5, NA, NA),
y = c(1:7),
x_stretched = linear_stretching(x,
max_source = 5, # original 5-point scale
max_target = 7, # 7-point target
min_source = 1,
min_target = 1)
) |>
pivot_longer(everything(), values_to = "scores", names_to = "variable") |>
ggplot(aes(scores, variable |> fct_rev()))+
geom_point(size = 2)+
geom_text(aes(label = scores), vjust = -0.5, size = 6)+
theme_minimal(base_size = 18)+
scale_y_discrete("")+
scale_x_continuous(breaks = c(1:7))
```
Other variants of linear stretching are **POMP** (percentage of maximum possible) or **scaling to a range from 0 to 1**. Both are simply variants of linear stretching.
Pomp would transform `x` with the scores 1, 2, 3, 4, and 5 to these values:
```{r}
linear_stretching(c(1:5),
max_source = 5, # original 5-point scale
max_target = 100, # 7-point target
min_source = 1,
min_target = 0) |>
paste0(collapse = "; ")
```
And scaling to a range from 0 to 1 would transform `x` with the scores 1, 2, 3, 4, and 5 to these values:
```{r}
linear_stretching(c(1:5),
max_source = 5, # original 5-point scale
max_target = 1, # 7-point target
min_source = 1,
min_target = 0)|>
paste0(collapse = "; ")
```
#### Applying linear stretching
Does this approach work? Let us find out by stretching all responses towards the format of question A.
If you will recall, these were the minimum and maximum response scores for each question:
```{r}
qlink_format_df |>
group_by(question) |>
summarise(min_score = min(response),
max_score = max(response))
```
In other words, we already see that only scores of "D" will be transformed at all.
```{r}
stretched_df <- qlink_format_df |>
mutate(
stretched_response = ifelse(question == "D",
linear_stretching(response, 10, 5),
response)
)
```
Let us look at the result visually:
```{r}
unharmonized_plot <- qlink_format_df |>
ggplot(aes(year))+
geom_smooth(aes(y = response, color = question), se = FALSE, method = "lm")+
scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
ggtitle("unharmonized")
stretched_plot <- stretched_df |>
ggplot(aes(year))+
geom_smooth(aes(y = stretched_response, color = question), se = FALSE, method = "lm")+
scale_y_continuous(limits = c(1,10), breaks = c(1:10), minor_breaks = NULL)+
theme_minimal()+
ggtitle("linear stretch")
# Combining both plots with patchwork::
unharmonized_plot + stretched_plot
```
Linear stretching did not harmonize responses in any meaningful way. The goal was the format of question "A", but all other questions deviate from that. Questions "B" and "C" even remain unchanged, because they were also 5-point scales. Question "D" was transformed, but not as intended.
# 04 QuestionLink step by step
After seeing the usefulness of QuestionLink We hope that you are motivated to delve a bit deeper into how to use it.
In our earlier example, we chained all QuestionLink functions together in a single `|>` pipe. Now, however, we look at the functions step by step. We also look at two helper functions, that will allow us to diagnose our data set and to plot equating opportunities.
## Wrangling our combined dataset
Before we can start out, we need to wrangle our combined data into the correct form.
This means:
\(1\) Bringing data into the so-called long format, where each row in the data frame is one response.
\(2\) Naming the columns along the convention that QuestionLink expects (mandatory: `year`, `question`, & `response`; optional: `weight` & `population`)
\(3\) Inverting response scores of inverted or reversed items (see section "inverted items").
\(4\) Removing cases with missing values (see section "missing values").
### Long format data
For QuestionLink, we decided on a format that most social scientists are well acquainted with: a data set in **long format**. For an in depth explanation of how to shape data into long format, see here: [R for Data Science (Wickham et al., 2023)](https://r4ds.hadley.nz/data-tidy#sec-pivoting)
The most common case are data that are spread across waves. Consider this toy example for data in **wide format**. Here, *each row is one respondent* with answers across three survey waves. However, for many analyses such as multilevel-models (and for QuestionLink), we need to bring the data into long format where *each row is one observation*.
```{r}
wide_format_example <- tibble(
respondent = c(1:5),
year_2000 = c(2,3,4,5,6),
year_2002 = c(6,3,4,3,6),
year_2004 = c(2,1,4,5,6)
)
wide_format_example |> format_table()
```
In R (when using `dplyr` from the `tidyverse`), we can easily lengthen that data with `pivot_longer()`. Here we see the first six rows of the new data frame in long format. There are now three rows per respondent because we have one row per observation (i.e., one row per wave).
```{r}
wide_format_example |>
pivot_longer(cols = contains("year_"), names_to = "wave", values_to = "score") |>
head() |>
format_table()
```
In **QuestionLink** we need just such a long format structure, but with some additional information. Most importantly, for every observation (i.e., response) we need a variable that tell us with which question that response was collected. Conveniently, we already did this step while simulating our example data.
If you consider the data frame we had prepared, you will find that it is quite simple. Each row is a single response. The **three mandatory variables** for QuestionLink are this the `year` in which the response was collected, the `question` that was used to collect the response, and the `response` as a raw, harmonized score.
Additionally, QuestionLink accepts **two optional variables**: `weight` and `population`. We will learn more about them under "advanced use cases" later on. However, in short: `weight` allows us to weight responses so that the samples become more comparable across source surveys. An example might be design weights, such as the common practice in several German survey programs to over sample the new federal states of former East Germany. Meanwhile, `population` allows us to add information from which population a response was collected. QuestionLink then avoids matching responses across populations, because this would break the assumptions of observed score equating.
One last requirement for a harmonizable data frame is that it contains **no missing values**. This is easily realized in (tidyverse) R with `tidyr::drop_NA`. However, if there are many missing values in some or all source surveys, we recommend that you look at our in depth discussion of missing values later on.
```{r}
qlink_format_df |> head() |> format_table()
```
## `ql_prepare` Finding harmonization connections
With a data frame in the correct format, we can begin harmonization by calling the QuestionLink function `ql_prepare`:
```{r}
#| code-fold: show
#| message: true
prepare_object <- qlink_format_df |> ql_prepare()
```
This function call results in a QuestionLink object that contains information about all opportunities to harmonize the questions into each other. However, at this step in the process the actual harmonization calculations have not been performed.
The object mainly contains the possible connections and the whole source data set!
We can extract the connections using the function `attr()` :
```{r}
prepare_object |>
attr("connections") |>
head() |>
format_table()
```
If you consider the first row, you see that we can harmonize "source" question B in the year 1980 towards the "target" question C, because both questions have data from that year.
A more convenient way to look at the connections QuestionLink has found is to use its `summary()` method:
------------------------------------------------------------------------
```{r}
#| message: true
prepare_object |> summary()
```
------------------------------------------------------------------------
We see several sections in the `summary()`:
1. "Summary: QuestionLink Prepare" shows where in the QuestionLink process we are. We also see the number of unique question combinations, which will be explained soon.
2. "Number of found connections" show us that we found 34 "direct connections", meaning that two questions were used in exactly the same year in the same population. For now, the other two connection types are zero.
3. "Additional information" reminds us that we did not provide any of the two optional variables `weight` or `population`. For now, this is fine.
4. "Overview of question combinations" is the most interesting part of the output. It shows which questions we can harmonize with each other. It also details which connections we have to do so. Here we note that we currently do not harmonize all questions with all other questions. We will resolve this soon.
5. "Overview over response options found" shows which response values each question has. This is a good moment to reflect if this is consistent with what you expect. Here you might find negative values (if missings were not adequately coded) or other issues that need to be resolved first. However, for a more in depth check, try `ql_check` , which we will discuss later.
As we have seen under "Overview of question combinations", we have not yet fully harmonized all questions. Specifically, we cannot harmonize from `source_question` "A" towards `target_question` "B" and vice versa. We also cannot harmonize from `source_question` "C" towards `target_question` "D" and vice versa.
We can validate this by looking under the hood of the prepare object, which contains all found connections. Here we find only "direct" connections, because we have not allowed any of the more advanced QuestionLink features yet. Direct connections are instances where two questions are asked in the same year.
```{r}
prepare_object |>
attr("connections") |>
group_by(source_question,target_question,connection_type) |>
summarize(n = dplyr::n()) |>
pivot_wider(names_from = connection_type,
values_from = n) |>
ungroup() |>
format_table()
```
Our current problem is apparently that there are no direct connections between some of the questions. This becomes apparent if we again consider the years in which each question was used:
```{r}
qlink_format_df |>
plot_question_years()
```
We cannot harmonize directly between "A" and "B", because there is a long gap between their two time-series (from 1991 to 1999). We also cannot directly harmonize between "C" and "D" because waves asking question "C" occurred in even years and questions asking "D" occurred in odd years.
Fortunately, QuestionLink can often compensate for both such issues. We just have to give it permission to do so.
### Bridging different survey rhythms with `time_relaxation`
It is not uncommon that surveys have a slightly different wave rhythm from each other. In our example, questions "C" and "D" share a long time series, but are slightly out of sync due to even and odd biyearly rhythms. Our proposed solution for this problem is time relaxation, meaning that we allow equating connections with adjacent years. With `time_relaxation = 1`, for example, we would allow QuestionLink to harmonize responses from 1990 in Question C with those from 1991 from question D. This may lead to some bias if constructs change quickly over time. Often, however, the error is negligible. Especially if we have many waves. After all, in the current setup we would harmonize Question C ion 1990 not only to D in 1991 but also in 1899. Such symmetry will mitigate much of the bias introduced by time mismatch.
We activate the feature by specifying it in `ql_prepare()` as `time_relaxation = 1`:
```{r}
#| code-fold: show
#| message: true
qlink_format_df |> ql_prepare(time_relaxation = 1)
```
Note that we now have "Time relaxed connections" aside from the "direct connections". And indeed, we can now harmonize "C" and "D". This can be easily verified by using `summary()`. However, to save space in this document, lets take a look directly at the relevant table:
```{r}
qlink_format_df |> ql_prepare(time_relaxation = 1) |>
attr("connections") |>
group_by(source_question,target_question,connection_type) |>
summarize(n = dplyr::n()) |>
pivot_wider(names_from = connection_type,
values_from = n) |>
ungroup() |>
format_table()
```
Note how there is now a second column. Aside from `direct` connections, we now also have `time_relaxation` connections. The pattern of `NA`s also shows us that some instruments can only be harmonized with direct, and others only with time relaxed connections.
### Bridging long gaps in timeseries with `use_relay`
Next, we want to bridge the gap between "A" and "B" by allowing relays, or in equating lingo: "chained equating". The idea is as follows: We might not have a connection from "A" to "B", but we have a connection from "A" to "C" and from "C" to "B". Thus we can use "C" as a relay, by "chaining the equating". This means nothing else as transforming scores from "A" into the format of "C" and then onwards to the format of "B". We enable this with the `ql_prepare()` argument `use_relay = TRUE`.
```{r}
#| code-fold: show
#| message: true
qlink_format_df |> ql_prepare(use_relay = TRUE)
```
Note how the function now displays not only "direct connections" but also "relay connections". And indeed, if we look at the possible connections again, we see that "A" and "B" can now be harmonized with `relay` connections:
```{r}
qlink_format_df |> ql_prepare(use_relay = TRUE) |>
attr("connections") |>
group_by(source_question,target_question,connection_type) |>
summarize(n = dplyr::n()) |>
pivot_wider(names_from = connection_type,
values_from = n) |>
ungroup() |>
format_table()
```
If we allow **both options** at the same time, we get all desired connections:
```{r}
qlink_format_df |>
ql_prepare(
time_relaxation = 1 ,
use_relay = TRUE) |>
attr("connections") |>
group_by(source_question,target_question,connection_type) |>
summarize(n = dplyr::n()) |>
pivot_wider(names_from = connection_type,
values_from = n) |>
ungroup() |>
format_table()
```
All that remains to do is to save the results into its own object for the next step:
```{r}
#| code-fold: show
prepared_connections <- qlink_format_df |>
ql_prepare(
time_relaxation = 1 ,
use_relay = TRUE)
```
## `ql_harmonize` Calculating the harmonization solutions
After finding all possible connections (i.e., opportunities for harmonization via equating), we need to actually calculate all those connections. To do this, we simply call `ql_harmonize` on the ql_prepare object. Note that there are no additional attributes or settings here. This is because all harmonization parameters were defined in the prepare stage.
```{r}
#| message: true
#| code-fold: show
harmonization_solution <- prepared_connections |>
ql_harmonize()
```
Looking at the output, we see that we have now completed the second of three steps towards our data harmonization!
Before we move on, let us take a quick look at the object that `ql_harmonize` has created:
```{r}
harmonization_solution |> str(max.level = 1)
```
We see that the object contains several components. Usually, we do not need to deal with this level of details if we just want to harmonize our data. Nonetheless, there are two components that are important to understanding what `ql_harmonize` has done.
Firstly, `ql_harmonize` goes through every single connection that we have found with `ql_prepare`. These are stored in the attribute `connections_rec_tbl` within the object created by `ql_harmonize`. Below we look at one such connection (i.e., one row in `connections_rec_tbl` ):
```{r}
n_connections <- harmonization_solution |>
attr("connections_rec_tbl") |>
nrow()
harmonization_solution |>
attr("connections_rec_tbl") |>
head(1) |>
t() |>
format_table()
```
In other words, each of the `r n_connections` rows of that connection overview contains details on the connection (e.g., which questions and what type of connection) and most importantly: A recoding table with which to harmonize the two instruments.
However, what we want is only one robust recoding table for each instrument pair ("A" and "B") in both directions ("A" -\> "B" & "B" -\> "A"). To this end, `ql_harmonize` condenses all these `r n_connections` down. Specifically, QuestionLink chooses the median equivalent for each score across all harmonization solutions for a given instrument pair and direction (e.g., "A" -\> "B").