/
04regressionAnalysis.Rmd
753 lines (589 loc) · 33 KB
/
04regressionAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
---
title: "Spatial regression analysis for the total number of covid-19 cases over the whole period"
author: "Sven Lautenbach"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
gallery: TRUE
lightbox: TRUE
thumbnails: TRUE
includes:
after_body: footer.html
---
```{r setup, include=FALSE}
library(knitr)
library(rmdformats)
require(sf)
require(spdep)
require(ggpubr)
require(ggplot2)
require(tidyverse)
require(tmap)
require(GGally)
require(MASS)
## Global options
options(max.print="140")
opts_chunk$set(echo=TRUE,
cache=FALSE,
prompt=FALSE,
tidy=FALSE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
set.seed(42)
```
```{r}
load("data/unionedListw_d_berlinNotSubdivided.Rdata")
load("data/kreiseWithCovidMeldedatumWeeklyPredictors.Rdata")
```
After the initial visual inspection of the temporal and spatial development of the covid-19 incidence at the districts level in Germany and the analysis of global and spatial autocorrelation we are now going to explore the association of the incidence rate with socio-economic predictors. We are starting with a GLM and check for spatial autocorrelation in the residuals. To deal with the spatial autocorrelation we use a spatial eigenvector approach. Furthermore, we investigate if regression coefficients vary in space. We use the combined neighborhood definition from the previous chapter for the analysis.
# Explorative analysis
## Maps of response and predictors
For reference we create a series of maps for the response and all predictors that we want to use in the following. We also characterise the spatial pattern by means of global Moran's I.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() + tm_polygons(col=c("incidenceTotal"),
legend.hist = TRUE, palette="-plasma",
legend.reverse = TRUE,
title = "Covid-19 cases by\n100,000 inhabitants") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Incidence rate, total period") +
tm_scale_bar()
```
The accumulated number of covid-19 cases over the whole period is clearly spatially structured as indicated by the empirical bayes index modification of Moran's I.
```{r}
EBImoran.mc(n= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$sumCasesTotal,
x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$EWZ, listw = unionedListw_d, nsim =999)
```
```{r}
mForeigner <- kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Auslaenderanteil"),
legend.hist = TRUE,
legend.reverse = TRUE,
title = "Share of foreigners of inhabitants [%]",
style= "pretty") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Foreign population") +
tm_scale_bar()
print(mForeigner)
```
The share of foreigners at the population is spatially clustered as indicated by Moran's I.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Auslaenderanteil,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that the larger shares of foreigners indicate a larger share of inhabitants with limited skills in the German language a lower access to health and social distancing related information and that thereby districts with higher shares are associated with higher incidence rates.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Auspendler"),
legend.hist = TRUE,
legend.reverse = TRUE,
title = "Outgoing commuters as share\nof total employees",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Outgoing commuters") +
tm_scale_bar()
```
The share of outgoing commuter is randomly distributed in space.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Auspendler,
listw = unionedListw_d, nsim =999)
```
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Einpendler"),
legend.hist = TRUE,
legend.reverse = TRUE,
title = "Incoming commuters as share\nof total employees",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Incoming commuters") +
tm_scale_bar()
```
The share of incoming commuter is weakly clustered.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Einpendler,
listw = unionedListw_d, nsim =999)
```
The hypothesis for both incoming and outgoing commuters is that they indicate a spill-over of population between districts and thereby are associated with higher incidence rates.
```{r}
mRural <- kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Laendlichkeit"),
legend.hist = TRUE, palette = "Purples",
legend.reverse = TRUE,
title = "Share of inhabitants in places\nwith less then 150 Inh/sqkm",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "rurality") +
tm_scale_bar()
print(mRural)
```
The share of inhabitants in places with less then 150 Inhabitants per sqkm is weakly spatially clustered.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Laendlichkeit,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that lower population densities and other modes of transport (less public transport) might lead to lower incidence rates.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Studierende"),
legend.hist = TRUE, palette = "Purples",
legend.reverse = TRUE,
title = "Students per \n1000 Inhabitants",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Students") +
tm_scale_bar()
```
The number of students per 1000 inhabitants is randomly distributed in space.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Studierende,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that students have on average higher contact rates (under non-lockdown conditions) and also are spatially more mobile (moving between their living place and the town there their parents live for example) and districts with higher share of students might be associated with higher incidence rates.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Hochqualifizierte"),
legend.hist = TRUE,
legend.reverse = TRUE,
title = "Highly qualified employees/ total employees",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Highly qualified employees") +
tm_scale_bar()
```
The share of highly qualified employees (employees with Bachelor or Master degree) is weakly spatially structured.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Hochqualifizierte,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that employees with an academic degree more frequently work at home office and districts with a higher share are thereby associated with lower incidence rates.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Breitbandversorgung"),
legend.hist = TRUE, palette = "Purples",
legend.reverse = TRUE,
title = "Share of households with internet \nconnectivity > 5 mBits/s",
style= "kmeans") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Highspeed internet") +
tm_scale_bar()
```
The share of households with internet connectivity greater than 5 mBits per second is weakly spatially structured.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Breitbandversorgung,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that districts with better internet access offer higher potential for working in home office and thereby are associated with lower incidence rates.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Langzeitarbeitslosenquote"),
legend.hist = TRUE, palette="Oranges",
legend.reverse = TRUE,
title = "Unemployment rate [%]") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Long term unemloyment rate") +
tm_scale_bar()
```
The long term unemployment rate is clearly spatially clustered.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Langzeitarbeitslosenquote,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that the long term unemployment rate is associated with higher incidence rates as it is an indicator for economic conditions in general. Poorer districts might be facing higher incidence rates due to denser living conditions, less access to information and ressources.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded %>%
tm_shape() +
tm_polygons(col=c("Stimmenanteile.AfD"),
legend.hist = TRUE, palette="Blues",
legend.reverse = TRUE,
title = "Share of votes at national election 2017") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey",
outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE,
main.title = "Right-winged party (AfD)") +
tm_scale_bar()
```
The share of votes for the biggest right-winged party (AfD) at the last election was clearly spatially clustured. The spatial distribution shows a clear east (former GDR)-west pattern with the the highest values in rural districts in Saxony.
```{r}
moran.mc(x= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded$Stimmenanteile.AfD,
listw = unionedListw_d, nsim =999)
```
The hypothesis is that the share of right-winged voters is a proxy for the share of the population skeptical with respect to social distancing, mask wearing and vaccination and might therefor be associated with higher incidence rates at the district level.
## Scatterplot matrix
As a start we create a scatterplot matrix for the response and a number of potential predictors. For this plot we drop the sticky geometry columns and select the variables of interest. For the meaning of the variables and their units we refer to the choreplethe maps above.
```{r, fig.width=11, fig.height=11}
st_drop_geometry(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded) %>%
dplyr::select(incidenceTotal, Langzeitarbeitslosenquote,
Hochqualifizierte, Breitbandversorgung, Stimmenanteile.AfD,
Studierende, Einpendler, Auspendler, Einpendler,
Auslaenderanteil, Laendlichkeit) %>%
ggpairs()
```
We see some predictors correlating with the incidence rate as well as some moderate collinearity between some of our predictors. We highlight a few observations and refer to the scatterplot matrix for further details.
The incidence rate is positively associated with the votes for the right-winged party and the share of foreigners at the district level. The long term unemployment rate is negatively associated with the share of outgoing and incoming commuters. The share of employees with an academic degree is positively associated with the availability of high-speed internet access, the number of students per population and the share of foreigners and negatively associated with the share of outgoing commuters and the share of the population living in rural areas. The share of votes for the right-winged party is negatively associated with the availability of highs-peed internet access and the share of foreigners. The share of foreigners is alo negatively associated with the rurality of the district. The share of students is negatively associated with the share of outgoing commuters. The availability of high-speed internet access is negatively associated with rurality, share of foreigners and the share of outgoing commutes and positively associated with the share of students.
# Normal GLM
We start with a normal GLM before checking for spatial autocorrelation in the residuals. Since we have count data a Poisson GLM with an offset for the population at risk seems a natural choice.
## Poisson GLM
```{r}
modelGlm <- glm(sumCasesTotal ~ Stimmenanteile.AfD +
Auslaenderanteil + Hochqualifizierte +
Langzeitarbeitslosenquote + Einpendler +
Studierende + Laendlichkeit +
offset(log(EWZ)),
data= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
family=poisson)
summary(modelGlm)
```
All selected predictors seem highly significant but the need to investigate overdispersion.
```{r}
sumModelGlm <- summary(modelGlm)
(phi <- sumModelGlm$deviance/sumModelGlm$df.residual)
```
A quick check reveals serious overdisperion that needs to be taken into account.
## Negative binomial GLM to account for overdispersion
Since we should be suspicious with respect to overdispersion we will run a negative binomial and afterwards a quasi-poisson GLM to account for that. Since the negative binomial GLM triggers some complications when using it with the spatial eigenvector mapping we will stay with the quasi-poisson model afterwards. While spatial eigenvector mapping can be use with a negative binomial GLM, we need to write code for that - due to shortage of time we will leave this for now.
```{r}
modelGlmNb <- glm.nb(sumCasesTotal ~ Stimmenanteile.AfD +
Auslaenderanteil + Hochqualifizierte +
Langzeitarbeitslosenquote + Einpendler +
Studierende + Laendlichkeit +
offset(log(EWZ)),
data= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded)
summary(modelGlmNb)
```
Considering overdispersion renders three regression coefficients non significant. We will stepwise reduce model complexity based on the AIC. We use the convenience function *drop1* that drops each term at a time and reports the AIC. Smaller AIC values indicate a better agreement of the model with the data.
$$ AIC = - 2* log-likelihood + 2*P$$
there $p$ indicates the number of parameters in the model.
```{r}
drop1(modelGlmNb)
```
Based on AIC we drop the share of incoming commuters.
```{r}
modelGlmNb <- update(modelGlmNb, ~ . - Einpendler)
drop1(modelGlmNb)
```
Next we drop the share of students.
```{r}
modelGlmNb <- update(modelGlmNb, ~ . - Studierende)
drop1(modelGlmNb)
```
```{r}
summary(modelGlmNb)
```
Rurality is only marginally significant. The direction of the effects is in line with our hypothesis with the exception of the long term unemployment rate that is associated with lower incidence rates.
## Quasi-Poisson GLM to account for overdispersion
As an alternative approach we use a quasi-poisson GLM. As only a pseudo-likelihood is defined for the quasi distribution families we cannot use the AIC for model comparison anymore. Instead we use an F-test to compare nested models.
```{r}
modelGlmQp <- glm(sumCasesTotal ~ Stimmenanteile.AfD +
Auslaenderanteil + Hochqualifizierte +
Langzeitarbeitslosenquote + Einpendler +
Studierende + Laendlichkeit +
offset(log(EWZ)),
data= kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
family = quasipoisson)
summary(modelGlmQp)
```
The different distributional assumption leads to somewhat differt regression coefficient etstimates and stadard errors.
```{r}
drop1(modelGlmQp, test = "F")
```
```{r}
modelGlmQp <- update(modelGlmQp, ~ . - Studierende)
drop1(modelGlmQp, test = "F")
```
Even if the share of incoming commuters is only marginaly significant we are going to stay with it for the moment.
```{r}
summary(modelGlmQp)
```
Explained deviance:
```{r}
1 - modelGlmQp$deviance / modelGlmQp$null.deviance
```
We end up with a model of decent quality. Directions of effect seem to be mostly aligned with expectations:
- higher share of votes for right winged party is associated with higher incidence. Presumably a proxy for the share of population opposing mask wearing and social distancing and vaccination
- higher share of foreigners is associated with higher incidence. Foreigners might not be reach by information campaigns with respect to social distancing due to language problems
- higher share of highly qualified work force is associated with lower incidence rates. For those employees it might be easier to work from home office and to avoid close contacts during work hours at office
- higher rurality (higher share of population living in rural areas) is associated with lower incidence rates. This might be due to lower contact rates e.g. by lower share of public transport.
The longterm unemployment rate and the share of incoming commuters is unexpectedly associated with lower incidence rates.
# Checking for spatial autocorrelation in the residuals
## Global Moran's I
For regression residuals we need to use a different test as residuals are centered around zero and will sum up to zero.
```{r}
lm.morantest(modelGlmNb, listw = unionedListw_d)
```
```{r}
(moranGlmQp <- lm.morantest(modelGlmQp, listw = unionedListw_d))
```
Both model suffer from significant global spatial autocorrelation.
This implies that the usual assumption about independence of errors is violated. In turn, our standard errors might be too low, p-values too small, size (and potentially even sign) of the regression coefficients might be wrong. So we need to incorporate spatial autocorrelation in our analysis.
## Plot residuals
First we create centrods that we use in turn for a proportional symbol map of the residuals.
```{r}
kreiseCentroids <- st_centroid(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
of_largest_polygon = TRUE)
```
Add residuals to the centroids for plotting. In addition to the (deviance) residuals we also store their absolute values as this is need to scale the symbols.
```{r}
kreiseCentroids$residGlmNb <- residuals(modelGlmNb)
kreiseCentroids$residGlmNbAbs <- abs(kreiseCentroids$residGlmNb)
kreiseCentroids$residGlmQp <- residuals(modelGlmQp)
kreiseCentroids$residGlmQpAbs <- abs(kreiseCentroids$residGlmQp)
```
The size of the symbols is taken from the absolute value while the color is assigned based on the deviance residuals.
```{r}
m1 <- tm_shape(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded) +
tm_polygons(col="grey") +
tm_shape(kreiseCentroids) +
tm_bubbles(size = "residGlmNbAbs", col= "residGlmNb", palette = "-RdBu",
alpha=.9, perceptual=TRUE, scale=.8, border.alpha=.3,
title.size = "Abs residual", title.col="Residuals", n=3) +
tm_layout(main.title = "Pearson residuals, GLM NB", bg="darkgrey",
legend.outside = TRUE, attr.outside = TRUE) +
tm_scale_bar()
m2 <- tm_shape(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded) +
tm_polygons(col="grey") +
tm_shape(kreiseCentroids) +
tm_bubbles(size = "residGlmQpAbs", col= "residGlmQp", palette = "-RdBu",
alpha=.9, perceptual=TRUE, scale=.8, border.alpha=.3,
title.size = "Abs residual", title.col="Residuals", n=3) +
tm_layout(main.title = "Pearson residuals, GLM QP", bg="darkgrey",
legend.outside = TRUE, attr.outside = TRUE) +
tm_scale_bar()
tmap_arrange(m1,m2)
```
As indicated by global Moran's I we see that large positive and large negative residuals form some cluster. The higher complexity of the quasi-poisson GLM leads to smaller residuals for some districts.
# Spatial Eigenvector Mapping
The idea behind the spatial eigenvector mapping approach is to use additional covariates that aborb the spatial autocorrelation, leading to unbiased estimators for the other predictors. The additional covariates are based on the eigenfunction decomposition of the spatial weight matrix $W$. Eigenvectors of $W$ represent the decompositions the spatial weight Matrix into all mutually orthogonal eigenvectors. Those with positive eigenvalues represent positive autocorrelation, whereas eigenvectors with negative eigenvalues represent negative autocorrelation. Only eigenvectors with positive eigenvalues are used for the selection.
## Selection of eigenvectors
The function *ME* uses brute force eigenvector selection to reach a subset of such vectors to be added to the RHS of the GLM model to reduce residual autocorrelation to below the specified alpha value (defaults to 0.05). Since eigenvector selection only works on symmetric weights, the weights are made symmetric beforehand.
```{r}
meQp <- spatialreg::ME(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte + Langzeitarbeitslosenquote +
Einpendler + Laendlichkeit,
family = quasipoisson,
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
offset = log(EWZ), listw = unionedListw_d)
```
Refitting GLM under incorporation of the selected spatial eigenvectors. The selected spatial eigenvectors are added by *fitted(meQp)*.
```{r}
modelSevmQp <- glm(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte + Langzeitarbeitslosenquote +
Einpendler + Laendlichkeit + fitted(meQp),
family = quasipoisson,
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
offset = log(EWZ))
summary(modelSevmQp)
```
The procedure added `r ncol(fitted(meQp))` spatial eigenvectors to the model. Together this leads to a more than satisfying amount of explained deviance. However, we need to keep in mind that a good share of that come from the spatial eigenvectors.
```{r}
1 - modelSevmQp$deviance / modelSevmQp$null.deviance
```
Rurality of the district and the share of commuter now became insignificant so we might want to drop tzem step by step from the model.
```{r}
meQp <- spatialreg::ME(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte + Laendlichkeit,
family = quasipoisson,
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
offset = log(EWZ),
listw = unionedListw_d)
```
Refitting GLM under incorporation of the selected spatial eigenvectors:
```{r}
modelSevmQp <- glm(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte + Laendlichkeit + fitted(meQp),
family = quasipoisson, offset = log(EWZ),
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded)
summary(modelSevmQp)
```
```{r}
meQp <- spatialreg::ME(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte,
family = quasipoisson,
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded,
offset = log(EWZ),
listw = unionedListw_d)
```
Refitting GLM under incorporation of the selected spatial eigenvectors:
```{r}
modelSevmQp <- glm(sumCasesTotal ~ Stimmenanteile.AfD + Auslaenderanteil +
Hochqualifizierte + fitted(meQp),
family = quasipoisson, offset = log(EWZ),
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded)
summary(modelSevmQp)
```
The eigenvectors selected changed then we dropped the coefficients.
## Plotting selected spatial eigenvectors
Presumably we have missed a lot of other predictors that are now partially absorbed into different eigenvectors. It might be worth to plot and investigate the eigenvectors that made it into the model. Therefore, we attach the selected eigenvectors to the sf object and plot them.
```{r}
summary(fitted(meQp))
sevQp <- fitted(meQp)
kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem <- st_sf(data.frame(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded, sevQp))
```
```{r}
tm_shape(kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem) +
tm_polygons(col= colnames(sevQp), palette = "-RdBu", lwd=.5,
n=6, midpoint=0, legend.show = FALSE) +
tm_layout(main.title = "Selected spatial eigenvectors", legend.outside = TRUE,
attr.outside = TRUE, panel.show = TRUE,
panel.labels = colnames(sevQp)) +
tm_scale_bar()
```
The spatial eigenvectors included in the model capture broadly speaking:
- north west gradients
- differences between regions in the northern part
- patterns that involve Mecklenburg-Vorpommern, Saxony and parts of Bavaria - some of these might remind us of clusters we have seen in the local Moran's I analysis
The spatial eigenvectors might help us to derive hypothesis about missing covariates that could be incorporated in the model. In any case they absorbe a good share of the spatial autocorrelation in the residuals.
### Rechecking spatial autocorrelation
```{r}
(moranSevmQp <- lm.morantest(modelSevmQp, listw = unionedListw_d))
```
We see that were is still some left over spatial autocorrelation not absorbed by the spatial eigenvectors. However, the amount of spatial autocorrelation has been reduced by a strong degree, from `r round(as.numeric(moranGlmQp$estimate[1]),2)` to `r round(as.numeric(moranSevmQp$estimate[1]),2)`.
```{r}
kreiseCentroids$residSevmQp <- residuals(modelSevmQp)
kreiseCentroids$residSevmQpAbs <- abs(kreiseCentroids$residSevmQp)
tm_shape(kreiseWithCovidMeldeWeeklyCleanedPredictorsAdded) +
tm_polygons(col="grey") +
tm_shape(kreiseCentroids) +
tm_bubbles(size = "residSevmQpAbs", col= "residSevmQp", palette = "-RdBu",
alpha=.9, perceptual=TRUE, scale=.8, border.alpha=.3,
title.size = "Abs residual", title.col="Residuals", n=5) +
tm_layout(main.title = "Pearson residuals, SEVM QP", bg="darkgrey",
legend.outside = TRUE, attr.outside = TRUE) +
tm_scale_bar()
```
# Spatial varying coefficients?
## Share of foreigners
We could use the eigenvectors to analyse is regression coefficients (and thereby the effect of predictors) vary in space. We will show this at the example of the share of foreigners.
```{r}
modelSevmQpInt <- glm(sumCasesTotal ~ Stimmenanteile.AfD + Hochqualifizierte +
(vec1 + vec2 + vec4 + vec5 + vec6 + vec10 +
vec11 + vec18 + vec19 + vec21) * Auslaenderanteil,
family = quasipoisson, offset = log(EWZ),
data = kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem)
summary(modelSevmQpInt)
```
```{r}
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec18:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec11:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec19:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec4:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec5:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec1:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
To avoid inclusion of too many eigenvectors in the interaction we might drop vec10 from it as well.
```{r}
modelSevmQpInt <- update(modelSevmQpInt, ~. - vec10:Auslaenderanteil)
drop1(modelSevmQpInt, test="F")
```
```{r}
summary(modelSevmQpInt)
```
```{r}
1 - modelSevmQpInt$deviance / modelSevmQpInt$null.deviance
```
Let's map the resulting regression coefficient for the share of foreigners.
First we map the results of the interaction between foreigners and each of the three eigenvectors.
```{r, fig.width=11, fig.height=6}
mForeignerVec2 <- kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem %>%
mutate(vec2rurality = vec2 * Auslaenderanteil) %>%
tm_shape() + tm_polygons(col=c("vec2rurality"),
legend.hist = TRUE, palette="-RdBu",
style = "fisher", n = 6, midpoint = 0,
legend.reverse = TRUE,
title = "Share foreigners moderated by eigenvector 2") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE, legend.outside.position = "left",
main.title = "Share foreigners by sev 2") +
tm_scale_bar()
mForeignerVec6 <- kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem %>%
mutate(vec2rurality = vec6 * Auslaenderanteil) %>%
tm_shape() + tm_polygons(col=c("vec2rurality"),
legend.hist = TRUE, palette="-RdBu",
style = "fisher", n = 6, midpoint = 0,
legend.reverse = TRUE,
title = "Share foreigners moderated by eigenvector 6") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE, legend.outside.position = "left",
main.title = "Share foreigners by sev 6") +
tm_scale_bar()
mForeignerVec21 <- kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem %>%
mutate(vec2rurality = vec21 * Auslaenderanteil) %>%
tm_shape() + tm_polygons(col=c("vec2rurality"),
legend.hist = TRUE, palette="-RdBu",
style = "fisher", n = 6, midpoint = 0,
legend.reverse = TRUE,
title = "Share foreigners moderated by eigenvector 21") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE, legend.outside.position = "right",
main.title = "Share foreigners by sev 21") +
tm_scale_bar()
tmap_arrange(mForeignerVec2, mForeigner, mForeignerVec6, mForeignerVec21,
ncol = 2, nrow=2)
```
We can calculate the resulting regression coefficient for each district.
$$ y = \beta_1 * Auslaenderanteil + \beta_2 * Auslaenderanteil * vec2 + + \beta_3 * Auslaenderanteil * vec6 + \beta_4 * Auslaenderanteil * vec21 + ... = $$
$$ = Auslaenderanteil * (\beta_1 + \beta_2 * vec2 + \beta_3 * vec6 + \beta_4 * vec21) + ...$$
where $(\beta_1 + \beta_2 * vec2 + \beta_3 * vec6 + \beta_4 * vec21)$ represents the resulting regression coefficient per district.
```{r}
kreiseWithCovidMeldeWeeklyCleanedPredictorsAddedSvem %>%
mutate(coefForeigner = coef(modelSevmQpInt)["Auslaenderanteil"] +
vec2*coef(modelSevmQpInt)["vec2:Auslaenderanteil"] +
vec6*coef(modelSevmQpInt)["vec6:Auslaenderanteil"] +
vec21*coef(modelSevmQpInt)["vec21:Auslaenderanteil"] ) %>%
tm_shape() + tm_polygons(col = "coefForeigner",
title = "Regression coefficient") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE, legend.outside.position = "right",
main.title = "Spatial varying regression coefficient for share foreigners") +
tm_scale_bar()
```
We see that the share of foreigners is always associated with higher incidence rates but that the strength of this relationship varies in space. This might reflect the different communities formed by foreigners (different education, integration, language,...) as well as how well the different groups of foreigners are addressed by e.g. the health administration.
The effect on incidence rates is largest in Hamburg and adjacen districts, Berlin as well as in Cottbus.
Of course there is much more potential to explore relationships between predictors and spatial eigenvectors and we have not even touched on interactions between "normal" predictors or higher order effects. We could (and should) try other predictor eigenvector combinations as well as interactions between the predictors.