This repository has been archived by the owner on Oct 1, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 44
/
08-B-spatial-regression.Rmd
251 lines (187 loc) · 12.7 KB
/
08-B-spatial-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
layout: topic
title: "Spatial regression"
author: Jes & Marissa
output: html_document
---
**Assigned Reading:**
> Chapters 6 & 7 from: Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. 2009. *Mixed Effects Models and Extensions in Ecology with R.* Springer. [link](https://link-springer-com.stanford.idm.oclc.org/book/10.1007%2F978-0-387-87458-6)
**Optional Readings:**
> Beale, C. M., Lennon, J. J., Yearsley, J. M., Brewer, M. J. and Elston, D. A. 2010. Regression analysis of spatial data. *Ecology Letters* **13**: 246-264. [DOI: 10.1111/j.1461-0248.2009.01422.x](http://dx.doi.org/10.1111/j.1461-0248.2009.01422.x)
>
> Hawkins, B. A. 2012. Eight (and a half) deadly sins of spatial analysis. *Journal of Biogeography* **39**: 1-9. [DOI: 10.1111/j.1365-2699.2011.02637.x](http://dx.doi.org/10.1111/j.1365-2699.2011.02637.x)
```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document
# Write every code block to the html document
knitr::opts_chunk$set(echo = TRUE)
# Write the results of every code block to the html document
knitr::opts_chunk$set(eval = TRUE)
# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/08-B/")
# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```
### Key Points
**General ideas**
+ Spatial and temporal autocorrelation can be problematic because they violate the assumption that the residuals in regression are independent, which causes estimated standard errors of parameters to be biased and causes parametric statistics no longer follow their expected distributions (i.e. p-values are too low).
+ Spatial dependence can be induced by both extrinsic and intrisic factors.
+ Spatial dependence can be modeled in many ways (see Table 2 in Beale et al. 2010) but we will talk about:
+ GLS: spatial dependence modeled by fitting a parametric model to the residuals
+ Spatial filters (PCNM or MEM): spatial structure extracted using ordination of locations and spatial eigenvectors included as covariates
**Definitions**
*stationarity* : when the mean, variance and correlation structure do not vary across time or space. We assume stationarity in the residuals of regression models.
*isotropy* : when a process does not vary with direction (only with distance)
*lag* : the number of steps or links that separate two observations (an integer). For example, the lag between observations sampled in June and August would be 2, if samples are collected once a month, but it would be 1, if samples are collected every other month.
*autocorrelation function* : a function that gives the correlation between two observations as a function of the lag between them. Note: `acf()` takes a series of values and assumes that adjacent values are separated by a lag of 1.
*variogram* : a function that gives the variance between two observations as a function of the distance between them. A *semi-variogram* gives 1/2 of the variance (semivariance) and is usually what is given by stats programs. Note: variograms are parametric functions, whereas *empirical variograms* are estimated from data by grouping observations into short windows of the distance between them.
+ *sill* : the asymptotic value of the variance
+ *range* : the distance at which the variance reaches the asymptote
+ *nugget* : the variance between observations separated by a distance of zero (e.g. variance upon re-sampling)
**How to determine whether you should incorporate autocorrelation in a regression model**
1. Fit a model without autocorrelation.
2. Plot the residuals vs. the covariates along which you expect autocorrelation (e.g. time, x-y coordinates).
3. Plot the correlation among residuals vs. the distance between them. This could be the autocorrelation function, empirical variogram, or Moran's I at different distances (see Friday's reading).
4. Compare models that do and do not model the autocorrelation structure using AIC or other information criteria.
**Functions for modeling residual correlation structure**
| Name | `correlation =` argument in `gls()` | Explanation |
|-------------------|-----------------------------------|-------------|
| compound symmetry | `corCompSym(form =~ x)` | The residual correlation is the same among all observations regardless of the distance between them. |
| autoregressive order 1 (AR-1) | `corAR1(form =~ x)` | The residual correlation is the same ($\rho$) between all observations separated by a lag of 1 and the residual correlation between observations separated by a lag of $n$ is $\rho^n$. |
| autoregressive moving average ARMA(p,q) | `corARMA(start_p, start_q, p, q)` | An observation is a linear function of other observations with separate coeffients estimated for lags of $1$ to $p$ plus a linear function of the error of other observations with separate coefficients estimated for lags of $1$ to $a$ (effectively a more flexible model) |
| exponential correlation | `corExp(form =~ x + y)` | parametric model of the semi-variogram |
| Gaussian correlation | `corGaus(form =~ x + y)`| parametric model of the semi-variogram|
| linear correlation | `corLin(form =~ x + y)` | parametric model of the semi-variogram|
| rational quadratic correlation | `corRatio( form =~ x + y)` | parametric model of the semi-variogram|
| spherical correlation | `corSpher( form =~ x + y)` | parametric model of the semi-variogram|
Note:
+ The variogram models are parametrized by their sill, range, and nugget.
+ In the variogram models, distances are by default Euclidean ($\sqrt{x^2 + y^2}$) so you should not use latitude and longitude. UTM coordinates can be used instead since they are in meters.
+ If using a random-effects model, be sure to apply the correlation structure at the deepest level within groups so that the correlation between groups is zero. Do this using `form =~ x | group`.
### Analysis Example
```{r, setup, echo = T}
# load libraries and data
library(plyr)
library(dplyr)
library(rgdal)
library(magrittr)
AmazonasShape <- readOGR("data/am_municipios/13MUE250GC_SIR.shp")
YFcases = read.csv("data/cleanedYFV_enviro.csv") %>%
mutate(code = as.character(code))
```
#### Data Visualization
```{r, vis}
plot(AmazonasShape)
points(YFcases$x, YFcases$y, col = ifelse(YFcases$YFV > 0, "red", "black"), pch = 16)
AmazonasShape@data %<>% left_join(YFcases, by = c("CD_GEOCODM" = "code"))
spplot(AmazonasShape, "percent_forest", main = "Percent Forest Cover")
spplot(AmazonasShape, "percent_forest_loss", main = "Percent Forest Loss")
spplot(AmazonasShape, "avg.rainfall", main = "Average Monthly Rainfall")
spplot(AmazonasShape, "mean_elevation", main = "Mean Elevation")
```
#### Fit a model without autocorrelation
```{r, model_basic}
YF.glm = glm(YFVoccurence ~ percent_forest_loss + percent_forest +
mean_elevation + population + avg.rainfall,
family = binomial(link = "probit"),
data = YFcases)
summary(YF.glm)
drop1(YF.glm)
```
#### Detect spatial correlation
```{r, detecting_spatial_corr}
library(gstat)
temp_data = data.frame(error = rstandard(YF.glm), x = YFcases$x, y = YFcases$y)
coordinates(temp_data) <- c("x","y")
bubble(temp_data, "error", col = c("black","grey"),
main = "Residuals", xlab = "X-coordinates", ylab = "Y-coordinates")
plot(temp_data$error ~ temp_data$x, xlab = "X-coordinates", ylab = "Errors")
plot(temp_data$error ~ temp_data$y, xlab = "Y-coordinates", ylab = "Errors")
plot(variogram(error ~ 1, temp_data))
plot(variogram(error ~ 1, temp_data, alpha = c(0, 45, 90, 135)))
```
#### Construct models with correlation structure
```{r, model_spatial1, eval = F}
f1 = YFVoccurence ~ percent_forest_loss + percent_forest +
mean_elevation + population + avg.rainfall
model1 <- glm(f1, family = binomial(link = "logit"), correlation = corSpher(form =~ x + y, nugget = TRUE), data = YFcases)
```
The error from this code is `Error in corSpher(form = ~x + y, nugget = TRUE) :`
`could not find function "corSpher"`. This error message occurs because you can't use a spatial correlation structure with `glm`. As laid out in Zuur et al., correlation structure can be specified in `gls`, `gamm`, and `lme`, however `gls` and `lme` do not allow for specifying that the data are binomial.
#### Correlation structure using glmmPQL
F. Dormann, C., M. McPherson, J., B. Araújo, M., Bivand, R., Bolliger, J., Carl, G., G. Davies, R., Hirzel, A., Jetz, W., Daniel Kissling, W., Kühn, I., Ohlemüller, R., R. Peres-Neto, P., Reineking, B., Schröder, B., M. Schurr, F. and Wilson, R. (2007), Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography, 30: 609–628. [doi:10.1111/j.2007.0906-7590.05171.x](http://onlinelibrary.wiley.com/doi/10.1111/j.2007.0906-7590.05171.x/full)
The supplementary material of this paper suggests using the Generalized Linear Mixed Model function `glmmPQL` in the package `MASS` for binary data with a correlation structure. They note that this is a "cheat", as their suggestions is to add a group variable that is the same for all elements of the dataset. Then, glmmPQL can be run on the data even if random effects are not desired (glmmPQL requires random effect be specified).
```{r, model_spatial2}
library(MASS)
library(nlme)
YFcases <- cbind(YFcases, group = factor(rep("a",nrow(YFcases))))
# GLMM fits ----------------------
f1 = YFVoccurence ~ percent_forest_loss + percent_forest +
mean_elevation + population + avg.rainfall
model.base<- glmmPQL(f1, random=~1|group,
data=YFcases,
family=binomial(link = "probit"))
summary(model.base)
plot(Variogram(model.base), main = "No Correlation Structure")
#exponential correlation structure
model.1 <- glmmPQL(f1, random=~1|group,
data=YFcases,
correlation=corExp(form=~x+y, nugget = T),
family=binomial(link = "probit"))
summary(model.1)
plot(Variogram(model.1), main = "Exponential Correlation")
#Gaussian correlation structure
model.2 <- glmmPQL(f1, random=~1|group,
data=YFcases,
correlation=corGaus(form=~x+y, nugget = T),
family=binomial(link = "probit"))
summary(model.2)
plot(Variogram(model.2), main = "Gaussian Correlation")
#spherical correlation structure
model.3 <- glmmPQL(f1, random=~1|group,
data=YFcases,
correlation=corSpher(form=~x+y, nugget = T),
family=binomial(link = "probit"))
summary(model.3)
plot(Variogram(model.3), main = "Spherical Correlation")
#rational quadratic correlation structure
model.4 <- glmmPQL(f1, random=~1|group,
data=YFcases,
correlation=corRatio(form=~x+y, nugget = T),
family=binomial(link = "probit"))
summary(model.4)
plot(Variogram(model.4), main = "Rational Quadratic Correlation")
```
#### Correlation structure using gamm
Alternatively, if you think you need a smoother, you can you `gamm`. Here, we suppose that percent forest has a nonlinear effect on the chance of yellow fever occurence.
```{r, model_spatial_gamm, warning=FALSE, message=FALSE}
library(mgcv)
f2 = YFVoccurence ~ percent_forest_loss + s(percent_forest) +
mean_elevation + population + avg.rainfall
model2.base <- gamm(f2, method = "REML",
data=YFcases,
family=binomial(link = "probit"))
#exponential correlation structure
model2.1 <- gamm(f2, method = "REML",
data=YFcases,
correlation=corExp(form=~x+y, nugget = T),
family=binomial(link = "probit"))
#Gaussian correlation structure
model2.2 <- gamm(f2, method = "REML",
data=YFcases,
correlation=corGaus(form=~x+y, nugget = T),
family=binomial(link = "probit"))
#rational quadratic correlation structure
model2.4 <- gamm(f2, method = "REML",
data=YFcases,
correlation=corRatio(form=~x+y, nugget = T),
family=binomial(link = "probit"))
AIC(model2.base$lme, model2.1$lme, model2.2$lme, model2.4$lme)
```
One benefit of using `gamm` is that you have AIC values to compare the model fits while `glmmPQL` does not calculate AIC values.
### Discussion Questions
1) How should the variogram be interpreted if it is not a uniformly increasing band of points?
2) What data analyses do you foresee needing correlation structure for? Will the spatial/temportal dependence be induced by extrinsic and/or intrisic factors?
3) How do spatial and temporal correlation structures differ?
4) How can we detect when spatial/temportal correlation is caused by a missing covariate?