This repository has been archived by the owner on Oct 1, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 44
/
04-B-binary-data.Rmd
288 lines (161 loc) · 9.2 KB
/
04-B-binary-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
layout: topic
title: "GLM with count, binary, and proportional data"
author: Tad and Meghan Howard
output: html_document
---
**Assigned Reading:**
> Chapters 9 & 10 from: Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. 2009. *Mixed Effects Models and Extensions in Ecology with R.* Springer. [link](https://link-springer-com.stanford.idm.oclc.org/book/10.1007%2F978-0-387-87458-6)
```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document
# Write every code block to the html document
knitr::opts_chunk$set(echo = TRUE)
# Write the results of every code block to the html document
knitr::opts_chunk$set(eval = TRUE)
# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/04-B/")
# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```
### Key Points
#### Poisson GLM for count data, without overdispersion
* `family = poisson`
* Model selection: AIC or hypothesis testing (z-statistics, `drop1()`, `anova()`)
* Model validation: Use normalized (or Pearson) residuals (as in Ch 4) or deviance residuals (default in R), which give similar results (except for zero-inflated data).
* Overdispersion (variance is larger than mean): Needs correction when Phi (= D/(n-P)) > 1.5, and quick fix is to use `family = quasipoisson`, but cost is that se of parameters will be multiplied by sqrt(Phi).
#### Negative binomial GLM for count data, with overdispersion
+ Use when Phi > 15
+ `glm.nb()` in `library(MASS)` (Modern Applied Statistics with S)
+ Advantage of NB over quasipoisson: `step()` and `stepAIC()` can be used for model selection
+ There can be overdispersion in NB GLM, but options for fixing it are scarse in R.
+ Offset: equation 9.18 on p. 240
`L0 <- glm(Copepod ~ offset(LVol) + fStation, family = poisson, data = Lice)`
#### Bernoulli GLM for binary (presence-absence) data
+ Table 10.1: getting rid of lower (0) and upper (1) bounds of probabilities
+ `family = binomial`
* `family = binomial(link="probit")`
* `family = binomial(link="cloglog")` - when there are many zeros or many ones
+ Bernoulli GAM (Fig 10.6)
#### Binomial GLM for proportional data
+ Model on p. 255: Yi ~ N(ni, pii)
+ `family=quasibinomial` for overdispersed data
***
### Analysis Example
#### My data:
I sampled mosquitoes in 38 sites of varying vegetation cover and microclimate. Given what we know about mosquito responses to microclimate and habitat requirements, I expect that variation in vegetation and microclimate should impact suitability of sites for specific mosquito species, and ultimately impact which/how many mosquitoes are found in a site. I am most interested in *Aedes albopictus*, the main disease vector species found in this study.
The data for the purpose of this exercise include:
+ Presence of *Aedes albopictus* in sites
+ Proportion of *Aedes albopictus* out of total mosquitoes trapped
+ Mosquito abundance data
+ Covariates of interest: vegetation cover, average annual minimum temperature
I want to investigate the relationship between my environmental covariates and 1) the presence of *Aedes albopictus*, and 2) the proportion of *Aedes albo* individuals out of the total trap count of mosquitoes.3) counts of *Aedes albo*
### GLM for Presence/Absence Data
Students in the course can download the data from our Canvas site into a data folder in their working directory.
First, I'll read in the data:
```{r results = 'hide'}
# Read in my data from data folder in your working directory
aedes_dat <- read.csv("data/AedesData_EcoStats_Example.csv")
attach(aedes_dat)
```
Next, I want to do some quick data exploration:
```{r results = 'hide'}
hist(Proportion, breaks=16, xlab = "aedes proportion", freq = FALSE)
hist(aedes, breaks=16, xlab = "vegetation cover", freq = FALSE)
dotchart(aedes, xlab = "Aedes count",
ylab = "Order of data", labels = F)
dotchart(Proportion, xlab = "Proportion Aedes",
ylab = "Order of data", labels = F)
plot(Vegetation,Presence)
plot(TempMin, Presence)
plot(Vegetation,Proportion)
#quick visual check of veg/temp covariation
plot(Vegetation,TempMin)
cor(Vegetation,TempMin) #Pearson correlation coefficient (default)
```
![](images/04-B/aedes_count.png)
The count and proportion data are definitely zero-inflated. The highest values that are up for evaluation as outliers are not considerably larger than the others, so I am going to keep them.
I am going to try fitting a binomial glm for the presence/absence data using vegetation cover and minimum temp. I will use the standard link function (logit). Both logit and probit link functions assume that you have approximately an equal
number of zeros and ones...and I do!
```{r warning = FALSE}
model1 <- glm(Presence ~ Vegetation + TempMin, family = binomial(link = "logit"),
data = aedes_dat)
summary(model1)
#load car library for easy VIF function
library(car)
vif(model1)
```
Looks like minimum temperature is not significant when Vegetation is included. I have a p-value < 0.05 for vegetation, so I'll drop minimum temperature in the model.
```{r}
model2 <- glm(Presence ~ Vegetation, family = binomial(link = "logit"),
data = aedes_dat)
summary(model2)
```
And now I can plot the predicted relationship.
```{r results = 'hide'}
#create new covariate values within observed range for predictions
pred1 <- data.frame(Vegetation = seq(from = 20,
to = 95, by = 5))
Pred <- predict(model2, newdata = pred1, type = "response")
plot(x = Vegetation, y = Presence)
lines(pred1$Vegetation, Pred)
```
![](images/04-B/binom_fit.png)
I can use a likelihood ratio test to determine whether the model without minimum temperature is better.
```{r}
# LR test by dropping each term shows that Vegetation should be retained.
drop1(model1, test = "Chi")
```
### GLM for Proportional Data
I'll now try using the response variable that is the proportion of *Ae. albo* out of the whole sample of mosquitoes per site. I'm still going to stick with vegetation cover in a simple regression.
![](images/04-B/prop_veg.png)
```{r}
model3 <- glm(Proportion ~ Vegetation, family = binomial,
data = aedes_dat)
summary(model3)
```
Big realization: I probably need to deal with nonlinear responses in my data. I expect that there will be a density of *Aedes* around species-specific optimal conditions.
*Friends, What's next on this front?*
In class we discussed using a quadratic term or GAM.
### GLM for count data
I am also interested in general changes in mosquito abundance across the land use gradient. A few studies have shown an increase in overall abundance with forest cover.
Again, I'll start with data exploration and will take a look at the total mosquito abundance data.
```{r results = 'hide'}
hist(Total, breaks=50, xlab = "Total mosquitoes", main = NULL, freq = FALSE)
dotchart(Total, xlab = "Total mosquitoes",
ylab = "Order of data", labels = F)
plot(Vegetation,Total)
```
![](images/04-B/total_hist.png)
![](images/04-B/dot_total.png)
![](images/04-B/veg_total.png)
So I've got some really big values at a few sites that had tons of mosquitoes, and they could have a huge influence on my model. *What to do?*
```{r}
total.mod1 <- glm(Total ~ Vegetation, family = poisson,
data = aedes_dat)
summary(total.mod1)
```
As expected, I have a significant result. I can calculate the explained deviance as the authors do in section *9.5.3*, and if this were an accepted result, I would find that vegetation explains 37.8% of the variation in mosquito abundance.
I'll plot the prediction...
```{r results = 'hide'}
Pred2 <- predict(total.mod1, newdata = pred1, type = "response")
plot(x = Vegetation, y = Total)
lines(pred1$Vegetation, Pred2)
```
![](images/04-B/total_pred.png)
Given that the dispersion parameter is 1, this suggests that I do not have to worry about overdispersion.
Next, validation:
*We need to take the residuals of choice (e.g. deviance) and plot them against (i) the fitted values, (ii) each explanatory variable in the model, (iii) each explanatory variable not in the model (the ones not used in the model, or the ones dropped during the model selection procedure), (iv) against time, and (v) against spatial coordinates, if relevant. We do not want to see any patterns in these graphs. If we do, then there is something wrong, and we need to work out what it is.*
### Discussion Questions
**Q1:** How to validate a binomial GLM model?
**Q2:** What are potential reasons for choosing between a quasi-Poisson model or negative binomial model to deal with overdispersion?
**Q3:** I thought these 2 chapters had a lot of information to take in...what are *your* questions?
****
### After-class follow-up
#### Quantile regression
+ Here is a package "quantreg" and a vignette:
https://cran.r-project.org/web/packages/quantreg/index.html
https://cran.ms.unimelb.edu.au/web/packages/quantreg/vignettes/rq.pdf
+ This looks like a good introduction, although not so recent:
http://onlinelibrary.wiley.com/doi/10.1890/1540-9295(2003)001[0412:AGITQR]2.0.CO;2/abstract