-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
398 lines (317 loc) · 14.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
---
title: "Ordering regression coef plots"
output: rmarkdown::github_document
---
Function that orders values of coefficients from regression models, so the complex multiple models can be plotted at the same time and compared. My theme of choice is [hrbrthemes](https://github.com/hrbrmstr/hrbrthemes). Although I do not know how to include label for reference category below variable name (ideally in smaller font size, or italic) but that it remains left-aligned.
```{r setup, echo=FALSE}
knitr::opts_chunk$set(fig.path = "README_figs/README-")
```
<p float="left">
<img src="README_figs/old.png" width=49% />
<img src="README_figs/new.png" width=49% />
</p>
It still requires a lot of fixing with label names, as can be seen, but I am working on making it into another function.
Function is located in 10-plot-sorting-function.R, and there are two samples of how to use it.
# Here is how to do it.
```{r, echo=FALSE}
sort_models <- function(MDS) {
######## REORDERING FACTORS #######
# Function that reorders factors in multiple models for
# better visualisation
# Required var names:
# - 'full_factor' output name of the factor from lm(...);
# - 'var' name of the variable - it has to be ordered! (alphabetic~ascending);
# - 'factor' probably more precise of the level,
# it is not used in this step, since it is possible that two
# same levels for two different variables exist;
# - 'AME' either average marginal effect or coefficient, numeric.
## 1 # assign the number of models each factor is involved to var n_mod
MDS$n_mod <- 0
TMP <- MDS %>%
group_by(full_factor) %>%
summarise(count = n())
TMP <- as.data.frame(TMP)
# because plot will be rotated, reverse the numbers so the factor with most
# models has the lowest value - thus will be on the bottom of the plot
TMP$count <- (TMP$count - (max(TMP$count) + 1)) * (-1)
for (i in c(1:nrow(TMP))) {
MDS[MDS$full_factor == TMP[i, 1], "n_mod"] <- as.numeric(TMP[i, 2])
}
## 2 # now order according to mean estimator for each factor to n_fact
MDS$n_fact <- 0
TMP <- MDS %>%
group_by(full_factor) %>%
summarise(mean = mean(AME))
TMP <- data.frame(full_factor = TMP$full_factor,
mean = TMP$mean)
# take ranking, i.e. order of each mean
TMP$ordr <- rank(TMP$mean)
for (i in c(1:nrow(TMP))) {
MDS[MDS$full_factor == TMP[i, 1], "n_fact"] <- as.numeric(TMP[i, 3])
}
## 3 # finally, the same just for whole variables
MDS$n_var <- 0
TMP <- MDS %>%
group_by(var) %>%
summarise(mean = mean(AME))
TMP <- data.frame(var = TMP$var,
mean = TMP$mean)
TMP$ordr <- rank(TMP$mean)
for (i in c(1:nrow(TMP))) {
MDS[MDS$var == TMP[i, 1], "n_var"] <- as.numeric(TMP[i, 3])
}
### ordr # NOW CREATER ONE VARIABLE
# it synthesises all the previous rankings according to priority
MDS$ordr <- (MDS$n_mod * 10000) + (MDS$n_var * 100) + MDS$n_fact
#MDS$ordr <- as.integer(MDS$ordr) # this is probably unnecessary transformation
#MDS$ordr <- as.factor(MDS$ordr) # this too
# this is final variable that will be used for "relevel" in ggplot
MDS$ordr <- as.integer(rank(MDS$ordr))
### v_ordr # REORDER LEVELS WITHIN var VARIABLE
# first position of each variable according to model and mean estimators
MDS$v_ord <- (MDS$n_mod * 10000) + (MDS$n_var * 100)
MDS$v_ord <- as.integer(MDS$v_ord)
# reduce the repeated values
MDS_n <- data.frame(var = MDS$var,
v_ord = MDS$v_ord)
MDS_n <- unique(MDS_n)
# sort according to alphabet (the way they are in the original DF)
MDS_n <- MDS_n[order(MDS_n$var), ]
# extract numeric factor for releveling of the variable factor, it has to be
# descenting order, because plot is rotated. And relevel factor.
reord_lev <- order(MDS_n$v_ord, decreasing = TRUE)
MDS$var <- factor(MDS$var, levels(MDS$var)[reord_lev])
return(MDS)
}
```
## Libraries and creating models
```{r, message=FALSE}
library(tidyverse)
library(car)
library(margins)
library(hrbrthemes)
```
Data is from GSS, however, .sav file is too large for github, and saved in .Rdata is data already prepared for OLS in GSS_14 (20-importing-and-cleaning-data.R is how I got there).
```{r}
load("GSS14.Rdata")
## Regression -------------------------------------------------------
GSS_14$r_res16_c <- relevel(GSS_14$r_res16, ref = "CITY")
GSS_14$polviews_m <- relevel(GSS_14$polviews, ref = "MODERATE")
GSS_14$r_religion_m <- relevel(GSS_14$r_religion, ref = "MOD&LIB")
clm_small <- lm(criteria_plus ~ age + sex + race + immigrant +
veteran + prt_ba + r_degree +
r_coninc + r_srcbelt, data = GSS_14)
clm_small_res16 <- update(clm_small, ~ . + r_res16_c)
clm_full <- update(clm_small_res16, ~ . + r_religion_m + r_region_4 +
polviews_m)
```
## Extract values and prepare labels, names, etc.
Now, extract the data from models. It's much easier with `tidy` from `broom` package (e.g. `mc_s <- tidy(clm_small)`), but I started with `margins` for some reason. It's just much slower, but function works best that way now. I'll be making it so it can work either way, once I finish more elegant way of renaming of variables and levels.
In total, we need five variables: `full_factor` output name of the factor from lm(...);
`var` name of the variable - it has to be ordered (alphabetic~ascending); `factor` probably more precise of the level, it is used in plotting and not sorting, since it is possible that two same levels for two different variables exist; `AME` and `lower` and `upper`.
```{r}
#### Extract coefficients (ame) ------------------------------------------------
# mc_s - small model; mc_r - model with childhood place; mc_f - full model
mc_s <- as.data.frame(summary(margins(clm_small)))
mc_r <- as.data.frame(summary(margins(clm_small_res16)))
mc_f <- as.data.frame(summary(margins(clm_full)))
#### Fix the labels and prepare them for ordering and poloting ----------------
# copy of the full name of factors from lm models which will be used later
mc_s$full_factor <- mc_s$factor
mc_r$full_factor <- mc_r$factor
mc_f$full_factor <- mc_f$factor
# create new variable with names of the variables from the model
mc_s$var <- mc_s$factor
mc_r$var <- mc_r$factor
mc_f$var <- mc_f$factor
# asing model names, so they can be distinguished
mc_s$model <- as.factor(c("Reduced"))
mc_r$model <- as.factor(c("W/ childhood"))
mc_f$model <- as.factor(c("Large"))
# first, variable names
mc_s$var <- c(
"Age", "Immigrant", "Parents' education", "Income",
"Education", "Education", "SMSA", "SMSA", "SMSA", "Race",
"Race", "Sex", "Veteran"
)
mc_r$var <- c(
"Age", "Immigrant", "Parents' education", "Income", "Education",
"Education", "Childhood", "Childhood", "SMSA", "SMSA", "SMSA",
"Race", "Race", "Sex", "Veteran"
)
mc_f$var <- c(
"Age", "Immigrant", "Political views", "Political views",
"Political views", "Political views", "Political views",
"Political views", "Parents' education", "Income", "Education",
"Education", "Region", "Region", "Region", "Religion",
"Religion", "Religion", "Religion", "Religion", "Childhood",
"Childhood", "SMSA", "SMSA", "SMSA", "Race", "Race", "Sex", "Veteran"
)
## remove variable name from the name of the factor
mc_s$factor <- c(
"One year older", "Yes", "Either has BA", "Increase of $20,000",
"BA or more", "HS or JC", "Other urban", "Rural",
"Suburban (top 100)", "Black", "Other", "Female", "Yes"
)
mc_r$factor <- c(
"One year older", "Yes", "Either has BA", "Increase of $20,000",
"BA or more", "HS or JC", "Large city or suburbs", "Rural",
"Other urban", "Rural", "Suburban (top 100)", "Black",
"Other", "Female", "Yes"
)
mc_f$factor <- c(
"One year older", "Yes",
"Conservative", "Extremly Liberal",
"Extremly Conservative", "Liberal",
"Slightly Conservative", "Slightly Liberal",
"Either has BA", "Increase of $20,000",
"BA or more", "HS or JC",
"Northeast", "South",
"West", "Cath. or orth.",
"Christian NGG", "Luth., Epsic., or Morm.",
"None, other, or Jewish", "Sect. or Bapt.",
"Large city or suburbs","Rural",
"Other urban", "Rural",
"Suburban (top 100)",
"Black","Other",
"Female", "Yes"
)
```
Now, we combine all three models into one, so they can be ordered and plotted.
```{r}
MB <- rbind(mc_f, mc_s)
MB <- rbind(MB, mc_r)
## make all the labels to be factors
MB$full_factor <- as.factor(MB$full_factor)
MB$factor <- as.factor(MB$factor)
MB$var <- as.factor(MB$var)
```
## Use the function and sort
And we just call the sorting function. It reorders levels in `var` that will be used for faceting, and adds `ordr` to the data frame that will be used for ordering of values. It does it quite crudely, using numbers and additions, but it works. Several "rules" are followed: 1. variable levels are ordered according to the size of the mean values for all applicable models; 2. variables are ordered according to the number of models they are involved with (most models in the bottom); and 3. variables are ordered according to total mean of all values in them.
```{r}
MB <- sort_models(MB)
```
## Additional fixing so the plot is nicer
Some more fixing before plotting - new line in variable names that includes a reference category.
```{r}
levels(MB$var)
levels(MB$var)[1] <- "Religion\n[ref: mod. or lib. prot.]"
levels(MB$var)[2] <- "Political views\n[ref: moderate]"
levels(MB$var)[3] <- "Region\n[ref: Midwest]"
levels(MB$var)[4] <- "Childhood\n[ref: medium city]"
levels(MB$var)[5] <- "Education\n[ref: LT HS]"
levels(MB$var)[10] <- "Metrop. area\n[ref: city (top 100)]"
levels(MB$var)[13] <- "Race\n[ref: white]"
```
Since not variables as whole might not be stat. significant in the models, and thus would be just adding to the clutter (e.g. especially categorical ones), they can be identified and their alpha reduced. A new logical variable in dataframe is useful for that, and the values are manually assigned (at this point). This also allows, for example, to easily see in what models certain categorical variable as whole looses significance.
```{r}
MB$sig <- TRUE
Anova(clm_full)
Anova(clm_small)
Anova(clm_small_res16)
MB$sig[MB$var == "Immigrant"] <- FALSE
MB$sig[(MB$var == "Veteran") &
(MB$model == "Large")] <- FALSE
MB$sig[(MB$var == "Parents' education") &
(MB$model == "Large")] <- FALSE
MB$sig[(MB$var == "Metrop. area\n[ref: city (top 100)]") &
(MB$model == "Large")] <- FALSE
MB$sig[(MB$var == "Region\n[ref: Midwest]") &
(MB$model == "Large")] <- FALSE
MB$sig[(MB$var == "Parents' education") &
(MB$model == "W/ childhood")] <- FALSE
MB$sig[(MB$var == "Metrop. area\n[ref: city (top 100)]") &
(MB$model == "W/ childhood")] <- FALSE
```
If there are specific variables we are interested in, as I was, their background can be shaded to draw attention. Since plot is going to be faceted, adding full width/length geom_rect that has also same main variables is one option. Thus, we are going to assign NA values to all the facets, and add (-)Infs to variables we are interested in.
```{r}
intrst <- data.frame(var = unique(MB$var),
xmn = NA,
xmx = NA,
ymn = NA,
ymx = NA
)
intrst[ 9, 2:5] <- c(-Inf, Inf, -Inf, Inf)
intrst[10, 2:5] <- c(-Inf, Inf, -Inf, Inf)
```
## Ploting
Finally, to plot the data `ordr` should be used to reorder variable `factor` (it contains levels of categorical variables, or units of non-categorical and since it was not used in creation of the `ordr` var it can contain same names for different variables).
```{r}
p_m <- ggplot(data = MB, aes(x = reorder(factor, ordr),
y = AME,
ymin = lower,
ymax = upper,
colour = model,
alpha = sig)
)
# add marginal effects
p_m <- p_m +
geom_pointrange(position = position_dodge(width = 0.5),
shape = 21,
fill = "white"
) +
scale_colour_manual(values = c("#1b9e77", "#d95f02", "#7570b3"))
# make vars. that do not reach stat.sig. transparent
p_m <- p_m + scale_alpha_discrete(range = c(0.5, 1.0))
# add intercept line
p_m <- p_m + geom_hline(yintercept = 0,
color = "dark gray")
# rotate plot and add some labels
p_m <- p_m +
coord_flip() +
labs(x = NULL,
y = "Average Marginal Effect",
colour = "Models",
alpha = "Variable\nsig at p<.05",
title = "Regression on criteria of national belonging (openess)")
# add facets according to model variables
# order of levels is going to take care of variable order
p_m <- p_m + facet_grid(var ~ . ,
scales = "free_y",
space = "free_y",
switch = "y")
# fix the appearance of the plot
p_m <- p_m +
theme(strip.text.y = element_text(angle = 180),
strip.placement = "outside",
strip.background = element_rect(fill = "#d9d9d9", color = NA),
plot.background = element_blank(),
panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "#d9d9d9"),
panel.grid.minor.x = element_line(colour = "#d9d9d9"),
panel.grid.major.y = element_line(colour = "#d9d9d9"),
panel.grid.minor.y = element_line(colour = "#d9d9d9"),
axis.ticks = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
legend.box.background = element_blank(),
legend.title = element_text(size = 10),
axis.title = element_text(size = 10),
title = element_text(size = 11)
)
# add rectangles to highlight variables we are interested in
p_m <- p_m + geom_rect(data = intrst,
aes(xmin = xmn,
xmax = xmx,
ymin = ymn,
ymax = ymx),
alpha = 0.09,
fill = "black",
inherit.aes = FALSE,
show.legend = FALSE
)
```
And this is how finished plot looks like, it does include several warnings, which are for facets that are not shaded.
```{r first-plot, fig.height=7}
# finished plot
p_m
```
## Post-plot-fixing
But, the title comes awkwardly in the middle, to fix it we need to use `ggplotGrob` and `grid.draw` from `grid` package.
```{r second-plot, warning=FALSE, fig.height=7}
p_m_fix <- ggplotGrob(p_m)
p_m_fix$layout$l[p_m_fix$layout$name == "title"] <- 2
grid::grid.draw(p_m_fix)
```
It's all fairly simple, but the function cuts a lot of lines. It can also be cut more, and I'll be trying to do it. :)
If you have any suggestion on how to further improve the plot, I would be thrilled to hear it!