-
Notifications
You must be signed in to change notification settings - Fork 8
/
anes48.Rmd
414 lines (365 loc) · 15.3 KB
/
anes48.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
% \VignetteIndexEntry{Analysing the American National Election Study of 1948 with "memisc"}
% \VignetteEngine{knitr::rmarkdown}
```{r,echo=FALSE,message=FALSE}
knitr::opts_chunk$set(comment=NA,
fig.align="center",
results="markup")
```
# Analysing the American National Election Study of 1948 with *memisc*
## Introduction
This vignette gives an example for the analysis of a typical social science data set.
It is the data file of the *American National Election Study* of
1948[^1] available from the [American National Election Studies website](https://electionstudies.org). The data file contains data
from to USA-wide surveys conducted October and November 1948
by the Survey Research Centre, University Michigan
(principal investigators: Angus Campbell and Robert L. Kahn). The
total number of cases in the data set is 662 and the number of variables
is 65 (more details about this
data set can be found
at <https://electionstudies.org/studypages/1948prepost/1948prepost.htm>).
[^1]: National Election Studies, 1948: *Post-Election Study [dataset].* Ann Arbor, MI: University of Michigan, Center for Political Studies [producer and distributor], 1999. ANES Dataset ID: 1948.T; ICPSR Study Number: 7218. These materials are based on work supported by the National Science Foundation under Grant Nos.: SBR-9707741, SBR-9317631, SES-9209410, SES-9009379, SES-8808361, SES-8341310, SES-8207580, and SOC77-08885. Any opinions, findings and conclusions or recommendations expressed in these materials are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
With 662 cases and 65 variables, the 1948 ANES data set is relatively
small as compared to current social science data sets. Such larger data
sets can be processed along the same lines as in this vignette.
Unlike the 1948 ANES data,
their size as well as, in some cases, legal restrictions prohibit the
inclusion of such a data set into the package, however.
This vignette starts with a demonstration how a data file can be
examined before loading it and how a subset of the data can be loaded
into memory.
After loading this subset into memory, some descriptive analyses are
conducted that showcase the construction of contingency tables and of
general tables of descriptive statistics using the
`genTable` function. In addition, a logit analysis
is demonstrated and the collection of several logit coefficients into
a comprehensive table by the `mtable` function.
It should be noted that the analyses reported in the following are conducted
only for purpose of demonstrating the features of the package and are
not to be considered of conclusive scientific evidence of any kind.
This vignette is run with the help of the [*knitr* package](https://cran.r-project.org/package=knitr). This
allows to showcase not only data management facilities provided
by *memisc*. The following code also demonstrates how output
created with some of the facilities of *memisc* can neatly
integrated in reports generated with *knitr*.
Before we start, we adjust *knitr*'s output (with which this
vignette is formatted) to produce HTML where possible.
```{r}
knit_print.codebook <-function(x,...)
knitr::asis_output(format_html(x,...))
knit_print.descriptions <-function(x,...)
knitr::asis_output(format_html(x,...))
knit_print.ftable <-function(x,options,...)
knitr::asis_output(
format_html(x,
digits=if(length(options$ftable.digits))
options$ftable.digits
else 0,
...))
# We can now adjust the number of digits after the comma
# for each column e.g. by adding an `ftable.digits` option
# to an R chunk, as in ```{r,ftable=c(2,2,0)}
knit_print.mtable <-function(x,...)
knitr::asis_output(format_html(x,...))
```
## Reading in a "portable" SPSS data file
We start with importing the data into R.
The following code extracts the SPSS portable file
`NES1948.POR` from
zip file `NES1948.ZIP` delivered with the *memisc* package.
```{r, message=FALSE}
library(memisc)
options(digits=3)
nes1948.por <- unzip(system.file("anes/NES1948.ZIP",package="memisc"),
"NES1948.POR",exdir=tempfile())
```
Now the portable file is in a temporary directory and the path
to the file is contained in the string variable `nes1948.por`.
In the next step, the file is declared as a SPSS/PSPP portable
file using the function `spss.portable.file`, which as first argument
takes the path to the file. `spss.portable.file`
reads in the information about the variables contained in the data
set and counts the
number of cases in the file. That is, standard
I/O operations are used on the file, but the data read in are just
thrown away without allocating core memory for the data.
This counting of cases can, of course, be suppressed if it would
take too long.
```{r}
nes1948 <- spss.portable.file(nes1948.por)
print(nes1948)
```
At this stage, the data are not loaded into the memory yet.
But we can see which variables exist inside
the data set:
```{r}
names(nes1948)
```
Note that the variable names are all changed from uppercase to lowercase (SPSS does not distinguish uppercase and lowercase variable names and uppercase looks like shouting). Casefolding could have been suppressed by the call `spsp.portable.file(nes1948.por,tolower=FALSE)`.
We also can ask for a description ("variable label") for each variable:
```{r}
description(nes1948)
```
or even a code book using
```{r, eval=FALSE}
codebook(nes1948)
```
(this is not shown
here because the output would have taken more then thirty pages).
We can also get a codebook of the first few variabels instead, with
```{r}
codebook(nes1948[1:5])
```
### Reading in a subset of the data
After we have decided which variables to use we can read in a subset of
the data:
```{r}
vote.48 <- subset(nes1948,
select=c(
V480018,
V480029,
V480030,
V480045,
V480046,
V480047,
V480048,
V480049,
V480050
))
```
The subset of the ANES 1948 we read in is now contained in the
variable `vote.48`, which contains an object of class `data.set`.
A `data.set` is an "embellished" version of a `data.frame`,
a data structure intended to contained `labelled` vectors.
`labelled` vectors contain the all the special information
attached to the variables in the original data set, such as
variable labels, value labels, and general missing values.
A short summary of this special information shows up after a
call to `str`.
```{r}
str(vote.48)
```
This output shows, for example, that variable `V480018`
has the description (variable label) "DID R VOTE/FOR WHOM" is considered as
having nominal level of measurement, has seven value labels
and one defined missing value.
Since the variable names in the ANES data set are not very mnemonic,
we rename the variables:
```{r}
vote.48 <- rename(vote.48,
V480018 = "vote",
V480029 = "occupation.hh",
V480030 = "unionized.hh",
V480045 = "gender",
V480046 = "race",
V480047 = "age",
V480048 = "education",
V480049 = "total.income",
V480050 = "religious.pref"
)
```
Since many data sets available from public repositories have
such non-mnemonic variable names as in this example, it might
be convenient to do the data loading and renaming in one step.
Indeed it is possible:
```{r,eval=FALSE}
vote.48 <- subset(nes1948,
select=c(
vote = V480018,
occupation.hh = V480029,
unionized.hh = V480030,
gender = V480045,
race = V480046,
age = V480047,
education = V480048,
total.income = V480049,
religious.pref = V480050
))
```
Before we start with analyses, we take a closer look at
the data.
```{r}
codebook(vote.48)
```
We now have obtained a *codebook*, which contains information
of the class and type of the variables in the data set,
the value labels and defined missing values, and counts of
the distinct values of the variables.
## Analysis
### Some descriptive analyses
We start our analyses with a contingency table, but first
we make some preparations: We recode the variables of
interest into a smaller number of categories in order to
get results that are easier to read and interpret.
```{r}
vote.48 <- within(vote.48,{
vote3 <- recode(vote,
1 -> "Truman",
2 -> "Dewey",
3:4 -> "Other"
)
occup4 <- recode(occupation.hh,
10:20 -> "Upper white collar",
30 -> "Other white collar",
40:70 -> "Blue collar",
80 -> "Farmer"
)
relig3 <- recode(religious.pref,
1 -> "Protestant",
2 -> "Catholic",
3:5 -> "Other,none"
)
race2 <- recode(race,
1 -> "White",
2 -> "Black"
)
})
```
Having constructed the unordered factors
`vote3`, `occup4`, `relig3`, and `race2` we can
proceed examining the association the vote, occupational class,
relgious denomination, and race. First, we look upon a simple
contingency table.
```{r}
ftable(xtabs(~vote3+occup4,data=vote.48))
```
Tables of percentages may seem more informative about the impact
of various factors on the vote. So we use the function
`genTable` to obtain such tables of percentages:
```{r, ftable.digits=c(2,2,2,0)}
gt1 <- genTable(percent(vote3)~occup4,data=vote.48)
## For knitr-ing, we use ```{r, ftable.digits=c(2,2,2,0)} here.
ftable(gt1,row.vars=2)
```
Obviously, voters from farmer and blue collar worker households
were especially supportive of President Truman, while voters
of upper white collar background largely supported
the Republican Candidate Dewey.
```{r, ftable.digits=c(2,2,2,0)}
gt2 <- genTable(percent(vote3)~relig3,data=vote.48)
ftable(gt2,row.vars=2)
```
This table shows that Catholics and adherents of other denominations
were more supportive of Truman than of Dewey.
```{r, ftable.digits=c(2,2,2,0)}
gt3 <- genTable(percent(vote3)~race2,data=vote.48)
ftable(gt3,row.vars=2)
```
African Americans apparently supported Truman by a large majority. The
number of members of this group in the sample is very small, however,
so that such an inference would be very shaky.
```{r, ftable.digits=c(2,2,2,0)}
gt4 <- genTable(percent(vote3)~total.income,data=vote.48)
ftable(gt4,row.vars=2)
```
The table of percentage of vote by income suggests that income had
some considerable influence on the choice either of Truman or of Dewey,
but the unequal distribution of income categories warrants a more
refined analysis that takes into account the uncertainty about the
vote percentages. Therefore, the percentages of support for Truman
broken down by income shown with confidence intervals:
```{r, ftable.digits=c(2,2,2)}
## For knitr-ing, we use ```{r, ftable.digits=c(2,2,2)} here.
inc.tab <- genTable(percent(vote3,ci=TRUE)~total.income,data=vote.48)
ftable(inc.tab,row.vars=c(3,2))
```
Occupational class is more evenly distributed in the sample, thus it may be possible
to obtain more precise estimates of the percentages of support for Truman for occupational
classes:
```{r, ftable.digits=c(2,2,2)}
occup.tab <- genTable(percent(vote3,ci=TRUE)~occup4,data=vote.48)
ftable(occup.tab,row.vars=c(3,2))
```
The upper and lower white-collar and blue-collar classes are quite distinct
with regard to the percentages of support for Truman. The point estimates
of the percentages are outside the confidence intervals of the respective
other occupational classes, the confidence intervals do not even overlap.
However, it is not clear whether farmers are distinct from the blue-collar
and lower white-collar classes.
### Logit modelling of candidate choice
In the following we conduct a logit analysis of the vote for Truman.
First, we assign non-standard contrasts the categorical predictors.
Here, the function `contr` is used to assign treatment (dummy)
contrasts to `occup4` and `total.income` with
baseline category 3 and 4, respectively.
```{r}
vote.48 <- within(vote.48,{
contrasts(occup4) <- contr("treatment",base = 3)
contrasts(total.income) <- contr("treatment",base = 4)
})
```
We now fit some logistic regression models of the
impact occupational class, income, and religious denomination
on the vote choice supporting Truman. The contrasts of
the occupational class and income factors are such that they
compare the choices of the members of the blue-collar class
with all other classes and the middle income group (\$ 2000-2999)
with the other income groups. The religious denomination factor
compares Protestants with Catholics and those with other or
no denominations.
```{r}
model1 <- glm((vote3=="Truman")~occup4,data=vote.48,
family="binomial")
model2 <- glm((vote3=="Truman")~total.income,data=vote.48,
family="binomial")
model3 <- glm((vote3=="Truman")~occup4+total.income,data=vote.48,
family="binomial")
model4 <- glm((vote3=="Truman")~relig3,data=vote.48,
family="binomial")
model5 <- glm((vote3=="Truman")~occup4+relig3,data=vote.48,
family="binomial")
```
First, we use `mtable` to construct a comparative table
of the estimates of `model1`,
`model2`, and `model3`. We thus can compare
the impact of occupational class and income on the choice of
candidate Truman.
```{r}
mtable(model1,model2,model3,summary.stats=c("Nagelkerke R-sq.","Deviance","AIC","N"))
```
`mtable` returns an object of class `"mtable"`. When formatted
it looks close to the requirements of typical social science
publications. Yet at least we want to change the technical variable
names into non-technical ones, for which we can use `relabel`:
```{r}
relabel(mtable(
"Model 1"=model1,
"Model 2"=model2,
"Model 3"=model3,
summary.stats=c("Nagelkerke R-sq.","Deviance","AIC","N")),
UNDER="under",
"AND OVER"="and over",
occup4="Occup. class",
total.income="Income",
gsub=TRUE
)
```
The comparison of the pseudo-R-Square values of model 1 and 2 suggests that
occupational class has a stronger influence on a preference for Truman than
household income. Indeed, if occupational class is taken into account, the
effect of income is no longer statistically significant as the column
corresponding to model 3 indicates.
Second, we compare the effect of occupational class and religious denomination
on the preference for Truman along the same lines as above. We use
`mtable` to collect the estimates of `model1`, `model4`,
and `model5` into a common table.
```{r}
relabel(mtable(
"Model 1"=model1,
"Model 4"=model4,
"Model 5"=model5,
summary.stats=c("Nagelkerke R-sq.","Deviance","AIC","N")),
occup4="Occup. class",
relig3="Religion",
gsub=TRUE
)
```
A comparison of the pseudo-R-squared values suggests that also the effect of
religious denomination is weaker than that of occupational class. However,
as the third column in the above table indicates the effect of religious
denomination remains statistically significant.
```{r,echo=FALSE}
rm(knit_print.codebook,
knit_print.descriptions,
knit_print.ftable,
knit_print.mtable)
```