/
art-002-case-data.Rmd
553 lines (353 loc) · 13.7 KB
/
art-002-case-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
---
title: "Case study: Data"
vignette: >
%\VignetteIndexEntry{Case study: Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
link-citations: yes
always_allow_html: true
---
```{r child = "../man/rmd/common-setup.Rmd"}
```
```{r}
#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-002-")
```
Part 2 of a case study in three parts, illustrating how we work with longitudinal student-level records.
1. *Goals.* Introducing the study.
2. *Data.* Transforming the data to yield the observations of interest.
3. *Results.* Summary statistics, metric, chart, and table.
## Method
Our data processing goal is to reduce the source data tables to the specific observations needed to compute our metrics. The data processing tasks include filtering observations, creating, renaming, and recoding variables, and joining data frames.
The analysis is organized to produce two data frames---students ever enrolled in the programs and students graduating from the programs---that are joined and written to file as a starting point for developing the results.
```{r child = "../man/rmd/note-for-practice-only-2.Rmd"}
```
## Load data
*Start.* If you are writing your own script to follow along, we use these packages in this article:
```{r}
library(midfieldr)
library(midfielddata)
library(data.table)
```
*Load.* Practice datasets. View data dictionaries via `?student`, `?term`, `?degree`.
```{r}
# Load practice data
data(student, term, degree)
```
## Initial processing
*Select (optional).* Reduce the number of columns to the minimum needed by the midfieldr functions.
```{r}
# Work with required midfieldr variables only
student <- select_required(student)
term <- select_required(term)
degree <- select_required(degree)
```
*Initialize.* Assign a working data frame. We often start with the `term` dataset.
```{r}
# Working data frame
DT <- copy(term)
DT
```
The result has `r nrow(DT)` observations. In the case study, we will typically note the number of observations as they change.
## Filter for data sufficiency
Some student records near the lower and upper terms that bound the available data must be excluded to prevent false summaries involving timely degree completion. To apply this filter, we first determine the timely completion term.
```{r child = "../man/rmd/define-timely-completion-term.Rmd"}
```
*Add variables.* Using information in `term`, we add the `timely_term` variable as well as supporting variables used in its construction.
```{r}
# Determine a timely completion term for every student
DT <- add_timely_term(DT, term)
DT
```
*Add variables.* Using information in `term`, we add the `data_sufficiency` variable as well as supporting variables used in its construction.
```{r}
# Determine data sufficiency for every student
DT <- add_data_sufficiency(DT, term)
DT
```
```{r child = "../man/rmd/define-data-sufficiency-criterion.Rmd"}
```
*Filter.* We filter to retain observations for which the data are sufficient then drop all but the ID variable.
```{r}
# Retain observations having sufficient data
DT <- DT[data_sufficiency == "include"]
DT <- DT[, .(mcid)]
DT <- unique(DT)
DT
```
The result has `r nrow(DT)` observations.
## Filter for degree seeking
```{r child = "../man/rmd/define-inner-join.Rmd"}
```
*Filter.* Use an inner join with `student` to retain degree-seeking students only. Select the ID column.
```{r}
# Filter for degree seeking, output unique IDs
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
DT <- unique(DT)
DT
```
The result has `r nrow(DT)` observations. (No change is expected in this example because all students in the midfielddata practice data are degree-seeking.) We preserve this data frame as a baseline set of IDs to be used again.
```{r}
baseline <- copy(DT)
```
## Identify programs
In MIDFIELD datasets, the `cip6` variable identifies the 6-digit code for the program in which a student is enrolled in a given term.
```{r child = "../man/rmd/define-cip.Rmd"}
```
We have already searched `cip` to obtain the codes for the four programs in our case study. The first four digits of the 6-digit CIP codes are:
- Civil Engineering 1408
- Electrical Engineering 1410
- Mechanical Engineering 1419
- Industrial/Systems Engineering 1427, 1435, 1436, and 1437.
From `cip`, we obtain all codes that start with any of the selected 4-digit codes.
```{r}
# Four engineering programs using 4-digit CIP codes
selected_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437"))
selected_programs
```
*Add a variable.* User-defined program names are nearly always required. Add a variable to label each of these `r nrow(selected_programs)` programs with one of the four conventional program abbreviations we will use in comparing metrics, i.e., Civil (CE), Electrical (EE), Mechanical (ME), and Industrial/Systems Engineering (ISE).
```{r}
# Recode program labels. Edit as required.
selected_programs[, program := fcase(
cip6 %like% "^1408", "CE",
cip6 %like% "^1410", "EE",
cip6 %like% "^1419", "ME",
cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE"
)]
```
Confirm that the abbreviations match the original 4-digit CIP names. We also illustrate using `options()` to change the number of data.table rows to print.
```{r}
# Preserve settings
op <- options()
# Edit number of rows to print
options(datatable.print.nrows = 15)
# Confirm that abbreviations match the longer program names
selected_programs[, .(cip4name, program)]
```
Having checked that the new abbreviations correctly represent the programs, we can finalize the data frame of program CIPs and names.
```{r}
selected_programs <- selected_programs[, .(cip6, program)]
selected_programs
# Restore original settings
options(op)
```
## Gather ever-enrolled
*Reset* The data frame of baseline IDs is the intake for this section.
```{r}
# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT
```
The result has `r nrow(DT)` observations.
```{r child = "../man/rmd/define-left-join.Rmd"}
```
*Left join (add a variable).* Returns all rows from `DT` and rows from `term` that match on `mcid`---in effect, adding the `cip6` variable to `DT`. Additionally, because `term` contains multiple rows per ID, the merged data frame also has the possibility of multiple rows per ID.
```{r}
# Left-outer join from term to DT
DT <- term[DT, .(mcid, cip6), on = c("mcid")]
DT <- unique(DT)
DT
```
The result has `r nrow(DT)` observations.
*Inner join (add a variable, filter observations).* Returns rows in `DT` and `study_programs` that match on `cip6`. In effect, we add a column of program labels to `DT` and simultaneously filter `DT` to retain rows that match the four case study programs only.
```{r}
# Join program names and retain desired programs only
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT
```
The result has `r nrow(DT)` observations.
*Filter.* Because students can change CIP codes but remain within the same labeled group (e.g., ISE), we drop the `cip6` code and filter for unique combinations of ID and program label.
```{r}
# Filter for unique ID-program combinations
DT[, cip6 := NULL]
DT <- unique(DT)
DT
```
The result has `r nrow(DT)` observations.
*Copy.* Set aside the ever enrolled information under a new name to use later for joining with graduates.
```{r}
# Prepare for joining
setcolorder(DT, c("mcid", "program"))
ever_enrolled <- copy(DT)
ever_enrolled
```
## Gather graduates
*Reset* The data frame of baseline IDs is the intake for this section. As before, the result has `r nrow(baseline)` observations.
```{r}
# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT
```
*Add variables.* We use `term` to again add the `timely_term` variable and its supporting variables.
```{r}
# Add timely completion term
DT <- add_timely_term(DT, term)
DT
```
*Add variables.* We use `degree` to add the `completion_status` variable and its supporting variables.
```{r}
# Add completion status
DT <- add_completion_status(DT, degree)
DT
```
```{r child = "../man/rmd/define-timely-completion-criterion.Rmd"}
```
*Filter.* Retain observations of timely completers only. Drop unnecessary variables.
```{r}
# Retain timely completers
DT <- DT[completion_status == "timely"]
DT <- DT[, .(mcid)]
DT
```
The result has `r nrow(DT)` observations.
*Left join (add variables).* We use a left-join with `degree` to add the CIP codes and terms of the degrees earned.
```{r}
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]
DT
```
The result has `r nrow(DT)` observations.
*Inner join (add a variable, filter observations)* Again, add a column of program labels and filter by program.
```{r}
# Join programs
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT
```
The result has `r nrow(DT)` observations.
*Filter.* Students may have earned multiple degrees in different terms. We retain degrees earned in their first degree term only.
```{r}
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT
```
The result has `r nrow(DT)` observations.
*Filter.* Drop unnecessary variables and filter for unique observations of ID and program label.
```{r}
# Filter for unique ID-program combinations
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT
```
*Copy.* Set aside the graduates information under a new name to use for joining with ever enrolled.
```{r}
# Prepare for joining
setcolorder(DT, c("mcid", "program"))
graduates <- copy(DT)
graduates
```
## Add groupings
We plan to group the data by program, bloc, race/ethnicity, and sex. Program is already present. Bloc labels are added next.
```{r child = "../man/rmd/define-bloc.Rmd"}
```
*Add a variable.* We add a `bloc` variable to the ever enrolled and graduates data frames before joining.
```{r}
ever_enrolled[, bloc := "ever_enrolled"]
graduates[, bloc := "graduates"]
```
*Join.* Combine the two data frames by rows, binding by matching column names.
```{r}
# Combine two data frames
DT <- rbindlist(list(ever_enrolled, graduates), use.names = TRUE)
DT
```
The result has `r nrow(DT)` observations.
```{r child = "../man/rmd/define-grouping-variables.Rmd"}
```
*Add variables.* Use a left join, matching on `mcid`, to add race/ethnicity and sex to the data frame.
```{r}
# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT
```
*Verify prepared data.* `study_observations`, included with midfieldr, contains the case study information developed above. Here we verify that the two data frames have the same content.
```{r}
# Demonstrate equivalence
same_content(DT, study_observations)
```
In this form, the observations are the starting point for part 3 of the case study.
## Closer look
We examine the study observations for a few specific students to better illustrate the structure of these data.
```{r}
#| echo: false
#| eval: false
# Find example IDs
x <- graduates[ever_enrolled, on = "mcid"]
y <- x[is.na(program)]
y$mcid[duplicated(y$mcid)]
y <- x[!is.na(program)]
y$mcid[duplicated(y$mcid)]
x[program == i.program]
mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]
term[mcid == mcid_we_want]
degree[mcid == mcid_we_want]
```
```{r}
#| echo: false
# Preserve settings
op <- options()
# Edit number of rows to print
options(datatable.print.nrows = 15)
```
*Example 1.* This ID yields one observation only. The student was enrolled in Electrical Engineering but did not complete one of the four case study programs.
```{r}
# Display one student by ID
mcid_we_want <- "MCID3111171519"
DT[mcid == mcid_we_want]
```
A closer look at the student's `term` record confirms the result: the student was enrolled in CIP 141001 (Electrical Engineering) but switched to CIP 110701 (Computer Science). The `degree` record indicates that the student graduated in Computer Science.
```{r}
# Closer look at term
term[mcid == mcid_we_want]
# Closer look at degree
degree[mcid == mcid_we_want]
```
*Example 2.* This ID yields two observations indicating that the student was enrolled in Industrial/Systems Engineering and a timely graduate of that program.
```{r}
# Display one student by ID
mcid_we_want <- "MCID3111150194"
DT[mcid == mcid_we_want]
```
The `term` and `degree` excerpts confirm those observations.
```{r}
# Closer look at terms
term[mcid == mcid_we_want]
# Closer look at degree
degree[mcid == mcid_we_want]
```
*Example 3.* This ID yields two observations indicating that the student was enrolled in Electrical Engineering and in Civil Engineering but a timely graduate of neither program.
```{r}
# Display one student by ID
mcid_we_want <- "MCID3111264877"
DT[mcid == mcid_we_want]
```
The `term` excerpt agrees; the `degree` record shows they graduated in CIP 261399 (Biological and Biomedical Sciences).
```{r}
# Closer look at term
term[mcid == mcid_we_want]
# Closer look at degree
degree[mcid == mcid_we_want]
```
*Example 4.* This ID yields four observations indicating that the student was enrolled in Civil, Electrical, and Mechanical Engineering and a timely graduate of Mechanical.
```{r}
# Display one student by ID
mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]
```
The `term` and `degree` excerpts confirm those observations.
```{r}
# Closer look at term
term[mcid == mcid_we_want]
# Closer look at degree
degree[mcid == mcid_we_want]
```
```{r}
#| echo: false
# Restore original settings
options(op)
```
## References
<div id="refs"></div>
```{r child = "../man/rmd/common-closing.Rmd"}
```