vignettes/art-002-case-data.Rmd

---
title: "Case study: Data"
vignette: >
  %\VignetteIndexEntry{Case study: Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
link-citations: yes
always_allow_html: true
---

```{r child = "../man/rmd/common-setup.Rmd"}
```

```{r}
#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-002-")
```

Part 2 of a case study in three parts, illustrating how we work with longitudinal student-level records. 

1. *Goals.* &nbsp; Introducing the study.   

2. *Data.* &nbsp; Transforming the data to yield the observations of interest.

3. *Results.* &nbsp; Summary statistics, metric, chart, and table.  


## Method

Our data processing goal is to reduce the source data tables to the specific observations needed to compute our metrics. The data processing tasks include filtering observations, creating, renaming, and recoding variables, and joining data frames. 

The analysis is organized to produce two data frames---students ever enrolled in the programs and students graduating from the programs---that are joined and written to file as a starting point for developing the results. 

```{r child = "../man/rmd/note-for-practice-only-2.Rmd"}
```


## Load data

*Start.* &nbsp; If you are writing your own script to follow along, we use these packages in this article:

```{r}
library(midfieldr)
library(midfielddata)
library(data.table)
```

*Load.* &nbsp; Practice datasets. View data dictionaries via `?student`, `?term`, `?degree`.  

```{r}
# Load practice data
data(student, term, degree)
```


## Initial processing

*Select (optional).* &nbsp; Reduce the number of columns to the minimum needed by the midfieldr functions. 

```{r}
# Work with required midfieldr variables only
student <- select_required(student)
term <- select_required(term)
degree <- select_required(degree)
```

*Initialize.* &nbsp; Assign a working data frame. We often start with the `term` dataset. 

```{r}
# Working data frame
DT <- copy(term)
DT
```

The result has `r nrow(DT)` observations. In the case study, we will typically note the number of observations as they change. 


## Filter for data sufficiency

Some student records near the lower and upper terms that bound the available data must be excluded to prevent false summaries involving timely degree  completion. To apply this filter, we first determine the timely completion term. 

```{r child = "../man/rmd/define-timely-completion-term.Rmd"}
```

*Add variables.*  &nbsp; Using information in `term`, we add the  `timely_term` variable as well as supporting variables used in its construction.

```{r}
# Determine a timely completion term for every student
DT <- add_timely_term(DT, term)
DT
```

*Add variables.*  &nbsp; Using information in `term`, we add the  `data_sufficiency` variable as well as supporting variables used in its construction.

```{r}
# Determine data sufficiency for every student
DT <- add_data_sufficiency(DT, term)
DT
```

```{r child = "../man/rmd/define-data-sufficiency-criterion.Rmd"}
```

*Filter.* We filter to retain observations for which the data are sufficient then drop all but the ID variable.    

```{r}
# Retain observations having sufficient data
DT <- DT[data_sufficiency == "include"]
DT <- DT[, .(mcid)]
DT <- unique(DT)
DT
```

The result has `r nrow(DT)` observations.


## Filter for degree seeking

```{r child = "../man/rmd/define-inner-join.Rmd"}
```

*Filter.* &nbsp; Use an inner join with `student` to retain degree-seeking students only. Select the ID column. 

```{r}
# Filter for degree seeking, output unique IDs
DT <- student[DT, .(mcid), on = c("mcid"), nomatch = NULL]
DT <- unique(DT)
DT
```

The result has `r nrow(DT)` observations. (No change is expected in this example because all students in the midfielddata practice data are degree-seeking.)  We preserve this data frame as a baseline set of IDs to be used again.  

```{r}
baseline <- copy(DT)
```


## Identify programs

In MIDFIELD datasets, the `cip6` variable identifies the 6-digit code for the program in which a student is enrolled in a given term. 

```{r child = "../man/rmd/define-cip.Rmd"}
```

We have already searched `cip` to obtain the codes for the four programs in our case study. The first four digits of the 6-digit CIP codes are: 

- Civil Engineering 1408 
- Electrical Engineering 1410 
- Mechanical Engineering 1419  
- Industrial/Systems Engineering 1427, 1435, 1436, and 1437. 

From `cip`, we obtain all codes that start with any of the selected 4-digit codes. 

```{r}
# Four engineering programs using 4-digit CIP codes
selected_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437"))
selected_programs
```

*Add a variable.* &nbsp;  User-defined program names are nearly always required. Add a variable to label each of these `r nrow(selected_programs)` programs with one of the four conventional program abbreviations we will use in comparing metrics, i.e., Civil (CE), Electrical (EE), Mechanical (ME), and Industrial/Systems Engineering (ISE). 

```{r}
# Recode program labels. Edit as required.
selected_programs[, program := fcase(
  cip6 %like% "^1408", "CE",
  cip6 %like% "^1410", "EE",
  cip6 %like% "^1419", "ME",
  cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE"
)]
```

Confirm that the abbreviations match the original 4-digit CIP names. We also illustrate using `options()` to change the number of data.table rows to print. 

```{r}
# Preserve settings
op <- options()
# Edit number of rows to print
options(datatable.print.nrows = 15)

# Confirm that abbreviations match the longer program names
selected_programs[, .(cip4name, program)]
```


Having checked that the new abbreviations correctly represent the programs, we can finalize the data frame of program CIPs and names. 

```{r}
selected_programs <- selected_programs[, .(cip6, program)]
selected_programs

# Restore original settings
options(op)
```


## Gather ever-enrolled

*Reset* &nbsp; The data frame of baseline IDs is the intake for this section.

```{r}
# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT
```

The result has `r nrow(DT)` observations.

```{r child = "../man/rmd/define-left-join.Rmd"}
```

*Left join (add a variable).* &nbsp;  Returns all rows from `DT` and rows from `term` that match on `mcid`---in effect, adding the `cip6` variable to `DT`. Additionally, because `term` contains multiple rows per ID, the merged data frame also has the possibility of multiple rows per ID.  

```{r}
# Left-outer join from term to DT
DT <- term[DT, .(mcid, cip6), on = c("mcid")]
DT <- unique(DT)
DT
```

The result has `r nrow(DT)` observations.

*Inner join (add a variable, filter observations).* &nbsp; Returns rows in `DT` and `study_programs` that match on `cip6`. In effect, we add a column of program labels to `DT` and simultaneously filter `DT` to retain rows that match the four case study programs only.  
 
```{r}
# Join program names and retain desired programs only
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT
```

The result has `r nrow(DT)` observations.

*Filter.* &nbsp; Because students can change CIP codes but remain within the same labeled group (e.g., ISE), we drop the `cip6` code and filter for unique combinations of ID and program label. 

```{r}
# Filter for unique ID-program combinations
DT[, cip6 := NULL]
DT <- unique(DT)
DT
```

The result has `r nrow(DT)` observations.

*Copy.* &nbsp; Set aside the ever enrolled information under a new name to use later for joining with graduates.  

```{r}
# Prepare for joining
setcolorder(DT, c("mcid", "program"))
ever_enrolled <- copy(DT)
ever_enrolled
```


## Gather graduates

*Reset* &nbsp; The data frame of baseline IDs is the intake for this section. As before, the result has `r nrow(baseline)` observations.

```{r}
# IDs of data-sufficient, degree-seeking students
DT <- copy(baseline)
DT
```

*Add variables.*  &nbsp; We use `term` to again add the `timely_term` variable and its supporting variables.

```{r}
# Add timely completion term
DT <- add_timely_term(DT, term)
DT
```

*Add variables.*  &nbsp; We use `degree` to add the `completion_status` variable and its supporting variables.

```{r}
# Add completion status
DT <- add_completion_status(DT, degree)
DT
```

```{r child = "../man/rmd/define-timely-completion-criterion.Rmd"}
```

*Filter.*  &nbsp; Retain observations of timely completers only. Drop unnecessary variables. 

```{r}
# Retain timely completers
DT <- DT[completion_status == "timely"]
DT <- DT[, .(mcid)]
DT
```

The result has `r nrow(DT)` observations.

*Left join (add variables).*  &nbsp; We use a left-join with `degree` to add the CIP codes and terms of the degrees earned. 

```{r}
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]
DT
```

The result has `r nrow(DT)` observations.

*Inner join (add a variable, filter observations)* &nbsp; Again, add a column of program labels and filter by program. 

```{r}
# Join programs
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT
```

The result has `r nrow(DT)` observations.

*Filter.* &nbsp; Students may have earned multiple degrees in different terms. We retain degrees earned in their first degree term only. 

```{r}
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT
```

The result has `r nrow(DT)` observations.

*Filter.* &nbsp; Drop unnecessary variables and filter for unique observations of ID and program label. 

```{r}
# Filter for unique ID-program combinations
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT
```

*Copy.* &nbsp; Set aside the graduates information under a new name to use  for joining with ever enrolled.  

```{r}
# Prepare for joining
setcolorder(DT, c("mcid", "program"))
graduates <- copy(DT)
graduates
```


## Add groupings

We plan to group the data by program, bloc, race/ethnicity, and sex. Program is already present. Bloc labels are added next. 

```{r child = "../man/rmd/define-bloc.Rmd"}
```

*Add a variable.* &nbsp; We add a `bloc` variable to the ever enrolled and graduates data frames before joining. 

```{r}
ever_enrolled[, bloc := "ever_enrolled"]
graduates[, bloc := "graduates"]
```

*Join.* &nbsp; Combine the two data frames by rows, binding by matching column names. 

```{r}
# Combine two data frames
DT <- rbindlist(list(ever_enrolled, graduates), use.names = TRUE)
DT
```

The result has `r nrow(DT)` observations.

```{r child = "../man/rmd/define-grouping-variables.Rmd"}
```

*Add variables.* &nbsp; Use a left join, matching on `mcid`, to add race/ethnicity and sex to the data frame. 

```{r}
# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT
```

*Verify prepared data.* &nbsp; `study_observations`, included with midfieldr, contains the case study information developed above. Here we verify that the two data frames have the same content. 

```{r}
# Demonstrate equivalence
same_content(DT, study_observations)
```

In this form, the observations are the starting point for part 3 of the case study. 


## Closer look

We examine the study observations for a few specific students to better illustrate the structure of these data.  

```{r}
#| echo: false
#| eval: false
# Find example IDs

x <- graduates[ever_enrolled, on = "mcid"]
y <- x[is.na(program)]
y$mcid[duplicated(y$mcid)]

y <- x[!is.na(program)]
y$mcid[duplicated(y$mcid)]

x[program == i.program]

mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]
term[mcid == mcid_we_want]
degree[mcid == mcid_we_want]
```

```{r}
#| echo: false
# Preserve settings
op <- options()

# Edit number of rows to print
options(datatable.print.nrows = 15)
```

 
*Example 1.* &nbsp; This ID yields one observation only. The student was enrolled in Electrical Engineering but did not complete one of the four case study programs. 

```{r}
# Display one student by ID
mcid_we_want <- "MCID3111171519"
DT[mcid == mcid_we_want]
```

A closer look at the student's `term` record confirms the result: the student was enrolled in CIP 141001 (Electrical Engineering) but switched to CIP 110701 (Computer Science). The `degree` record indicates that the student graduated in Computer Science. 

```{r}
# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]
```

*Example 2.* &nbsp; This ID yields two observations indicating that the student was enrolled in Industrial/Systems Engineering and a timely graduate of that program.  

```{r}
# Display one student by ID
mcid_we_want <- "MCID3111150194"
DT[mcid == mcid_we_want]
```

The `term` and `degree` excerpts confirm those observations. 

```{r}
# Closer look at terms
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]
```

*Example 3.* &nbsp; This ID yields two observations indicating that the student was enrolled in Electrical Engineering and in Civil Engineering but a timely graduate of neither program. 

```{r}
# Display one student by ID
mcid_we_want <- "MCID3111264877"
DT[mcid == mcid_we_want]
```

The `term` excerpt agrees; the `degree` record shows they graduated in CIP 261399 (Biological and Biomedical Sciences). 

```{r}
# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]
```

*Example 4.* &nbsp; This ID yields four observations indicating that the student was enrolled in Civil, Electrical, and Mechanical Engineering and a timely graduate of Mechanical.   

```{r}
# Display one student by ID
mcid_we_want <- "MCID3112470255"
DT[mcid == mcid_we_want]
```

The `term` and `degree` excerpts confirm those observations.

```{r}
# Closer look at term
term[mcid == mcid_we_want]

# Closer look at degree
degree[mcid == mcid_we_want]
```

```{r}
#| echo: false
# Restore original settings
options(op)
```


## References

<div id="refs"></div>

```{r child = "../man/rmd/common-closing.Rmd"}
```