SDPdatalinkingtasks.Rmd

---
title: "CONNECT: DATA LINKING GUIDE"
author: "Strategic Data Project"
date: "Center for Education Policy Research at Harvard University"
output:
  pdf_document:
    toc: yes
    toc_depth: 4
    latex_engine: xelatex
    includes:
      in_header: harvardheader.tex
      before_body: harvard_prefix.tex
  html_document:
    toc: yes
    toc_depth: 4
---

```{r setup, echo=FALSE, error=FALSE, message=FALSE, warning=FALSE, comment=NA}
# Set options for knitr
library(knitr)
knitr::opts_chunk$set(comment=NA, warning=FALSE, echo=TRUE,
                      error=FALSE, message=FALSE, fig.align='center')
options(width=80)
```

# CONNECT: DATA LINKING GUIDE

## Introduction 

### Purpose

**Connect** links data elements from across your system into one analysis file. 
The file allows you to execute analyses inspired by the SDP College Going 
Diagnostic to examine students’ progression through high school and college.

After completing connect, you will have:

- Produced student-level files that track high school completion and graduation 
- Linked postsecondary college enrollment and persistence data from the National
Student Clearinghouse (NSC), to your agency’s student achievement records - 
Merged disparate data files to create a single analysis file to support 
**Analyze**.

The National Student Clearinghouse collects information on postsecondary 
enrollment for students across the country. To access your agency’s data, please
visit: studentclearinghouse.org.

At the end of **Connect**, you will have merged 7 files and generated the 
necessary variables for the analysis file.

Note: This guide references **Identify** and requires output from **Clean**. To 
move through Connect, you should review these stages of the toolkit.

### Data and Structure

Connect consists of 10 steps to build one analysis files from various sources.
After the first ten steps, there is also a section on producing indicators to
on-track graduation. The steps in **Connect** require data files from **Clean**.


![Task Structure](includes/img/DLTaskDataAndStructure.jpg)

Throughout **Connect** the term "merge" indicates that two files will be linked.
Merging allows you to combine datasets horizontally and add new columns (or
variables) from one dataset to another based on identifier(s) present in both.


### Step Components

For each step, you will find the following:

- Purpose: an overview of each step;
- Files Needed: data elements or files required to complete each step;
- After this step: an overview of "output" generated by each step.

Also, throughout **Connect**, you will find R code to explain each of the 
sub-steps. Code appears in boxes, like below:

```{r unevaluatedExample, eval=FALSE, echo=TRUE}
# keep only observations if 8th grade math score is not missing
stutest %<>% filter(!is.na(test_math_8))

# check to see if the file is unique by student id
nrow(stutest) == n_distinct(stutest$sid)

```

### Infrastructure

In the infrastructure folder you unzipped from
[sdp.cepr.harvard.edu/toolkit](www.sdp.cepr.harvard.edu/toolkit) toolkit, look
for the files in the **clean** folder and the files `Connect.R` and 
`Connect_On_Track.R` in the **programs** folder. The files in the **clean**
folder are those you produced via **Clean** and that we have also provided for
you. The do files provide a shell for you to fill in the ten steps of
**Connect** and the section on On-Track indicators. Doing so will allow you to
produce the `CG_Analysis` file which will be saved to the **analysis** folder.

As always, if you would like additional support from the friendly SDP team,
please email us at sdp@gse.harvard.edu.

### Prepare the R Environment

```{r loadRequiredPackages}
#  Load the packages and prepare your R environment
library(tidyverse) # main suite of R packages to ease data analysis
library(magrittr) # allows for some easier pipelines of data

# Read in some R functions that are useful for toolkit tasks, see SDP R Glossary
# for details
source("R/functions.R")
library(haven) # required for importing .dta files
```


## Step 1: Prior Achievement Part 1

### Overview

**Purpose:** Prepare 8th grade test scores for the analysis file.

**Files needed:** Prior_Achievement output file from Task 5 in **Clean**

**After this step** you will have a temporary file `tests` that contains 
prior achievement information for math, ELA, and math-ELA composite score.


#### 1.1 Load the File and Check Uniqueness

Begin Step 1 of Connect by loading the `Prior_Achievement` file resulting from 
Task 5. 


```{r readPriorAchievement}
# To read data from a zip file we create a connection to the path of the 
# zip file
tmpfileName <- "clean/Prior_Achievement.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuach <- read_stata(con) # read data in the data subdirectory
close(con)
```


Note the structure of the file. Thanks to Task 5, the file is unique by `sid` 
and contains test scores for only 8th grade math , ELA and math-ELA composite. 
If your data is not structured like this, please review Task 5.

Raw scores (`math_raw_score` and `ela_raw_score` from Task 5) are not shown. You 
will primarily use scaled or standardized scores in future analyses. However, 
keep raw scores in your file to compare results between scaled, standardized, 
or raw scores later on.

#### 1.2 Rename variables to indicate that they are 8th grade

```{r renameg8Vars}
stuach %<>% rename( 
       test_math_8_raw = raw_score_math,
       test_ela_8_raw = raw_score_ela,
       test_math_8 = scaled_score_math,
       test_ela_8 = scaled_score_ela,
       test_composite_8 = scaled_score_composite,
       test_math_8_std = scaled_math_std,
       test_ela_8_std = scaled_ela_std, 
       test_composite_8_std = scaled_score_composite_std)
```

#### 1.3 Define Prior Achievement Quartiles in Each Subject

Prior achievement quartiles have to be created by subject and by year. 

To create a variable to capture the quartile of an eighth grader’s score in 
each subject, relative to peers who took the same test, the same school year.

This allows you to compare performance of each student to peers in the same year.

In R this can be done by combining a `group_by` and `mutate` command as below.

```{r defineQuartiles}
stuach %<>% group_by(school_year) %>% 
  mutate(qrt_8_math = ntile(test_math_8, 4), 
         qrt_8_ela = ntile(test_ela_8, 4), 
         qrt_8_composite = ntile(test_composite_8, 4))
```

## Step 2: School Crosswalk

### Overview 

**Purpose:** Prepare a crosswalk between school codes and names. This allows 
you to link a high school students’ high school graduation with their college 
enrollment outcomes.

**Files needed:** School research file from Identify

**After this step** you will have created a temporary file `highschoolinfo` that
contains a crosswalk between school codes and school names.

### Steps

#### 2.1: Load the File and Check Uniqueness

```{r loadSchoolData}
# To read data from a zip file we create a connection to the path of the 
# zip file
tmpfileName <- "clean/School.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
schl <- read_stata(con) # read data in the data subdirectory
close(con)
```


Load the School research file from **Identify**. Restrict the "universe" of 
schools to only include high schools.

A crosswalk table ensures the final file is unique by `school_code` and that one 
`school_code` maps to one `school_name`. For example, in an uncleaned file, 
Albert Einstein High School might be spelled three ways, "A. Einstein HS," 
"Einstein High School," or "A.E. HS," but have one `school_code`. Alternatively, 
"Jones High School" might have a code of 153 and 154. You must fix these issues 
before moving on.

```{r distinctSchools}
# keep only the school code and school name
schl <- select(schl, school_name, school_code)
# keep school_code school_name
# duplicates drop
schl <- distinct(schl)
# // check that the file is unique by school_code
# isid school_code
length(unique(schl$school_code)) == nrow(schl)

```

#### 2.2 Generate First, Last, and Longest High School Variables

Next, generate three variables, `first_hs_code`, `last_hs_code`, and 
`longest_hs_code`. Set them equal to the `school_code` of each high school.

```{r genSchoolRenameVars}
# creates first / last / longest hs id variables
schl$first_hs_code <- schl$school_code
schl$last_hs_code <- schl$school_code
schl$longest_hs_code <- schl$school_code
```


## Step 3: Student Attributes

### Overview 

**Purpose:** Load Student Attributes data to obtain time-invariant information for 
students in the system.

**Files needed:** `Student_Attributes` output file from Task 1 in Clean

**After this step** you will have loaded the Student Attributes data into memory.

### Steps

#### 3.1 Load the File and Check Uniqueness

```{r loadStudentAttributes}
# To read data from a zip file we create a connection to the path of the 
# zip file
tmpfileName <- "clean/Student_Attributes.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuatt <- read_stata(con) # read data in the data subdirectory
close(con)
```

## Step 4: Student School Year

### Overview

**Purpose:** Merge Student School Year data with Student Attributes data in 
memory and generate program participation status variables.

**Files needed:** `Student_School_Year_Ninth` output file from Task 3 in **Clean**

**After this step** You will have merged Student School Year data with Student 
Attributes data into memory and generated variables that indicate if a student 
has ever been classified as FRPL (Free and Reduced Price Lunch), 
IEP (Individualized Education Plan), ELL (English Language Learner) or Gifted
in the system or during high school.

### Steps

#### 4.1 Merge on Student School Year

Merge the Student School Year file with the Student Attributes
file in memory. This adds student-level information that may
change from year to year (i.e FRPL, IEP, ELL, or Gifted). This is a
1:m (one-to-many) merge because the Student Attributes file is
unique by `sid` and the Student School Year file is unique by `sid` +
`school_year`.

Before conducting the merge, the Student School Year output file should be 
unique by `sid` and `school_year` and contain information on all available grades. 
The Student School Year data should also include the `first_9th_school_year_observed`
variable.

```{r mergeonStudentSchYear}
tmpfileName <- "clean/Student_School_Year_Ninth.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stusy <- read_stata(con)
close(con)

# Data checks
# Is data unique by sid and school year
nrow(stusy) == length(unique(paste0(stusy$sid, stusy$school_year)))

# How many unique grades exist?
table(stusy$grade_level)

# Does first 9th school year exist?
"first_9th_school_year_observed" %in% names(stusy)

# Optional: Get data dimensions for both frames for checking
nrow_stusy <- nrow(stusy)
nstu_stusy <- n_distinct(stusy$sid)
nrow_stuatt <- nrow(stuatt)
nstu_stuatt <- n_distinct(stuatt$sid)

# Merge
stusy <- inner_join(stusy, stuatt, by = "sid")
```


#### 4.2 Checking the Merge

In an ideal world records match perfectly. However, administrative records are 
often messy. Perfect merges rarely occur. Therefore, consider a merge 
satisfactory if at least 95% of students appear in both files.

By design, `inner_join` only keeps the matching variables. 


```{r checkStuSchYearMerge}
# check the number and percentage of students appearing in both files

# Check for perfect merge
nrow(stusy) == nrow_stusy
nstu_stusy == n_distinct(stusy$sid)
nstu_stuatt == n_distinct(stusy$sid)

# Check merge percentage
nrow(stusy) / nrow_stusy
nstu_stusy / n_distinct(stusy$sid)
nstu_stuatt / n_distinct(stusy$sid)

stusy <- arrange(stusy, sid)
length(unique(stusy$sid)) == length(unique(stuatt$sid))
```

#### 4.3 Generate Program Participation Variables

Now, create binary variables (variables that assume values of 0 or 1) to 
indicate if a student ever:

1. qualified to participate in FRPL;
2. qualified for an IEP;
3. classified as ELL (or LEP);
4. qualified for gifted program.

These variables, (`frpl_ever`, `iep_ever`, `ell_ever`, and `gifted_ever`) allow 
you to explore high school and college outcomes for students that participated 
in these programs for one or more school years. 

Create analogous variables to capture students’ program participation status in 
high school.

```{r genProgramPartVars}
# In R this is an easy way to go by just using group_by and mutate
tmp <- filter(stusy, (grade_level >= 9 & grade_level <= 12)) %>% 
               group_by(sid) %>% 
  summarize(frpl_ever_hs = ifelse(max(frpl) > 0, 1, 0), 
         iep_ever_hs = max(iep), 
         ell_ever_hs = max(ell), 
         gifted_ever_hs = max(gifted))

stusy <- inner_join(stusy, tmp, by = "sid")
stusy <- arrange(stusy, sid)

stusy %<>% group_by(sid) %>% 
  mutate(frpl_ever = ifelse(max(frpl) > 0, 1, 0), 
         iep_ever = max(iep), 
         ell_ever = max(ell), 
         gifted_ever = max(gifted))

rm(tmp)
```

High school status variables (`frpl_ever_hs`, `iep_ever_hs`, `ell_ever_hs` 
and `gifted_ever_hs`) allow flexibility in defining student subgroups. This is 
useful if participation data is missing non-randomly before a student enters 
high school, or if variables that capture participation are overly inclusive 
across years. For example, a student who demonstrates limited English 
proficiency in 4th grade may be fluent in English by 9th grade. It may or may 
not be appropriate to categorize the student as ELL in analyses that examine 
high school outcomes.


## STEP 5: Student School Enrollment

### Overview

**Purpose:** Merge Student School Enrollment with data in memory.

**Files needed:** Student_School_Enrollment_Clean output from Task 4 in Clean

**After this step** you will have merged Student School Enrollment data with 
data in memory.

### Steps

#### 5.1 Merge on Student School Enrollment

Merge the Student School Enrollment file onto the analysis file.
This allows you to identify high schools students enrolled at
different times.

This is a 1:m merge, as the data file from the previous two
steps is unique by `sid` + `school_year` and the Student School
Enrollment file is unique by `sid`, `school_year`, `school_code`, and
`enrollment_date`.

```{r loadAndMergeEnrollment}
tmpfileName <- "clean/Student_School_Enrollment_Clean.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuschl <- read_stata(con)
close(con)

# Optional - get dimensions for comparing merge
nstu_stusy <- n_distinct(stusy$sid)
nstu_stuschl <- n_distinct(stuschl$sid)
nrow_stusy <- nrow(stusy)

stusy <- inner_join(stusy, stuschl, by = c("sid", "school_year"))
```

Before the merge, the Student School Enrollment file should
be unique by `sid`, `school_year`, `school_code`, and `enrollment_date`.

#### 5.2 Checking the Merge

```{r checkStuSchlMerge}
# Check percentage of students and rows merged
n_distinct(stusy$sid) / nstu_stusy 
n_distinct(stusy$sid) / nstu_stuschl
nrow(stusy) / nrow_stusy  

# Above 0.95 so we can proceed
```

## STEP 6: High School Indicators and Outcomes

### Overview

**Purpose:** Generate high school indicators and outcomes in two categories.

**Files needed:** The file in memory from Steps 3-5 and `highschoolinfo` from 
Step 2. 

**After this step** you will have created a number of high school indicators 
and outcomes: 1) first, last, and longest high school; 2) 9th grade and 
graduation cohorts; and 3) end of high school outcomes: on-time and late graduates 
and high school enrollment outcomes for non-graduates.

### Steps

#### 6.1 Define First, Last, and Longest High School

To begin, make sure that the data includes only student observations in high 
school. (You have done this in **Clean** already; we are checking it again here).

```{r selectHSonly}
stusy %<>% filter(grade_level >= 9 & !is.na(grade_level) & 
                    grade_level <= 12)

```

There might be students who are assigned to high schools but whose attendance 
duration is 0. Drop these school assignments to ensure that you assign students 
to high schools they actually attended. 

```{r dropZeroDayAttend}
stusy %<>% filter(days_enrolled > 0)
```

#### Define First High School

To identify a student’s first high school, determine the first enrollment 
episode for the student.

In some cases, students enroll in more than one school at the same time. In such 
cases, assign them to the school where they attended longest.

Should students have multiple first enrollments of the same length, randomly 
assign them to one of these schools.

```{r assignFirstHS}
stusy %<>% arrange(sid, school_year, enrollment_date, desc(days_enrolled))

stusy %<>% group_by(sid) %>%
  arrange(sid, school_year, enrollment_date, desc(days_enrolled)) %>% 
  mutate(first_hs_code = first(school_code), 
         last_hs_code = last(school_code))

```


#### Define Last High School


To identify a student’s last high school, determine the last enrollment episode 
for the student. In cases of joint enrollment, use the school where the student 
attended longest. Where joint enrollment duration is the same, randomly assign 
the last high school.


```{r genLastHSCode}
stusy %<>% group_by(sid) %>%
  arrange(sid, desc(school_year), desc(withdrawal_date), desc(days_enrolled)) %>% 
  mutate(last_hs_code = last(school_code)) %>%
  ungroup
```


#### Define Longest High School

To determine the longest enrolling HS, you first have to add up all enrollments 
within a HS. Since in Clean you ensured that there are no overlapping enrollments 
within a school, you can add enrollments up. 

In cases where students enrolled in more than one school for the same amount of 
time, randomly assign the longest high school

```{r defineLongestHS}
stusy %<>% group_by(sid, school_code) %>% 
  mutate(total_days_enrolled_in_school = sum(days_enrolled))

stusy %<>% group_by(sid) %>% 
  mutate(total_days_enrolled_in_school_max = max(total_days_enrolled_in_school))

stusy %>% select(sid, school_code, enrollment_date, total_days_enrolled_in_school, 
                 days_enrolled, first_hs_code) %>% 
  head

stusy %<>% group_by(sid) %>% 
  mutate(longest_hs_code = unique(school_code[total_days_enrolled_in_school_max ==
                                                total_days_enrolled_in_school])[1])


stusy %>% select(sid, school_code, enrollment_date, total_days_enrolled_in_school, 
                 days_enrolled, first_hs_code, longest_hs_code) %>% 
  head

# Drop temporary variables
stusy$total_day_enrolled_in_school <- NULL
stusy$total_days_enrolled_in_school <- NULL
stusy$total_days_enrolled_in_school_max <- NULL
```


#### Merge on highschoolinfo

Merge the `highschoolinfo` tempfile created in Step 2 onto the
current file. This allows you to obtain school names(`first_hs_name` and 
`last_hs_name`) associated with the high school codes just captured.

This requires merging data currently loaded in R to the
`schl` object three times – once on `first_hs_code`,
then on `last_hs_code`, and finally on `longest_hs_code`. These
merges will all be m:1 (many to one) because the file in memory
contains multiple observations per school and the tempfile
contains only one per school.

```{r mergeSchoolNames}

stusy <- left_join(stusy, schl[, c("school_name", "first_hs_code")], 
                   by = c("first_hs_code"))

stusy %<>% rename(school_name_first_hs = school_name)

stusy <- left_join(stusy, schl[, c("school_name", "last_hs_code")], 
                   by = c("last_hs_code"))

stusy %<>% rename(school_name_last_hs = school_name)

stusy <- left_join(stusy, schl[, c("school_name", "longest_hs_code")], 
                   by = c("longest_hs_code"))

stusy %<>% rename(school_name_longest_hs = school_name)
```

By default, the `left_join` function keeps all students in the `stusy` enrollment 
data. If a student has no school name that can be assigned, this is filled in as 
a missing value. This can be used to filter the merge results easily.

```{r cleanupMerge}

stusy %<>% filter(!is.na(school_name_longest_hs))
stusy %<>% filter(!is.na(school_name_first_hs) & !is.na(school_name_last_hs))

```


#### 6.2 Assign Ninth Grade and Graduation Cohorts

Assigning students to cohorts will allow you to calculate various indicators 
(e.g. high school graduation, college enrollment) using a different set of 
students in the denominator. For example, when calculating college enrollment, 
you could use the ninth grade cohort to illustrate how high schools prepared 
their incoming freshmen for future success, or you could use the graduating 
cohort to illustrate the percentage of a high school’s graduates enrolling in 
college.

Since the ninth grade cohort is equal to `first_9th_school_year_observed` in the 
student attributes file, just rename `first_9th_school_year_observed` to 
`chrt_ninth`.

```{r renameChrtNinghtVar}
# define ninth grade cohort
# rename first_9th_school_year_observed chrt_ninth
stusy %<>% rename(chrt_ninth = first_9th_school_year_observed)
```


The graduation cohort variable, `chrt_grad`, is the school year in which a student 
graduated. If a student obtained a diploma prior to September 1st, the `chrt_grad`
variable is the same as the year of `hs_diploma_date`. If a student received a 
diploma between September 1st and December 31st treat them as graduates for the 
next school year.

```{r defineChrtGrad}
# define graduation cohort
stusy$chrt_grad <- NULL
library(lubridate)
# Use lubridate package to find months and years easily
head(year(stusy$hs_diploma_date))
head(month(stusy$hs_diploma_date))

stusy$chrt_grad <- ifelse(month(stusy$hs_diploma_date) < 9, 
                          year(stusy$hs_diploma_date), 
                            year(stusy$hs_diploma_date) + 1)

```

**Question 6.2** Test your understanding by filling the shaded areas below.

```{r checkResultsChrtGrad}
stusy %>% filter(sid %in% c(16305, 16306, 16307)) %>% 
  select(sid, hs_diploma_date, chrt_ninth, chrt_grad) %>% 
  distinct(.keep_all=TRUE) 
```

Note, that in **Clean**, you assigned every student a
`first_ninth_school_year_observed`. You either had their first ninth grade in 
the data, or if they transferred into the district later in high school, you 
backward mapped them to an appropriate ninth grade school year. Thus, 
`chrt_ninth` should never be missing. `chrt_grad`, however, could be missing, 
because not all students graduated, thus they will not be assigned a graduating 
cohort. 

#### 6.3 Define High School Outcomes

To determine end of high school outcomes, you need a student's last withdrawal 
code (withdrawal code at their last high school). Use the last withdrawal code 
to determine if the student graduated, transferred out, dropped out, or has 
another outcome. 

#### 6.3a Group last withdrawal codes together into four end of high school outcomes:


* 1 = Graduated (define graduated using withdrawal data and `hs_diploma` in the Student Attributes file, or any other source of graduation information used)
* 2 = Transfer Out
* 3 = Drop Out
* 4 = Other (all other reasons for withdrawal)

Outcomes are captured using the `last_wd_group` variable. Assigning the last 
withdrawal code to `last_wd_group` requires an understanding of decision rules in 
your agency. Some withdrawal codes may be ambiguous or redundant and need to be 
combined to fit under the four categories. Therefore, it is important to elicit 
help from those knowledgeable of local graduation, transfer, and dropout policies 
in your agency. For example, there may be special codes for students who are 
incarcerated or pass away that are not well-documented. We provide an example 
below, but you will have to customize this script based on values your agency 
uses. Particularly, for dropouts you should make sure the agency is not being 
penalized for something it does not have control over.

First, examine the values for `last_withdrawal_reason`. You will have to make sure 
that you capture all these values in defining the last withdrawal groups. Should 
your data have any missing values for `last_withdrawal_reason`, be sure to assign 
those to an appropriate category. In some agencies, a missing withdrawal code 
indicates that the student is still enrolled, so assign them as still enrolled.

```{r checkWithdrawalCodes}
stusy %>% arrange(sid) %>% 
  summarize(last_withdrawal = last(last_withdrawal_reason)) %>% 
  select(last_withdrawal) %>% unlist %>% table

```


```{r recodeWithdrawalCodes}
stusy$last_wd_group <- NA
stusy$last_wd_group[stusy$last_withdrawal_reason %in% c("Home School", 
                                                        "Other Transfer", 
                                                        "Transfer Out of District", 
                                                        "Death")] <- 2
stusy$last_wd_group[stusy$last_withdrawal_reason %in% c("Absenteeism", 
                                                        "No Show", 
                                                        "Expulsion")] <- 3
stusy$last_wd_group[stusy$hs_diploma == 1] <- 1
stusy$last_wd_group[is.na(stusy$last_wd_group)] <- 4
table(is.na(stusy$last_wd_group))
```

Note that we populated the graduating group last; this is because the evidence 
of a high school diploma overrides any other value for last withdrawal reason. 
Any remaining `last_withdrawal_reason` values were then classified as 4, Other. 

#### 6.3b Define High School Outcomes

**Identify On-Time and Late High School Graduates**

First, identify students who graduated within 4 years of entering high school 
(on-time graduates) as well as students who took more than 4 years 
(late graduates). These variables allow you to examine time taken to complete 
high school and explore how this varies across high schools within a system. 
These two variables have to add up to the total graduates.

```{r defineGradTypes}
# define on-time graduates
stusy$ontime_grad <- ifelse(stusy$chrt_ninth >= stusy$chrt_grad -3 & 
                        !is.na(stusy$chrt_ninth) & 
                        !is.na(stusy$chrt_grad) & 
                        stusy$hs_diploma == 1 , 1, 0)

# define late graduates
stusy$late_grad <- ifelse(stusy$ontime_grad == 0 & 
                        !is.na(stusy$chrt_ninth) & 
                        !is.na(stusy$chrt_grad) & 
                        stusy$hs_diploma == 1 , 1, 0)
all(stusy$late_grad + stusy$ontime_grad == stusy$hs_diploma)
```

**Identifying High School Enrollment Outcomes for Non-Graduates**

Next, assign high school enrollment outcomes for students who have not graduated 
by a point in time. You may define this point, but typically it is the current 
year if data is up to date. It is important that each student is either marked 
as a graduate or assigned to only one of the following categories. Notice how 
the definition of each category is conditional on all previous categories.

```{r hsOutcomesNonGrads}
# still enrolled

maxDataYear <- max(stusy$school_year)
stusy %<>% group_by(sid) %>% 
  mutate(still_enrl = ifelse(max(school_year) == maxDataYear & 
                               hs_diploma != 1, 1, 0))
# transfer out

stusy$transferout <- ifelse(stusy$last_wd_group == 2 & 
                              stusy$hs_diploma!=1 & 
                              stusy$still_enrl != 1, 1, 0)
# drop out
stusy$dropout <- ifelse(stusy$last_wd_group == 3 & 
                              stusy$hs_diploma!=1 & 
                              stusy$still_enrl != 1 & 
                              stusy$transferout != 1, 1, 0)
# disappear
stusy$disappear <- ifelse(stusy$dropout != 1 & 
                              stusy$hs_diploma!=1 & 
                              stusy$still_enrl != 1 & 
                              stusy$transferout != 1, 1, 0)

```

You have generated most of the key high school indicators and outcomes, so you 
no longer need all source variables. Keep only the variables listed here.

Based on these time-invariant variables, the file is now unique by `sid`. To drop 
other variables, first keep the 30 then drop duplicates.

```{r keepTimeInvariantVars}
# // keep time-invariant variables
stusy %<>% ungroup %>% 
  select(sid, male, race_ethnicity, 
         last_wd_group, still_enrl, transferout, 
         dropout, disappear,
         matches("hs_diploma|_ever|_hs_code|school_name|chrt|_grad"))

stusy %<>% distinct(.keep_all = TRUE)

# // make sure the file is unique by sid
nrow(stusy) == length(unique(stusy$sid))

```

## STEP 7: Prior Achievement Part 2

### Overview 

**Purpose:** Merge prior achievement test scores onto the analysis file.

**Files needed:** the analysis file in memory from Step 6 and `tests' from Step 1

**After this step** you will have merged prior achievement data with the current
analysis file.

#### 7.1 Merge on tests

You have a data set unique by `sid` that contains key student attributes, 
high school indicators and outcomes. All you need now are 8th grade test scores.
To add prior achievement scores to the file, merge the `stuach` object from 
Step 1 onto the current analysis file. This is a 1:1 merge on `sid`. 

Use `left_join` to keep all the `sid` in the enrollment file. You may expect to 
capture prior achievement for most students, but not all students will have
score information so it is important to keep the enrollment file. For example, 
students who first enroll in the system during high school (after 8th grade) or 
were exempt from tests will not have prior test scores.

Using `left_join` saves a step of dropping students in `stusy` who do not appear in `stuach`. 

```{r MergePriorAchievement}
stusy <- left_join(stusy, stuach, by = "sid")
```

Review the merge results, to see what percentage of students
have prior scores.

```{r checkPriorAchieveMerge}
table(is.na(stusy$qrt_8_math))
```

In this case, almost 70% of high school students have prior achievement data. 

## STEP 8: Examining the Analysis File Part 1

### Overview

Congratulations! You now finished working with agency administrative records and 
can save a preliminary analysis file to generate analyses on student transitions 
through high school completion and college-going success. 

First, order the variables in a sensible way.

```{r organizeColumns}
stusy %<>% select(sid, male, race_ethnicity, hs_diploma, hs_diploma_type, 
                  hs_diploma_date, frpl_ever, iep_ever, ell_ever, gifted_ever,
                  frpl_ever_hs, iep_ever_hs, ell_ever_hs, gifted_ever_hs,
                  first_hs_code, last_hs_code, longest_hs_code, school_name_first_hs, 
                  school_name_last_hs, school_name_longest_hs, last_wd_group,
                  chrt_ninth, 
                  chrt_grad, ontime_grad, late_grad, still_enrl, transferout,
                  dropout, disappear, test_math_8_raw, test_math_8, 
                  test_math_8_std, test_ela_8_raw, test_ela_8, 
                  test_ela_8_std, test_composite_8, test_composite_8_std, 
                  qrt_8_math, qrt_8_ela, qrt_8_composite)

# Save as college going
stuCG <- stusy; rm(stusy)
```


All that is left to do is process college enrollment records from the National 
Student Clearinghouse (NSC), and merge these data onto the `Student_CollegeGoing`
file. This creates a single analysis file to generate analyses on student 
transitions through high school and college.

Before moving on, take a moment to admire your work and refamiliarize yourself 
with sources and processes for each of these variables. Ask yourself: What
research files were the variables produced from? How were high school indicators 
and outcomes created?

```{r reviewVarNames}
names(stuCG)
```


Use the following questions for the numbered variables above to guide your thoughts:

1. (1-6): Which research file are these variables from?
2. (7-14): Which research file are these variables from?
3. (15-17): How are first, last, and longest high school codes identified?
4. (18-20): From what research file are first, last, and longest high school names obtained?
5. (21): What does the `last_wd_group` variable describe?
6. (22-23): How are 9th grade and graduation cohorts defined?
7. (24-29): How are graduation and high school enrollment outcomes defined?
8. (30-40): Which research file are these variables from?

## STEP 9: National Student Clearinghouse Data

**Purpose:** Generate college enrollment and persistence indicators

**Files needed:** The `Student_NSC_Enrollment_Indicators` data from **Task 7** 
in **Clean**, and the `Student_CollegeGoing` file you saved in Step 8.

**After this step** you have created the indicators that will be used for college 
going analysis.

For all NSC related analyses, we create two types of indicators: one bases on 
the graduating cohort, and another based on the ninth grade cohort. Each of these 
indicators serves a different purpose, and can be used to answer different 
questions. For example, if you are interested in how high schools or the entire 
agency is doing in enrolling their graduates to college, you would be using the 
indicators based on the graduating cohort. 

If, however, you want to evaluate how the high schools or agency is preparing 
their incoming freshman to go through high school and enroll in college, you 
will use the indicators bases on the night grade cohort. In addition, we create 
separate indicators to evaluate how soon after high school students enroll in 
college. One set of indicators is based on enrollment on October 1st. The second 
set of indicators is based on enrollment within two years of graduation. This 
latter indicator is calculated based on the calendar date of the student’s high 
school graduation (or, in case of the ninth grade cohort, the expected on-time 
high school graduation). 

To begin this step, open the `Student_NSC_Enrollment_Indicators` file from Task 7, 
and merge the `Student_CollegeGoing` (the `stuCG` object) from Step 8.

```{r joinNSCData}
tmpfileName <- "clean/Student_NSC_Enrollment_Indicators.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stunsc <- read_stata(con)
close(con)

# merge on variables needed from Student_College_Going to a temp file
tmp <- select(stuCG, sid, hs_diploma_date, hs_diploma, chrt_grad, chrt_ninth)
# Use inner_join to only keep students in both
stunsc <- inner_join(tmp, stunsc, by = c("sid")); rm(tmp)
```

#### 9.1 Create a varaible to indicate if the student enrolled in college within two years of graduating from high school.

Start with the graduating cohort. If the student enrolled in college within 
`2*365` days after high school graduation, set the indicator to 1.

```{r genTwoYearEnrollment}
# create and indicator to show if the student enrolled within two years 
# of HS graduation

stunsc$enrl_ever_w2_grad <- ifelse(stunsc$first_enrl_date_any < 
                                     (stunsc$hs_diploma_date + (365*2)) &
                                     !is.na(stunsc$hs_diploma_date) & 
                                     !is.na(stunsc$first_enrl_date_any), 
                                   1, 0)
```

For the ninth grade cohort, first create a variable that represents on-time 
graduation. Set this date to September 1st the fourth year after the student’s 
ninth grade cohort. Then create the enrollment indicator using this date.

```{r genEnrollForOntimeGrad}
stunsc$ontime_yr <- stunsc$chrt_ninth + 3
stunsc$ontime_date <- mdy(paste0("09", "01", stunsc$ontime_yr))

# create and indicator to show if the student enrolled within two years of 
# expected HS graduation
stunsc$enrl_ever_w2_ninth <- ifelse(stunsc$first_enrl_date_any < 
                                      (stunsc$ontime_date + (365*2)) & 
                                      !is.na(stunsc$ontime_date) &
                                      !is.na(stunsc$first_enrl_date_any), 
                                    1, 0)

```

**Question 9.1 Test your understanding by filling the shaded areas below.**

```{r checkCodingNSC}

stunsc %>% filter(sid %in% c(15647,15656,15658)) %>% 
  select(sid, chrt_ninth, chrt_grad, hs_diploma_date, first_enrl_date_any, 
         enrl_ever_w2_grad, ontime_yr, ontime_date, enrl_ever_w2_ninth) %>% 
  distinct(.keep_all=TRUE) %>% 
  as.data.frame

```


#### 9.2 Create variables to indicate if the student was enrolled in college by Oct 1

This variable indicates if a student enrolled in college immediately after high 
school. In addition, we use indicators based on October 1st college enrollment 
to track if student persisted in college. For this, we need to create variables 
to indicate enrollment on October 1st the 1st, 2nd, 3rd and 4th year after high 
school graduation.

First, create placeholder variables for both the ninth and graduating cohort, 
for each of the four years. 

```{r genPlaceholderYearVars}
# Create the 4 enrollment outcomes of interest by October 1st

# Create a new data.frame with all variables and merge it onto stunsc

newdf <- data.frame(NA)

for(num in 1:4) {
  eval(parse(text=paste0("newdf$enrl_1oct_grad_yr", num, " <- 0")))
  eval(parse(text=paste0("newdf$enrl_1oct_ninth_yr", num, " <- 0")))
}

# Drop the first empty row
newdf <- newdf[, -1]

# We can use cbind, or column bind, instead of mergin
stunsc <- cbind(stunsc[, 1:35], newdf)
```

Now, loop through these year values. We replace the above created placeholder 
variables with 1 if the student enrolled in college as of October 1st and the 
student's enrollment for that year ended after October 1st. We do this for the 
1st, 2nd, 3rd and 4th year after high school graduation. As we loop through the 
variables, we have to make sure that we are only replacing values for students 
who graduated in the year in the current loop, so we have to set a condition 
that ensures that the record is for a student whose diploma date falls in that 
school year.

```{r loopPlaceholderValuesNSC}
stunsc$n_enroll_begin_date[stunsc$sid == 16011]

stunsc %>% filter(sid %in% c(16011, 16016)) %>% 
  select(sid, chrt_grad, hs_diploma_date, first_enrl_date_any, n_enroll_begin_date)

testDF <- stunsc %>% 
  mutate(compDateG = ymd(paste0(chrt_grad, "-10-01")), 
         compDateE = ymd(paste0(chrt_ninth, "-10-01")),
         compInterval = n_enroll_begin_date %--% n_enroll_end_date) %>%
    mutate(enrl_1oct_grad_yr1 = compDateG %within% compInterval, 
           enrl_1oct_grad_yr2 = (compDateG + 365) %within% compInterval,
           enrl_1oct_grad_yr3 = (compDateG + 730) %within% compInterval,
           enrl_1oct_grad_yr4 = (compDateG + 1095) %within% compInterval) %>% 
  mutate(enrl_1oct_grad_yr1 = ifelse(compDateE %within% compInterval, 1,
                                     as.numeric(enrl_1oct_grad_yr1)),
           enrl_1oct_grad_yr2 = ifelse((compDateE + 365) %within% compInterval, 1, 
                                       as.numeric(enrl_1oct_grad_yr2)),
           enrl_1oct_grad_yr3 = ifelse((compDateE + 730) %within% compInterval, 1, 
                                       as.numeric(enrl_1oct_grad_yr3)),
           enrl_1oct_grad_yr4 = ifelse((compDateE + 1095) %within% compInterval, 1, 
         as.numeric(enrl_1oct_grad_yr4)))
  

stunsc %<>% filter(!is.na(stunsc$chrt_grad))

# [stunsc$chrt_grad ==", i, "] 
for(i in as.numeric(na.omit(unique(stunsc$chrt_grad)))){
  for(j in 1:4){
    eval(parse(text=paste0("stunsc$enrl_1oct_grad_yr", j," <- ifelse((
                       (stunsc$enrl_1oct_grad_yr", j," == 0) &
                           (stunsc$n_enroll_begin_date <=
                       ymd(paste0(", i, "+", j-1, ",'-10-01')) & 
                      ymd(paste0(", i, "+", j-1, ",'-05-01')) <= 
                      stunsc$n_enroll_end_date) &
                      (year(stunsc$hs_diploma_date) == ", i, " &
                      month(stunsc$hs_diploma_date) <=9) | 
                    (year(stunsc$hs_diploma_date) == ", i-1, "& 
                      month(stunsc$hs_diploma_date)>9)), 1, 
                    stunsc$enrl_1oct_grad_yr", j,")")))
    
  }
}

table(stunsc$enrl_1oct_grad_yr1)
table(stunsc$enrl_1oct_grad_yr2)
table(stunsc$enrl_1oct_grad_yr3)
table(stunsc$enrl_1oct_grad_yr4)

stunsc %>% filter(sid %in% c(16011, 16016)) %>% 
  select(sid, chrt_grad, hs_diploma_date, first_enrl_date_any,
         n_enroll_begin_date, n_enroll_end_date,
         enrl_1oct_grad_yr1, enrl_1oct_grad_yr2, 
         enrl_1oct_grad_yr3, enrl_1oct_grad_yr4) %>% 
  as.data.frame

for(i in as.numeric(unique(stunsc$ontime_yr))){
  for(j in 1:4){
    eval(parse(text=paste0("stunsc$enrl_1oct_ninth_yr", j," <- ifelse((
                       (stunsc$enrl_1oct_ninth_yr", j," == 0) &
                           (stunsc$n_enroll_begin_date <=
                       ymd(paste0(", i, "+", j-1, ",'-10-01')) & 
                      ymd(paste0(", i, "+", j-1, ",'-05-01')) <=
                  stunsc$n_enroll_end_date) & (stunsc$chrt_ninth == ", i - 3, ")),
                        1, stunsc$enrl_1oct_ninth_yr", j,")")))
    
  }
}
table(stunsc$enrl_1oct_ninth_yr1)
table(stunsc$enrl_1oct_ninth_yr2)
table(stunsc$enrl_1oct_ninth_yr3)
table(stunsc$enrl_1oct_ninth_yr4)

```

#### 9.3 Collapse and reshape data to make it unique by sid

At this point, you have all the values that were depended on individual college
enrollment episodes. Now we will make the data unique by student, ensuring that 
we have indicators for 2-year, 4-year and any college. Create an indicator that
specifies the type of college. Make this the highest level of enrollment.

```{r makeNSCuniqueSID}
stunsc$type <- NA
stunsc$type[stunsc$n_college_2yr == 0 & stunsc$n_college_4yr == 0] <- "_none"
stunsc$type[stunsc$n_college_2yr == 1] <- "_2yr"
stunsc$type[stunsc$n_college_4yr == 1] <- "_4yr"
```

Collapse the data by the invariant variables to populate the enrollment 
indicators for each record by student and college type.

Note, that when you collapse, only variables specified in the collapse command 
remain in your data.

```{r collapsetoSIDonly}
# We do not need variables that refer to no college, so we can drop those.
stunsc <- stunsc %>% 
  select(sid, chrt_grad, chrt_ninth, hs_diploma_date,
         type, matches("enrl_|first_college")) %>% 
  group_by(sid, type) %>%
  summarize_all(.funs = "max") %>% 
  filter(type !="_none")

tmp <- stunsc %>% select(matches("_any|_2yr|_4yr"))
stunsc <- stunsc %>% select(-matches("_any|_2yr|_4yr"))


tmp %<>% distinct(sid, .keep_all = TRUE)
```

Reshape the data to have one record per student, and separate indicators for 
the type of college.

```{r reshapeNSCDataWide}

stunsc <- as.data.frame(stunsc)
stunsc <- reshape(stunsc, 
                  timevar = "type", 
                  idvar = c("sid", "chrt_grad", "chrt_ninth", 
                            "hs_diploma_date"),
                  direction = "wide", 
                  sep="")
stunsc <- inner_join(stunsc, tmp, by = c("sid" = "sid"))

stunsc %>% filter(sid == 41) %>% 
  select(sid, enrl_ever_w2_ninth_2yr, 
         enrl_1oct_grad_yr1_2yr, enrl_1oct_ninth_yr2_2yr,
         enrl_1oct_ninth_yr3_2yr, enrl_1oct_ninth_yr4_2yr,
         enrl_ever_w2_ninth_4yr, 
         enrl_1oct_grad_yr1_4yr, enrl_1oct_ninth_yr2_4yr,
         enrl_1oct_ninth_yr3_4yr, enrl_1oct_ninth_yr4_4yr)

```


```{r checkDimensions}
nrow(stunsc) == length(unique(stunsc$sid))
# isid sid
```

Ensure that your data is unique by student id.

#### 9.4 Ensure mutual exclusivity of 2-year and 4-year college enrollment

To ensure that in analyses that compare list both 2-year and 4-year enrollment 
students are not double counted, you have to ensure that these indicators are 
mutually exclusive. If have a student enrolled in both 2-year and 4-year college, 
report the 4-year.

```{r checkCollegeTypeExclusivity}
varList <- c(paste0("enrl_ever_w2_", c("grad", "ninth")),
             as.vector(outer(c("enrl_1oct_grad_", "enrl_1oct_ninth_"), 
                c("yr1", "yr2", "yr3", "yr4"), paste, sep="")))

for(i in varList){
  var2yr <- paste0(i, "_2yr")
  var4yr <- paste0(i, "_4yr")
  stunsc[, var2yr] <- ifelse(stunsc[, var2yr] == 0 & stunsc[, var4yr] ==1, 
                             0, stunsc[, var2yr])
}

stunsc$enrl_ever_w2_grad_2yr <- ifelse(is.na(stunsc$enrl_ever_w2_grad_2yr), 
                                       0, stunsc$enrl_ever_w2_grad_2yr)
stunsc$enrl_ever_w2_grad_4yr <- ifelse(is.na(stunsc$enrl_ever_w2_grad_4yr), 
                                       0, stunsc$enrl_ever_w2_grad_4yr)
stunsc$enrl_ever_w2_ninth_2yr <- ifelse(is.na(stunsc$enrl_ever_w2_ninth_2yr), 
                                       0, stunsc$enrl_ever_w2_ninth_2yr)
stunsc$enrl_ever_w2_ninth_4yr <- ifelse(is.na(stunsc$enrl_ever_w2_ninth_4yr), 
                                       0, stunsc$enrl_ever_w2_ninth_4yr)
```

#### 9.5 Create "any college" version of the variables

At this point, we have 2-year and 4-year variables. We create “any college” 
variables by setting them to 1 if either the 2-year or the 4-year variable is 1.


```{r createAnyCollegeVar}

varList <- c(grep("enrl_1oct", names(stunsc), value = TRUE), 
             grep("enrl_ever", names(stunsc), value = TRUE))

# as.vector(outer(as.vector(outer(c("grad", "ninth"), 
#                                 paste0("yr", 1:4), paste, sep="_")),
#                 c("2yr", "4yr"), paste, sep = "_"))

# iterator
stubs <- as.vector(outer(c("grad", "ninth"), 
                                 paste0("yr", 1:4), paste, sep="_"))

for(i in stubs){
  tmp <- grep(i, varList, value = TRUE)
  newVar <- paste0(gsub("_2yr", "", tmp[1]), "_any")
  stunsc[, newVar] <- rowSums(zeroNA(stunsc[, tmp]))
}

stunsc$enrl_ever_w2_grad_any <- rowSums(zeroNA(stunsc[, c("enrl_ever_w2_grad_2yr",
                                                          "enrl_ever_w2_grad_4yr")]))
stunsc$enrl_ever_w2_ninth_any <- rowSums(zeroNA(stunsc[, c("enrl_ever_w2_ninth_2yr",
                                                          "enrl_ever_w2_ninth_4yr")]))
```

#### 9.6 Create persistence outcomes for graduates and ninth graders

We consider that a student has persisted to the second year of college if they 
were enrolled in college on October 1st after graduation, and were also enrolled 
on October 1st one year later. Apply the same logic to creating a variable 
that indicates continuous enrollment over four years in college.

```{r createPersistenceVar}
# This is a very un-R way to do this, but it works within the data structure 
# specified by the toolkit

for(chrt in c("ninth", "grad")){
  for(type in c("2yr", "4yr", "any")){
    newVar <- paste0("enrl_1oct_", chrt, "_persist_", type)
    var1 <- paste0("enrl_1oct_", chrt, "_yr1_", type)
    var2 <- paste0("enrl_1oct_", chrt, "_yr1_any")
    var3 <- paste0("chrt_", chrt)
      
    stunsc[, newVar] <- ifelse(stunsc[, var1] == 1 & 
                                 stunsc[, var2] == 1 & 
                                 !is.na(stunsc[, var3]), 
                               1, 0)
    stunsc[, newVar] <- zeroNA(stunsc[, newVar])
    # Persistence all 4 years
    newVar <- paste0("enrl_1oct_", chrt, "_all4_", type)
    vList <- paste("enrl_1oct", chrt, 
                        c("yr1", "yr2", "yr3", "yr4"), type, sep = "_")
    var5 <- paste0("chrt_", chrt)
    stunsc[, newVar] <- ifelse(stunsc[, vList[1]] == 1 & 
                                 stunsc[, vList[2]] == 1 & 
                                 stunsc[, vList[3]] == 1 & 
                                 stunsc[, vList[4]] == 1 & 
                                 !is.na(stunsc[, var5]), 
                               1, 0)
    stunsc[, newVar] <- zeroNA(stunsc[, newVar])
  }
}
```


## Step 10: Merge the Collegoing and NSC File

**You're Almost There!**

You are almost finished creating the college-going analysis file. All that is 
left is to merge the `Student_CollegeGoing` file with the NSC college-going 
indicators you created in Step 9 and drop any students classified as `transferout` 
from the analysis. This allows your analysis file to contain student 
characteristics, high school outcomes, and college enrollment and persistence 
outcomes. 

This is a 1:1 merge on `sid` and results in observations that match, as well 
as observations that come only from the `Student_CollegeGoing` file.

```{r mergeTogether}
out <- left_join(stuCG, stunsc, by = "sid")
out %<>% filter(transferout != 1)
cg_analysis <- out
# Save the current file as CG_Analysis.rda
# Create a analysis directory
# dir.create("analysis")
# save(out, file = "analysis/CG_Analysis.rda")
# Or if you want to save the Stata file
# write_dta(out, file = "analysis/CG_Analysis.dta")

# Save the current file as Student_CollegeGoing.rda
# Create a analysis directory
# dir.create("clean")
# save(stuCG, file = "clean/Student_CollegeGoing.rda")
# Or if you want to save the Stata file
# write_dta(stuCG, file = "clean/Student_CollegeGoing.dta")
# cleanup
rm(list = ls()[ls() != "cg_analysis"])
```


# Connect: On Track Indicators

**Purpose:** Generate variables that indicate students’ on track status at the 
end of each high school year.

**Files needed:** 

`Student_School_Year_Ninth`, `Student_School_Enrollment_Clean`,
and `Student_Class_Enrollment_Merged` from Clean, and `Student_CollegeGoing` 
from Step 8 in Connect.

After this step the variables created will be merged onto the `CG_Analysis` file 
to create a `CG_Analysis_Ontrack` file.

The process of creating the on track indicators consists of three major steps:

1. Create the sample used for the on-track analyses
2. Create the on-track variables
3. Create GPA test variables.

Before you begin, specify the current school year. For our example, we assume 
that we are in 2010.

```{r globalSchYear}
current_schyr <- 2010
```

Since you just cleaned out our workspace after creating `Student_Collegegoing` you 
need to reload any packages (necessary if you are restarting R at 
this point) and re-source any custom functions in the `functions.R` file.

```{r reloadWorkspace}
# Load the packages and prepare your R environment
library(tidyverse) # main suite of R packages to ease data analysis
library(magrittr) # allows for some easier pipelines of data

# Read in some R functions that are useful for toolkit tasks, see SDP R Glossary
# for details
source("R/functions.R")
library(haven) # required for importing .dta files
```


## Step 1: Create the On-Track Sample


#### 1.1 Keep the students who have all records to be in the sample

The on-track indicator file is based on a subset of the college-going file. In 
order to be included, a student has to be enrolled in the district in the first 
semester of 9th grade, and has to have continuous enrollment. 

If a student ever transferred out, they could have earned credits in another 
district that we don’t have records us. Thus, we have to drop them from the 
sample. We use the Student School Year file and the Student School Enrollment 
file to determine enrollment.

```{r readinOTdata}
# Read in file 1
tmpfileName <- "clean/Student_School_Year_Ninth.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stusy <- read_stata(con)
close(con)
# Read in file 2
tmpfileName <- "clean/Student_School_Enrollment_Clean.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuschl <- read_stata(con)
close(con)

stuOT <- left_join(stuschl, stusy[, 1:3], by = c("sid", "school_year"))
rm(stusy, stuschl); gc()

```

Create a new withdrawal-code-based binary variable that identifies transfer-out 
codes. This variable will be 1 for all withdrawal codes related with 
transfer-out, not just last withdrawal code observed for the student.

First review the withdrawal codes:

```{r withdrawalCodeTable}
table(stuOT$withdrawal_code_desc)
```

```{r identifyTransferout}
# Create a list of values that indicate a transfer out

transferCodes <- c("Home School", "Left District", "Other Transfer", 
                   "Transfer Out of District", 
                   "Death")

# Group by sid and test whether any withdrawal_code_desc for a student appear 
# in the transferCodes list, this gives TRUE/FALSE
# Convert this to numeric

stuOT %<>% group_by(sid) %>% 
  mutate(ever_transferout = any(withdrawal_code_desc %in% transferCodes)) %>% 
  mutate(ever_transferout = as.numeric(ever_transferout))

```


Now the `ever_transferout` variable is consistent by student.
Omit students who ever transfer out of the district since we can't determine 
their total credit accumulation in any year in which they left the district.

```{r filterTransferout}
stuOT %<>% filter(ever_transferout == 0)
stuOT %<>% select(-ever_transferout)

```

Keep the relevant variables, and ensure that your file is unique by student id
and school year. Label the object student school year (`stusy`) and leave it in
the workspace for later.

```{r filterSaveSortStuOT}
# Keep only relevant variables and drop duplicates
stusy <- stuOT %>% select(sid, school_year, grade_level) %>% 
  # Select only one unique row by sid, school_year, and grade_level
  distinct(school_year, grade_level)
# Confirm uniqueness of rows
n_distinct(paste0(stusy$sid, stusy$school_year)) == nrow(stusy)
# Clean up
rm(stuOT)
```


Next, load the Student Class Enrollment file, and merge with the college-going 
analysis file.

```{r readStuCGStuEnroll}
# Read in file 1
tmpfileName <- "clean/Student_Class_Enrollment_Merged.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuenrl <- read_stata(con)
close(con)
# Read in file 2
tmpfileName <- "analysis/Student_CollegeGoing.dta"
con <- unz(description = "data/analysis.zip", filename = tmpfileName, 
           open = "rb")
stuCG <- read_stata(con)
close(con)

# We can only assess if a student is on track if we have course information for 
# them. Keep only records that appear in both files using inner_join

stuOT <- inner_join(stuenrl, stuCG, by = c("sid"))
rm(stuenrl, stuCG); gc()
```


Merge the file with the Student School Year (`stusy`) object you saved above, 
keeping students who appear in both files.

```{r mergeStusyOn}
# Use right_join to keep only sids found in stusy
stuOT <- inner_join(stuOT, stusy, by = c("sid", "school_year"))
# rm(stusy)
```


Keep only students who were in the district in first semester of 9th grade, to 
ensure that we have complete record for them.

```{r filterLateHSStudents}
markList <- c("YL", "S1", "Q1")

stuOT %<>% group_by(sid) %>% 
  mutate(any_grade_9 = any(grade_level == 9)) %>% 
  mutate(enrolled_grade_9 = any(marking_period[any_grade_9] %in% markList)) %>% 
  ungroup %>% 
  select(-any_grade_9)

stuOT %<>% filter(enrolled_grade_9) %>% 
  select(-enrolled_grade_9)

```

Restrict to cohorts that have had time to graduate. We assume here that you 
have complete records until the school year before the current year.

```{r restrictCohorts}
stuOT %<>% filter(chrt_ninth <= current_schyr - 4)
```

Identify students who don't enroll subsequently from one year to the next and 
omit those from our sample. To do this, create an variable that indicates if a 
student is enrolled in one year, not enrolled in the next, and then enrolled 
again.

```{r nonLinEnrollPattern}

nonlin <- stuOT %>% select(sid, school_year) %>% 
  distinct() %>% arrange(sid, school_year) %>% 
  group_by(sid) %>% 
  mutate(syLag = lag(school_year)) %>% ungroup %>% 
  mutate(syDiff = school_year - syLag)

nonlin$nonlin_enrl <- ifelse(nonlin$syDiff > 1 & !is.na(nonlin$syDiff), 1, 0)
nonlin %<>%  select(sid, nonlin_enrl) %>% group_by(sid) %>% 
  mutate(nonlin_enrl = max(nonlin_enrl, na.rm=TRUE)) %>% 
  distinct(.keep_all = TRUE)

```

Merge this indicator onto your current file, and drop all students who have any 
episodes of non-linear enrollment.

```{r mergeanddropNonlinEnroll}

stuOT <- inner_join(stuOT, nonlin, by = "sid")

stuOT %<>% filter(stuOT$nonlin_enrl < 1) %>% 
  select(-nonlin_enrl)
```

#### 1.2 Resolve inconsistencies with credit variables

Now, you can move on to resolving inconsistencies with the credit variables.

In Clean, you ensured that there are no missing credits possible and credits 
earned. Confirm this here.

```{r confirmCleanedCredits}
all(!is.na(stuOT$credits_possible))
all(!is.na(stuOT$credits_earned))

table(stuOT$credits_possible,stuOT$credits_earned)
```

In our data, there are a handful have 2 or 3 `credits_possible`. This may be due
to block scheduling.

Some students with 0 `credits_possible` have received credits_earned > 0 
Typically a course has 0 credits possible if it does not count towards GPA. In
most cases students receive a "P" for these courses.

```{r gradeCreditTables}
# stuOT %>% filter(credits_possible == 0 & credits_earned != 0) %>% 
#   with(., table(final_grade_mark))

stuOT %>% filter(credits_possible == 0 & credits_earned != 0) %>% 
  with(., table(final_grade_mark, credits_earned))
```

However, notice that in some cases a student has a normal final grade mark, for 
these courses. We will change the credits possible to credits earned for these.

To do this, we are looping through the possible final grade mark letters, and 
replacing credits possible with credits earned if the final grade mark contains 
any of those letters.

```{r recodeGrades}

for(gl in c("A", "B", "C", "D", "E")){
  stuOT$credits_possible[stuOT$credits_possible == 0 & 
                           !is.na(stuOT$credits_earned !=0) & 
                           grepl(gl, stuOT$final_grade_mark) & 
                           stuOT$final_grade_mark != "NGPA"] <- 
    stuOT$credits_earned[stuOT$credits_possible == 0 & 
                           !is.na(stuOT$credits_earned !=0) & 
                           grepl(gl, stuOT$final_grade_mark) & 
                           stuOT$final_grade_mark != "NGPA"]
  
}


```


Note, that our data contains a final grade mark of “NGPA”. These are courses 
that do not earn GPA credits. In the code above, we had to exclude "NGPA" 
specifically, because it contains the letter "A", and as such, it would be 
replaced in the loop for “A”.

Review final grade marks and credits earned.

```{r checkGradesandCredits}
table(stuOT$final_grade_mark, stuOT$credits_earned)
```


Make sure that final grade marks of F and non credit-carrying letter grades (in
our case, NGPA) have 0 credits earned.

```{r replaceFailingGrades}
stuOT$credits_earned <- ifelse(stuOT$final_grade_mark == "F" | 
                                 stuOT$final_grade_mark == "NGPA", 0,
                               stuOT$credits_earned)
```


For the most part, students with credits_earned == 0 receive an "F" letter grade.
However, some students receive passing grades and should receive credits_earned. 
Review and fix these cases.

```{r reviewEdgeCases}
table(stuOT$final_grade_mark[stuOT$credits_earned == 0])
table(stuOT$final_grade_mark[stuOT$credits_earned == 0],
      stuOT$credits_possible[stuOT$credits_earned == 0])

```

Assign these students non-zero credits earned = credits_possible.

```{r assignCredits}
with(stuOT, credits_earned[credits_earned == 0 & 
                       credits_possible != 0  & 
                         grepl("A|B|C|D|E", final_grade_mark) & 
                         final_grade_mark!="NGPA"] <- 
       credits_possible[credits_earned == 0 & 
                       credits_possible != 0  & 
                         grepl("A|B|C|D|E", final_grade_mark) & 
                         final_grade_mark!="NGPA"]
     )
```

If credits_possible in that observation is also zero, then replace with mode of 
course. Calculate modal `credits_earned` for each course (`cid` is unique by 
`school_year` `school_code` `section_code` `course_code`)

```{r calculateModalCourseGrades}
stuOT %<>% group_by(cid) %>% 
  mutate(credits_earned_mode = statamode(credits_earned)) %>% 
  ungroup

with(stuOT, credits_earned[credits_earned == 0 & 
                             !is.na(credits_earned_mode) & 
                              grepl("A|B|C|D|E", final_grade_mark) & 
                             final_grade_mark != "NGPA"] <- 
       credits_earned_mode[credits_earned == 0 & 
                             !is.na(credits_earned_mode) & 
                              grepl("A|B|C|D|E", final_grade_mark) & 
                             final_grade_mark != "NGPA"])

```

Finally, if still 0 for `credits_earned` and `credits_possible`, leave as is.

Look at the tabulation and make sure that there are not many cases like this; 
Otherwise, work with your agency to determine possible causes. In our example, 
we have 245 cases.

```{r checkGrade0CreditsTable}

stuOT %>% select(final_grade_mark, credits_earned, credits_possible) %>% 
  filter(credits_earned == 0 & final_grade_mark %in% c("A", "B", "C", "D", "E")) %>% 
  with(., table(final_grade_mark, credits_possible))

```


Next, resolve instances in which `credits_earned` > `credits_possible`.

```{r checkCreditsEarnedandPossible}
stuOT %>%  filter(final_grade_mark %in% c("A", "B", "C", "D", "E", "F") & 
                    credits_earned > credits_possible) %>% 
  with(., table(credits_earned, credits_possible))

```


Set `credits_possible` to `credits_earned` if `credits_possible` is zero and
`credits_earned` is non-zero.

```{r adjustCreditsEarned}
with(stuOT, credits_earned[credits_possible == 0 & 
                             credits_earned!= 0  & 
                             credits_earned > credits_possible & 
                             grepl("A|B|C|D|E", final_grade_mark) & 
                             final_grade_mark != "NGPA"] <- 
       credits_earned[credits_possible == 0 & 
                             credits_earned!= 0  & 
                             credits_earned > credits_possible & 
                             grepl("A|B|C|D|E", final_grade_mark) & 
                             final_grade_mark != "NGPA"])
```


Now set remaining `credits_earned` AND `credits_possible` to equal the mode 
credits earned computed above.

```{r useCourseModeCredits}
stuOT$replace_credits <- ifelse(stuOT$credits_possible > 0 &
                           stuOT$credits_earned > stuOT$credits_possible & 
                           grepl("A|B|C|D|E", stuOT$final_grade_mark) &
                           stuOT$final_grade_mark != "NGPA", 1, 0)

stuOT$credits_earned[stuOT$replace_credits == 1 & 
                       !is.na(stuOT$credits_earned_mode)] <- 
  stuOT$credits_earned_mode[stuOT$replace_credits == 1 & 
                              !is.na(stuOT$credits_earned_mode)]

stuOT$credits_possible[stuOT$replace_credits == 1 & 
                       !is.na(stuOT$credits_earned_mode)] <- 
  stuOT$credits_earned_mode[stuOT$replace_credits == 1 & 
                              !is.na(stuOT$credits_earned_mode)]
```

Review remaining mismatched `credits_possible` and `credits_earned`. These all 
have missing modes. We are dropping them at this point.

```{r filterFinalMissingMarks}
stuOT %>% filter(!credits_earned > credits_possible & 
                    (grepl("A|B|C|D|E", final_grade_mark) &
                    stuOT$final_grade_mark != "NGPA")) %>% nrow

stuOT %>%  filter(credits_earned != credits_possible & replace_credits == 1) %>% 
  with(., table(credits_earned, credits_possible))
```

Look at the results by credit and by each final grade mark with the code below 
(results not shown).

```{r tableofCreditsandGrade, results='hide'}
table(stuOT$credits_possible, stuOT$credits_earned, stuOT$final_grade_mark)
```

Make sure that `credits_possible` and `credits_earned` are not missing.

```{r checkMissingnessCredits}
all(!is.na(stuOT$credits_possible))
all(!is.na(stuOT$credits_earned))
```


#### 1.3 Create Variable Indicating years in HS

At this point, the only thing left in creating the sample file is to keep only 
one observation per student/`school_year`, create a variable indicating how many 
years the student appears in the high school data, and merge back onto main file.


```{r calcYearsinHS}

num_yrs <- stuOT %>% select(sid, school_year, grade_level) %>%
  distinct(.keep_all=TRUE) %>% 
  arrange(sid, school_year, grade_level) %>% 
  group_by(sid) %>% 
  mutate(year_in_hs = n())

stuOT <- inner_join(stuOT, num_yrs[, c(1, 4)], by = "sid")

table(stuOT$chrt_ninth, stuOT$chrt_grad)
```

```{r saveStuOTData}
# Save the current file as Student_OnTrack_Sample.dta
# Create a analysis directory
# dir.create("analysis")
# save(stuOT, file = "analysis/Student_OnTrack_Sample.rda")
# Or if you want to save the Stata file
# write_dta(stuOT, file = "analysis/Student_OnTrack_Sample.dta")
```


## Step 2: Create On-Track Variables

In this step you will calculate credits earned in each of the first four high 
school years, generate on-track indicators and generate outcome indicators at 
the end of each year in high school.

To start, open the file you saved in Step 1 above.

```{r readStuOTdataIn}
# Read in file 1
tmpfileName <- "analysis/Student_OnTrack_Sample.dta"
con <- unz(description = "data/analysis.zip", filename = tmpfileName, 
           open = "rb")
stuOT <- read_stata(con)
close(con)
```


#### 2.1 Calculate credits earned in each of a student's first four school years in high school


First, calculate total credits earned.

```{r calcCreditsEarnedYear}
# Cumulative credits within years
stuOT %<>% group_by(sid, year_in_hs) %>% 
  mutate(cum_credits_yr = cumsum(credits_earned)) %>% 
  mutate(cum_credits_yr = max(cum_credits_yr))

# Cumulative credits across years
tmp <- stuOT %>% select(sid, year_in_hs, cum_credits_yr) %>% 
  distinct(sid, year_in_hs, .keep_all = TRUE)

tmp %<>% group_by(sid) %>% arrange(year_in_hs) %>% 
  mutate(cum_credits = cumsum(cum_credits_yr)) %>% arrange(sid, year_in_hs) %>% 
  select(-cum_credits_yr)
  ungroup

stuOT <- inner_join(stuOT, tmp, by = c("sid", "year_in_hs"))
rm(tmp)

```


Then, calculate total credits earned in Math and English Language Arts. Replace 
with 0 if student didn't earn credit.


```{r creditsbySubject}
stuOT %<>% group_by(sid, year_in_hs) %>% 
  mutate(cum_credits_yr_math = sum(credits_earned[math_flag == 1]), 
         cum_credits_yr_ela = sum(credits_earned[ela_flag == 1])) %>% 
  ungroup()

# Cumulative credits across years
tmp <- stuOT %>% select(sid, year_in_hs, cum_credits_yr_math, cum_credits_yr_ela) %>% 
  distinct(sid, year_in_hs, .keep_all = TRUE)


tmp %<>% group_by(sid) %>% arrange(year_in_hs) %>% 
  mutate(cum_credits_math = cumsum(cum_credits_yr_math),
         cum_credits_ela = cumsum(cum_credits_yr_ela)) %>% arrange(sid, year_in_hs) %>% 
  select(-cum_credits_yr_ela, -cum_credits_yr_math)
  ungroup

stuOT <- inner_join(stuOT, tmp, by = c("sid", "year_in_hs"))
rm(tmp)

# stuOT %>% filter(sid == 10585) %>% 
#  select(sid, year_in_hs, credits_earned, 
#         math_flag, cum_credits_math, cum_credits_ela) %>% 
#   as.data.frame %>% head(40)

stuOT %<>% select(-cum_credits_yr_ela, -cum_credits_yr_math, 
                  -cum_credits_yr)
```


#### 2.2 Generate on track indicators

In our sample district, 23 credits are needed to graduate from high school with 
regular diploma, including 4 credits in ELA, 3 in Math.
In addition, promotion to next grade guidelines are as follows:
Promotion from grade to grade should be based on credits earned:

- Promotion to 10th grade – 5 credits
- Promotion to 11th grade – 11 credits
- Promotion to 12th grade - 17 credits

Using this information, define on-track indicators by year enrolled in HS, of graduating within 4 years of initial
high school enrollment.

DEFINITION:
- on track by end of 9th: 5 total credits, 1 math, 1 English
- on track by end of 10th: 10 total credits, 1 math, 2 English
- on track by end of 11th: 15 total credits, 2 math, 3 English
- on track by end of 12th: 20 total credits, 3 math, 4 English

To do this consistently by student we operate in the long format the way R prefers 
and group by `sid` and `year_in_hs`.

```{r onTrackFlags}
stuOT %<>% group_by(sid, year_in_hs) %>%
  mutate(ontrack_endyr = ifelse(cum_credits >= (first(year_in_hs) * 5) & 
           cum_credits_math >= (ceiling(first(year_in_hs)^2 / 6)) & 
           cum_credits_ela >= first(year_in_hs), 1, 0)) %>% ungroup
```


Keep only relevant variables, and make sure you only have one observation per 
student/yr observed in high school


```{r simplifyOTdata}
stuOT %<>% select(sid, school_year, starts_with("hs_diploma"), 
                        last_wd_group, starts_with("chrt_"), 
                        ontime_grad, late_grad, still_enrl, 
                        transferout, dropout, disappear, year_in_hs, 
                        starts_with("cum_credits"), starts_with("ontrack_"))

stuOT %<>% distinct(sid, school_year, .keep_all=TRUE)

```


#### 2.3 Generate outcome indicators at the end of each year in high school

Create labels that you can apply to the variable created for each high school year.


2.3a Loop through the first 3 years of high school. Determine if the student is 
on track or off track, or has dropped out, disappeared or earned a General 
Education Diploma. Make sure that these values are populated consistently for 
each student.

```{r codeEOYoutcomes}
stuOT %<>% group_by(sid, year_in_hs) %>% 
  mutate(status_eoy = ifelse(ontrack_endyr == 1, 1, 2)) %>%
  mutate(status_eoy = ifelse(dropout == 1, 3, status_eoy)) %>%
  mutate(status_eoy = ifelse(disappear == 1, 4, status_eoy)) %>%
  ungroup()

sum(table(stuOT$status_eoy)) == nrow(stuOT)
```


2.3b Now, define status after 4th year using diploma information.

```{r defineStatus}
tmp <- stuOT %>% filter(year_in_hs == 4)  %>% 
  group_by(sid) %>% 
  mutate(status_eoy_yr4 = ifelse(ontime_grad == 1 & !is.na(chrt_grad), 1, 0)) %>%
  mutate(status_eoy_yr4 = ifelse(still_enrl == 1 | late_grad ==1, 2, status_eoy)) %>%
  mutate(status_eoy_yr4 = ifelse(is.na(hs_diploma_date) & is.na(status_eoy) & 
                                   disappear == 1, 4, status_eoy)) %>%
  ungroup() %>% 
  select(sid, status_eoy_yr4)

stuOT <- left_join(stuOT, tmp, by = "sid")

# Replace
stuOT$status_eoy[stuOT$year_in_hs == 4] <- stuOT$status_eoy_yr4[stuOT$year_in_hs == 4]
stuOT$status_eoy_yr4 <- NULL

```

Reshape wide by `sid`

```{r reshapeOTwide}
stuOT$status_after <- stuOT$status_eoy; stuOT$status_eoy <- NULL
stuOT$school_year <- NULL
stuOT %<>% filter(year_in_hs < 5)
# Reshape
stuOT <- as.data.frame(stuOT)
stuOT <- reshape(stuOT, idvar = names(stuOT)[1:13], 
               timevar = "year_in_hs", 
               direction = "wide", sep = "_yr")
stuOT$ontrack_hsgrad_sample <- 1

```

Save

```{r saveOTsample}
# Save the current file as Student_OnTrack_Variables.dta
# Create a analysis directory
# dir.create("analysis")
# save(stuOT, file = "analysis/Student_OnTrack_Variables.rda")
# Or if you want to save the Stata file
# write_dta(stuOT, file = "analysis/Student_OnTrack_Variables.dta")
```

## Step 3: Generate GPA and Test variables


In this step you will calculate GPA and credits earned in each year of high 
school, add SAT and ACT score variables, and identify if a student is considered 
highly qualified.

For this step you will start with the Sample file you saved in Step 1.

```{r loadDataOTagain}
# Read in file 1
tmpfileName <- "analysis/Student_OnTrack_Sample.dta"
con <- unz(description = "data/analysis.zip", filename = tmpfileName, 
           open = "rb")
stuOT <- read_stata(con)
close(con)
```


#### 3.1 Process final grades and credits

First, ensure `final_grade_mark` and `final_grade_mark_num` comport. In the 
tabulation you should see a pattern that makes sense. In the sample data we 
provided you will notice that in addition to the regular grades (A through F), 
we have "P", representing Pass/Fail courses, and "NGPA", representing other 
non-credit bearing courses.

```{r inspectGradeTable}
table(stuOT$final_grade_mark, stuOT$final_grade_mark_num)
```

For "P" and "NGPA" set credits_possible and numeric grade mark to 0 to exclude 
from GPA calculation.

```{r excludeSomeGrades}
stuOT$credits_possible[stuOT$final_grade_mark %in% c("P", "NGPA")] <- 0
stuOT$final_grade_mark_num[stuOT$final_grade_mark %in% c("P", "NGPA")] <- 0

```


Calculate GPA points by multiplying the numeric final grade marks by credits 
possible. This allows you to weigh higher credit courses in the GPA calculation.

```{r GPAcalc1}
stuOT %<>% mutate(gpa_points = final_grade_mark_num * credits_possible)
```

Ensure that credits attempted field is populated for everyone with GPA points.


```{r checkGPAwork}
all(!is.na(stuOT$credits_possible), !is.na(stuOT$gpa_points))
all(stuOT$gpa_points[stuOT$final_grade_mark %in% c("P", "NGPA")]==0)
```


Determine GPA earned by each student every year. Total yearly GPA is calculated 
by dividing the total GPA points by total credits possible for the courses the 
student enrolled in.

```{r gpaEarned}
stuOT %<>% group_by(sid, school_year) %>%
  mutate(tot_gpa_points_yr = sum(gpa_points), 
         tot_gpa_credits_yr = sum(credits_possible)) %>% 
  mutate(gpa_year = tot_gpa_points_yr/tot_gpa_credits_yr) %>% ungroup
```

Now you have created GPA and credit variables by student and school year. Keep 
these variables, and ensure your file is unique by student id and school year.

```{r keepOnlyGPAvars}
stuOT %<>% select(sid, school_year, starts_with("tot_gpa"), gpa_year)
stuOT %<>% distinct(sid, school_year, .keep_all=TRUE)
stuOT %<>% filter(!is.na(gpa_year))
nrow(stuOT) == n_distinct(paste0(stuOT$sid, stuOT$school_year))

```

Calculate a cumulative GPA. Start by adding GPA points and credits across years 
until the given year.

```{r cumGPAPoints}
stuOT %<>% group_by(sid) %>% arrange(school_year) %>% 
  mutate(total_points = cumsum(tot_gpa_points_yr), 
         total_credits = cumsum(tot_gpa_credits_yr)) %>% ungroup
```

Then, generate a cumulative GPA by dividing the cumulative GPA points by the 
cumulative credits.

```{r calcCumGPAandDropVars}
stuOT$cum_gpa_yr <- stuOT$total_points/stuOT$total_credits
sum(stuOT$cum_gpa_yr)
stuOT %<>% select(-total_points, tot_gpa_credits_yr, -tot_gpa_points_yr, 
                  -total_credits)
```


Identify final cumulative GPA in high school.

```{r finalGPACum}

stuOT %<>% group_by(sid) %>% arrange(school_year) %>% 
  mutate(cum_gpa_final = last(cum_gpa_yr))
```


Reshape data to have one record per student, and variables for each high school 
year.

```{r reshapeWide}
stuOT %<>% select(sid, school_year, cum_gpa_yr, cum_gpa_final) %>% 
  group_by(sid) %>% arrange(school_year) %>% 
  filter(row_number() < 5) %>% 
  mutate(idx = row_number()) %>%
  ungroup %>% arrange(sid, school_year) %>%
  select(-school_year) %>%
  as.data.frame()

# stuOT$cum_gpa_final <- round(stuOT$cum_gpa_final, 3)

tmp <- reshape(stuOT, direction = "wide", 
               idvar = c("sid", "cum_gpa_final"), 
               timevar = "idx")
rm(stuOT)
```


Now, merge these GPA variables and the `Student_OnTrack_Variables` file you 
created in Step 2. Order the variables and save the resulting file in a 
temporary file.

```{r readMergeCGOT}
# Read in file 1
tmpfileName <- "analysis/Student_OnTrack_Variables.dta"
con <- unz(description = "data/analysis.zip", filename = tmpfileName, 
           open = "rb")
stuOT <- read_stata(con)
close(con)

cg_student <- inner_join(stuOT, tmp, by = "sid")
rm(stuOT, tmp)
```


#### 3.2 Add SAT and ACT Scores

Now you are ready to add the SAT and/or ACT scores you cleaned in Task 5 in 
Clean. We are going to convert ACT scores to SAT score equivalent, to ensure that 
we have the indicators on the same scale for each student. For SAT , we are 
using the Math and ELA portion (excluding Writing). 

Start by adding SAT and ACT scores to the temporary file you saved before this 
step, then save the results in the same temporary file.


```{r loadSATdata}
# Read in file 1
tmpfileName <- "clean/SAT.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuSAT <- read_stata(con)
close(con)
# Read in file 1
tmpfileName <- "clean/ACT.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName, 
           open = "rb")
stuACT <- read_stata(con)
close(con)

cg_student <- left_join(cg_student, stuSAT, by = "sid")
cg_student <- left_join(cg_student, stuACT, by = "sid")
rm(stuSAT, stuACT)
```


Convert ACT scores to SAT scores for all students according to ACT -SAT Concordance table stored in the `ACTtoSAT` function

```{r ACTSATconvert}
cg_student$sat_act_temp <- ACTtoSAT(cg_student$act_composite_score)

```


Generate SAT concordance.

`SAT_ACT_concordance` equal to SAT `total_score` if student has SAT score and 
does not have ACT score

`SAT_ACT_concordance` equal to mean of SAT and ACT score if student has taken 
both exams

`SAT_ACT_concordance` = concordance score calculated above if a student has 
only ACT score

```{r convertACTSATConcordance}
cg_student$sat_act_concordance <- NA

cg_student$sat_act_concordance[!is.na(cg_student$sat_total_score) & 
                                 is.na(cg_student$act_composite_score)] <-
  cg_student$sat_total_score[!is.na(cg_student$sat_total_score) & 
                                 is.na(cg_student$act_composite_score)]

cg_student$sat_act_mean <- (cg_student$sat_total_score + cg_student$sat_act_temp) /2

cg_student$sat_act_concordance[!is.na(cg_student$act_composite_score) |
                          !is.na(cg_student$sat_total_score)] <- 
  cg_student$sat_act_mean[!is.na(cg_student$act_composite_score) |
                          !is.na(cg_student$sat_total_score)] 

cg_student$sat_act_concordance[is.na(cg_student$sat_total_score) & 
                           !is.na(cg_student$act_composite_score)] <- 
  cg_student$sat_act_temp[is.na(cg_student$sat_total_score) & 
                           !is.na(cg_student$act_composite_score)] 

```
 
Check for correctness

```{r checkACTSATWORK}
all(with(cg_student, 
     table(is.na(sat_act_concordance[!is.na(sat_total_score) | 
                                             !is.na(act_composite_score)]))
           ))

all(with(cg_student, 
     table(!is.na(sat_act_concordance[is.na(sat_total_score) | 
                                             is.na(act_composite_score)]))
           ))

```


#### 3.3 Identify highly qualified students (eligible to attend flagship university)

We use the SPI definition of highly qualified. Students are considered highly qualified if they
- have a GPA of 3.7 or higher and an SAT score of 1100 or higher, OR
- have a GPA of 3.3 or higher and an SAT score of 1200 or higher, OR
- have a GPA of 3.0 or higher and an SAT score of 1300 or higher

```{r encodeHighlyQualified}

cg_student$highly_qualified <- NA


cg_student$highly_qualified[!is.na(cg_student$chrt_grad) & 
                              cg_student$cum_gpa_final >= 3.7 & 
                          !is.na(cg_student$cum_gpa_final) & 
                            cg_student$sat_act_concordance >= 1100 & 
                            !is.na(cg_student$sat_act_concordance)] <- 1

cg_student$highly_qualified[!is.na(cg_student$chrt_grad) & 
                              cg_student$cum_gpa_final >= 3.3 & 
                          !is.na(cg_student$cum_gpa_final) & 
                            cg_student$sat_act_concordance >= 1200 & 
                            !is.na(cg_student$sat_act_concordance)] <- 1

cg_student$highly_qualified[!is.na(cg_student$chrt_grad) & 
                              cg_student$cum_gpa_final >= 3.0 & 
                          !is.na(cg_student$cum_gpa_final) & 
                            cg_student$sat_act_concordance >= 1300 & 
                            !is.na(cg_student$sat_act_concordance)] <- 1

cg_student$highly_qualified[!is.na(cg_student$chrt_grad) &
                              is.na(cg_student$highly_qualified)] <- 0

table(is.na(cg_student$highly_qualified[is.na(cg_student$chrt_grad)]))
table(!is.na(cg_student$highly_qualified[!is.na(cg_student$chrt_grad)]))
```


Keep only cohorts who have had a chance to graduate.


```{r filterCOhortsOT}
cg_student %<>% filter(chrt_ninth == 2005 | chrt_ninth == 2006)

```


Keep the on-track variables.

```{r dropVarsOT}
cg_student %<>% select(sid, starts_with("ontrack_"), 
                       starts_with("cum_credits"), 
                       starts_with("cum_gpa"), 
                       starts_with("status_"), 
                       sat_act_concordance, highly_qualified)
cg_student$ontrack_sample <- 1
```


Merge the on-track variables with the analysis file and save the final college 
going analysis file that now includes on-track indicators.

```{r combineWithAnalysisFile}
cg_analysis <- inner_join(cg_analysis, cg_student, by = "sid")
rm(cg_student)
# use "${analysis}/CG_Analysis.dta", clear
# merge 1:1 sid using `ontrack', nogen
# save "${analysis}/CG_Analysis.dta", replace
```


```{r cleanUpNames, echo=FALSE}
cg_analysis$hs_diploma_date <- cg_analysis$hs_diploma_date.x
cg_analysis$first_hs_name <- cg_analysis$school_name_first_hs
cg_analysis$longest_hs_name <- cg_analysis$school_name_longest_hs
cg_analysis$last_hs_name <- cg_analysis$school_name_last_hs
cg_analysis$chrt_ninth <- cg_analysis$chrt_ninth.x
cg_analysis$chrt_grad <- cg_analysis$chrt_grad.x

cg_analysis %<>% select(sid, male, race_ethnicity, hs_diploma, hs_diploma_type, 
                        hs_diploma_date, frpl_ever, iep_ever, ell_ever, 
                        gifted_ever, frpl_ever_hs, iep_ever_hs, ell_ever_hs,
                        gifted_ever_hs, first_hs_code, first_hs_name, 
                        last_hs_code, last_hs_name, longest_hs_code, 
                        longest_hs_name, last_wd_group, chrt_ninth, chrt_grad, 
                        ontime_grad, late_grad, still_enrl, transferout, 
                        dropout, disappear, test_math_8_raw, test_math_8, 
                        test_math_8_std, test_ela_8, test_ela_8_std, 
                        test_composite_8, test_composite_8_std, 
                        qrt_8_math, qrt_8_ela, qrt_8_composite,
                      matches("first_college_opeid|enrl_1oct_grad_|enrl_ever_w2|enrl_grad_|enrl_1oct_ninth_|cum_credits_yr|ontrack_endyr|status_after_yr|cum_gpa_yr"), 
                        ontrack_hsgrad_sample,
                        cum_gpa_final,
                        sat_act_concordance,
                        highly_qualified,
                        ontrack_sample)
```


Your final file will look like the chart below:

```{r columnNamesend}
names(cg_analysis)
```


\vfill
Last updated December 2016.

© 2016 President and Fellows of Harvard College.