-
Notifications
You must be signed in to change notification settings - Fork 3
/
SDPdatalinkingtasks.Rmd
2310 lines (1689 loc) · 80.8 KB
/
SDPdatalinkingtasks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "CONNECT: DATA LINKING GUIDE"
author: "Strategic Data Project"
date: "Center for Education Policy Research at Harvard University"
output:
pdf_document:
toc: yes
toc_depth: 4
latex_engine: xelatex
includes:
in_header: harvardheader.tex
before_body: harvard_prefix.tex
html_document:
toc: yes
toc_depth: 4
---
```{r setup, echo=FALSE, error=FALSE, message=FALSE, warning=FALSE, comment=NA}
# Set options for knitr
library(knitr)
knitr::opts_chunk$set(comment=NA, warning=FALSE, echo=TRUE,
error=FALSE, message=FALSE, fig.align='center')
options(width=80)
```
# CONNECT: DATA LINKING GUIDE
## Introduction
### Purpose
**Connect** links data elements from across your system into one analysis file.
The file allows you to execute analyses inspired by the SDP College Going
Diagnostic to examine students’ progression through high school and college.
After completing connect, you will have:
- Produced student-level files that track high school completion and graduation
- Linked postsecondary college enrollment and persistence data from the National
Student Clearinghouse (NSC), to your agency’s student achievement records -
Merged disparate data files to create a single analysis file to support
**Analyze**.
The National Student Clearinghouse collects information on postsecondary
enrollment for students across the country. To access your agency’s data, please
visit: studentclearinghouse.org.
At the end of **Connect**, you will have merged 7 files and generated the
necessary variables for the analysis file.
Note: This guide references **Identify** and requires output from **Clean**. To
move through Connect, you should review these stages of the toolkit.
### Data and Structure
Connect consists of 10 steps to build one analysis files from various sources.
After the first ten steps, there is also a section on producing indicators to
on-track graduation. The steps in **Connect** require data files from **Clean**.
![Task Structure](includes/img/DLTaskDataAndStructure.jpg)
Throughout **Connect** the term "merge" indicates that two files will be linked.
Merging allows you to combine datasets horizontally and add new columns (or
variables) from one dataset to another based on identifier(s) present in both.
### Step Components
For each step, you will find the following:
- Purpose: an overview of each step;
- Files Needed: data elements or files required to complete each step;
- After this step: an overview of "output" generated by each step.
Also, throughout **Connect**, you will find R code to explain each of the
sub-steps. Code appears in boxes, like below:
```{r unevaluatedExample, eval=FALSE, echo=TRUE}
# keep only observations if 8th grade math score is not missing
stutest %<>% filter(!is.na(test_math_8))
# check to see if the file is unique by student id
nrow(stutest) == n_distinct(stutest$sid)
```
### Infrastructure
In the infrastructure folder you unzipped from
[sdp.cepr.harvard.edu/toolkit](www.sdp.cepr.harvard.edu/toolkit) toolkit, look
for the files in the **clean** folder and the files `Connect.R` and
`Connect_On_Track.R` in the **programs** folder. The files in the **clean**
folder are those you produced via **Clean** and that we have also provided for
you. The do files provide a shell for you to fill in the ten steps of
**Connect** and the section on On-Track indicators. Doing so will allow you to
produce the `CG_Analysis` file which will be saved to the **analysis** folder.
As always, if you would like additional support from the friendly SDP team,
please email us at sdp@gse.harvard.edu.
### Prepare the R Environment
```{r loadRequiredPackages}
# Load the packages and prepare your R environment
library(tidyverse) # main suite of R packages to ease data analysis
library(magrittr) # allows for some easier pipelines of data
# Read in some R functions that are useful for toolkit tasks, see SDP R Glossary
# for details
source("R/functions.R")
library(haven) # required for importing .dta files
```
## Step 1: Prior Achievement Part 1
### Overview
**Purpose:** Prepare 8th grade test scores for the analysis file.
**Files needed:** Prior_Achievement output file from Task 5 in **Clean**
**After this step** you will have a temporary file `tests` that contains
prior achievement information for math, ELA, and math-ELA composite score.
#### 1.1 Load the File and Check Uniqueness
Begin Step 1 of Connect by loading the `Prior_Achievement` file resulting from
Task 5.
```{r readPriorAchievement}
# To read data from a zip file we create a connection to the path of the
# zip file
tmpfileName <- "clean/Prior_Achievement.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
stuach <- read_stata(con) # read data in the data subdirectory
close(con)
```
Note the structure of the file. Thanks to Task 5, the file is unique by `sid`
and contains test scores for only 8th grade math , ELA and math-ELA composite.
If your data is not structured like this, please review Task 5.
Raw scores (`math_raw_score` and `ela_raw_score` from Task 5) are not shown. You
will primarily use scaled or standardized scores in future analyses. However,
keep raw scores in your file to compare results between scaled, standardized,
or raw scores later on.
#### 1.2 Rename variables to indicate that they are 8th grade
```{r renameg8Vars}
stuach %<>% rename(
test_math_8_raw = raw_score_math,
test_ela_8_raw = raw_score_ela,
test_math_8 = scaled_score_math,
test_ela_8 = scaled_score_ela,
test_composite_8 = scaled_score_composite,
test_math_8_std = scaled_math_std,
test_ela_8_std = scaled_ela_std,
test_composite_8_std = scaled_score_composite_std)
```
#### 1.3 Define Prior Achievement Quartiles in Each Subject
Prior achievement quartiles have to be created by subject and by year.
To create a variable to capture the quartile of an eighth grader’s score in
each subject, relative to peers who took the same test, the same school year.
This allows you to compare performance of each student to peers in the same year.
In R this can be done by combining a `group_by` and `mutate` command as below.
```{r defineQuartiles}
stuach %<>% group_by(school_year) %>%
mutate(qrt_8_math = ntile(test_math_8, 4),
qrt_8_ela = ntile(test_ela_8, 4),
qrt_8_composite = ntile(test_composite_8, 4))
```
## Step 2: School Crosswalk
### Overview
**Purpose:** Prepare a crosswalk between school codes and names. This allows
you to link a high school students’ high school graduation with their college
enrollment outcomes.
**Files needed:** School research file from Identify
**After this step** you will have created a temporary file `highschoolinfo` that
contains a crosswalk between school codes and school names.
### Steps
#### 2.1: Load the File and Check Uniqueness
```{r loadSchoolData}
# To read data from a zip file we create a connection to the path of the
# zip file
tmpfileName <- "clean/School.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
schl <- read_stata(con) # read data in the data subdirectory
close(con)
```
Load the School research file from **Identify**. Restrict the "universe" of
schools to only include high schools.
A crosswalk table ensures the final file is unique by `school_code` and that one
`school_code` maps to one `school_name`. For example, in an uncleaned file,
Albert Einstein High School might be spelled three ways, "A. Einstein HS,"
"Einstein High School," or "A.E. HS," but have one `school_code`. Alternatively,
"Jones High School" might have a code of 153 and 154. You must fix these issues
before moving on.
```{r distinctSchools}
# keep only the school code and school name
schl <- select(schl, school_name, school_code)
# keep school_code school_name
# duplicates drop
schl <- distinct(schl)
# // check that the file is unique by school_code
# isid school_code
length(unique(schl$school_code)) == nrow(schl)
```
#### 2.2 Generate First, Last, and Longest High School Variables
Next, generate three variables, `first_hs_code`, `last_hs_code`, and
`longest_hs_code`. Set them equal to the `school_code` of each high school.
```{r genSchoolRenameVars}
# creates first / last / longest hs id variables
schl$first_hs_code <- schl$school_code
schl$last_hs_code <- schl$school_code
schl$longest_hs_code <- schl$school_code
```
## Step 3: Student Attributes
### Overview
**Purpose:** Load Student Attributes data to obtain time-invariant information for
students in the system.
**Files needed:** `Student_Attributes` output file from Task 1 in Clean
**After this step** you will have loaded the Student Attributes data into memory.
### Steps
#### 3.1 Load the File and Check Uniqueness
```{r loadStudentAttributes}
# To read data from a zip file we create a connection to the path of the
# zip file
tmpfileName <- "clean/Student_Attributes.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
stuatt <- read_stata(con) # read data in the data subdirectory
close(con)
```
## Step 4: Student School Year
### Overview
**Purpose:** Merge Student School Year data with Student Attributes data in
memory and generate program participation status variables.
**Files needed:** `Student_School_Year_Ninth` output file from Task 3 in **Clean**
**After this step** You will have merged Student School Year data with Student
Attributes data into memory and generated variables that indicate if a student
has ever been classified as FRPL (Free and Reduced Price Lunch),
IEP (Individualized Education Plan), ELL (English Language Learner) or Gifted
in the system or during high school.
### Steps
#### 4.1 Merge on Student School Year
Merge the Student School Year file with the Student Attributes
file in memory. This adds student-level information that may
change from year to year (i.e FRPL, IEP, ELL, or Gifted). This is a
1:m (one-to-many) merge because the Student Attributes file is
unique by `sid` and the Student School Year file is unique by `sid` +
`school_year`.
Before conducting the merge, the Student School Year output file should be
unique by `sid` and `school_year` and contain information on all available grades.
The Student School Year data should also include the `first_9th_school_year_observed`
variable.
```{r mergeonStudentSchYear}
tmpfileName <- "clean/Student_School_Year_Ninth.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
stusy <- read_stata(con)
close(con)
# Data checks
# Is data unique by sid and school year
nrow(stusy) == length(unique(paste0(stusy$sid, stusy$school_year)))
# How many unique grades exist?
table(stusy$grade_level)
# Does first 9th school year exist?
"first_9th_school_year_observed" %in% names(stusy)
# Optional: Get data dimensions for both frames for checking
nrow_stusy <- nrow(stusy)
nstu_stusy <- n_distinct(stusy$sid)
nrow_stuatt <- nrow(stuatt)
nstu_stuatt <- n_distinct(stuatt$sid)
# Merge
stusy <- inner_join(stusy, stuatt, by = "sid")
```
#### 4.2 Checking the Merge
In an ideal world records match perfectly. However, administrative records are
often messy. Perfect merges rarely occur. Therefore, consider a merge
satisfactory if at least 95% of students appear in both files.
By design, `inner_join` only keeps the matching variables.
```{r checkStuSchYearMerge}
# check the number and percentage of students appearing in both files
# Check for perfect merge
nrow(stusy) == nrow_stusy
nstu_stusy == n_distinct(stusy$sid)
nstu_stuatt == n_distinct(stusy$sid)
# Check merge percentage
nrow(stusy) / nrow_stusy
nstu_stusy / n_distinct(stusy$sid)
nstu_stuatt / n_distinct(stusy$sid)
stusy <- arrange(stusy, sid)
length(unique(stusy$sid)) == length(unique(stuatt$sid))
```
#### 4.3 Generate Program Participation Variables
Now, create binary variables (variables that assume values of 0 or 1) to
indicate if a student ever:
1. qualified to participate in FRPL;
2. qualified for an IEP;
3. classified as ELL (or LEP);
4. qualified for gifted program.
These variables, (`frpl_ever`, `iep_ever`, `ell_ever`, and `gifted_ever`) allow
you to explore high school and college outcomes for students that participated
in these programs for one or more school years.
Create analogous variables to capture students’ program participation status in
high school.
```{r genProgramPartVars}
# In R this is an easy way to go by just using group_by and mutate
tmp <- filter(stusy, (grade_level >= 9 & grade_level <= 12)) %>%
group_by(sid) %>%
summarize(frpl_ever_hs = ifelse(max(frpl) > 0, 1, 0),
iep_ever_hs = max(iep),
ell_ever_hs = max(ell),
gifted_ever_hs = max(gifted))
stusy <- inner_join(stusy, tmp, by = "sid")
stusy <- arrange(stusy, sid)
stusy %<>% group_by(sid) %>%
mutate(frpl_ever = ifelse(max(frpl) > 0, 1, 0),
iep_ever = max(iep),
ell_ever = max(ell),
gifted_ever = max(gifted))
rm(tmp)
```
High school status variables (`frpl_ever_hs`, `iep_ever_hs`, `ell_ever_hs`
and `gifted_ever_hs`) allow flexibility in defining student subgroups. This is
useful if participation data is missing non-randomly before a student enters
high school, or if variables that capture participation are overly inclusive
across years. For example, a student who demonstrates limited English
proficiency in 4th grade may be fluent in English by 9th grade. It may or may
not be appropriate to categorize the student as ELL in analyses that examine
high school outcomes.
## STEP 5: Student School Enrollment
### Overview
**Purpose:** Merge Student School Enrollment with data in memory.
**Files needed:** Student_School_Enrollment_Clean output from Task 4 in Clean
**After this step** you will have merged Student School Enrollment data with
data in memory.
### Steps
#### 5.1 Merge on Student School Enrollment
Merge the Student School Enrollment file onto the analysis file.
This allows you to identify high schools students enrolled at
different times.
This is a 1:m merge, as the data file from the previous two
steps is unique by `sid` + `school_year` and the Student School
Enrollment file is unique by `sid`, `school_year`, `school_code`, and
`enrollment_date`.
```{r loadAndMergeEnrollment}
tmpfileName <- "clean/Student_School_Enrollment_Clean.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
stuschl <- read_stata(con)
close(con)
# Optional - get dimensions for comparing merge
nstu_stusy <- n_distinct(stusy$sid)
nstu_stuschl <- n_distinct(stuschl$sid)
nrow_stusy <- nrow(stusy)
stusy <- inner_join(stusy, stuschl, by = c("sid", "school_year"))
```
Before the merge, the Student School Enrollment file should
be unique by `sid`, `school_year`, `school_code`, and `enrollment_date`.
#### 5.2 Checking the Merge
```{r checkStuSchlMerge}
# Check percentage of students and rows merged
n_distinct(stusy$sid) / nstu_stusy
n_distinct(stusy$sid) / nstu_stuschl
nrow(stusy) / nrow_stusy
# Above 0.95 so we can proceed
```
## STEP 6: High School Indicators and Outcomes
### Overview
**Purpose:** Generate high school indicators and outcomes in two categories.
**Files needed:** The file in memory from Steps 3-5 and `highschoolinfo` from
Step 2.
**After this step** you will have created a number of high school indicators
and outcomes: 1) first, last, and longest high school; 2) 9th grade and
graduation cohorts; and 3) end of high school outcomes: on-time and late graduates
and high school enrollment outcomes for non-graduates.
### Steps
#### 6.1 Define First, Last, and Longest High School
To begin, make sure that the data includes only student observations in high
school. (You have done this in **Clean** already; we are checking it again here).
```{r selectHSonly}
stusy %<>% filter(grade_level >= 9 & !is.na(grade_level) &
grade_level <= 12)
```
There might be students who are assigned to high schools but whose attendance
duration is 0. Drop these school assignments to ensure that you assign students
to high schools they actually attended.
```{r dropZeroDayAttend}
stusy %<>% filter(days_enrolled > 0)
```
#### Define First High School
To identify a student’s first high school, determine the first enrollment
episode for the student.
In some cases, students enroll in more than one school at the same time. In such
cases, assign them to the school where they attended longest.
Should students have multiple first enrollments of the same length, randomly
assign them to one of these schools.
```{r assignFirstHS}
stusy %<>% arrange(sid, school_year, enrollment_date, desc(days_enrolled))
stusy %<>% group_by(sid) %>%
arrange(sid, school_year, enrollment_date, desc(days_enrolled)) %>%
mutate(first_hs_code = first(school_code),
last_hs_code = last(school_code))
```
#### Define Last High School
To identify a student’s last high school, determine the last enrollment episode
for the student. In cases of joint enrollment, use the school where the student
attended longest. Where joint enrollment duration is the same, randomly assign
the last high school.
```{r genLastHSCode}
stusy %<>% group_by(sid) %>%
arrange(sid, desc(school_year), desc(withdrawal_date), desc(days_enrolled)) %>%
mutate(last_hs_code = last(school_code)) %>%
ungroup
```
#### Define Longest High School
To determine the longest enrolling HS, you first have to add up all enrollments
within a HS. Since in Clean you ensured that there are no overlapping enrollments
within a school, you can add enrollments up.
In cases where students enrolled in more than one school for the same amount of
time, randomly assign the longest high school
```{r defineLongestHS}
stusy %<>% group_by(sid, school_code) %>%
mutate(total_days_enrolled_in_school = sum(days_enrolled))
stusy %<>% group_by(sid) %>%
mutate(total_days_enrolled_in_school_max = max(total_days_enrolled_in_school))
stusy %>% select(sid, school_code, enrollment_date, total_days_enrolled_in_school,
days_enrolled, first_hs_code) %>%
head
stusy %<>% group_by(sid) %>%
mutate(longest_hs_code = unique(school_code[total_days_enrolled_in_school_max ==
total_days_enrolled_in_school])[1])
stusy %>% select(sid, school_code, enrollment_date, total_days_enrolled_in_school,
days_enrolled, first_hs_code, longest_hs_code) %>%
head
# Drop temporary variables
stusy$total_day_enrolled_in_school <- NULL
stusy$total_days_enrolled_in_school <- NULL
stusy$total_days_enrolled_in_school_max <- NULL
```
#### Merge on highschoolinfo
Merge the `highschoolinfo` tempfile created in Step 2 onto the
current file. This allows you to obtain school names(`first_hs_name` and
`last_hs_name`) associated with the high school codes just captured.
This requires merging data currently loaded in R to the
`schl` object three times – once on `first_hs_code`,
then on `last_hs_code`, and finally on `longest_hs_code`. These
merges will all be m:1 (many to one) because the file in memory
contains multiple observations per school and the tempfile
contains only one per school.
```{r mergeSchoolNames}
stusy <- left_join(stusy, schl[, c("school_name", "first_hs_code")],
by = c("first_hs_code"))
stusy %<>% rename(school_name_first_hs = school_name)
stusy <- left_join(stusy, schl[, c("school_name", "last_hs_code")],
by = c("last_hs_code"))
stusy %<>% rename(school_name_last_hs = school_name)
stusy <- left_join(stusy, schl[, c("school_name", "longest_hs_code")],
by = c("longest_hs_code"))
stusy %<>% rename(school_name_longest_hs = school_name)
```
By default, the `left_join` function keeps all students in the `stusy` enrollment
data. If a student has no school name that can be assigned, this is filled in as
a missing value. This can be used to filter the merge results easily.
```{r cleanupMerge}
stusy %<>% filter(!is.na(school_name_longest_hs))
stusy %<>% filter(!is.na(school_name_first_hs) & !is.na(school_name_last_hs))
```
#### 6.2 Assign Ninth Grade and Graduation Cohorts
Assigning students to cohorts will allow you to calculate various indicators
(e.g. high school graduation, college enrollment) using a different set of
students in the denominator. For example, when calculating college enrollment,
you could use the ninth grade cohort to illustrate how high schools prepared
their incoming freshmen for future success, or you could use the graduating
cohort to illustrate the percentage of a high school’s graduates enrolling in
college.
Since the ninth grade cohort is equal to `first_9th_school_year_observed` in the
student attributes file, just rename `first_9th_school_year_observed` to
`chrt_ninth`.
```{r renameChrtNinghtVar}
# define ninth grade cohort
# rename first_9th_school_year_observed chrt_ninth
stusy %<>% rename(chrt_ninth = first_9th_school_year_observed)
```
The graduation cohort variable, `chrt_grad`, is the school year in which a student
graduated. If a student obtained a diploma prior to September 1st, the `chrt_grad`
variable is the same as the year of `hs_diploma_date`. If a student received a
diploma between September 1st and December 31st treat them as graduates for the
next school year.
```{r defineChrtGrad}
# define graduation cohort
stusy$chrt_grad <- NULL
library(lubridate)
# Use lubridate package to find months and years easily
head(year(stusy$hs_diploma_date))
head(month(stusy$hs_diploma_date))
stusy$chrt_grad <- ifelse(month(stusy$hs_diploma_date) < 9,
year(stusy$hs_diploma_date),
year(stusy$hs_diploma_date) + 1)
```
**Question 6.2** Test your understanding by filling the shaded areas below.
```{r checkResultsChrtGrad}
stusy %>% filter(sid %in% c(16305, 16306, 16307)) %>%
select(sid, hs_diploma_date, chrt_ninth, chrt_grad) %>%
distinct(.keep_all=TRUE)
```
Note, that in **Clean**, you assigned every student a
`first_ninth_school_year_observed`. You either had their first ninth grade in
the data, or if they transferred into the district later in high school, you
backward mapped them to an appropriate ninth grade school year. Thus,
`chrt_ninth` should never be missing. `chrt_grad`, however, could be missing,
because not all students graduated, thus they will not be assigned a graduating
cohort.
#### 6.3 Define High School Outcomes
To determine end of high school outcomes, you need a student's last withdrawal
code (withdrawal code at their last high school). Use the last withdrawal code
to determine if the student graduated, transferred out, dropped out, or has
another outcome.
#### 6.3a Group last withdrawal codes together into four end of high school outcomes:
* 1 = Graduated (define graduated using withdrawal data and `hs_diploma` in the Student Attributes file, or any other source of graduation information used)
* 2 = Transfer Out
* 3 = Drop Out
* 4 = Other (all other reasons for withdrawal)
Outcomes are captured using the `last_wd_group` variable. Assigning the last
withdrawal code to `last_wd_group` requires an understanding of decision rules in
your agency. Some withdrawal codes may be ambiguous or redundant and need to be
combined to fit under the four categories. Therefore, it is important to elicit
help from those knowledgeable of local graduation, transfer, and dropout policies
in your agency. For example, there may be special codes for students who are
incarcerated or pass away that are not well-documented. We provide an example
below, but you will have to customize this script based on values your agency
uses. Particularly, for dropouts you should make sure the agency is not being
penalized for something it does not have control over.
First, examine the values for `last_withdrawal_reason`. You will have to make sure
that you capture all these values in defining the last withdrawal groups. Should
your data have any missing values for `last_withdrawal_reason`, be sure to assign
those to an appropriate category. In some agencies, a missing withdrawal code
indicates that the student is still enrolled, so assign them as still enrolled.
```{r checkWithdrawalCodes}
stusy %>% arrange(sid) %>%
summarize(last_withdrawal = last(last_withdrawal_reason)) %>%
select(last_withdrawal) %>% unlist %>% table
```
```{r recodeWithdrawalCodes}
stusy$last_wd_group <- NA
stusy$last_wd_group[stusy$last_withdrawal_reason %in% c("Home School",
"Other Transfer",
"Transfer Out of District",
"Death")] <- 2
stusy$last_wd_group[stusy$last_withdrawal_reason %in% c("Absenteeism",
"No Show",
"Expulsion")] <- 3
stusy$last_wd_group[stusy$hs_diploma == 1] <- 1
stusy$last_wd_group[is.na(stusy$last_wd_group)] <- 4
table(is.na(stusy$last_wd_group))
```
Note that we populated the graduating group last; this is because the evidence
of a high school diploma overrides any other value for last withdrawal reason.
Any remaining `last_withdrawal_reason` values were then classified as 4, Other.
#### 6.3b Define High School Outcomes
**Identify On-Time and Late High School Graduates**
First, identify students who graduated within 4 years of entering high school
(on-time graduates) as well as students who took more than 4 years
(late graduates). These variables allow you to examine time taken to complete
high school and explore how this varies across high schools within a system.
These two variables have to add up to the total graduates.
```{r defineGradTypes}
# define on-time graduates
stusy$ontime_grad <- ifelse(stusy$chrt_ninth >= stusy$chrt_grad -3 &
!is.na(stusy$chrt_ninth) &
!is.na(stusy$chrt_grad) &
stusy$hs_diploma == 1 , 1, 0)
# define late graduates
stusy$late_grad <- ifelse(stusy$ontime_grad == 0 &
!is.na(stusy$chrt_ninth) &
!is.na(stusy$chrt_grad) &
stusy$hs_diploma == 1 , 1, 0)
all(stusy$late_grad + stusy$ontime_grad == stusy$hs_diploma)
```
**Identifying High School Enrollment Outcomes for Non-Graduates**
Next, assign high school enrollment outcomes for students who have not graduated
by a point in time. You may define this point, but typically it is the current
year if data is up to date. It is important that each student is either marked
as a graduate or assigned to only one of the following categories. Notice how
the definition of each category is conditional on all previous categories.
```{r hsOutcomesNonGrads}
# still enrolled
maxDataYear <- max(stusy$school_year)
stusy %<>% group_by(sid) %>%
mutate(still_enrl = ifelse(max(school_year) == maxDataYear &
hs_diploma != 1, 1, 0))
# transfer out
stusy$transferout <- ifelse(stusy$last_wd_group == 2 &
stusy$hs_diploma!=1 &
stusy$still_enrl != 1, 1, 0)
# drop out
stusy$dropout <- ifelse(stusy$last_wd_group == 3 &
stusy$hs_diploma!=1 &
stusy$still_enrl != 1 &
stusy$transferout != 1, 1, 0)
# disappear
stusy$disappear <- ifelse(stusy$dropout != 1 &
stusy$hs_diploma!=1 &
stusy$still_enrl != 1 &
stusy$transferout != 1, 1, 0)
```
You have generated most of the key high school indicators and outcomes, so you
no longer need all source variables. Keep only the variables listed here.
Based on these time-invariant variables, the file is now unique by `sid`. To drop
other variables, first keep the 30 then drop duplicates.
```{r keepTimeInvariantVars}
# // keep time-invariant variables
stusy %<>% ungroup %>%
select(sid, male, race_ethnicity,
last_wd_group, still_enrl, transferout,
dropout, disappear,
matches("hs_diploma|_ever|_hs_code|school_name|chrt|_grad"))
stusy %<>% distinct(.keep_all = TRUE)
# // make sure the file is unique by sid
nrow(stusy) == length(unique(stusy$sid))
```
## STEP 7: Prior Achievement Part 2
### Overview
**Purpose:** Merge prior achievement test scores onto the analysis file.
**Files needed:** the analysis file in memory from Step 6 and `tests' from Step 1
**After this step** you will have merged prior achievement data with the current
analysis file.
#### 7.1 Merge on tests
You have a data set unique by `sid` that contains key student attributes,
high school indicators and outcomes. All you need now are 8th grade test scores.
To add prior achievement scores to the file, merge the `stuach` object from
Step 1 onto the current analysis file. This is a 1:1 merge on `sid`.
Use `left_join` to keep all the `sid` in the enrollment file. You may expect to
capture prior achievement for most students, but not all students will have
score information so it is important to keep the enrollment file. For example,
students who first enroll in the system during high school (after 8th grade) or
were exempt from tests will not have prior test scores.
Using `left_join` saves a step of dropping students in `stusy` who do not appear in `stuach`.
```{r MergePriorAchievement}
stusy <- left_join(stusy, stuach, by = "sid")
```
Review the merge results, to see what percentage of students
have prior scores.
```{r checkPriorAchieveMerge}
table(is.na(stusy$qrt_8_math))
```
In this case, almost 70% of high school students have prior achievement data.
## STEP 8: Examining the Analysis File Part 1
### Overview
Congratulations! You now finished working with agency administrative records and
can save a preliminary analysis file to generate analyses on student transitions
through high school completion and college-going success.
First, order the variables in a sensible way.
```{r organizeColumns}
stusy %<>% select(sid, male, race_ethnicity, hs_diploma, hs_diploma_type,
hs_diploma_date, frpl_ever, iep_ever, ell_ever, gifted_ever,
frpl_ever_hs, iep_ever_hs, ell_ever_hs, gifted_ever_hs,
first_hs_code, last_hs_code, longest_hs_code, school_name_first_hs,
school_name_last_hs, school_name_longest_hs, last_wd_group,
chrt_ninth,
chrt_grad, ontime_grad, late_grad, still_enrl, transferout,
dropout, disappear, test_math_8_raw, test_math_8,
test_math_8_std, test_ela_8_raw, test_ela_8,
test_ela_8_std, test_composite_8, test_composite_8_std,
qrt_8_math, qrt_8_ela, qrt_8_composite)
# Save as college going
stuCG <- stusy; rm(stusy)
```
All that is left to do is process college enrollment records from the National
Student Clearinghouse (NSC), and merge these data onto the `Student_CollegeGoing`
file. This creates a single analysis file to generate analyses on student
transitions through high school and college.
Before moving on, take a moment to admire your work and refamiliarize yourself
with sources and processes for each of these variables. Ask yourself: What
research files were the variables produced from? How were high school indicators
and outcomes created?
```{r reviewVarNames}
names(stuCG)
```
Use the following questions for the numbered variables above to guide your thoughts:
1. (1-6): Which research file are these variables from?
2. (7-14): Which research file are these variables from?
3. (15-17): How are first, last, and longest high school codes identified?
4. (18-20): From what research file are first, last, and longest high school names obtained?
5. (21): What does the `last_wd_group` variable describe?
6. (22-23): How are 9th grade and graduation cohorts defined?
7. (24-29): How are graduation and high school enrollment outcomes defined?
8. (30-40): Which research file are these variables from?
## STEP 9: National Student Clearinghouse Data
**Purpose:** Generate college enrollment and persistence indicators
**Files needed:** The `Student_NSC_Enrollment_Indicators` data from **Task 7**
in **Clean**, and the `Student_CollegeGoing` file you saved in Step 8.
**After this step** you have created the indicators that will be used for college
going analysis.
For all NSC related analyses, we create two types of indicators: one bases on
the graduating cohort, and another based on the ninth grade cohort. Each of these
indicators serves a different purpose, and can be used to answer different
questions. For example, if you are interested in how high schools or the entire
agency is doing in enrolling their graduates to college, you would be using the
indicators based on the graduating cohort.
If, however, you want to evaluate how the high schools or agency is preparing
their incoming freshman to go through high school and enroll in college, you
will use the indicators bases on the night grade cohort. In addition, we create
separate indicators to evaluate how soon after high school students enroll in
college. One set of indicators is based on enrollment on October 1st. The second
set of indicators is based on enrollment within two years of graduation. This
latter indicator is calculated based on the calendar date of the student’s high
school graduation (or, in case of the ninth grade cohort, the expected on-time
high school graduation).
To begin this step, open the `Student_NSC_Enrollment_Indicators` file from Task 7,
and merge the `Student_CollegeGoing` (the `stuCG` object) from Step 8.
```{r joinNSCData}
tmpfileName <- "clean/Student_NSC_Enrollment_Indicators.dta"
con <- unz(description = "data/clean.zip", filename = tmpfileName,
open = "rb")
stunsc <- read_stata(con)
close(con)
# merge on variables needed from Student_College_Going to a temp file
tmp <- select(stuCG, sid, hs_diploma_date, hs_diploma, chrt_grad, chrt_ninth)
# Use inner_join to only keep students in both
stunsc <- inner_join(tmp, stunsc, by = c("sid")); rm(tmp)
```
#### 9.1 Create a varaible to indicate if the student enrolled in college within two years of graduating from high school.
Start with the graduating cohort. If the student enrolled in college within
`2*365` days after high school graduation, set the indicator to 1.
```{r genTwoYearEnrollment}
# create and indicator to show if the student enrolled within two years
# of HS graduation
stunsc$enrl_ever_w2_grad <- ifelse(stunsc$first_enrl_date_any <
(stunsc$hs_diploma_date + (365*2)) &
!is.na(stunsc$hs_diploma_date) &
!is.na(stunsc$first_enrl_date_any),
1, 0)
```
For the ninth grade cohort, first create a variable that represents on-time
graduation. Set this date to September 1st the fourth year after the student’s
ninth grade cohort. Then create the enrollment indicator using this date.
```{r genEnrollForOntimeGrad}
stunsc$ontime_yr <- stunsc$chrt_ninth + 3
stunsc$ontime_date <- mdy(paste0("09", "01", stunsc$ontime_yr))
# create and indicator to show if the student enrolled within two years of
# expected HS graduation
stunsc$enrl_ever_w2_ninth <- ifelse(stunsc$first_enrl_date_any <
(stunsc$ontime_date + (365*2)) &
!is.na(stunsc$ontime_date) &
!is.na(stunsc$first_enrl_date_any),
1, 0)
```
**Question 9.1 Test your understanding by filling the shaded areas below.**
```{r checkCodingNSC}
stunsc %>% filter(sid %in% c(15647,15656,15658)) %>%
select(sid, chrt_ninth, chrt_grad, hs_diploma_date, first_enrl_date_any,
enrl_ever_w2_grad, ontime_yr, ontime_date, enrl_ever_w2_ninth) %>%
distinct(.keep_all=TRUE) %>%
as.data.frame
```
#### 9.2 Create variables to indicate if the student was enrolled in college by Oct 1
This variable indicates if a student enrolled in college immediately after high
school. In addition, we use indicators based on October 1st college enrollment
to track if student persisted in college. For this, we need to create variables
to indicate enrollment on October 1st the 1st, 2nd, 3rd and 4th year after high
school graduation.
First, create placeholder variables for both the ninth and graduating cohort,
for each of the four years.