<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    Joseph Chappell, Benjamin Feder, Nathan Barrett</center>
    <a href="https://doi.org/10.5281/zenodo.6407247"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6407247.svg" alt="DOI"></a>


# <center> **Data Exploration: Tennessee Community Colleges** </center>

## **1. Introduction**
Historically, state agencies used their own administrative data to administer programs and complete required reporting. Legal, technological, and human resource barriers prevented the use of data to assist policymakers in making decisions. Accordingly, data were siloed at the agency level and rarely leveraged to evaluate or inform policy development within or across agencies. Through a concerted effort supported by the Statewide Longitudinal Data System (SLDS) Grant Program, states have gradually come together to address this limitation. In this course, you will learn how to review, understand, and link data from different agencies. You will also uncover real-life problems in using the data (such as missing data, errors in the data, and complex data structures) and learn how to address them. 

This notebook introduces you to the concept of creating a group or a "cohort" that will be used for future analysis. We will construct measures to understand who we are including and excluding (coverage) from the cohort and walk you through the decisions that need to be made when constructing the cohort using filters such as award level, gender, major, and others. We begin with introducing you to the data analytical tools to access the data, including connecting R to the server and using SQL queries to pull the data. We will then use these tools to explore completions files from the Tennessee Board of Regents (TBR). We will create a dataset (called a "data frame") and investigate trends in Tennessee community college graduates. At the end of this notebook, we will save the summary statistics in csv files to be used in subsequent notebooks.

## **2. Learning Objectives**

The Applied Data Analytics training uses a project-based approach to develop your analytic skills. You will begin by working with your team to develop and refine a research question. A crucial part of this is data exploration. You will implement techniques using SQL and R to explore and better understand the data that are available to you and address the feasibility of your question. This will form the basis for all future analyses you will do in this class and is a crucial first step for any data analysis workflow. As you work through the notebook, we will have checkpoints for you to practice writing code by making small adjustments, but you can also think about how you might apply any of the techniques and code presented to other datasets to address your research question. 

The guiding research questions we will use for this series of notebooks are quite general: 

>**What are the employment outcomes of the 2015-16 community college graduates? How do these outcomes vary by cohort characteristics and employer characteristics?**

This will allow the code we use to have the most versatility. We will analyze these questions through a variety of different lenses, and will start in this notebook by defining a specific cohort of Tennessee associate's degree recipients in the 2015-2016 academic year. We will then track their earnings and employment outcomes over time in the following notebook. The exploration of the supply side of the labor market will be later supplemented by an analysis of the demand side to enhance our understanding of the overall labor market.

We are going to show just a portion of what you might be interested in investigating to answer these overarching questions, so don't feel restricted by the questions we've decided to try to answer.

>**When defining your research question(s), recall that one key benefit of working with Tennessee administrative records in the ADRF is the ability to integrate higher education and employment data.**

#### **Digging into the literature**

Once a research question is posed&mdash;or a general research topic has been identified&mdash;it is always recommended to thoroughly seek out and examine any current literature/research pertaining to the topic. Previous studies and articles should help direct your research as they can...
- ...document what has and has not been investigated.
- ...demonstrate how others have defined and/or measured key concepts.
- ...provide a foundation for additional research.
- ...contribute confirmatory/contradictory findings.
- ...place your research into context.

A brief search for research articles pertaining to employment outcomes of community college graduates yields numerous articles including the following:

- Stevens, A. H., Kurlaender, M., & Grosz, M. (2019). Career technical education and labor market outcomes evidence from California community colleges. <em>Journal of Human Resources</em>, <em>54</em>(4), 986&mdash;1036.
- Minaya, V., & Scott-Clayton, J. (2020). Labor Market Trajectories for Community College Graduates: How Returns to Certificates and Associate's degrees Evolve Over Time. <em>Education Finance and Policy</em>, 1&mdash;62.
- Carruthers, C. K., Fox, W. F., & Jepsen, C. (2020). Promise Kept? Free Community College, Attainment, and Earnings in Tennessee. February 2020 (work in progress). <em>WORK</em>.
- Stobierski, T. (2020, June 9). Average Salary by Education Level: The Value of a College Degree. <em>Northeastern University</em>. https://www.northeastern.edu/bachelors-completion/news/average-salary-by-education-level/

While the articles you find may not directly address your research questions/topics, they are valuable in refining your research question(s) and methodolgy. For example, the Carruthers, Fox, & Jepsen (2020) paper contains ample information regarding policy relevance and context. Similarly, you could see an opportunity to expand on the research by Stobierski (2020) who found differences in earnings by education level; however, Stobierski was limited to public use data from the U.S. Bureau of Labor statistics. The microdata available via the ADRF would enable you to examine this in far greater detail and place the results within the context of an individual state. You may find research, such as the article authored by Minaya & Scott-Clayton (2020), that closely aligns with your topic of interest. In these instances you may find helpful information regarding conceptional framework and methodology.

> The research articles are available on the course webpage outside of the ADRF.

#### **Notebook 1a Questions and Goals** 
In this notebook, we focus on seeking answers to the following questions: 
- How many students graduated from Tennessee community colleges in the 2015-16 academic year?
- What filters can be used to define the cohort (e.g., demographics, institutions, enrollment type, etc.)?
- How many students graduated from Tennessee community colleges by subgroup (e.g. demographics, institutions, enrollment type, etc.)?

After completing this notebook you should be able to perform the following analytical tasks:
- load R libraries and establish a connection to the server
- create a cohort sample by using the TBR completions file
- calculate descriptive statistics to understand who is in the population
- create new tables from the larger tables in a database (sometimes called the "analytical frame")
- explore different variables of interest
- clean data
- create aggregate metrics

The specific techniques include but not limited to:
- **SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `GROUP BY`: aggregate data over the variables of interest
 - `ORDER BY`: sort data based on the variables of interest
 - `DISTINCT`: look at distinct values of a variable
 - `JOIN ... ON`: join tables
- **R code**:
 - `group_by` and `summarize` to find group-based measures
 - `mutate` to create new variables
 - `arrange` and `desc` to sort values

#### **Datasets** ####
We will explore and understand the Tennessee Board of Regents (TBR) tables in this notebook:
- **Community College Graduates**: The graduates table is provided by TBR. The data include graduations at all TBR community colleges and covers the time period of summer 2009 through fall 2020.
- **Community College Enrollments**:  Also provided by TBR and contains all enrollment data at TBR community colleges from summer 2009 through fall 2020.

> **NOTE:** Only public 2-year institutions and Tennessee Colleges of Applied Technology (TCATs) institutions report data to TBR. Additional data regarding completions beyond TBR's governance can be found in the post-TBR graduation data.

#### **Directory Structure**

We will constantly read and write csv files to load crosswalks and to save results in all the notebooks. Let's create a few folders in your U drive first so it is eaiser for you to organize all the files. 

- Open Windows File Explorer
- On the left hand side, find U drive (U:) and click into it
- On the right hand side, open your user folder: FirstName.LastName.UserID
- In your user folder, create a new folder: TN Training
- In the "TN Training" folder, create three subfolders: "Notebooks", "Results", "Output"
- You can copy and paste the class notebooks to the "Notebook" folder, save summary statistics to the "Results" folder, and save visualizations (in the third notebook) to the "Output" folder.

For example, at the end of this notebook, **we save summary statistics to "U:\\FirstName.LastName.UserID\TN Training\Results\filename.csv"**.


## **3. Load the Data**

In this section, we will demonstrate how to use R to read data from a relational database. First, we need to load libraries in R.

#### **R Setup**

We will use several R functions that are not immediately available in base R. Therefore, we need to load them using the built-in function `library()`. For example, running `library(tidyverse)` loads the `tidyverse` suite of packages. It is a collection of packages designed for data science.

> When you run the following code cell, don't worry about the warning message below.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

__When in doubt, full documentation for a method can be printed with `?<package/function_name>`, e.g. `?tidyverse/ggplot` or `?sprintf`.__ Do not worry about memorizing the information in the help documentation - you can always run this command when you are unsure of how to use a function.

> Certain functions exist across multiple packages (e.g. the function `lag` exists in both the `dplyr` and `stats` package - also noted in the message yielded from `library(tidyverse)`. When calling a function, you can put the package name first to ensure that you are using the right one. For example, `dplyr::lag` or `stats::lag` calls the `lag` function from `dplyr` or `stats`, respectively. 

In [None]:
# See help documentation for head:
# a function we will use frequently to check the content of a table
# It returns the first few rows of a table
?head

#### **Establish a Connection to the Server**

Now, we are ready to connect to the server. We will create the connection using the `DBI`  and `ODBC` libraries. 

> **Loading R libraries** and **establishing connection** should always be the first step in your Jupyter Notebooks. Make sure you copy these code chunks when you create a new notebook.

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

#### **Formulate Data Query**

Next, we need to dictate what we want to pull in from the database. This part is similar to writing a SQL query in DBeaver. In this example, we will pull in 20 rows of Tennessee community college graduates information, which is stored in the `graduations` table inside the `ds_tn_tbr` schema. Before running the code below, test the inital query you will use to bring in your first data frame to make sure it successfully runs in DBeaver:

    SELECT TOP 20 *
    FROM ds_tn_tbr.dbo.graduations ;

We can create the same query as a `character` object in R.

In [None]:
# Create qry character object
# Database name: ds_tn_tbr
# Schema name: dbo
# Table name: graduations
qry <- "
SELECT TOP 20 *
FROM ds_tn_tbr.dbo.graduations;
"

We use `TOP` to read in only the first 20 rows because we're just looking to preview the data and we don't want to eat up memory by reading a huge data frame into R. 

> `TOP` provides one simple way to get a "sample" of data. You may get different samples of data from others using just the `TOP` clause. However, it is not because you get a random sample by using `TOP`. It is because the database returns the results that can be pulled the fastest.

#### **Read in the Data** 

Now we can use `con` and `qry` as inputs to `dbGetQuery()` to read the data into R. Compare the results below with the test query you made in DBeaver. To run the code without saving it to a data frame for later reference, you can simply include `dbGetQuery(con,qry)`, as shown below.

In [None]:
# Read in data frame 
dbGetQuery(con,qry)

In [None]:
# See first few rows
dbGetQuery(con, qry) %>%
    head()

> Note: There are other methods you can use to explore the data. Two of these functions are `glimpse` and `names`.

#### **Checkpoint 1: Explore Columns**

Take a look at the columns in the `graduations` table. Which variables might be useful for your project?  Let's explore another table.  Try to query another higher education data table. Explore the `enrollments` table in the `ds_tn_tbr` database.

> Refer to the data dictionary on the class website to get a better understanding of the variables.

In [None]:
# Replace ____ with the table database and table name
qry <- "
SELECT TOP 20 *
FROM ___.___.__;
"

# Read in data frame
dbGetQuery(con,qry)

# Can write code to explore the data frame
dbGetQuery(con, qry) %>%
    ____()

-----

## **4. Explore the table and understand the data**

Before building a cohort, it is important to understand the quality of the data. Since we will be creating a cohort of graduates, it is important to note that the `graduations` table lists awards, not people. Each row represents a person-degree/credential-major. The data are not always clean. Before creating the cohort it is useful to understand missing values, changes in trends, and inconsistent data. 

Since we hope to create a cohort of graduates that graduated within a specific time period (2015-16 academic year) so that we can track their future employment outcomes as a group, let's take a look at the distribution of the number of graduates by year, or `YearAward` in the data.

Try running the following query in DBeaver to understand the kinds of information you will be bringing into your data frame:

    SELECT YearAward, count(DISTINCT(ssn)) as num_individuals
    FROM ds_tn_tbr.dbo.graduations
    GROUP BY YearAward
    ORDER BY YearAward desc;

Now run the query in R and review the results.

> **NOTE:** Ctrl+C and Ctrl+V can be used to copy and paste text within the ADRF. 

In [None]:
# Exploration query on award year earned
qry <- "
SELECT YearAward, count(DISTINCT(SSN)) as num_individuals
FROM ds_tn_tbr.dbo.graduations
GROUP BY YearAward
ORDER BY YearAward desc;
"

# Read in data frame and but don't save the results
dbGetQuery(con,qry)

    
> **NOTE:** 2009 appears to have far fewer records than the subsequent years. This is not an error; rather, the 2009 calendar year only includes graduates that would be included in the 2009-2010 graduating cohort. 

As you can see, we have quite a large number of potential graduates to pull from. However, the academic year of 2015-2016 includes some graduates from 2015, and others from 2016. To get a better sense of our potential sample size, let's look at the number of individuals that graduated by term in 2015 and 2016 using the following query:

    SELECT TermDesc, TermAward, YearAward, count(DISTINCT(ssn)) as num_individuals
    FROM ds_tn_tbr.dbo.graduations
    WHERE YearAward = 2015 OR YearAward = 2016
    GROUP BY TermDesc, TermAward, YearAward
    ORDER BY YearAward, TermAward ;

In [None]:
# Exploration query on YearAward and TermAward
qry <- "
SELECT TermDesc, TermAward, YearAward, count(DISTINCT(SSN)) as num_individuals
FROM ds_tn_tbr.dbo.graduations
WHERE YearAward = 2015 OR YearAward = 2016
GROUP BY YearAward, TermAward, TermDesc
ORDER BY YearAward;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

Now that we have a general understanding of the volume of graduates in each term and calendar year, let's read the data into R and save the resulting data frame as `df`.

In [None]:
# read table into r and assign as df
qry <- "
select *
from ds_tn_tbr.dbo.graduations
where YearAward = 2015 or YearAward = 2016
"
df<-dbGetQuery(con, qry)

# see first few rows of df
head(df)

#### **Checkpoint 2: Explore the Data** 

This checkpoint has two parts. Run the following queries in the cells below to better understand some key variables of interest:

    SELECT YearAward, CIP_Family, count(DISTINCT(ssn)) as num_individuals
    FROM ds_tn_tbr.dbo.graduations
    GROUP BY YearAward, CIP_Family
    ORDER BY YearAward ;

    SELECT YearAward, AwardType, count(DISTINCT(ssn)) as num_individuals
    FROM ds_tn_tbr.dbo.graduations
    GROUP BY YearAward, AwardType
    ORDER BY YearAward ;

After reviewing the results from the queries above, try to answer the following questions:
1. Are there null values in the data you are seeing?
1. Do you see any obvious problems in the data? 

In [None]:
# Replace ___ in query to understand CIP_Family/AwardType and year
qry <- "
SELECT YearAward, ____, count(DISTINCT(SSN)) as num_individuals
FROM ds_tn_tbr.dbo.graduations
GROUP BY YearAward, ____
ORDER BY YearAward ;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

In [None]:
# Replace ___ in query to understand CIP_Family/AwardType and year
qry <- "
SELECT YearAward, ____, count(DISTINCT(SSN)) as num_individuals
FROM ds_tn_tbr.dbo.graduations
GROUP BY YearAward, ____
ORDER BY YearAward ;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

-----

## **5. Create the Cohort**

In this section, we will use the Tennessee `graduations` table to create a sample of all students in TBR institutions who earned a degree or certificate during the 2015-16 academic year. This is not as easy as it looks since we still need to further limit the cohort from our original query that yielded `df` because the academic year is not the same as the calendar year.

In addition to establishing a time period, it is common to further narrow your population.  Some research questions require you to select only certain graduates.  Others focus on degree level (associates degree recipients, for example) or field of study (business, for example). When establishing your cohort, it is helpful to build an initial query iteratively, checking each restriction before adding others. To recall, our initial query to yield `df` is

    select *
    from ds_tn_tbr.dbo.graduations
    where YearAward = 2015 or YearAward=2016
    
Let's keep track of the number of individuals we currently have in `df`.

In [None]:
# see number of individuals in df
df %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

### Academic Year

As previously mentioned, we have not yet limited `df` to just include graduates from the 2015-16 *academic* year, which includes summer 2015, fall 2015, and spring 2016 graduates. Since the data frame also includes academic certificates (which are not included in regular reporting and are not used for funding purposes), we should exclude those as well.

To isolate these graduates, we can `filter()` `df` on these requirements.

In [None]:
# isolate 2015-16 academic year graduates
df <- df %>%
    filter(
        (YearAward == 2015 & (trimws(TermDesc) == 'Fall' | trimws(TermDesc) == 'Summer')) |
        (YearAward == 2016 & trimws(TermDesc) == 'Spring')
    )

# see number of individuals in df
df %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

In [None]:
# omit academic certificates
df <- df %>%
    filter(
        (NonAcadem == 1)
    )

# see number of individuals in df
df %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

Let's see our breakdown of graduates by term and year to confirm we properly filtered `df`.

In [None]:
# see count of grads by term and year of award
df %>%
    group_by(YearAward, TermDesc) %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

### Associate's degree earners

Now that we have isolated all graduates in the 2015-16 academic year, let's turn our attention to associate's degree earners. According to the data documentation, an associate's degree is assigned a `DegreeLevel_c` value of `2`. Before further subsetting `df`, let's take a look at `DegreeLevel_c`.

In [None]:
# count number of graduates by awardtype
df %>%
    group_by(DegreeLevel_c, DegreeLevel) %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

Let's go ahead and `filter` for associate's degree recipients.

In [None]:
# filter for associate's degree recipients
df <- df %>%
    filter(DegreeLevel_c == 2) 

#### **Checkpoint 3: Create Your Sample**
Starting with the `graduations` table, create a sample of graduates of a separate academic year and award level. Name the data frame `df_checkpoint`.

In [None]:
# Replace ____ 
qry <- "
select *
from ds_tn_tbr.dbo.graduations
where YearAward in (___, ___)
"

# Read in data frame and save it as df_checkpoint
df_checkpoint <- dbGetQuery(con,qry)

df_checkpoint <- df_checkpoint %>%
    filter(___)

## **6. Link data across tables**

Now that we have identified a cohort, it is important to link that cohort to other tables to gain further insights. In this example, we will link our data frame of associate's degree recipients in the 2015-16 academic year to the `enrollments` data to gather some of the demographic information captured from the graduated students. First, we will write a query to read some of the demographic information into R so that we can understand how CIP codes and certain demographics may or may not be related.  We don't need all the data in the `enrollments` table because we only need a few characteristics that should not vary by term. As a result, the query below only includes specific columns in the data frame, and enrollment records are limited to those currently in `df`.

> Note: The code `WHERE` clause after `SSN IN (` is ONE translation of the R code used up to this point to define `df` written in SQL.

In [None]:
# Query to bring in specific columns in the enrollments table
qry <- "
SELECT DISTINCT SSN, Gender, BirthYear
FROM ds_tn_tbr.dbo.enrollments
WHERE SSN IN (
    SELECT SSN
    FROM tr_tn_2021.dbo.grads1516
    WHERE (
        (YearAward = 2015 and TermDesc in ('Fall', 'Summer')) OR (YearAward = 2016 and TermDesc = 'Spring')) AND 
        NonAcadem = 1 AND
        DegreeLevel_c = 2
    );
"

# Read in data frame and save it as df_demographics
df_demographics <- dbGetQuery(con,qry)

# see df_demographics
glimpse(df_demographics)

While these demographics should remain consistent throughout the `enrollments` data for each student, there may be errors or revisions in the data that could ultimately cause multiple records for a student in `df_demographics`. For example, if a male student's gender was reported in one semester but wasn't reported in a subsequent term, there will be a record indicating a gender of male and another with the gender missing. Likewise, a student's birth year may have been reported as 1997 in one term and 1998 in a different term. We can check for this sort of duplication in R, and additional information regarding identifying and managing duplicates will be covered later in this notebook.

In [None]:
# Create a data frame containing the records that have duplicate SSN
df_demographic_dup <- df_demographics %>% 
    count(SSN) %>% 
    filter(n>1)

nrow(df_demographic_dup)

As we can see, there are several instances where data that is assumed to be static has been reported differently for different semesters. In other words, duplication may result from an individual who has contrasting demographic information across the enrollment records. So, which record do we choose and how do we implement this choice in the code? You could adjust for this in a number of ways: select the most recent demographic record, select the record that most closely corresponds to the time period of interest (such as the term of graduation), etc. In most cases, you will need to adjust your original SQL query to overcome the duplication issue. 

We can use a subquery to find the most recent term for which the student enrolled:

```
SELECT SSN, MAX(TermSeq) as MaxTerm
FROM ds_tn_tbr.dbo.enrollments
GROUP BY SSN
```

With the most recent term identified, we can use an inner join to only extract the records where the `TermSeq` matches the `MAX(TermSeq)`.

In [None]:
# Modified query to get most recent demographic record and prevent most common forms of duplication
qry <- "
SELECT DISTINCT b.SSN, b.Gender, b.BirthYear
FROM (
    SELECT SSN, MAX(TermSeq) as MaxTerm
    FROM ds_tn_tbr.dbo.enrollments
    GROUP BY SSN
) a
INNER JOIN ds_tn_tbr.dbo.enrollments b
ON b.SSN = a.SSN AND b.TermSeq = a.MaxTerm
WHERE a.SSN IN (
    SELECT SSN
    FROM tr_tn_2021.dbo.grads1516
    WHERE (
        (YearAward = 2015 and TermDesc in ('Fall', 'Summer')) OR (YearAward = 2016 and TermDesc = 'Spring')) AND 
        NonAcadem = 1 AND
        DegreeLevel_c = 2
    )
;
"

# Read in data frame and save it as df_demographics
df_demographics <- dbGetQuery(con,qry)

# see df_demographics
glimpse(df_demographics)

There still happens to be one record with inconsistent `BirthYear` values in `df_demographic`. In practice, there are many ways to approach these scenarios. Here, we will select the observation that corresponds to the institution of graduation.

In [None]:
# observation to keep
obs_keep <- df_demographics %>% 
    filter(
        SSN == '(REDACTED)',
        BirthYear == '1991'
    )

# insert in only observation to keep for this individual
# first will filter out all observations for this individual and then reinsert obs_keep, the one we want to re-add
df_demographics<-df_demographics %>%
    filter(SSN != '(REDACTED)') %>%
    rbind(obs_keep)

# make sure same number of rows as number of individuals in df_demographics
# will equal TRUE if so
df_demographics %>%
    summarize(
        TEST = n() == n_distinct(SSN)
    )

Next, we can merge the columns in `df_demographics` for all records in `df` as long as they have the same SSN, as designated by the `SSN` column across the two data frames.  The joining statement in R is as follows:

    df <- df %>% 
        left_join(df_demographics, by = "SSN")
    
A left join is used because we would like to retain all records in the left data frame (`df`) and are only bringing in matched records from the right data frame (`df_demographics`).  Conversely, a right join starts with all the records from the right table and only brings in matched records from the left table, and an inner join only includes records for which there is a match between both tables.  Notice that if the the same column name is used to match the data frames together, you only need to specify the one name for both tables after "by".  If the column names are different, you need to declare the column names for the two data frames.

> **NOTE:** In this example, we are going to use R to join data frames. As you will see in the next notebook, with larger tables, it is inefficient or at times not possible to bring extremely large tables into R.  As a result, the joins have to be done in SQL prior to bringing the data frame into R.

In [None]:
#Left join cohort to demographic data
df <- df %>% 
    left_join(df_demographics, by = "SSN")

# See top records in the dataframe
df %>% 
    head()

Now that the data frame is finalized, you can manipulate it further based on age.  If, for example, your working group was only interested in adult graduates, the data frame can be filtered and saved with those records.

> **NOTE:** `BirthYear` is currently a character-type variable in the data frame and will need to cast as an integer to perform numerical operations. `YearAward` - `BirthYear` will be used to approximate age at graduation.

In [None]:
# Subset the dataframe to specific age group
df_adultgrads <- df %>% 
    filter((YearAward - strtoi(BirthYear)) >= 25)

Then you can see the distribution of adult graduates by approximate age.

In [None]:
# see approximate age
df_adultgrads %>%
    count(YearAward - strtoi(BirthYear))

#### Checkpoint 4: Add columns

Using the cohort that you created in the previous checkpoint, try to join data from the `enrollments` table and further subset your data frame to non-adult graduates. Save the resulting data frame as `df_checkpoint_age`. 

In [None]:
# Modified query to get most recent demographic record and prevent most common forms of duplication
# Replace ___
qry <- "
SELECT DISTINCT b.SSN, b.Gender, b.BirthYear
FROM (
    SELECT SSN, MAX(TermSeq) as MaxTerm
    FROM ds_tn_tbr.dbo.enrollments
    GROUP BY SSN
) a
INNER JOIN ds_tn_tbr.dbo.enrollments b
ON b.SSN = a.SSN AND b.TermSeq = a.MaxTerm
WHERE a.SSN IN (
    SELECT SSN
    FROM ds_tn_tbr.dbo.graduations
    WHERE (
        ___
    )
;
"

# Read in data frame and save it as df_demographics_checkpoint
df_demographics_checkpoint <- dbGetQuery(con,qry)

# see df_demographics_checkpoint
glimpse(df_demographics_checkpoint)

# ensure no duplicates after taking max semester
df_demographics_checkpoint %>%
    summarize(
        TEST =__ == __
    )

In [None]:
# Add in demographic info
df_checkpoint <- df_checkpoint %>%
    left_join(df_demographics_checkpoint, by = "SSN")

# filter for non-adult graduates
df_checkpoint_age <- df_checkpoint %>%
    filter(___)

# See top records in the data frame
head(df_checkpoint_age)

-----

## **7. Higher Education Graduate Cohort Count and Descriptive Statistics**

Next we will run some statistics to understand how data are structured in the cohort data frame. Recall that you had already made a data frame on your own, `df`. Here, we will use the same data frame to further examine the data elements. Recall that that each record in our data frame does not represent a person. Each record represents an award or credential (degree or certificate). You can see this by comparing the number of rows, or awards, with the number of graduates.

In [None]:
# compare number of rows to grads
df %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(SSN)
    ) 

The difference between awards and graduates is due to a subset of students earning multiple degrees. If we didn't restrict `df` to associate's degree earners, there is the potential for students to have earned a certificate and a degree or multiple short-cycle certificates in an academic year. When framing your higher education cohort, you need to ask if you want to de-duplicate the file (by selecting the highest award for example) or if you want to focus on one level of degree. Workforce outcomes may be very different for certificate holders without a degree compared to those with associates degrees. Those with bachelors or graduate degrees are expected to have higher incomes. Three decisions are required to accurately form the cohort:

1. What time period are you using to define the cohort (one academic year or multiple academic years)?
2. What degree level or levels will you focus on?
3. What will you do with duplicate records?

In this example, we will review the duplicates to help inform decisions.

### Duplicates code

The series of commands below help identify duplicates, create a data frame of duplicates, and list the results. Before we de-duplicate the files, let's save an award-level (not person-level) file for potential award-based analyses.  We will name this data frame `df_awards`.

In [None]:
# copy df as df_awards
df_awards <- df

Next we can start to explore the duplicates. First, we will identify a case of duplication, which we can isolate by counting the number of occurrences of each `SSN` in `df`, and then finding the `SSN` with the highest number of occurrences.

In [None]:
# find duplicate example
dup_ex <- df %>%
    count(SSN) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

From here, we can find all rows in `df` with the `SSN` in `dup_ex` so we can further explore a duplicated example. We will select certain variables to highlight the duplication.

In [None]:
# see all duplicated rows in example
df %>%
    filter(SSN == dup_ex$SSN) %>%
    select(SSN, YearAward, TermAward, CIP_6, CIP_Family)

Sometimes, the CIP family is the same for all duplicates, and that the double or triple awards are nearly equivalent (for example Accounting and Finance OR Business and Marketing) in addition to other true duplicates, based on the columns we selected. We will assume that most degrees are in similar CIP families, thus de-duplicating and taking the most recent record (or one of the most recent if there are multiple). To do so, we will first sort `df`, so that for each `SSN`, the first row is at least one of the most recent degrees. From there, we can use `distinct` to isolate the first row within each `SSN`.

However, we cannot simply sort by `TermAward` within each `YearAward` as it is currently encoded, since '4' corresponds to the Summer term, '1' to the Fall term and '3' to the Spring term. Instead, we can leverage the `TermSeq` variable, which tracks the term relative to the first term in the `enrollments` and `completions` tables, summer 2009.

In [None]:
# unduplicate cohort
df <- df %>%
    arrange(SSN, desc(TermSeq)) %>%
    distinct(SSN, .keep_all = TRUE)

# compare number of rows to grads
df %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(SSN)
    ) 

Though the sample cohort table for 2015-16 associate's degree earners was created for you in the database, the following code demonstrates how you can create a database table from an R data frame. This can be especially helpful if you want to limit future queries to only include students that are members of the cohort (ultimately improving the efficiency of the query). This code creates the table `grads1516` in the Tennessee training workspace `tr_tn_2021` from the data frame `df`.
```
qry <- " use tr_tn_2021;"
DBI::dbExecute(con, qry)


DBI::dbWriteTable(
    conn = con,
    name = DBI::SQL("dbo.grads1516"), 
    value = df
)
```

#### **Checkpoint 5: Explore duplicates and remove for your cohort**

For your data frame `df_checkpoint`, explore and come up with a strategy for removing all duplicates.

In [None]:
# replace ___ with code
dup_ex <- __ %>%
    count(SSN) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

In [None]:
# see all duplicated rows in example
___ %>%
    filter(SSN == dup_ex$___) %>%
    select(___)

In [None]:
# unduplicate cohort
___ <- ___ %>%
    arrange(SSN, ___) %>%
    distinct(SSN, .keep_all = TRUE)

# compare number of rows to grads
___ %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(SSN)
    ) 

-----

## **8. Exploratory Analysis of the Cohort**

In this section we will find out more about our 2015-16 graduating cohort. We will begin by isolating the top 5 majors (via 6-digit CIP codes) using the CIP family.

From there we will look to see if there are differences by gender. Understanding these patterns are an important part of understanding potential disparities in employment outcomes.  


Up to this point, we have identified our cohort (`df`), and removed all duplicates so that they are person-level files.  Let's start by looking at the difference in `CIP_6` compared to `CIP_Family`, based on the granularity we desire in major groups.

### Major Groupings



In [None]:
## see difference in number of award types between major title and cip family
df %>%
    summarize(
        num_cip_fam = n_distinct(CIP_Family),
        num_cip_10 = n_distinct(CIP_6)
    )

For the sake of this analysis, we will use `CIP_Family`, as we continue our analysis by major. Let's find the 5 most common majors in the cohort.

In [None]:
# 5 most common majors
df %>%
    count(CIP_Family) %>%
    arrange(desc(n)) %>%
    head(5)

Does this list surprise you? For perspective, we will add in another column tracking the proportion of graduates by major.

In [None]:
# 5 most common majors with proportion
df %>%
    count(CIP_Family) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5)

Because we hope to build off of this cursory subgroup analysis in later notebooks, let's save the resulting data frame to `df_common_major`.

In [None]:
# 5 most common majors with proportion
df_common_major <- df %>%
    count(CIP_Family) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5)

### Gender

Additionally, we can look at the gender breakdown within the cohort using the `Gender` variable.

In [None]:
# gender breakdown
df_gender <- df %>%
    count(Gender) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    )

# see df_gender
df_gender

Does anything stand out about the proportion of graduates by gender? Is this something you would expect to see?

### Top Majors by Gender

Let's intersect the major breakdown by gender—do the most common majors differ amongst gender groups? Since we are looking at proportions and counts within multiple combinations of subgroups (`CIP_Family` and `Gender`), we need to adjust the code from above a bit. First, we need to calculate the proportion of observations within each `Gender`, hence the `group_by`, and we replace `head` with `slice` to retrieve the top 5 majors within each `Gender` value, instead of returning the top 5 rows as ordered by `Gender`.

In [None]:
# major/gender breakdown
df_major_gender <- df %>%
    count(CIP_Family, Gender) %>%
    arrange(desc(n)) %>%
    group_by(Gender) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    arrange(Gender, desc(n)) %>%
    slice(1:5)

df_major_gender

The breakdown above shows that females and males alike choose Liberal Arts & Science as their top major, accounting for about REDACTED and REDACTED of all graduates respectively. Females are more likely to choose Health Professions and Realted Services than males, REDACTED compared to REDACTED. Engineering is the second choice for males while it doesn't make the top 5 for females.

#### **Checkpoint 6: Common Majors and gender**

Using your own data frame, `df_checkpoint`, identify the 5 most common majors overall and by gender. Save these results to `df_checkpoint_common_major` and `df_checkpoint_major_gender`, respectively.

Do your results vary drastically from those derived from `df`?

In [None]:
# find common major
df_checkpoint_common_major <- df_checkpoint %>%
    ___

df_checkpoint_common_major

In [None]:
# find most common majors by gender
df_checkpoint_major_gender <- df_checkpoint %>%
    ___

df_checkpoint_major_gender

## **9. Export Results to .csv Files**

Now you have successfully finished defining a cohort and a quick subgroup analysis! The last step is to save your results in .csv files so that we can re-use these results in future notebooks. 

<font color=red> Note that you need to change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Save dataframes to CSV to use in later notebook

# most common majors
write_csv(df_common_major, "U:\\..\\TN Training\\Results\\common_major.csv")

# gender breakdown
write_csv(df_gender, "U:\\..\\TN Training\\Results\\common_gender.csv")

# most common majors by gender
write_csv(df_major_gender, "U:\\..\\TN Training\\Results\\common_major_gender.csv")

# References

Simone, Sean, Barrett, Nathan, & Feder, Benjamin. (2022, March 25). Data Exploration for Cohort Analysis using New Jersey Education to Earnings Data System Tables. Zenodo. https://doi.org/10.5281/zenodo.6385510