<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    Greg Cumpton, Benjamin Feder, Nathan Barrett, Rukhshan Mian </center>
    <a href="https://doi.org/10.5281/zenodo.6412617"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6412617.svg" alt="DOI"></a>


# <center> **Data Exploration: Texas Colleges and Universities** </center>

## **1. Introduction**
This notebook introduces you to the concept of creating a group, or "cohort", that will be used for future analysis. We will construct measures to understand who we are including and excluding (coverage) from the cohort and walk you through the decisions that need to be made when devising the cohort using filters such as award level, gender, major, and others. Cohorts define the primary population of interest in much research; once created, cohorts may then be used to link to other data sources.

Most of the Tri-Agency research questions (and their *data sources*) rely on the construction of cohorts, some examples of which could include:

   + Title 1 program completers (*PIRL*) 
   + Higher education completers (*THECB graduations file*) 
   + High school graduates (*TEA graduation data*) 

We begin by introducing you to data analytics tools to access the data, including connecting R to the server and using SQL queries to pull the data. We will then leverage these tools to explore the graduations file from the Texas Higher Education Coordinating Board (THECB). To train you in creating a cohort, we will create a dataset (called a "data frame") and investigate trends in Texas college graduates. At the end of this notebook, we will save the summary statistics to csv files so that we can use them in subsequent notebooks.

## **2. Learning Objectives**

You will implement techniques using SQL and R to explore and better understand the data that are available to you, and to address the feasibility of your team's potential research question. This will form the basis for all future analyses you will do in this training class and is a crucial first step for any data analysis workflow. As you work through the notebook, we will have checkpoints for you to practice writing code by making small adjustments, but you are also encouraged to think about how you might apply any of the techniques and code presented to other datasets to address your assigned research question. 

You can access hints and solutions to the checkpoints by running the code cell below.

In [None]:
# Import the file with hints and solutions
source("nb1_hints_and_solutions.txt")

The guiding research questions we will use for this series of notebooks are quite general: 

>**What are the employment outcomes of the 2015 college graduates? How do these outcomes vary by cohort characteristics?**

This will allow the code we use to have the most versatility. We will analyze these questions through a variety of different lenses and will start in this notebook by defining a specific cohort of Texas degree recipients in the 2015 calendar year. We will then track their earnings and employment outcomes over time in the following notebook. The exploration of the supply side of the labor market will be later supplemented by an analysis of the demand side to enhance our understanding of the overall labor market.

>**The key benefit to working with Texas administrative records in the ADRF is the ability to integrate data across sources, including K-12, higher education, training, and employment data.**

At the end of this notebook, you should test your skills by performing the following tasks:

+ (1) Construct one or more cohorts related to your specific research question 
+ (2) Examine the demographic characteristics of your cohort

We are going to show just a portion of what you might be interested in investigating to answer these overarching questions, so don't feel restricted by the questions we've decided to answer.

#### **Notebook Questions and Goals** 
In this notebook, we focus on seeking answers to the following questions: 
- How many students graduated from Texas colleges during the 2015 calendar year?
- What filters can be used to define the cohort (e.g., demographics, institutions, enrollment type, college type, etc.)?
- How many students graduated from Texas colleges by subgroup (e.g. demographics, institutions, enrollment type, college type, etc.)?

After completing this notebook you should be able to perform the following analytical tasks:
- load R libraries and establish a connection to the server
- create a cohort sample by using the THECB graduates file
- calculate descriptive statistics to understand who is in the population
- create new tables from the larger tables in a database (sometimes called the "analytical frame")
- explore different variables of interest
- clean the data
- create aggregate metrics

The specific techniques include, but are not limited to:

**SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `GROUP BY`: aggregate data over the variables of interest
 - `ORDER BY`: sort data based on the variables of interest
 - `DISTINCT`: look at distinct values of a variable
**R code**:
 - `group_by` and `summarize` to find group-based measures
 - `mutate` to create new variables
 - `arrange` and `desc` to sort values

#### **Datasets** ####
We will explore and understand the Texas Higher Education Coordinating Board (THECB) tables in this notebook:
- **College Graduates**: The graduations table is provided by the THECB. The data include graduations at all Texas colleges and universities and covers the time period of January 2011 through December 2020.
- **College Enrollments**:  Also provided by THECB and contains all enrollment data at all Texas colleges and universities from fall 2010 through winter 2020.

#### **Directory Structure**

We will constantly read and write csv files to load crosswalks and to save results in all the notebooks. Let's create a few folders in your U drive first so it is eaiser for you to organize all the files. 

- Open Windows File Explorer
- On the left hand side, find U drive (U:) and click into it
- On the right hand side, open your user folder: FirstName.LastName.UserID *Your name may be truncated*
- In your user folder, create a new folder: TX Training
- In the "TX Training" folder, create three subfolders: "Notebooks", "Results", "Output"
- You can copy and paste the class notebooks to the "Notebook" folder, save summary statistics to the "Results" folder, and save visualizations (in the third notebook) to the "Output" folder.

At the end of this notebook, **we save summary statistics with the path "U:\\FirstName.LastName.UserID\TX Training\Results\filename.csv". Note that Firstname.Lastname.UserID may be referred to as your username.**

To make your life easier, please insert your ADRF username to replace the ____ inside the quotations in the following cell.

In [None]:
# insert ADRF username Firstname.Lastname.UserID
username <- "___"

## **3. Load the Data**

In this section, we will demonstrate how to use R to read data from a relational database. First, we need to load libraries in R that provide certain functionalities we will leverage in this notebook.

#### **R Setup**

We will use several R functions that are not immediately available in base R. Therefore, we need to load them using the built-in function `library()`. For example, running `library(tidyverse)` loads the `tidyverse` suite of packages. It is a collection of packages designed for data science.

> When you run the following code cell, you will see an output cell in red. Though in future coding this red colored cell may be cause for concern and reflect a need to adjust your code, don't worry about this particular warning message. If you are unsure of the meaning of a particular warning message, you can always paste that message into a search engine to see if anyone has run into a similar problem in the past.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

__When in doubt, full documentation for a method can be printed with `?<package/function_name>`, e.g. `?tidyverse/ggplot` or `?sprintf`.__ Do not worry about memorizing the information in the help documentation - you can always run this command when you are unsure of how to use a function.

> Certain functions exist across multiple packages (e.g. the function `lag` exists in both the `dplyr` and `stats` package - also noted in the message yielded from `library(tidyverse)`. When calling a function, you can put the package name first to ensure that you are using the right one. For example, `dplyr::lag` or `stats::lag` calls the `lag` function from `dplyr` or `stats`, respectively. 

In [None]:
# See help documentation for head:
# a function we will use frequently to check the contents of a data frame
# It returns the first few rows
?head

#### **Establish a Connection to the Server**

Now that we have loaded the necessary libraries, we are ready to connect to the server. We will create the connection using the `DBI` and `ODBC` libraries. 

> **Loading R libraries** and **establishing connection** should always be the first step in your Jupyter Notebooks. **Make sure you copy these code chunks when you create a new notebook.** 

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

<font color = "purple"> <h3> Practice Creating a Jupyter Notebook [Optional] </h3> </font>

Choose New from the File menu on the upper left of your screen and select Notebook. Select R as the Kernel. Copy and paste the code cells in the **R Setup** and the **Establishing a Connection to the Server** sections from this notebook to your new Jupyter notebook. You may choose to rename your notebook by right-clicking on your new file in the File Browser on the left-hand side bar and selecting rename. Save your new notebook, as it may serve as a future programming space for your project.

> **NOTE:** Ctrl+C and Ctrl+V can be used to copy and paste text within the ADRF. However, when copying specific cells within JupyterLab, the shortcuts are just C and V, respectively.

#### **Formulate Data Query**

Next, we need to dictate what we want to pull in from the server. This part is similar to writing a SQL query in DBeaver. In this example, we will pull in 5 rows of Texas college graduates information, which is stored in the `graduations` table inside the `ds_tx_thecb` database. Before running the code below, test in DBeaver the inital query you will use to bring in your first data frame to make sure it successfully runs:

    SELECT TOP 5 *
    FROM ds_tx_thecb.dbo.graduations ;

We can create the same query as a character object in R.

> The `.dbo` schema will always be the schema that contains the tables we will be using throughout this training program, even across the different databases.

In [None]:
# Create qry character object
# Database name: ds_tx_thecb
# Schema name: dbo
# Table name: graduations
qry <- "
SELECT TOP 5 *
FROM ds_tx_thecb.dbo.graduations;
"

We use `TOP` to read in only the first 5 rows because we're just looking to preview the data and we don't want to eat up memory by reading a huge data frame into R. 

> `TOP` provides one simple way to get a "sample" of data. You may get different samples of data from others using just the `TOP` clause. However, `TOP` is not returning a random selection of the data, it is just returning the results that can be pulled the fastest.

#### **Read in the Data** 

We can use `con` and `qry` as arguments `dbGetQuery()` to read the data into R. Compare the results below with the test query you ran in DBeaver. To run the code without saving it to a data frame for later reference, you can simply include `dbGetQuery(con,qry)`, as shown below.

> Effectively,`dbGetQuery()` provides a bridge from R (and JupyterLab) to the server to access the table. In contrast, you can run a query without `dbGetQuery()` in DBeaver because the connection to the server has already been established.

In [None]:
# Read in data frame 
dbGetQuery(con,qry)

In [None]:
# See column names
dbGetQuery(con, qry) %>%
    names()

> Note: There are other methods you can use to explore the data. Two of these functions are `glimpse()` and `head()`.

<font color=orange> <h3> **Checkpoint 1: Explore Columns** </h3> </font>

Take a look at the columns in the `graduations` table. Which variables might be useful for your project?  Let's explore another table.  Try to query another higher education data table to see five rows and the names of the columns. For example, explore the `enrollments` table in the `ds_tx_thecb` database.  **Note: For the purposes of extracting data from the server in the ADRF using R, the location will always be ds_tx_SOURCE.dbo.FILE, where the SOURCE is thecb, tea, or twc and the FILE refers to the tables of data within those sources.**

> Refer to the data dictionary on the class website to get a better understanding of the variables.

In [None]:
# Replace ____ with the table database, schema, and table name
qry <- "
SELECT TOP 5 *
FROM ds_tx_thecb.dbo.__;
"

# Read in data frame
dbGetQuery(con,qry)

# Can write additional code to explore the data frame
dbGetQuery(con, qry) %>%
    ____()

Uncomment the lines below if you would like to see a hint or a solution.

In [None]:
#checkpoint_1.hint()

In [None]:
#checkpoint_1.solution()

-----

## **4. Explore the table and understand the data**

Before building a cohort, it is important to understand the quality and characteristics of the data. Since we will be creating a cohort of graduates, it is important to note that the `graduations` table lists awards, not people. Each row represents a person-degree/credential. The data are not always clean, and it is useful to understand missing values, changes in trends, and inconsistent data before furthering any analysis. 

Since we hope to create a cohort of graduates that graduated within a specific time period (2015 calendar year) so that we can track their future employment outcomes as a group, let's take a look at the distribution of the number of graduates by fiscal year, or `gradyear` in the data.

Try running the following query in DBeaver to understand the kinds of information you will be bringing into your data frame:

    SELECT gradyear, count(DISTINCT(gradid)) as num_individuals
    FROM ds_tx_thecb.dbo.graduations
    GROUP BY gradyear
    ORDER BY gradyear desc;

Now run the query in R and review the results.

> **REMINDER:** Ctrl+C and Ctrl+V can be used to copy and paste text within the ADRF. 

In [None]:
# Exploration query on fiscal year earned
qry <- "
SELECT gradyear, count(DISTINCT(gradid)) as num_individuals
FROM ds_tx_thecb.dbo.graduations
GROUP BY gradyear
ORDER BY gradyear desc;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

As you can see, we have quite a large number of potential graduates to pull from within each *fiscal year*. The source of graduates also varies by the type of institution, as the graduations file includes both two and four year college graduations. To get a better sense of our potential sample size across year and college type, let's look at the number of individuals that graduated in the 2015 *calendar year*. 

Recall that `gradyear` does not represent the calendar year, but rather the fiscal year. From the data dictionary, we can see that, for example, those who graduated in the 2020 fiscal year graduated anywhere from September 2019 to August 2020. Therefore, we need to leverage the combination of `gradyear` and `gradmonth` to properly define the 2015 calendar year, as all graduates from January to August will need to correspond to the `gradyear` 2015, and all graduates from September to December to the `gradyear` 2016.

    SELECT gradtypi, gradmonth, gradyear, count(DISTINCT(gradid)) as num_individuals
    FROM ds_tx_thecb.dbo.graduations
    WHERE (gradyear = 2015 and gradmonth < 9) or (gradyear = 2016 and gradmonth >= 9)
    GROUP BY gradmonth, gradyear, gradtypi
    ORDER BY gradmonth;

In [None]:
# Exploration query on gradmonth and gradtypei
qry <- "
SELECT gradtypi, gradmonth, gradyear, count(DISTINCT(gradid)) as num_individuals
FROM ds_tx_thecb.dbo.graduations
WHERE (gradyear = 2015 and gradmonth < 9) or (gradyear = 2016 and gradmonth >= 9)
GROUP BY gradmonth, gradyear, gradtypi
ORDER BY gradmonth;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

Note that there are three college types (`gradtypi`), which represent the following: (1) Universities; (3) CTCs [Community and Technical Colleges] ; and, (5) HRIs [Health-Related Institutions]. Some counts in this table serve as a reminder that disaggregating data can produce numbers too small to include in any unsecured discussions or exported output. Now that we have a general understanding of the volume of graduates in each college type for the 2015 calendar year, let's read the data into R and save the resulting data frame as `Grads_2015`.

In [None]:
# read table into r and assign as Grads_2015
qry <- "
SELECT *
FROM ds_tx_thecb.dbo.graduations
WHERE (gradyear = 2015 and gradmonth < 9) or (gradyear = 2016 and gradmonth>=9)
"
Grads_2015 <- dbGetQuery(con, qry)

# see first few rows of df
head(Grads_2015)

<font color=orange> <h3> **Checkpoint 2: Explore the Data** </h3> </font>

Note that `gradmaj` corresponds to the CIP code for the program of study, `graddegr` is the type of award, and `gradwhite` is one of the race/ethnicity indicator variables. THECB also records race/ethnicity values in the aggregate variable `GradEthnicityCode`.

Run the following queries in the notebook to better understand some key variables of interest within each *fiscal year*:

    SELECT gradyear, gradmaj, count(DISTINCT(gradid)) as num_individuals
    FROM ds_tx_thecb.dbo.graduations
    GROUP BY gradyear, gradmaj
    ORDER BY gradyear ;

    SELECT gradyear, graddegr, count(DISTINCT(gradid)) as num_individuals
    FROM ds_tx_thecb.dbo.graduations
    GROUP BY gradyear, graddegr
    ORDER BY gradyear ;
    
    SELECT gradyear, gradwhite, count(DISTINCT(gradid)) as num_individuals
    FROM ds_tx_thecb.dbo.graduations
    GROUP BY gradyear, gradwhite
    ORDER BY gradyear ;


Try running the code in the box below using different columns.  After reviewing this information, try to answer the following questions:
1. Which variables are character and which numeric? 
2. Are there null values in the data you are seeing?  Where are they present?
3. The variable `gradmaj` describes the CIP code of study. Take a moment to review the Classification of Instructional Programs online. Do you see anything unusual in how this variable has been coded? 

In [None]:
# replace ___
qry <- "
SELECT gradyear, ___, count(DISTINCT(gradid)) as num_individuals
FROM ds_tx_thecb.dbo.graduations
GROUP BY gradyear, ___
ORDER BY gradyear
"
dbGetQuery(con, qry)

In [None]:
#checkpoint_2.hint()

In [None]:
#checkpoint_2.solution()

-----

## **5. Create the Cohort**

In this section, we will use the Texas `graduations` table to create a sample of all students in THECB institutions who earned an Associate degree or certificate during the 2015 calendar year to demonstrate how you may choose to narrow your population of interest. In section 6, we return to examining the entire cohort of 2015 calendar year graduates.

In addition to establishing a time period, it is common to further narrow your population.  Some research questions require you to select only certain graduates.  Others focus on degree level (associate degree recipients, for example) or field of study (business, for example). When establishing your cohort, it is helpful to build an initial query iteratively, checking each restriction before adding others. To recall, our initial query to yield the data frame we named `Grads_2015` is

    SELECT *
    FROM ds_tx_thecb.dbo.graduations
    WHERE (gradyear = 2015 and gradmonth < 9) or (gradyear = 2016 and gradmonth >= 9)
    
Let's keep track of the number of individuals we currently have in `Grads_2015`.

In [None]:
# see number of individuals in Grads_2015
Grads_2015 %>%
    summarize(
        num_inds = n_distinct(gradid)
    )


### Associate Degree Earners

Now that we have all graduates in the 2015 calendar year, let's turn our attention to the types of degrees graduates earned. According to the data documentation, an associate degree is assigned a `gradlev` value of `1`. Before subsetting `Grads_2015`, let's see the different degree conferred (`graddegr`) types within each `gradlev`.

In [None]:
# count number of graduates by gradlev/graddegr combination
Grads_2015 %>%
    group_by(gradlev, graddegr) %>%
      summarize(
        num_inds = n_distinct(gradid)
    )

Note that the value of `gradlev` does not always rise with the level of degree; for example, certifications possess a `gradlev` of 8. To demonstrate the complexity of degrees and the level of degrees, after tabulating these two variables and consulting with the data dictionary, we discover that `gradlev` corresponds to (1) Associate's degrees, (2) Bachelor's degrees, (3) Master's degrees, (4) Doctoral degrees, (5) Professional degrees (MD, etc.), (6-8) Certifications. A quick review of these shortened classifications in the table produced demonstrates more nuance, some examples of which follow:

- The CCC degree (with `gradlev` 5), represents Core Curriculum Completers at a community college who complete a set number of credits that will automatically be accepted at Texas colleges and universities. 
- The BAT degree represents a Bachelor's of Applied Technology and it uses two `gradlev` codes, Bachelor's degrees and Certifications, representing the different types of program offerrings. South Texas College's entry requirements for their BAT program are similar to any community college's entry requirements and may be considered similar to a certificate. The University of Texas System BAT requires an Associate's upon entry for their BAT program and completion is more closely aligned with receiving a Bachelor's degree.


Let's go ahead and `filter` for associate degree recipients and save the results in a new data frame `Grads_2015_Assoc`.

In [None]:
# filter for associate degree recipients
Grads_2015_Assoc <- Grads_2015 %>%
    filter(gradlev == 1) 

Note that the degrees conferred includes a relatively small number of certifications that you may choose to exclude from your analysis. But prior to doing so, always investigate the meaning behind unexpected data points. For example, the CER in this group appears mostly tied to EMT certification programs, as you will see in the following cell by examining CIP codes.

In [None]:
# examine 'CER' for associate degree recipients
Grads_2015_CER <- Grads_2015 %>%
    filter(
        gradlev == 1, 
        graddegr == 'CER'
    ) 

Grads_2015_CER %>%
    group_by(gradmaj) %>%
      summarize(
        num_inds = n_distinct(gradid)
    )

After looking at those with `CER` degrees conferred, it may make sense to drop them from a potential cohort of associate degree earners. You could do so by adding another argument to your `filter()` statement.

In [None]:
# filter for associate degree recipients
Grads_2015_Assoc <- Grads_2015 %>%
    filter(
        gradlev == 1, 
        graddegr != 'CER'
    ) 

We can confirm that our filtering worked as intended by retabulating the number of graduates by degree conferred.

In [None]:
# count number of associate degree graduates by degree conferred.
Grads_2015_Assoc %>%
    group_by(graddegr) %>%
      summarize(
        num_inds = n_distinct(gradid)
    )

<font color=orange> <h3> **Checkpoint 3: Create Your Cohort** </h3> </font>
Starting with the `graduations` table, create a sample of graduates of a separate *calendar year*. Name the data frame `df_checkpoint`. Then, modify your cohort to include graduates of a single graduation level. 

In [None]:
# Replace ____ 
qry <- "
SELECT *
FROM ds_tx_thecb.dbo.graduations
WHERE (gradyear = ___ and gradmonth = ___) or (gradyear = ___ and gradmonth = ___)
"

# Read in data frame and save it as df_checkpoint
df_checkpoint <- dbGetQuery(con,qry)

# count number of graduates by gradlev/graddegr combination
df_checkpoint %>%
    group_by(___, ___) %>%
      summarize(
        num_inds = n_distinct(gradid)
    )

In [None]:
# Use the filter statement to modify your cohort to include only one type of graduation level.
df_checkpoint <- df_checkpoint %>%
    filter(___)

In [None]:
#checkpoint_3.hint()

In [None]:
checkpoint_3.solution()

-----

## **6. Higher Education Graduate Cohort Count and Cohort Construction**

Next we return to the larger graduates file to understand how the `graduations` data is structured in the cohort data frame. Recall that you have already made a data frame that includes all graduations in the 2015 calendar year, `Grads_2015`. Here, we will use the same data frame to further examine the data elements. Recall that that each record in our data frame does not represent a person. Each record represents an award or credential (degree or certificate). You can see this by comparing the number of rows, or awards, with the number of graduates.

In [None]:
# compare number of rows (awards) to grads
Grads_2015 %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(gradid)
    ) 

The difference between awards and graduates is due to a subset of students earning multiple degrees. There is the potential for students to have earned a certificate and a degree or multiple short-cycle certificates in a calendar year.  When framing your higher education cohort, you need to think about if you want to de-duplicate the file (e.g., by selecting the highest award) or if you want to focus on one level of degree. Workforce outcomes may be very different for certificate holders without a degree compared to those with associate degrees. Those with bachelor's or graduate degrees are expected to have higher incomes. Four decisions are required to accurately form the cohort:

1. What time period are you using to define the cohort (one year or multiple years)?
1. What degree level or levels will you focus on?
1. Will you limit your cohort to specific degrees?
1. What will you do with multiple records for the same graduate?

In this example, for future processing speed reasons, we will further subset the cohort to only include bachelor's recipients. Then, we will review the duplicates to help inform decisions.

### Bachelor's subset

You may recall the code cell earlier in the notebook where we counted the number of graduates by `gradlev`/`graddegr` combination. In that cell, it seemed as though bachelor's degrees corresponded with the `gradlev` value of 2. However, after consulting with the corresponding data dictionary, it appears that a `gradlev` 2 has different meanings depending on the institution--namely, it references bachelor's degrees for Universities and HRIs, but not CTCs. 

Therefore, we can isolate bachelor's recipients in our `Grads_2015` data frame by filtering for `gradlev` values of 2 as long as the institution is not a CTC (`gradtypi` is not 3).

In [None]:
# create bachelors data frame
bachelors <- Grads_2015 %>% 
    filter(gradlev == 2, gradtypi != 3)

Let's see if we still need to address the potential duplication issue.

In [None]:
# compare number of rows (awards) to grads
bachelors %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(gradid)
    ) 

### Duplicates code

The series of commands below help identify duplicates, create a data frame of duplicates, and list the results. Before we de-duplicate the files, let's save an award-level (not person-level) file for potential award-based analyses.  We will name this data frame `df_awards`.

In [None]:
# copy bachelors as df_awards
df_awards <- bachelors

Next we can start to explore the duplicates. First, we will identify a case of duplication, which we can isolate by counting the number of occurrences of each `gradid` in `df_awards`, here focusing on the `gradid` with the highest number of occurrences.

In [None]:
# find duplicate example
dup_ex <- df_awards %>%
    count(gradid) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

From here, we can find all rows in `df_awards` with the `gradid` in `dup_ex` so we can further explore a duplicated example. We will select certain variables to highlight the duplication.

In [None]:
# see all duplicated rows in example
df_awards %>%
    filter(gradid == dup_ex$gradid) %>%
    select(gradid, gradmonth, graddegr, gradlev, gradmaj, gradfice)

This individual received multiple bachelors degrees (BA, BSMTH, BSCHE) and received these credentials in three types of majors from one institution of higher education in December. Sometimes, the CIP program (the first two digits of the `gradmaj`) is the same for all duplicates, and the double or triple awards are nearly equivalent (for example Accounting and Finance OR Business and Marketing) in addition to other true duplicates, based on the columns we selected. Here, we see two unique CIP programs 40 (Physical Sciences) and 27 (Mathematics) across the degrees earned.

You may choose among several way to address duplicates based on your needs in future work, including:

- taking the most recently received degree
- taking the highest degree earned
- include all conferred degrees and acknowledge that each row represents a person and a unique degree

For this example, we will elect to choose the first degree after sorting for degree granting month.

### De-Duplicating Cohort

Since the `bachelors` data frame is already restricted to 2015 calendar year degree earners, we can simply de-duplicate the awards to individuals by choosing the first record per person after sorting to ensure the most recent degree is selected (highest `gradmonth` per `gradid`).

In [None]:
# unduplicate cohort
# on most recent degree
bachelors <- bachelors %>%
    arrange(gradid, desc(gradmonth)) %>%
    distinct(gradid, .keep_all = TRUE)

We can re-compare the number of awards to the number of graduates to confirm that `bachelors` is now de-duplicated.

In [None]:
# compare number of rows (awards) to grads
bachelors %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(gradid)
    ) 

Though the sample cohort table for 2015 calendar year bachelor's degree earners was already created for you, the following code demonstrates how you can create a database table from an R data frame. Saving the cohort as a permanent table can be especially helpful if you want to limit future queries to only include students that are members of the cohort (ultimately improving the efficiency of the query), or join across larger tables in the server. This code creates the table `grads15` in the TX training workspace database `tr_tx_2021` from the data frame `bachelors`.
```
qry <- "use tr_tx_2021;"
DBI::dbExecute(con, qry)


DBI::dbWriteTable(
    conn = con,
    name = DBI::SQL("dbo.grads15"), 
    value = bachelors
)
```

> Note: If you run this code, you will get an error because a table with the name `grads15` already exists in the `tr_tx_2021` database.

<font color=orange> <h3> **Checkpoint 4: Explore duplicates and remove for your cohort** </h3> </font> 

For your data frame `df_checkpoint`, explore and come up with a strategy for removing all duplicates.

In [None]:
# replace ___ with code
df_checkpoint_awards <- df_checkpoint

dup_ex <- __ %>%
    count(gradid) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

In [None]:
# see all duplicated rows in example
___ %>%
    filter(gradid == dup_ex$___) %>%
    select(gradid, gradmonth, graddegr, gradlev, gradmaj, gradfice)

In [None]:
# unduplicate cohort
___ <- ___ %>%
    arrange(gradid, ___) %>%
    distinct(gradid, .keep_all = TRUE)

# compare number of rows (awards) to grads
___ %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(gradid)
    ) 

In [None]:
#checkpoint_4.hint()

In [None]:
#checkpoint_4.solution()

-----

## **7. Exploratory Analysis of the Cohort**

In this section we will find out more about our 2015 calendar year de-duplicated bachelor's recipients' cohort. We will begin by isolating the top 5 majors. From there, we will look to see if there are differences by gender. Understanding these patterns are an important part of understanding potential disparities in employment outcomes.  

### Major Groupings

Recall the last question from Checkpoint 2 asked about whether you saw an issue with how the CIP codes in `gradmaj` were coded. You probably noted that they were 8-digits rather than the 6-digits used in CIP code reporting. Nearly all of the `gradmaj` 8-digits codes end in two zeros. Lets explore this mystery.

> Reminder: If you are unsure as to what a certain function does, you can always look at the help documentation.

In [None]:
# all have 8 characters!!
bachelors %>% 
    distinct(nchar(gradmaj))

From looking at the multiple awards in an earlier section, we know that the first two digits are not read in as `00`. Just in case, let's take a look at the last two digits to see if they are potentially unnecessary and generally refer to 6-digit codes.

In [None]:
# and the last two characters are not always 00...but they mostly are 
bachelors %>% 
    mutate(
        sub = substring(gradmaj, 7, 8)
    ) %>%
    group_by(sub) %>%
    summarize(
        n = n()
    )

Since REDACTED of the `gradmaj` values end in two zeros, and CIP codes are 6-digits, it may seem like this is a coding error. But as it turns out, Texas uses a two digit suffix to provide further detail on the diversity of course and program offerings. This serves as a reminder that the theoretical basis, origin, recording, storage, management, and transfer of data should always be thoughtfully considered in any research; data may contain errors, missing records, and include inherent or explicit bias that researchers should investigate and consider in their analysis. Since we know the origin of the values, and because we do not need this level of specificity for the class, we will ignore the last two digits.

Next, we will look at the number of different 6-digit CIP codes that appear in `bachelors` and compare that to the number of CIP programs (2-digits).

In [None]:
# See difference in number of award types between major title and cip program
bachelors %>%
    summarize(
        num_cip_6 = n_distinct(gradmaj),
        num_cip_2 = n_distinct(substring(gradmaj, 1, 2))
    )

There are REDACTED unique CIP codes used in `gradmaj` within our cohort, but we only have REDACTED unique codes if we go by the CIP program. For the sake of simplicity, we will use the CIP program as we continue our analysis by major.

In [None]:
# Create a 2-digit CIP program code from the full CIP code in `gradmaj`
bachelors <- bachelors %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

# confirm code works by inspecting first two digits of gradmaj
bachelors %>%
    select(CIP_Program, gradmaj) %>%
    head()

Now we can find the 5 most common majors in the cohort.

In [None]:
# 5 most common majors
bachelors %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    head(5)

Unless you have memorized the corresponding names for all 2-digit CIP codes, this table will be meaningless. However, we have uploaded a CIP code crosswalk in our public database, `ds_public_1` that we can join to these results to provide some more context.

We first need to load this table into R.

In [None]:
# load CIP crosswalk into R
qry <- "
SELECT *
FROM ds_public_1.dbo.cip_lookup
"
cip_lookup <- dbGetQuery(con, qry)

# see first few rows of cip_lookup
head(cip_lookup)

Given the time frame of our cohort and the knowledge that CIP codes change every 10 years, it would make sense for us to use the 2010 columns.

In [None]:
# only select 2010 columns
cip_lookup <- cip_lookup %>%
    select(ends_with("2010"))

head(cip_lookup)

Now we can join `cip_lookup` to our results table.

In [None]:
# 5 most common majors
bachelors %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    head(5) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

Does this list surprise you? For additional perspective, we will add in another column tracking the proportion of graduates by major.

In [None]:
# 5 most common majors with proportion
bachelors %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

Because we hope to build off of this cursory subgroup analysis in later notebooks, let's save the resulting data frame to `df_common_major`.

In [None]:
# 5 most common majors with proportion
df_common_major <- bachelors %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

### Gender

Additionally, we can look at the gender breakdown within the cohort using the `gradgen` variable.

In [None]:
# gender breakdown
df_gender <- bachelors %>%
    count(gradgen) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    )

# see df_gender
df_gender

Does anything stand out about the proportion of graduates by gender? Is this something you would expect to see?

### Top Majors by Gender

Let's intersect the major breakdown by gender—do the most common majors differ amongst gender groups? Since we are looking at proportions and counts within multiple combinations of subgroups (`CIP_Program` and `gradgen`), we need to adjust the code from above a bit. First, we need to calculate the proportion of observations within each `gradgen`, hence the `group_by()`, and we replace `head()` with `slice()` to retrieve the top 5 majors within each `gradgen` value, instead of returning the top 5 rows as ordered by `gradgen`.

In [None]:
# major/gender breakdown
df_major_gender <- bachelors %>%
    count(CIP_Program, gradgen) %>%
    arrange(desc(n)) %>%
    group_by(gradgen) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    arrange(gradgen, desc(n)) %>%
    slice(1:5)

df_major_gender

We can then add in our CIP code lookup table for more context.

In [None]:
# add in CIP code lookup table
df_major_gender <- df_major_gender %>%     
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

df_major_gender

Females were more likely to choose Health Professions and Related Programs than males, with this field not appearing in their top 5 choices. Note that males were more likely to receive Biological and Biomedical Science degrees (REDACTED) compared to females (REDACTED), but that more females completed these degrees than men (REDACTED vs. REDACTED). Business, Management, Marketing, and Related Services was the most common major choice for males and the second most common for females.

<font color=orange> <h3> **Checkpoint 5: Common Majors, Gender Breakdown, and Majors by Gender** </h3> </font>

Using your own data frame, `df_checkpoint`, identify the 5 most common majors overall and by gender. Save these results to `df_checkpoint_common_major` and `df_checkpoint_major_gender`, respectively.

Do your results vary drastically from those derived from `bachelors`?

In [None]:
# find most common majors
df_checkpoint <- df_checkpoint %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

df_checkpoint_common_major <- df_checkpoint %>%
    count(___) %>%
    arrange(___(___)) %>%
    mutate(
        prop = ___/sum(___)
    ) %>%
    ___(5) %>%
    inner_join(___, by = c(___ = ___))

df_checkpoint_common_major

In [None]:
# find most common majors by gender
df_checkpoint_major_gender <- df_checkpoint %>%
    count(___, ___) %>%
    arrange(___(___)) %>%
    group_by(___) %>%
    mutate(
        prop = ___/sum(___)
    ) %>%
    arrange(___, ___) %>%
    ___(___) %>%  
    inner_join(___, by = c(___ = ___))

df_checkpoint_major_gender

In [None]:
#checkpoint_5.hint()

In [None]:
#checkpoint_5.solution()

## **8. Export Results to .csv Files**

Now you have successfully finished defining a cohort and a quick subgroup analysis! The last step is to save your results in .csv files so that we can re-use these results in future notebooks. 

In [None]:
# Save dataframes to CSV to use in later notebook

# most common majors
write_csv(df_common_major, sprintf("U:\\%s\\TX Training\\Results\\common_major.csv", username))

# gender breakdown
write_csv(df_gender, sprintf("U:\\%s\\TX Training\\Results\\common_gender.csv", username))

# most common majors by gender
write_csv(df_major_gender, sprintf("U:\\%s\\TX Training\\Results\\common_major_gender.csv", username))

## **9. References**
Chappell, Joseph, Feder, Benjamin, & Barrett, Nathan. (2022, April 1). Data Exploration for Cohort Analysis using Tennessee Board of Regents Data. Zenodo. https://doi.org/10.5281/zenodo.6407247