# **<center> Data Exploration: Create a Cohort </center>**

Nathan Barrett, Benjamin Feder, Jimmy Green, Gavin Rozzi, Sean Simone, and Angie Tombari

## **1. Introduction**
Historically, state agencies used their own administrative data to administer programs and complete required reporting. These data were siloed and were not leveraged to evaluate or inform policy development within or across agencies. Legal, technological, and human resource barriers prevented using data to assist policymakers in making decisions. Through a concerted effort supported by the Statewide Longitudinal Data System (SLDS) Grant Program, states have gradually come together to address this limitation. In this course, you will learn how to review, understand, and link data from different agencies. You will also uncover real-life problems in using the data (such as missing data, errors in the data, and complex data structures) and learn how to address them. 

This notebook introduces you to the concept of creating a group or a "cohort" that will be used for future analysis. We will construct measures to understand who we are including and excluding (coverage) from the cohort and walk you through the decisions that need to be made when constructing the cohort using filters such as institution governance and/or level (in terms of highest degree offered), award level, major, and others. We begin with introducing you to the data analytical tools to load the data, including connecting R to the database and using SQL queries to pull the data. We then use these tools to explore completions files from the NJ Office of the Secretary of Higher Education. We will create a dataset (called a "data frame") and investigate the trends in New Jersey postsecondary graduates. At the end of this notebook, we will save the summary statistics in csv files that will be used in subsequent notebooks.

#### **A Note About COVID-19**

This course will not address the COVID-19 crisis that most states are facing. While some states have access to weekly claims data, in most states it takes time for data to work through business processes before data are available to reserachers. For New Jersey, outcomes data (unemployment claims and wage data), aren't posted to the data system until six months after a quarter ends. Higher education completions data aren't posted until eight months after academic year ends. These correspond to the data collection windows for reporting to the Federal government. Additionally, when someone graduates with a credential from a postsecondary institution, there is a time delay before an analyst can see an outcome (2 to 4 quarters after graduation). As such, analysts must consider these limitations when designing an analysis. Statistics and data visualizations from this course can serve as a baseline for future analysis on COVID-19. Once designed, the same analysis can be repeated over time to show changes pre- and post-COVID-19.

## **2. Learning Objectives**

The Applied Data Analytics training uses a project-based approach to develop your analytic skills. You will begin by working with your team to develop and refine a research question. A crucial part of this is data exploration. You will implement techniques using SQL and R to explore and better understand the data that are available to you and if addressing your question is feasible. This will form the basis of all the other types of analyses you will do in this class and is a crucial first step for any data analysis workflow. As you work through the notebook, we will have checkpoints for you to practice writing code by making small adjustments, but you can also think about how you might apply any of the techniques and code presented with other datasets to address your research question. 

The guiding research questions we will use for the notebooks are quite general: 

>**What are the employment outcomes of the 2012-13 graduating cohort? How do these outcomes vary by cohort characteristics and employer characteristics?** 

This will allow the code we use to have the most versatility. We will analyze these questions through a variety of different lenses, but will start by defining a specific cohort of New Jersey graduates in the 2012-2013 academic year. We will then track their earnings and employment outcomes over time. The exploration of the supply side of the labor market will later be supplemented by an analysis of the demand side to enhance our understanding of the overall labor market.

We are going to show just a portion of what you might be interested in investigating to answer these overarching questions, so don't feel restricted by the questions we've decided to try to answer.

>**When defining your research question(s), recall that one key benefit of working with New Jersey administrative records in the ADRF is the ability to integrate higher education and employment data.**

#### **Notebook 1 Questions and Goals** 
In this notebook, we focus on seeking answers to the following questions: 
- How many students graduated from New Jersey public postsecondary institutions in the 2012-13 academic year?
- What filters can be used to define the cohort (e.g., demographics, institutions, enrollment type, etc.)?
- How many students graduated from New Jersey public postsecondary institutions by subgroup (e.g. demographics, institutions, enrollment type, etc.)?

After completing this notebook you should be able to perform the following analytical tasks:
- load R libraries and establish a connection to the Database
- create a cohort sample by using the OSHE Completions file
- calculate descriptive statistics to understand who is in the population
- create new tables from the larger tables in a database (sometimes called the "analytical frame")
- explore different variables of interest
- clean data
- create aggregate metrics

The specific techniques include but not limited to:
- **SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `GROUP BY`: aggregate data over the variables of interest
 - `ORDER BY`: sort data based on the variables of interest
 - `DISTINCT`: look at distinct values of a variable
 - `JOIN ... ON`: join tables
- **R code**:
 - `group_by` and `summarize` to find group-based measures
 - `mutate` to create new variables
 - `arrange` and `desc` to sort values

#### **Datasets** ####
We will explore and understand the New Jersey Education to Earnings Data System (NJEEDS) tables in this notebook:
- **Higher Education (OSHE) Completions**: The completions table comes from the Office of the Secretary of Higher Education's (OSHE) Student Unit Record data system (SURE). The data include completions at all levels that are reported to the U.S. Department of Education's Integrated Postsecondary Education Data System (IPEDS) Completions Survey.
- **Higher Education (OSHE) Supplemental Tables**:  Multiple supplemental tables are available to append contextual information to the completions table. The `supplements_cip` table connects major names to the file. The `supplements_instcode` table connect institution characteristics to the file.

> **NOTE:** Not all colleges and universities within New Jersey report data into the SURE system. Only public institutions reliably report data over time. For private nonprofit institutions, analysts should use caution as to which institutions reports data and which do not.


#### **Directory Structure**

We will constantly read and write csv files to load crosswalks and to save results in all the notebooks. Let's create a few folders in your U drive first so it is easier for you to organize all the files. 

- Open a Windows File Explorer
- On the left hand side, find U drive (U:) and click into it
- On the right hand side, open your user folder: FirstName.LastName.UserID
- In your user folder, create a new folder: NJ Training
- In the "NJ Training" folder, create three subfolders: "Notebooks", "Results", "Output"
- You can copy and paste the class notebooks to the "Notebook" folder, save summary statistics to the "Results" folder, and save visualizations (in the third notebook) to the "Output" folder.

For example, we read all the crosswalks from **"P:\tr-dol-nj\NJ Class Notebooks\xwalks"**.  At the end of this notebook, **we save summary statistics to "U:\\FirstName.LastName.UserID\NJ Training\Results\filename.csv"**.


## **3. Load the Data**

In this section, we will demonstrate how to use R to read data from a relational database. First, we need to load packages in R.

#### **R Setup**

We will use several R functions that are not immediately available in base R. Therefore, we need to load them using the built-in function `library()`. For example, running `library(tidyverse)` loads the `tidyverse` suite of packages. It is a collection of packages designed for data science.

> When you run the following code cell, don't worry about the warning message below.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

__When in doubt, full documentation for a method can be printed with `?<package/function_name>`, e.g. `?tidyverse/ggplot` or `?sprintf`.__ Do not worry about memorizing the information in the help documentation - you can always run this command when you are unsure of how to use a function.

> Certain functions exist across multiple packages (e.g. the function `lag` exists in both the `dplyr` and `stats` package - also noted in the message yielded from `library(tidyverse)`. When calling a function, you can put the package name first to ensure that you are using the right one. For example, `dplyr::lag` or `stats::lag` calls the `lag` function from `dplyr` or `stats`, respectively. 

In [None]:
# See help documentation for head:
# a function we will use frequently to check the content of a table
# It returns the first few rows of a table
?head

#### **Establish a Connection to the Server**

Now, we are ready to connect to the server. We will create the connection using the `DBI`  and `ODBC` libraries. 

> **Loading R libraries** and **establishing connection** should always be the first step in your Jupyter Notebooks. Make sure you copy these code chunks when you create a new notebook.

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

#### **Formulate Data Query**

Next, we need to dictate what we want to pull in from the database. This part is similar to writing a SQL query in DBeaver. In this example, we will pull in 5 rows of the New Jersey higher education graduates, which is stored in the `completions` table inside the `ds_nj_oshe` schema. Before running the code below, test the inital query you will use to bring in your first data frame to make sure it successfully runs in DBeaver:

    SELECT TOP 5 *
    FROM ds_nj_oshe.dbo.completions ;

Next, we create the same query as a `character` object in R.

In [None]:
# Create qry character object
# Database name: ds_nj_oshe
# Schema name: dbo
# Table name: completions
qry <- "
SELECT TOP 5 *
FROM ds_nj_oshe.dbo.completions;
"

We use `TOP` to read in only the first 5 rows because we're just looking to preview the data and we don't want to eat up memory by reading a huge data frame into R. 

> `TOP` provides one simple way to get a "sample" of data. You may get different samples of data from others using just the `TOP` clause. However, it is not because you get a random sample by using `TOP`. It is because the database returns the results that can be pulled the fastest.

#### **Read in the Data** 

Now we can use `con` and `qry` as inputs to `dbGetQuery()` to read the data into R. Compare the results below with the test query you made in DBeaver. To run the code without saving it to a data frame for later reference, you can simply include `dbGetQuery(con,qry)`, as shown below.

In [None]:
# Read in data frame 
dbGetQuery(con,qry)

In [None]:
# See first few rows
dbGetQuery(con, qry) %>%
    head()

> Note: There are other methods you can use to explore the data. Two of these functions are `glimpse` and `names`.

#### **Checkpoint 1: Explore Columns** 

Take a look at the columns in the `completions` table. Which variables might be useful for your project?  Let's explore another table.  Try to query another higher education data table. Explore the `supplements_cip` lookup table in the `ds_nj_oshe` database.

> Refer to the data dictionary on the class website to get a better understanding of the variables.

In [None]:
# Replace ____ with the table database and table name
qry <- "
SELECT TOP 5 *
FROM ___.___.__;
"

# Read in data frame
dbGetQuery(con,qry)

# Can write code to explore the data frame
dbGetQuery(con, qry) %>%
    ____()

-----

## **4. Explore the table and understand the data**

Before building a cohort, it is important to understand the quality of the data. As we will be creating a cohort of graduates, it is important to note that the `completions` table lists awards, not people. Each row represents a person-degree/credential-major. The data are not always clean. Before creating the cohort it is useful to understand missing values, changes in trends, and inconsistent data. 

Since we hope to create a cohort of graduates that graduated within a specific time period (2012-13 academic year), let's take a look at the distribution of the number of graduates by year, or `awardyearn` in the data.

Try running the following query in DBeaver to understand the kinds of information you will be bringing into your data frame:

    SELECT awardyearn, count(DISTINCT(hashed_ssn)) as num_individuals
    FROM ds_nj_oshe.dbo.completions
    GROUP BY awardyearn
    ORDER BY awardyearn desc;

Now run the query in R and review the results.

In [None]:
# Exploration query on award year earned
qry <- "
SELECT awardyearn, count(DISTINCT(hashed_ssn)) as num_individuals
FROM ds_nj_oshe.dbo.completions
GROUP BY awardyearn
ORDER BY awardyearn desc;
"

# Read in data frame and but don't save the results
dbGetQuery(con,qry)

As you can see, we have quite a large number of potential graduates to pull from. However, the academic year of 2012-2013 includes some graduates from 2012, and others from 2013. To get a better sense of our potential sample size, let's look at the number of individuals that graduated by month in 2012 and 2013 using the following query:

    SELECT awarddate, count(DISTINCT(hashed_ssn)) as num_individuals
    FROM ds_nj_oshe.dbo.completions
    WHERE awardyearn = 2012 OR awardyearn = 2013
    GROUP BY awarddate
    ORDER BY awarddate ;

In [None]:
# Exploration query on awarddate
qry <- "
SELECT awarddate, count(DISTINCT(hashed_ssn)) as num_individuals
FROM ds_nj_oshe.dbo.completions
WHERE awardyearn = 2012 OR awardyearn = 2013
GROUP BY awarddate
ORDER BY awarddate ;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

Notice that there are cases where leading zeros are missing.  This can be corrected in SQL code or in R depending on if the data frame has been created or not.  In SQL, the following code will add leading zeros to ensure that each `awarddate` is 6 digits:

    RIGHT('000000'+ISNULL(awarddate,''),6)

In [None]:
# Exploration query on awarddate
qry <- "
SELECT right('000000'+ awarddate, 6) as new_awarddate, count(DISTINCT(hashed_ssn)) as num_individuals
FROM ds_nj_oshe.dbo.completions
WHERE awardyearn = 2012 OR awardyearn = 2013
GROUP BY right('000000'+ awarddate, 6)
ORDER BY right('000000'+ awarddate, 6) ;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

Now that we have created a new column, `new_awarddate`, to adjust for the inconsistency of leading zeros, let's read the data into R and save the resulting data frame as `df`.

In [None]:
# read table into r and assign as df
qry <- "
select *, right('000000' + awarddate, 6) as new_awarddate
from ds_nj_oshe.dbo.completions 
where awardyearn in (2012, 2013)
"
df<-dbGetQuery(con, qry)

# see first few rows of df
head(df)

#### **Checkpoint 2: Explore the Data** 

Run the following queries in the notebook to better understand some key variables of interest:

    SELECT awardyearn, race_single, count(DISTINCT(hashed_ssn)) as num_individuals
    FROM ds_nj_oshe.dbo.completions
    GROUP BY awardyearn, race_single
    ORDER BY awardyearn ;

    SELECT awardyearn, citizenship, count(DISTINCT(hashed_ssn)) as num_individuals
    FROM ds_nj_oshe.dbo.completions
    GROUP BY awardyearn, citizenship
    ORDER BY awardyearn ;

Try running the code in the box below using different columns.  After reviewing this information, try to answer the following questions:
1. Do you see any obvious problems in the data? 
2. Are there null values in the data you are seeing?  Do you see any trends or patterns?
3. Are there years where you need to use different column to represent the same or similar construct?
4. Are there cases where some values aren't as clean as you would have hoped? 

In [None]:
# Replace ___ in query to understand race/citizenship and year
qry <- "
SELECT awardyearn, ____, count(DISTINCT(hashed_ssn)) as num_individuals
FROM ds_nj_oshe.dbo.completions
GROUP BY awardyearn, ____
ORDER BY awardyearn ;
"

# Read in data frame but don't save the results
dbGetQuery(con,qry)

-----

## **5. Create the Cohort**

In this section, we will use the New Jersey `completions` table to create a sample of all students in NJ SURE participating institutions who earned a bachelor's degree during the 2012-13 academic year. This is not as easy as it looks. Even though we created a new variable to consistently track graduations, we still need to further limit the cohort from our original query that yielded `df` because the academic year is not the same as the calendar year.

In addition to establishing a time period, it is common to narrow your population down.  Some research questions require you to select only certain graduates.  Some questions focus on degree level (bachelors degree recipients, for example), or major (health sciences, for example). When establishing your cohort, it is helpful to build an initial query iteraively, checking each restriction before adding others. To recall, our initial query is

    select *, right('000000' + awarddate, 6) as new_awarddate
    from ds_nj_oshe.dbo.completions 
    where awardyearn in (2012, 2013)
    
Let's keep track of the number of individuals we currently have in `df`.

In [None]:
# see number of individuals in df
df %>%
    summarize(
        num_inds = n_distinct(hashed_ssn)
    )

### Academic Year

As previously mentioned, we have not yet limited `df` to just include graduates from the 2012-13 *academic* year, which is defined by the last REDACTED months of 2012 and first REDACTED months of 2013.

To isolate these graduates, we can `filter` `df` based on these requirements, as the month of graduate is now consistent in `new_awarddate`.

In [None]:
# isolate 2012-2013 academic year grads
# substring(variable, 1, 2) will isolate the first two characters of the variable
# | represents "or"
df <- df %>%
    filter(
        (awardyearn == 2012 & between(substring(new_awarddate, 1, 2), '07', '12')) |
        (awardyearn == 2013 & between(substring(new_awarddate, 1, 2), '01', '06'))
    )

Let's see our breakdown of graduates by month now to confirm we properly filtered `df`.

In [None]:
# see count of grads by month
df %>%
    group_by(new_awarddate) %>%
    summarize(
        num_inds = n_distinct(hashed_ssn)
    )

### Bachelor's degree earners

Now that we have isolated all graduates in the 2012-13 academic year, let's turn our attention to bachelor's degree earners. According to the data documentation, a bachelor's degree is assigned codes from 300-399 inclusive for the `awardtype` variable. Before further subsetting `df`, let's take a look at `awardtype`.

In [None]:
# count number of graduates by awardtype
df %>%
    group_by(awardtype) %>%
    summarize(
        num_inds = n_distinct(hashed_ssn)
    )

There do not appear to be any issues in terms of potential lengths within `awardtype`, so let's go ahead and `filter` for bachelor's degree recipients.

In [None]:
# filter for bachelor's degree recipients
df <- df %>%
    filter(between(awardtype, '300', '399')) 

#### **Checkpoint 3: Create Your Sample**
Starting with the `completions` table, create a sample of graduates of a separate academic year and award level. Name the data frame `df_checkpoint`.

In [None]:
# Replace ____ 
qry <- "
select *, right('000000' + awarddate, 6) as new_awarddate
from ds_nj_oshe.dbo.completions 
where awardyearn in (___, ___)
"

# Read in data frame and save it as df_checkpoint
df_checkpoint <- dbGetQuery(con,qry)

df_checkpoint <- df_checkpoint %>%
    filter(___)

## **6. Link data across tables**

Now that we have identified a cohort, it is important to link that cohort to other tables to gain further insights. In this example, we will link our data frame in the classification of instructional program (CIP) codes to the code names. First, we will write a query to read a cip crosswalk into R, so that we can understand the CIP code - subject meaning a bit more.  We don't need all the data in the CIP table because we are using the most recent CIP codes. As a result, the query below only includes specific columns in the data frame.

In [None]:
# Query to bring in speific columns in the cip_2010 table
qry <- "
SELECT code_2010 as major, title_2010 as major_title, cip2, cip_family
FROM ds_nj_oshe.dbo.supplements_cipcode;
"

# Read in data frame and save it as df_cip
df_cip <- dbGetQuery(con,qry)

# see df_cip
glimpse(df_cip)

Next, we merge the columns from `df_cip` for all records in `df` as long as they have the same CIP code, as designated by the `major` column across the two data frames.  The joining statement in R is as follows:

    df <- df %>% 
        left_join(df_cip, by.y = "major")
    
A left join is used because we would like to retain all records in the left data frame (`df`) and are only bringing in matched records from the right data frame (`df_cip`).  Conversely, a right join  starts with all the records from the right table and only brings in matched records from the left table, and an inner join only includes records for which there is a match between both tables.  Notice that if the the same column name is used to match the data frames together, you only need to specify the one name for both tables after "by".  If the column names are different, you need to declare the column names for the two data frames.

> **NOTE:** In this example, we are going to use R to join data frames. As you will see in the next notebook, with larger tables, it is inefficient or at times not possible to bring extremely large tables into R.  As a result, the joins have to be done in SQL prior to bringing the data frame into R.

In [None]:
#Left join cohort to cip code xwalk data
df <- df %>% 
    left_join(df_cip, by.y = "major")

# See top records in the dataframe
head(df)

Now that the data frame is finalized, you can manipulate it further based on major.  If, for example, your working group was only interested in those graduates in New Jersey graduating in the cip code of "110101", the data frame can be filtered and saved with those records.

In [None]:
# Subset the dataframe to specific major
df_compgrads <- df %>% filter(major == '110101')

Then you can see the major title associated with the CIP code. Of course, you can work the other way, where you `filter` for a specific `major_title` or `cip_family`.

In [None]:
# see major title
df_compgrads %>%
    count(major_title)

#### Checkpoint 4: Add columns

Using the cohort that you created in the previous checkpoint, try to join data from the `supplements_cip2010` table and further subset your data frame to a specific major. Save the resulting data frame as `df_checkpoint_major`. If the data you selected was from before 2010, then use CIP 2000 codes.

In [None]:
# Add in cip xwalk
df_checkpoint <- df_checkpoint %>%
    left_join(df_cip, by.y = "major")

# Replace ___ with a major. See https://nces.ed.gov/ipeds/cipcode outside of the ADRF to select a cipcode.
df_checkpoint_major <- df_checkpoint %>%
    filter(___)

# See top records in the dataframe
head(df_checkpoint_major)

-----

## **7. Higher Education Graduate Cohort Count and Descriptive Statistics**

Next we will run some statistics to understand how data are structured in the cohort data frame. Recall that you had already made a data frame on your own, `df`. Here, we will use the same data frame to further examine the data elements. Recall that that each record in our data frame does not represent a person. Each record represents an award or credential (degree or certificate). You can see this by comparing the number of rows, or awards, with the number of graduates.

In [None]:
# compare number of rows to grads
df %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(hashed_ssn)
    ) 

The difference between awards and graduates is because a subset of students earn two degrees, a certificate and a degree, or multiple short-cycle certificates in an academic year. When framing your higher education cohort, one needs to ask if you want to de-duplicate the file (by selecting the highest award for example) or if you want to focus on one level of degree. Workforce outcomes are going to be very different for certificate holders without a degree compared to those with associates degrees. Those with bachelors or graduate degrees are expected to have higher incomes. Recall that in this example, we have restricted our cohort to bachelor's degree holders. Three decisions are required to accurately form the cohort:

1. What time period are you using to define the cohort (one academic year or multiple academic years)?
2. What degree level or levels will you focus on?
3. What will you do with duplicate records?

In this example, we will review the duplicates to help inform decisions.

### Duplicates code

The series of commands below help identify duplicates, create a data frame of duplicates, and list the results. Before we de-duplicate the files, let's save a degree-level (not person-level) file for further analysis.  We will name this data frame `df_awards`.

In [None]:
# copy df as df_awards
df_awards <- df

Next we can start to explore the duplicates. First, we will identify a case of duplication, which we can isolate by counting the number of occurrences of each `hashed_ssn` in `df`, and then finding the `hashed_ssn` with the highest number of occurrences.

In [None]:
# find duplicate example
dup_ex <- df %>%
    count(hashed_ssn) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

From here, we can find all rows in `df` with the `hashed_ssn` in `dup_ex` so we can further explore a duplicated example. We will select certain variables to highlight the duplication.

In [None]:
# see all duplicated rows in example
df %>%
    filter(hashed_ssn == dup_ex$hashed_ssn) %>%
    select(hashed_ssn, new_awarddate, instcode, major, major_title, cip2)

Sometimes, the two-digit family CIP code is nearly the same for all duplicates, and that the double or triple degrees are nearly equivalent (for example Accounting and Finance OR Buisness and Marketing) in addition to other true duplicates, based on the columns we selected. We will assume that most degrees are in similar two-digit CIP families, thus de-duplicating and taking the most recent record (or one of the most recent if there are multiple). To do so, we will first sort `df`, so that for each `hashed_ssn`, the first row is at least one of the most recent degrees. From there, we can use `distinct` to isolate the first row within each `hashed_ssn`.

In [None]:
# unduplicate cohort
df <- df %>%
    arrange(hashed_ssn, desc(awardyearn), desc(new_awarddate)) %>%
    distinct(hashed_ssn, .keep_all = TRUE)

# compare number of rows to grads
df %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(hashed_ssn)
    ) 

#### **Checkpoint 5: Explore duplicates and remove for your cohort**

For your data frame `df_checkpoint`, explore and come up with a strategy for removing all duplicates.

In [None]:
# replace ___ with code
dup_ex <- __ %>%
    count(hashed_ssn) %>%
    arrange(desc(n)) %>%
    head(1)

# see example
dup_ex

In [None]:
# see all duplicated rows in example
___ %>%
    filter(hashed_ssn == dup_ex$___) %>%
    select(___)

In [None]:
# unduplicate cohort
___ <- ___ %>%
    arrange(hashed_ssn, ___) %>%
    distinct(hashed_ssn, .keep_all = TRUE)

# compare number of rows to grads
___ %>% 
    summarize(
        awards=n(), 
        graduates=n_distinct(hashed_ssn)
    ) 

-----

## **8. Exploratory Analysis of the Cohort**

In this section we will find out more about our 2012-13 graduating cohort. We will begin by isolating the top 10 majors using the two digit CIP code. 

From there we will look to see if there are differences by sex. Understanding these patterns are an important part of understanding potential disparities in employment outcomes.  


Up to this point, we have identified our cohort (`df`), and removed all duplicates so that they are person-level files.  Let's start by looking at the difference in `major_title` compared to `cip_family`, based on the granularity we desire in major groups.

### Major Groupings



In [None]:
## see difference in number of award types between major title and cip family
df %>%
    summarize(
        num_cip_fam = n_distinct(cip_family),
        num_major_title = n_distinct(major_title)
    )

For the sake of this analysis, we will use `cip_family`, or the two-digit CIP codes as we continue our analysis by major. Let's find the 10 most common majors in the cohort.

In [None]:
# 10 most common majors
df %>%
    count(cip_family) %>%
    arrange(desc(n)) %>%
    head(10)

Does this list surprise you? For perspective, we will add in another column tracking the proportion of graduates by major.

In [None]:
# 10 most common majors with proportion
df %>%
    count(cip_family) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(10)

Because we hope to build off of this cursory subgroup analysis in later notebooks, let's save the resulting data frame to `df_common_major`.

In [None]:
# 10 most common majors with proportion
df_common_major <- df %>%
    count(cip_family) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(10)

### Sex

Additionally, we can look at the sex breakdown within the cohort using the `sex` variable.

In [None]:
# sex breakdown
df_sex <- df %>%
    count(sex) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    )

# see df_sex
df_sex

### Top Majors by Sex

Let's intersect the major breakdown by sex—do the most common majors differ amongst sex groups? Since we are looking at proportions and counts within multiple combinations of subgroups (`cip_family` and `sex`), we need to adjust the code from above a bit. First, we need to calculate the proportion of observations within each `sex`, hence the `group_by`, and we replace `head` with `slice` to retrieve the top 10 majors within each `sex` value, instead of returning the top 10 rows as ordered by `sex`.

In [None]:
# major/sex breakdown
df_major_sex <- df %>%
    count(cip_family, sex) %>%
    arrange(desc(n)) %>%
    group_by(sex) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    arrange(sex, desc(n)) %>%
    slice(1:10)

df_major_sex

#### **Checkpoint 5: Common Majors and Sex**

Using your own data frame, `df_checkpoint`, identify the 10 most common majors overall and by sex. Save these results to `df_checkpoint_common_major` and `df_checkpoint_major_sex`, respectively.

Do your results vary drastically from those derived from `df`?

In [None]:
# find common major
df_checkpoint_common_major <- df_checkpoint %>%
    ___

df_checkpoint_common_major

In [None]:
# find most common majors by sex
df_checkpoint_major_sex <- df_checkpoint %>%
    ___

df_checkpoint_major_sex

## **9. Export Results to .csv Files**

Now you have successfully finished defining a cohort and a quick subgroup analysis! The last step is to save your results in .csv files so that we can re-use these results in future notebooks. 

<font color=red> Note that you need to change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Save dataframes to CSV to use in later notebook

# most common majors
write_csv(df_common_major, "U:\\..\\NJ Training\\Results\\common_major.csv")

# sex breakdown
write_csv(df_sex, "U:\\..\\NJ Training\\Results\\common_sex.csv")

# most common majors by sex
write_csv(df_major_sex, "U:\\..\\NJ Training\\Results\\common_major_sex.csv")