<center><br><br>
    Arkansas Work-Based Learning to Workforce Outcomes <br>
    Applied Data Analytics Training | Spring 2022
    <h1> Characterizing Demand: Descriptive Analysis Checkpoints </h1>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center> Joshua Edelmann, Rukhshan Arif Mian, Benjamin Feder</center>
</center>

***

## Introduction

The purpose of this checkpoint notebook is to apply the methods we used in `04_Characterizing_Demand_Beginner.ipynb` to your cohort by utilizing a measure of your choice. 

In the checkpoint notebooks for `02_Creating_a_cohort.ipynb`, we asked you to create and save your cohort as an SQL table. You will be utilizing the cohort you created as part of this checkpoint notebook. 

At each checkpoint, you will be replacing the `___` with the appropriate variable, function or R code snippet. 

You are encouraged to attempt the checkpoints on your own. Having said that, hints and suggested solutions are provided and these can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution()`, respectively. 

> Note: This checkpoint notebook has been created by keeping a cohort of apprenticeship graduates in mind. We encourage you to reach out to your team facilitator to learn more about the methods you can use to characterize demand if your cohort is defined differently, such as by apprenticeship starters. Also, the code for accessing hints and solutions is currently commented out – in order for the cells to run, you will need to uncomment them first. 

## Workspace Set Up

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# for as.yearqtr()
suppressMessages(library(zoo))

#Switching on warnings
options(warn = 0)

source('04_Characterizing_Demand_Beginner_checkpoints_hints_solutions.txt')

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Reading in Data

In [None]:
# read in employer measures table
qry <- "
SELECT * 
FROM tr_ar_2022.dbo.employer_yearly_agg
"

employer_measures <- dbGetQuery(con, qry)

head(employer_measures)

We will create one additional measure to calculate the ratio of employees that are fully employed relative to the total amount employed.

In [None]:
# create new measure
employer_measures <- employer_measures %>% 
    mutate(ratio_full_total = avg_full_num_employed/avg_num_employed)

## Checkpoint 1

For this checkpoint, we ask you to pull in data from the cohort you created in `02_Creating_a_cohort_checkpoints.ipynb` notebook or the cohort you have created for your research project. We ask you to update 2 blanks, the first one with the name of your cohort and the second with your year of interest.

If your cohort contains individuals from multiple years, we ask that you only focus on one year's worth of data for this notebook.

### Linked Cohort-Wages data (first year after completion)

In [None]:
# using dimensional model to get primary employer information
qry <- "
SELECT
F.Quarter_ID - P.Apprenticeship_End_Quarter_ID AS Quarters_Relative_to_Completion,
P.Person_ID,
F.Primary_Employer_Wages ,
PE.Federal_EIN
FROM 
tr_ar_2022.dbo.___ C --COHORT
JOIN tr_ar_2022.dbo.AR_MDIM_Person P ON (P.Apprentice_Number=C.apprnumber) --PERSON
JOIN tr_ar_2022.dbo.AR_FACT_Quarterly_Observation F --QUARTERLY OBSERVATION FACT
	ON (P.Person_ID=F.Person_ID) 
	AND (F.Quarter_ID BETWEEN (P.Apprenticeship_End_Quarter_ID) AND (P.Apprenticeship_End_Quarter_ID+4))  --QTRS POST COMPLETION
JOIN tr_ar_2022.dbo.AR_RDIM_NAICS_National_Industry NNI ON (P.Apprenticeship_NAICS_National_Industry_ID=NNI.NAICS_National_Industry_ID) --APPRENTICESHIP INDUSTRY
JOIN tr_ar_2022.dbo.AR_MDIM_Employer PE ON (PE.Employer_ID=F.Primary_Employer_ID)  --PRIMARY EMPLOYER
WHERE P.Apprenticeship_Completer='Y' and YEAR(C.exitwagedt) = ___ --RESTRICT COHORT YEAR
"

cohort_wages_empr <- dbGetQuery(con, qry)

head(cohort_wages_empr)

In [None]:
# hint
# check_1.hint()

In [None]:
# solution
# check_1.solution()

Count the unique number of individuals from our cohort who were employed along with the unique number of employers by whom they were primarily employed at the time of completion.

In [None]:
# save data frame of employment info at time of completion
cohort_wages_empr_comp <- cohort_wages_empr %>% 
    filter(Quarters_Relative_to_Completion == 0)

# see summary stats
cohort_wages_empr_comp %>% 
    summarize(unique_indiv = n_distinct(Person_ID),
              unique_empr = n_distinct(Federal_EIN))

## Checkpoint 2: Selecting Employer Measure

Now, you will select an employer measure with which you will identify groups of employers and then identify how those groups correspond with an employment outcome of interest. Please select one employer measure from the following list:

**Firm characteristics**:

- Total Payroll
- Average earnings per employee 
- Average full quarter earnings per employee
- Earnings per employee at 25th percentile
- Earnings per employee at 75th percentile
- Total full quarter employment
- Total employment

**Opportunity**:

- Number of new hires
- Employment growth rate
- Hiring growth rate

**Stability**:
- Separation growth rate (this is calculated using the same formula as above)
- Number of new hires who become full quarter employees (hired in t-1 whom we see in t+1)
- Ratio of full quarter employees to total number of employees 

In [None]:
employer_measures <- employer_measures %>%
    mutate(measure = _____)

In [None]:
# hint
# check_2.hint()

In [None]:
# solution
# check_2.solution()

## Checkpoint 3: Filtering employer measures for a specific year

In this checkpoint, we ask you to fill in the blank below with the year for which you want to keep the employer measures for. We recommend the year prior to your cohort selection, as this will ideally be the most recent information they can use prior to (assuming a cohort defined by completion) completing their apprenticeship.

In [None]:
select_year = ____

In [None]:
# hint
# check_3.hint()

In [None]:
# solution
# check_3.solution()

In [None]:
employer_measures_cohort <- employer_measures %>%
    # filtering on employer number and year
    filter(federal_ein %in% cohort_wages_empr_comp$Federal_EIN, 
           year == select_year)

In [None]:
# Count the number of unique employers in this new data frame
employer_measures_cohort %>% 
    summarize(unique_emp = n_distinct(federal_ein))

## Extended Analysis

The rest of this notebook directly follows the code in `04_Characterizing_Demand_Beginner.ipynb`. Feel free to run through it - there will not be any more checkpoints in this notebook past this point.

In [None]:
# getting the mean, median, standard deviation, min and max values for your measure of interest
employer_measures_cohort %>%
    summarize(mean_emp = mean(measure), 
             median_emp = median(measure), 
             sd_emp = sd(measure), 
             min_emp = min(measure), 
             max_emp = max(measure))

Next, you will construct the summary statistics for all employers who did not primarily employ individuals from your cohort you can compare the statistics to those you created for employers who primarily employed individuals from your cohort. 

In [None]:
employer_measures %>%
    filter(!federal_ein %in% cohort_wages_empr$Federal_Ein, year == select_year) %>%
      summarize(
          mean_emp = mean(measure), 
          median_emp = median(measure), 
          sd_emp = sd(measure), 
          min_emp = min(measure), 
          max_emp = max(measure)
      ) 

Now, you will identify categories for High, Medium and Low based on your measure of interest. These are defined as follows:

- High: >= 75th percentile of `measure`
- Medium: > 25th percentile and < 75th percentile of `measure`
- Low: <= 25th percentile of `measure`

Use R's `quantile` function to get the 25th and 75th percentiles for `measure` and assign these to `p25` and `p75`, respectively. If you have different groupings in mind, please feel free to do so as it is difficult to incorporate that into a checkpoint. 

In [None]:
# use pull() to isolate the variable of interest in a vector
p <- employer_measures %>%
    filter(year == select_year) %>%
    pull(measure) %>%
    quantile(probs = c(.25, .75))

p

The first value corresponds to 25th percentile and the second value corresponds to the 75th percentile. 

In [None]:
# extracting 25th and 75th percentile 
p25 <- p[1]
p75 <- p[2]

We define a categorical variable, **measure_cat**, by taking into account the 25th and 75th percentiles. 

In [None]:
employer_measures_cat <- employer_measures %>%
    filter(year == select_year) %>%
    mutate(measure_cat = case_when(
                                # Low: <= 25th percentile
                                measure <= p25 ~ "Low", 
                                # Medium: > 25th and < 75th percentile
                                measure > p25 & measure < p75 ~ "Medium", 
                                # High: >= 75th percentile
                                TRUE ~ "High")
          ) %>%
    select(federal_ein, measure_cat)

# see amount of employers in the year within each bin--should be 25% high, 25% low, and 50% in the middle
table(employer_measures$measure_cat)

head(employer_measures_cat)

With our categorical variable of our interest, we can link **employer_measures_cat** to **cohort_wages_empr_comp** to add the categorical variable for the measure of your interest to your primary employer-employee data at the time of apprenticeship completion. 

The resulting data frame will give us the measure (of your interest) category (High/Medium/Low) in which an apprenticeship completer's first primary employer falls. 

> Note: We will only need the **Person_ID** and **measure_cat** variables moving forward, so we will explicitly `select()` them after performing the join.

In [None]:
cohort_wages_merged_measure_cat <- inner_join(cohort_wages_empr_comp, 
                                              employer_measures_cat, 
                                              by=c("Federal_EIN" = "federal_ein")) %>%
    select(Person_ID, measure_cat)

head(cohort_wages_merged_measure_cat)

At this point, we have created a lookup table, **cohort_wages_merged_measure_cat**, which tracks those in the cohort who were employed at the time of their apprenticeship completion and their measure category of their primary employer at the time. Now, we can combine this information with that of future employment information for these individuals, which is saved in **cohort_wages_empr**. 
 
We can do so by leveraging the `inner_join()` function so that we only include employment histories for individuals who were employed in the quarter of completion and whose primary employers were tracked in our employer measures table in your year of interest. We perform this join on **ssn** and not a combination of **ssn** and **federal_ein** because we are interested in seeing *any* form of employment as opposed to looking at employment with the same primary employer. Our goal is to look at initial employment after graduation and how it affects our cohort's trajectories over the next year. 

In [None]:
emp_empr_matches_measure <- cohort_wages_empr %>%
    inner_join(cohort_wages_merged_measure_cat, by="Person_ID") %>%
    select(Quarters_Relative_to_Completion, measure_cat, Person_ID)

head(emp_empr_matches_measure)

Next, we will calculate the percentage of our cohort employed in each quarter by the type (high/medium/low) of primary employer they were employed by after they completed an apprenticeship.

> Note: In the Firm Characteristics section, we looked at the number of quarters an individual worked at the same employer. Here, we are looking at each quarter and calculating the percentage of individuals within each group. 

In [None]:
# count number of people within each emp_rate_cat and Quarters_Relative_to_Completion subgroup
# can use count() again because there is one row per person/quarter combination
df_counts <- emp_empr_matches_measure %>% 
    count(measure_cat, Quarters_Relative_to_Completion)

df_counts

From here, we will create a mini-data frame that stores the initial counts of those employed in the quarter of apprenticeship completion based on the employment rate category of their primary employer.

In [None]:
# store values for each emp_rate_cat group at Quarters_Relative_to_Competion = 0
first_ob <- df_counts %>% 
    filter(Quarters_Relative_to_Completion == 0) %>%
    # renaming "n" column to differentiate from that in df_counts
    rename(n_start = n) %>% 
    # don't need Quarters_Relative_to_Completion since it is filtered to 0
    select(-Quarters_Relative_to_Completion)

first_ob

Now, we can join this mini-data frame **first_ob** back to **df_counts** to calculate the percentage employed based on the employment rate categorization of their primary employer at the time of their apprenticeship completion.

In [None]:
df_counts_prop <- df_counts %>%
    inner_join(first_ob, by="measure_cat") %>%
    mutate(prop = n/n_start)

df_counts_prop

Lastly, we visualize our results to understand how your measure category impacts trajectories of apprenticeship completers after they finish their training. 

In [None]:
df_counts_prop %>%
    ggplot(aes(x = Quarters_Relative_to_Completion, y = prop, color = measure_cat)) +  
    geom_line()