<center><br><br>
    <h4>TANF Data Collaborative </h4>
    <h4>Applied Data Analytics Training | Spring 2022</h4>
    <h1>Characterizing Demand: Descriptive Analysis Checkpoints</h1>
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Maryah Garner, Allison Nunez, Rukhshan Arif Mian, Benjamin Feder</center>
</center>

<br>

***

# Introduction:

The purpose of this checkpoint notebook is to apply the methods we used in `04_Characterizing_Demand_Beginner.ipynb` to your cohort by utilizing a measure of your choice. 

In the checkpoint notebooks for `02_Creating_a_cohort.ipynb`, we asked you to create and save your cohort as an SQL table. You will be utilizing the cohort you created as part of this checkpoint notebook. 

> Note: This checkpoint notebook has been created by keeping a TANF exit cohort in mind. We encourage you to reach out to your team facilitator to learn more about the methods you can use to Characterize Demand if your cohort focuses on TANF entry.

At each checkpoint you will be replacing the `___` with the appropriate variable, function or R code snippet. 

Participants are encouraged to attempt the checkpoints on their own. Having said that, hints and suggested solutions are provided and these can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution` respectively. Note: Codes for accessing hints and solutions are currently commented out – in order for these to run, we would need to uncomment them first. 


# Setting up workspace

#### Read in Libraries

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# for as.yearqtr()
suppressMessages(library(zoo))

#Switching on warnings
options(warn = 0)

source('04_Characterizing_Demand_Beginner_checkpoints_hints_solutions.txt')

#### Establishing Database Connection

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

# Reading in Data

#### Employer Measures

In [None]:
qry <- "
SELECT * 
FROM tr_tdc_2022.dbo.employer_yearly_agg
"

employer_measures <- dbGetQuery(con, qry)

head(employer_measures)

## Checkpoint 1

For this checkpoint, we ask you to pull in data from the cohort you created in `02_Creating_a_cohort_checkpoints.ipynb` notebook or the cohort you have created for your research project. You will also need to fill in the year-quarter (`yr_quarter`) value(s) for the first quarter after exit.

#### Linked Cohort-Wages data (first quarter after exit)

In [None]:
qry <- "SELECT nb.ssn, wr.Empr_no, wr.Year, wr.Quarter, wr.Wage, wr.yr_quarter 
    FROM 
    tr_tdc_2022.dbo.____ nb
    INNER JOIN 
    (
        select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('___')
    ) wr
    ON wr.SSN=nb.SSN
"

cohort_wages_empr <- dbGetQuery(con,qry)

In [None]:
# hint
# check_1.hint()

In [None]:
# solution
# check_1.solution()

## Checkpoint 2: 

In this checkpoint, we ask you to update the code below to read in your cohort and link it with wage records for the year-quarter value(s) of your interest. You will need to pull in data for a specific interval over which you intend on doing your analysis. We have included 4 blank spaces for you to fill in this data but feel free to pull as many year-quarter (`yr_quarter`) values as necessary for your analysis. 

#### Cohort Wages 

We use the following code to reading in cohort wages for the first 4 quarters (one year) after exit. This is similar to what we saw in the `03_Linkage_and_Longitudinal_Analysis` notebooks. 

In [None]:
qry <- "SELECT nb.ssn, wr.Empr_no, wr.Year, wr.Quarter, wr.Wage, wr.yr_quarter 
    FROM 
    tr_tdc_2022.dbo.___ nb
    INNER JOIN 
    (
        select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('____', '____', '____', '____')
    ) wr
    ON wr.SSN=nb.SSN
"

cohort_wages <- dbGetQuery(con, qry)

head(cohort_wages)

In [None]:
# hint
# check_2.hint()

In [None]:
# solution
# check_2.solution()

# Employer Measures 

In this notebook, you will select an employer measure with which you will identify groups of employers and then identify how those groups correspond with an employment outcome of interest. 



## Checkpoint 3: Selecting Employer Measure

Please select one employer measure from the following list:

**Firm characteristics**:

- Total Payroll
- Average earnings per employee 
- Average full quarter earnings per employee
- Earnings per employee at 25th percentile
- Earnings per employee at 75th percentile
- Total full quarter employment
- Total employment

**Opportunity**:

- Number of new hires
- Employment growth rate
- Hiring growth rate

**Stability**:
- Separation growth rate (this is calculated using the same formula as above)
- Number of new hires who become full quarter employees (hired in t-1 whom we see in t+1)
- Ratio of full quarter employees to total number of employees 

In [None]:
employer_measures <- employer_measures %>%
    mutate(measure = ____)

In [None]:
# hint
# check_3.hint()

In [None]:
# solution
# check_3.solution()

## Checkpoint 4: Keeping employers who hired from your cohort

In this checkpoint, we ask you to fill in the blank below with the year for which you want to keep the employer measures for. We recommend the year prior to your cohort selection.

In [None]:
select_year = 2017

In [None]:
# hint
# check_4.hint()

In [None]:
# solution
# check_4.solution()

In [None]:
employer_measures_cohort <- employer_measures %>%
    # filtering on employer number
    filter(Empr_no %in% cohort_wages_empr$Empr_no, 
    # filtering on year
           year == select_year)

employer_measures_cohort %>%
    summarize(unique_emp = n_distinct(Empr_no))


#### Summary Statistics

In [None]:
# getting the mean, median, standard deviation, min and max values for your measure of interest
employer_measures_cohort %>%
    summarize(mean_emp = mean(measure), 
             median_emp = median(measure), 
             sd_emp = sd(measure), 
             min_emp = min(measure), 
             max_emp = max(measure))

Next, you will construct the summary statistics for all employers who did not employ individuals from your cohort you can compare the statistics to those you created for employers who employed individuals from your cohort. 

In [None]:
employer_measures %>%
    filter(!Empr_no %in% cohort_wages_empr$Empr_no, 
                  year == select_year) %>%
      summarize(mean_emp = mean(measure), 
             median_emp = median(measure), 
             sd_emp = sd(measure), 
             min_emp = min(measure), 
             max_emp = max(measure)) 

## Differentiating Employers
In this section, you will identify High, Medium and Low growth employers. These are defined as follows:

- High: >= 75th percentile of `measure`
- Medium: > 25th percentile and < 75th percentile of `measure`
- Low: <= 25th percentile of `measure`

Use R's `quantile` function to get the 25th and 75th percentiles for `measure` and assign these to `p25` and `p75` respectively. If you have different groupings in mind, please feel free to do so as it is difficult to incorporate that into a checkpoint. 

In [None]:
p <- quantile(employer_measures$measure,
              probs = c(.25, .75))

p

The first value corresponds to 25th percentile and the second value corresponds to the 75th percentile. 

In [None]:
# extracting 25th and 75th percentile 
p25 <- p[1]
p75 <- p[2]

Define a categorical variable, `measure_cat` by taking into account the 25th and 75th percentiles. 

In [None]:
employer_measures <- employer_measures %>%
    mutate(measure_cat = case_when(
                                # Low: <= 25th percentile
                                measure <= p25 ~ "Low", 
                                # Medium: > 25th and < 75th percentile
                                measure > p25 & measure < p75 ~ "Medium", 
                                # High: >= 75th percentile
                                TRUE ~ "High")
          )

table(employer_measures$measure_cat, useNA="always")

Next, we select columns for `Empr_no` and `measure_cat` – this removes columns that we do not require for further analysis. 

In [None]:
employer_measures_cat <- employer_measures %>%
    filter(year==select_year) %>%
    select(Empr_no, measure_cat)

## Linking with Cohort (first quarter after exit)

Now that you have a categorical variable of interest, link the `employer_measures_cohort` to `cohort_wages_empr` to add the categorical variable for employment growth rate to your data.

Keep `ssn` and `measure_cat` – drop all unecessary columns. The resulting dataframe will give you the measure category (High/Medium/Low) a TANF recipient's first employer (after exit) falls in for each individual in your cohort who was employed in the first quarter after exit. Note, individuals who were employed by more than one employer in the first quarter after exit will be included in this dataframe for every job they had. 

In [None]:
emp_empr_matches_measure <- inner_join(cohort_wages_empr, 
                                                employer_measures_cat, by="Empr_no") %>%
                                     select(ssn, measure_cat)

table(emp_empr_matches_measure$measure_cat)
head(emp_empr_matches_measure)

## Merging

Here, you will pull wage data for the same individuals you see above (in `emp_empr_matches_measure`). This will allow you to track employment for these individuals. We will be pulling this data for the time period of your analysis you selected in Checkpoint 2.
 
Next, you will combine the two dataframes from above using the `left_join` function, selecting out only variables you need for this analysis.

In [None]:
merged_cohort_wages_cat_m <- cohort_wages %>%  
                        select(ssn, Empr_no, yr_quarter) %>%
                        left_join(emp_empr_matches_measure , by=c('ssn')) %>%
                        filter(!is.na(measure_cat)) # dropping employers that did not hire from our TANF cohort


head(merged_cohort_wages_cat_m)

## Percentage of Cohort Employed

Calculate the percentage of your cohort employed in each quarter BY the type of employer they were hired by after they exited TANF.

In [None]:
grouped_measure_cat <- merged_cohort_wages_cat_m %>%
    # grouping by type of employer growth category (for first employer after exit)
    group_by(measure_cat)  %>% 
    # getting total number of individuals falling in each category in the first quarter after exit
    mutate(pop = n_distinct(ssn)) %>%
    ungroup() %>%
    # grouping by employer growth category and quarter
    group_by(measure_cat, yr_quarter) %>% 
    # counting and creating percentage of cohort employed
    summarize(count = n_distinct(ssn), 
             pop = unique(pop), 
             perc = count/pop)

head(grouped_measure_cat)

## Visualizing Results

Lastly, we visualize our results to understand how the employer growth category impacts trajectories of TANF recipients after they exit. 

In [None]:
grouped_measure_cat %>%
    ggplot() + 
    aes(x = yr_quarter,
        y = perc, 
        group = measure_cat, 
        color = measure_cat) +  
    geom_line() + 
    geom_point() + 
    expand_limits(y = 0)