<center><br><br>
    <h4>TANF Data Collaborative </h4>
    <h4>Applied Data Analytics Training | Spring 2022</h4>
    <h1>Characterizing Demand: Descriptive Analysis</h1>
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Maryah Garner, Allison Nunez, Rukhshan Arif Mian, Benjamin Feder</center>
    <a href="https://doi.org/10.5281/zenodo.7459656"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.7459656.svg" alt="DOI"></a>

</center>

<br>

***

# Introduction

Characterizing the labor market demand can help us understand the different types of employers at a city, county or state level. A majority of the research on labor market outcomes lays emphasis on the role of the employee (labor market supply). While this is important, understanding the employer's role is also critical for developing employment outcomes. 

In the Beginner and Advanced notebooks for `03_Linkage_and_Longitudinal_Analysis.ipynb`, we used descriptive statistics to understand employment outcomes for our cohort. This allowed us to see the various patterns (for example: missingness) exhibited in the data that could impact any statistical analysis performed. The goal of this notebook is now to understand how we can use descriptive statistics for the purpose of characterizing labor demand. This will allow us to understand the types of employers individuals in our cohort are employed by and how this can impact employment outcomes. More specifically, we can identify employer characteristics that are associated with positive employment outcomes for our cohort. 

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# for as.yearqtr()
suppressMessages(library(zoo))

#Switching on warnings
options(warn = 0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

# Data Sources

We utilize the following data sources throughout this notebook:

- **Employer Measures** (`employer_yearly_agg`): Measures we created in the `TDC-Supplemental_Employer_Measures.ipynb` notebook
- **TANF Cohort** (`nb_cohort`): Work eligible (and adult) TANF recipients who exited in '2018 Q2'. We created this in the `02_Creating_a_cohort.ipynb` notebook from Module 2. 
- **UI Wages** (`wages_tanf`): UI Wage records for Indiana

## Describing Employer Measures

This class will focus on 3 essential categories of annualized employer-level measures when it comes to characterizing labor market demand: Firm characteristics, Opportunity, and Stability. We define the specific employer measures that fall into each category below:

**Firm characteristics**:

- Total Payroll
- Average earnings per employee 
- Average full quarter earnings per employee
- Earnings per employee at 25th percentile
- Earnings per employee at 75th percentile
- Total full quarter employment
- Total employment

**Opportunity**:

- Number of new hires
- Employment growth rate
- Hiring growth rate

 $$ \ \large{g_{et}=\frac{(x_{et} - x_{e,t-1})}{\frac{(x_{et} + x_{e,t-1})}{2}}} $$

where: 
 - $x_{et}$: Total Employment/Hiring/Separation at time *t*
 - $x_{et-1}$: Total Employment/Hiring/Separation at time *t-1*
 


**Stability**:
- Separation growth rate (this is calculated using the same formula as above)
- Number of new hires who become full quarter employees (hired in t-1 whom we see in t+1)
- Ratio of full quarter employees to total number of employees 


As a reminder, these measures were created using Indiana's UI Wage Records and filter out any employers that had < 5 employees. 


To learn more about what each of these measures means and how they were created, we encourage you to go through the supplementary worksheet (`Supplemental_Employer_Measure_Worksheet.xlsx`). 

As discussed in the introduction, we can use descriptive analysis to understand the correlation between employer measures and employment outcomes. We can use these to address questions such as:

- For TANF exiters, what ypes of employers they are being employed by? Are these high-growth or low-growth employers? 

In this notebook, we focus on single measures from 2 of the 3 employer measure categories: 
1. Total Employment (Firm Characteristics)
2. Employment Growth Rate (Opportunity)

Before diving into these measures, we look at how the `employer_yearly_agg` table is structured. 

In [None]:
qry <- "
SELECT * 
FROM tr_tdc_2022.dbo.employer_yearly_agg
"

employer_measures <- dbGetQuery(con, qry)

head(employer_measures, 0)

Next, we look at the columns in the `employer_measures` data frame.

In [None]:
names(employer_measures)

As a reminder, each observation in this data frame corresponds to an employer-year combination. We have data for all years that fall in the following range

In [None]:
# range calculates the min and max values for a variable of interest
range(employer_measures$year) 

One important note here is that not all employers appear in this data, and those who appear may not appear every year. There are several reasons this could happen:

1. A non-UI covered employer will not submit wages to UI. For example: federal work or gig work.
2. The employer may have less than 5 employees (these were filtered out when constructing the employer measures).
3. The employer may not yet be established or has gone out of business .

Keep all of this in mind when considering the inferential limitations to using the `employer_measures` table.

# Analyzing Firm Level Characteristics

In this section, we use firm level characteristics to understand how we can characterize labor demand for our cohort of TANF exiters. 

We explore the following research question: 
-  Do TANF exiters from our cohort stay longer at a job when they are hired by larger employers compared to smaller employers? 
    - We define larger employers as those who have an above average number of employees. Similarly, we define small employers as those who have a below average number of employees

In this notebook, we will be considering employer characteristics from 2017. We will be looking at summary statistics for all employers and for employers who hired TANF exiters from our cohort. We will then look at the correlation between employer characteristics and employee outcomes.

In this analysis, we include individuals from our cohort who were employed in the first quarter after they exit. Since our cohort consists of TANF recipients who exited in '2018 Q2', they must be employed in '2018 Q3' to be included. We use '2018 Q3' to look at how the characteristics of TANF exiters' first employers impact their outcomes.

We will then count the number of quarters they worked for the employer that hired them in '2018 Q3'.

> Note: We consider employer measures from 2017 since this would be the information that was available to TANF agencies when aiding TANF recipients in finding employment in '2018 Q2'. It is very unlikely they were used in practice; however, we hope to provide a practical tool that can be utilized when guiding TANF recipients in the future.

The first step in approaching this research question is to link our cohort with the UI Wage data and only keep the first year after exit in UI Wages. 

## Linking Cohort to Wages (2018Q3)

In [None]:
qry <- "SELECT nb.ssn, wr.Empr_no, wr.Year, wr.Quarter, wr.Wage, wr.yr_quarter 
    FROM 
    tr_tdc_2022.dbo.nb_cohort nb
    INNER JOIN 
    (
        select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('2018 Q3')
        AND (SSN IN (SELECT DISTINCT SSN FROM tr_tdc_2022.dbo.nb_cohort))
    ) wr
    ON wr.SSN=nb.SSN
"

cohort_wages_empr <- dbGetQuery(con, qry)

Note that the code above is similar to what we saw in the Beginner and Advanced notebooks for `03_Linkage_and_Longitudinal_Analysis`. The only difference here is that we use an `INNER JOIN` which will only pull individuals who have keep wage records from '2018 Q3'. 

Count the unique number of individuals from our cohort who were employed in '2018 Q3' along with the unique number of employers by whom they were employed. 

In [None]:
cohort_wages_empr %>%
    summarize(unique_indiv = n_distinct(ssn),
              unique_empr = n_distinct(Empr_no))

## Extracting Employer Measures

In this sub-section, we extract the employers (and their 2017 measures, if available) who employed individuals from our TANF cohort in '2018 Q3'. 

We create a subset of the `employer_measures` data frame by keeping:
1. Unique employers from the previous step
2. Year = 2017

In [None]:
employer_measures_cohort <- employer_measures %>%
    # filtering on employer number
    filter(Empr_no %in% cohort_wages_empr$Empr_no, 
    # filtering on year
           year == 2017)

Count the number of employers after applying this filter.

In [None]:
# Count the number of unique employers in this new data frame
employer_measures_cohort %>% 
    summarize(unique_emp = n_distinct(Empr_no))

Note that the number of employers falls when we apply this filter. Recall that measures were not developed for firms that had fewer than 5 employees and if the UI Wage Records did not exist for these firms in 2017.

## Descriptive Statistics
The measure of interest for this section is the average number of individuals working for an employer in a quarter, (averaged over the year) `avg_num_employed.` 
> If employer A employed 100 people in 2017Q1, 102 people in 2017Q2, 98 people in 2017Q3 and 104 people in 2017Q4 their `avg_num_employed` in 2007 would be 101. 

As a first step, we calculate summary statistics for this variable only looking at employers who hired individuals from our cohort in '2018 Q3'. Then we compare these results with summary statistics for all other employers (who did not hire individuals from our cohort) in 2017.

#### Employers who hired TANF recipients from our cohort (in 2018 Q3)

In [None]:
employer_measures_cohort %>%        
    summarize(
        # mean
        mean_emp = mean(avg_num_employed),
        # median
        median_emp = median(avg_num_employed), 
        # standard deviation
        sd_emp = sd(avg_num_employed),
        # min value
        min_emp = min(avg_num_employed), 
        # max value
        max_emp = max(avg_num_employed)
    )

#### Employers who did not hire TANF recipients from our cohort (in 2018 Q3)

In [None]:
employer_measures %>%
    # filtering out employers that hired from our cohort
    filter(!Empr_no %in% cohort_wages_empr$Empr_no, 
    # keeping year == 2017
           year == 2017) %>%
    summarize(mean_emp = mean(avg_num_employed), 
             median_emp = median(avg_num_employed), 
             sd_emp = sd(avg_num_employed),
             min_emp = min(avg_num_employed), 
             max_emp = max(avg_num_employed))


We see how both the mean and median values for average number of employees are higher for employers who hired from our cohort compared to all other employers in 2017. It seems as though TANF recipients are more likely to be hired by larger employers.

## Differentiating Employers

We use the mean value for `avg_num_employed` to create a categorical variable to differentiate employers into 2 groups:

1. Large: Employers with above-average number of employees
2. Small: Employers with below-average number of employees

#### Establish groups of employers

We use the employer measures to establish the average number of employees across all employers in 2017. 

In [None]:
# extract median value for avg_num_employed from the stats dataframe
mean_emp <- employer_measures %>% 
    filter(year == 2017) %>%
    summarize(mean_emp = mean(avg_num_employed))

In [None]:
# creating a variable: emp_cat that equals Large if avg_num_employed >= mean
employer_measures_2017 <- employer_measures %>%
    filter(year == 2017) %>%
    mutate(emp_cat = ifelse(avg_num_employed >= mean_emp$mean_emp, 
                                 "Large", 
                                 "Small"))

In [None]:
# check distribution of emp_cat
table(employer_measures_2017$emp_cat)

Now that we have a categorical variable for the size of the employer, we link the `employer_measures_cohort` dataframe to our joined cohort-wages data frame `cohort_wages_empr`.

In [None]:
employer_measures_cat <- employer_measures_2017 %>%
    # selecting out Employer number and categorical variable of interest
    select(Empr_no, emp_cat)

# performing an inner join so that we only keep individuals who are hired by employers for whom we have employer measures
cohort_wages_merged_measures <- inner_join(cohort_wages_empr, 
          employer_measures_cat, 
          by="Empr_no")

table(cohort_wages_merged_measures$emp_cat)

Now, we will count the number of unique individuals and employers we have in our merged dataframe.  

> We expect this number to be lower than what we saw for `cohort_wages_empr` as not all employers are captured in `employer_measures` (reminder: employer may not exist or has less than 5). 

In [None]:
cohort_wages_merged_measures %>%
    summarize(unique_indiv = n_distinct(ssn), 
              unique_emp = n_distinct(Empr_no))

## Employee-employer combinations (matches)
Next, we select out the columns for `ssn`, `Empr_no` and `emp_cat` to have a dataframe that corresponds to employee-employer matches. 

In [None]:
emp_empr_matches <- cohort_wages_merged_measures %>% 
    select(ssn, Empr_no, emp_cat) 

head(emp_empr_matches)

The next step in our methodology is to **count the number of quarters during the first year of exit** for which an employee stayed with the same employer.

Thus, we will require another data pull that captures our TANF cohort's employment for the first year after exit. Since our exit quarter is '2018 Q2', the first year would include: '2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2'. 

## Cohort Wages 

We use the following code to read in cohort wages for the first 4 quarters (one year) after exit. This is similar to what we saw in the `03_Linkage_and_Longitudinal_Analysis` notebooks. 

In [None]:
qry <- "SELECT nb.ssn, wr.Empr_no, wr.Year, wr.Quarter, wr.Wage, wr.yr_quarter 
    FROM 
    tr_tdc_2022.dbo.nb_cohort nb
    INNER JOIN 
    (
        select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2')
    ) wr
    ON wr.SSN=nb.SSN
"

cohort_wages <- dbGetQuery(con,qry)

head(cohort_wages)

## Quarters worked for the same employer

We will look at the number of quarters each individual is employed by the same employer over the year following their exit. 

The `emp_empr_matches` dataframe contains a list of unique employee-employer combinations from '2018 Q3'. `cohort_wages` contains employee-employer combinations from '2018 Q3', '2018 Q4', '2019 Q1' and '2019 Q2'. 

To identify the number of quarters each individiaul worked for the same employer, we merge the 2 dataframes and count the unique number of quarters for each employee-employer pair. 

In [None]:
merged_cohort <- inner_join(emp_empr_matches, # employee-employer matches 
                            cohort_wages, # employee-employer matches for first year after exit
                            by=c("ssn", "Empr_no"))

Once we have the merged dataframe, we group by our `ssn`-`Empr_no`-`emp_cat` and count the number of unique `yr_quarter`s. 

In [None]:
merged_cohort_num_quart = merged_cohort %>%
        group_by(ssn, Empr_no, emp_cat) %>%
        summarize(num_quart_emp = n_distinct(yr_quarter)) %>%
        ungroup()

head(merged_cohort_num_quart)

#### Portion of employees that stayed with the same employer for 1, 2, 3 or 4 quarters

The goal here is to explore whether TANF exiters who are hired by larger firms are more likely to stay with that employer compared to exiters who are hired by smaller firms. To do this, we will calculate the portion of employees that stayed with the same employer for 1, 2, 3 or 4 quarters after exit broken down by whether they were hired by a large firm or a smaller firm at the time of exit.

In [None]:
grouped <- merged_cohort_num_quart %>%
    # grouping by type of employer (Large or Small)
    group_by(emp_cat) %>%
    # counting the number of employee-employer pairs in first quarter after exit for large and small firms
    mutate(pop = n()) %>%
    ungroup() %>%
    # grouping by type of employer and number of quarters (1, 2, 3, 4)
    group_by(emp_cat, num_quart_emp) %>%
    # counting the number of unique 
    summarize(pop=unique(pop),  # every pop value in this grouping is the same, we keep one of them (otherwise, data will repeat itself)
              job_count = n(),  # counting the number of employee-employer pairs by type of employer and number of quarters
             portion=job_count*100/pop) # calculating portion

head(grouped)

## Visualizing Results

We visualize our results in the form of a bar plot to better understand outcomes. 

In [None]:
ggplot(grouped, 
       aes(x=num_quart_emp, 
           y=portion, 
           fill=emp_cat))+
  geom_bar(stat='identity', position='dodge')

As you can see from this visual: 
1. The most common outcome is only staying with the same employer for one quarter 
2. TANF exiters employed by smaller firms are more likely to work with that firm for 1 quarter or 4 quarters compared to their counterparts employed at larger firms. 

# Opportunity

The second measure of interest for this notebook is Employment Growth Rate. The goal is to look at average employment growth rate and identify the outcomes of individuals who are employed by high-, medium- and low-growth employers. 

The outcome we consider is the percent of individuals within each group that were employed in each quarter of the year following exit. For example, of our cohort that was hired in '2018 Q3' by high growth employers, what percentage was employed in the second, third and fourth quarters after exit?

Note that we are looking at employment in general and not just employment with the same employer. Also note the underlying assumption that if a TANF recipient gets a "good" job, they might be able to transition into an even better job more easily. This allows us to understand TANF recipients' total employment trajectory based on the type of employer they were hired by as soon as they exit. 

## Descriptive Statistics

Similar to what we did for Firm Characteristics, we perform basic summary statistics. Our measure of interest for this section is Employment Growth Rate which is captured by the following variable: `avg_emp_rate`. 


> Note: The values for `avg_emp_rate` are bound between -2 and 2

In [None]:
# getting the mean, median, standard deviation, min and max values for `avg_emp_rate`
employer_measures_cohort %>%
    summarize(mean_emp = mean(avg_emp_rate), 
             median_emp = median(avg_emp_rate), 
             sd_emp = sd(avg_emp_rate), 
             min_emp = min(avg_emp_rate), 
             max_emp = max(avg_emp_rate))

Next, we compare our results with the all other employers from 2017.

In [None]:
employer_measures %>%
    filter(!Empr_no %in% cohort_wages_empr$Empr_no, 
                  year == 2017) %>%
      summarize(mean_emp = mean(avg_emp_rate), 
             median_emp = median(avg_emp_rate), 
             sd_emp = sd(avg_emp_rate), 
             min_emp = min(avg_emp_rate), 
             max_emp = max(avg_emp_rate)) 



Employers who hire from our cohort of TANF exiters have lower mean and median employment growth rates. 

## Differentiating Employers
In this section, we are looking to identify high-, medium-, and low-growth employers. We define these as follows:

- High: >= 75th percentile of `avg_emp_rate`
- Medium: > 25th percentile and < 75th percentile of `avg_emp_rate`
- Low: <= 25th percentile of `avg_emp_rate`

We use R's `quantile` function to get the 25th and 75th percentiles for `avg_emp_rate` and assign these to `p25` and `p75` respectively. 

In [None]:
p <- quantile(employer_measures$avg_emp_rate,
              probs = c(.25, .75))

p

The first value corresponds to 25th percentile and the second value corresponds to the 75th percentile. 

In [None]:
# extracting 25th and 75th percentile 
p25 <- p[1]
p75 <- p[2]

We define a categorical variable, `emp_rate_cat` by taking into account the 25th and 75th percentiles. 

In [None]:
employer_measures <- employer_measures %>%
    mutate(emp_rate_cat = case_when(
                                # Low: <= 25th percentile
                                avg_emp_rate <= p25 ~ "Low", 
                                # Medium: > 25th and < 75th percentile
                                avg_emp_rate > p25 & avg_emp_rate < p75 ~ "Medium", 
                                # High: >= 75th percentile
                                TRUE ~ "High")
          )

table(employer_measures$emp_rate_cat)

Next, we select columns for `Empr_no` and `emp_rate_cat` – this removes columns that we do not require for further analysis. 

In [None]:
employer_measures_growth_cat <- employer_measures %>%
    filter(year==2017) %>%
    select(Empr_no, emp_rate_cat)

## Linking with Cohort (first quarter)

Now that we have a categorical variable of our interest, we link the `employer_measures_cohort` to `cohort_wages_empr` to add the categorical variable for employment growth rate to our data. 

We only keep `ssn` an `emp_rate_cat` – removing any unnecessary columns. The resulting dataframe will give us the growth category (High/Medium/Low) a TANF recipient's first employer (after exit) falls in. 

In [None]:
emp_empr_matches_growth <- inner_join(cohort_wages_empr, 
                                                employer_measures_growth_cat, by="Empr_no") %>%
                                     select(ssn, emp_rate_cat)

table(emp_empr_matches_growth$emp_rate_cat)
head(emp_empr_matches_growth)

## Merging

We will pull wage data for the same individuals we see above (in `emp_empr_matches_growth`). This will allow us to track employment for these individuals. We will be pulling this data for: '2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2'. 
 
Next, we will combine the two dataframes from above using the `left_join` function, selecting out only variables we need for this analysis. We perform this join on `ssn` and not a combination of `ssn` and `Empr_no` because we are interested in seeing *any* form of employment as opposed to looking at employment with the same employer. Our goal is to look at initial employment after exit and **how it effects our cohort's trajectories over the next year**. 

In [None]:
merged_cohort_wages_cat_g <- cohort_wages %>%  
                        select(ssn, Empr_no, yr_quarter) %>%
                        left_join(emp_empr_matches_growth, by=c('ssn')) %>%
                        filter(!is.na(emp_rate_cat)) # dropping employers that did not hire from our TANF cohort



head(merged_cohort_wages_cat_g)

We use the code below to save this table in our SQL database – we will utilize this for one of the later notebooks in Module 3. We have commented this out for now. 

In [None]:
```
qry <- "use tr_tdc_2022;"
DBI::dbExecute(con, qry)

DBI::dbWriteTable(
    conn = con,
    name = DBI::SQL("dbo.merged_cohort_wages_growth"), 
    value = merged_cohort_wages_cat_g,
    overwrite = TRUE
)
```

## Percentage of Cohort Employed

In this section, we calculate the percentage of our cohort employed in each quarter by the type (high-growth/medium-growth/low-growth) of employer they were hired by after they exited TANF.

> Note: In the Firm Characteristics section, we looked at the number of quarters an individual worked at the same employer. Here, we are looking at each quarter and calculating the percentage of individuals employed from each group. 

In [None]:
grouped_growth_cat <- merged_cohort_wages_cat_g %>%
    # grouping by type of employer growth category (for first employer after exit)
    group_by(emp_rate_cat)  %>% 
    # getting total number of individuals falling in each category in the first quarter after exit
    mutate(pop = n_distinct(ssn)) %>%
    ungroup() %>%
    # grouping by employer growth category and quarter
    group_by(emp_rate_cat, yr_quarter) %>% 
    # counting and creating percentage of cohort employed
    summarize(count = n_distinct(ssn), 
             pop = unique(pop), 
             perc = count/pop)

head(grouped_growth_cat)

## Visualizing Results

Lastly, we visualize our results to understand how the employer growth category impacts trajectories of TANF recipients after they exit. 

In [None]:
grouped_growth_cat %>%
    ggplot() + 
    aes(x = yr_quarter,
        y = perc, 
        group = emp_rate_cat, 
        color = emp_rate_cat) +  
    geom_line() + 
    expand_limits(y = 0)

One might think that high growth rate jobs would lead to more positive employment outcomes, but this does not seem to be the case (especially in later quarters). In the figure above, you can see that those who were initially employed in high growth jobs are the least likely to be employed in '2019 Q2'. It is likely that a lot of things correlated with these high growth employers. This is why instead of looking at measures in isolation, you might want to group employers based on multiple measures as seen in the next notebook: `04_Characterizing_Demand_Advanced.ipynb`. 

> Note: The next notebook (`04_Characterizing_Demand_Advanced.ipynb`) uses Unsupervised Machine Learning and more advanced coding which might not be appropriate for all class participants.