<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

<center> Julia Lane, Clayton Hunter, Brian Kim, Benjamin Feder, Ekaterina Levitskaya, Tian Lou, Lisa Osorio-Copete. 
</center>

# Using Employment and Employer-Level Measures to Understand Indiana's Labor Market

## Introduction

While in the [Data Exploration](Data_Exploration.ipynb) notebook we focused primarily on understanding our cohort's earnings, here we will first look at two measures of stable employment before switching gears to the demand side of employment: the employers. For the second part of this notebook, we will analyze some employer-level measures created in a supplementary [notebook](Create_Employer_Characteristics.ipynb) to get a better sense of Indiana's labor market and how employers of individuals in our cohort fit into the overall labor market.

### Learning Objectives

We will cover two prominent analyses:

1. Different measures of stable employment
1. Labor market interactions

These two sections will have two different units of analysis: the first will focus directly on the individuals in our cohort, and then will switch onto their employers. 

Before we start looking at their employers, a logical prelude would be taking a deeper dive into our cohort's employment. Here, we will walk through two different measures of stable employment within a cohort and see if their earnings differed significantly from those without stable employment. From there, we will load in our employer-level measures file and look at the differences in employers of members in our cohort who experienced different levels in employment.

We would like to find out if there are any distinguishing factors between the overall labor market in Indiana and the employers that hired members of our 2016Q4 cohort. Ultimately, we want to gain a better understanding of the demand side when it comes to employment opportunities for our TANF leavers.

Similar to the [Data Exploration](Data_Exploration.ipynb) notebook, we will pose a few direct questions we will use to answer our ultimate question: **How can we use labor market interactions to help explain employment outcomes of TANF leavers?**

Before we do so, we need to load our external R packages and connect to the database.

### R Setup

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)

# scaling data
library(scales)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

## Stable Employment Measures

As discussed above, we will spend some time in this section taking a look at our 2016Q4 cohort's employment outcomes. We will examine two different defintions of stable employment and see how average quarterly earnings differ for individuals who satisfy these definitions of stable employment. We have listed the two questions we will seek to answer in this section below:

1. How many leavers found stable employment? What percentage is this of our total cohort?
1. What were the average quarterly earnings within these stable jobs?

Let's first load our table matching our 2016Q4 cohort to their employment outcomes into R.

In [None]:
# read table into R
qry = "
select *
from ada_tdc_2020.cohort_2016_earnings
"
df_2016_wages = dbGetQuery(con, qry)

In [None]:
# take a look at df_2016_wages
glimpse(df_2016_wages)

Now, we're ready to start answering our first guiding question for this section.

<font color=green><h3>Question 1: How many leavers found stable employment? What percentage is this of our total cohort? </h3></font> 

How would you define stable employment? In fact, it is quite a subjective measure. Here are the two definitions of stable employment we will look at: 

1. Those with positive earnings all four quarters after exit with the same employer
2. Those that experienced full-quarter employment. By full-quarter employment, an individual had earnings in quarters t-1, t, and t+1 from the same employer.

> These are not the only two, but just two common measures of stable employment. If you choose to analyze stable employment within a specific cohort (highly recommended), make sure you clearly state your definition of stable employment.

### Stable Employment Measure #1: Positive earnings all four quarters with the same employer

This calculation is relatively simple given that we have to just manipulate `df_2016_wages`. We will approach this calculation by counting the number of quarters each individual (`ssn`) received wages from each employer (`uiacct`), and then filter for just those `ssn`/`uiacct` combinations that appear in all four quarters in 2017.

In [None]:
# see if we can calculate stable employment measure #1
df_2016_wages %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(quarter)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    head()

From here, we can add one line of code `summarize(n_distinct(ssn))` to calculate the number of individuals in our cohort that experienced this measure of stable employment.

In [None]:
# calculate number of individuals in our cohort that experienced stable employment measure #1
df_2016_wages %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(quarter)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    summarize(n_distinct(ssn))

If you are curious about the amount of members of our cohort that found stable employment (according to this defintion) with multiple employers, you can do so with a few more lines of code.

In [None]:
# see if we can calculate stable employment measure #1
df_2016_wages %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(quarter)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    group_by(ssn) %>%
    summarize(n=n()) %>%
    ungroup() %>%
    filter(n>1) %>%
    summarize(num=n())

Anyways, we can calculate the percentage of our cohort that experienced stable employment within this time frame pretty easily now--we just need to load our original cohort into R as a frame of reference.

In [None]:
# 2016Q4 cohort with most recent case information
qry <- "
SELECT *
FROM ada_tdc_2020.cohort_2016
"

#read into R as df
df_2016 <- dbGetQuery(con,qry)

In [None]:
# save to calculate stable employment percentage
stable <- df_2016_wages %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(quarter)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    summarize(num = n_distinct(ssn))

In [None]:
# percentage employed all four quarters
percent((stable$num/n_distinct(df_2016$ssn)), .01)

Now, let's see how the percentage changes when we use our second definition of stable employment.

### Stable Employment Measure #2: Full-Quarter Employment

Finding full-quarter employment is a bit more complicated. Instead of using R, we will venture back into SQL, since we will need to find earnings for our cohort from 2016Q4 through 2018Q1 to calculate if an individual experienced full-quarter employment some time in 2017. We have already created this table, named `full_q_wages_2016` in the `ada_tdc_2020` schema for you using the code below:
> To satisfy full-quarter employment in 2017Q1, an individual needed to have earnings from the same employer in 2016Q4, 2017Q1, and 2017Q2. Therefore, if we want to see all full-quarter employment from 2017Q1 to 2017Q4, we would need all earnings data from 2016Q4 to 2018Q1.

    create table ada_tdc_2020.full_q_wages_2016 as
    select a.ssn, a.tanf_spell_months, a.tanf_total_months, a.county,
    b.year, b.quarter, b.uiacct, b.wages, b.naics_3_digit, b.cnty, 
    format('%s-%s-1', b.year, b.quarter*3-2)::date as job_yr_q
    from ada_tdc_2020.cohort_2016 a
    left join in_dwd.wage_by_employer b
    on a.ssn = b.ssn
    where b.year = 2017 or (b.year = 2016 and b.quarter = 4) or (b.year=2018 and b.quarter=1)

In [None]:
# get earnings for our cohort from 2016Q4-2018Q1
qry = '
select *
from ada_tdc_2020.full_q_wages_2016
limit 5
'
dbGetQuery(con, qry)

Now that we have earnings for our cohort from 2016Q4-2018Q1, we can calculate full-quarter employment. To do so, we will use three copies of the same table, and then use a `WHERE` clause to make sure we are identifying the same individual and employer combination across three consecutive quarters.

The `\'3 month\'::interval` code can be used when working with dates (`job_yr_q` in this case), as it will match to exactly three months from the original date. Before or after the original date can be indicated with `+` or `-` signs.

In [None]:
# see if we can calculate full-quarter employment
qry = '
select a.ssn, a.uiacct, a.job_yr_q, a.wages
from ada_tdc_2020.full_q_wages_2016 a, ada_tdc_2020.full_q_wages_2016 b, ada_tdc_2020.full_q_wages_2016 c
where a.ssn = b.ssn and a.uiacct=b.uiacct and
a.ssn = c.ssn and a.uiacct = c.uiacct and a.job_yr_q = (b.job_yr_q - \'3 month\'::interval)::date and 
a.job_yr_q = (c.job_yr_q + \'3 month\'::interval)::date
order by a.ssn, a.job_yr_q
limit 5
'
dbGetQuery(con, qry)

The query above will only select earnings for quarters where an individual experienced full-quarter employment with an employer, and due to the `WHERE` clause, it will only select full-quarter employment in 2017, and won't include those who experienced full quarter employment in 2016Q4 or 2018Q1.

In [None]:
# read full-quarter employment into r as cohort_2016_full
qry = '
select a.ssn, a.uiacct, a.job_yr_q, a.wages
from ada_tdc_2020.full_q_wages_2016 a, ada_tdc_2020.full_q_wages_2016 b, ada_tdc_2020.full_q_wages_2016 c
where a.ssn = b.ssn and a.uiacct=b.uiacct and
a.ssn = c.ssn and a.uiacct = c.uiacct and a.job_yr_q = (b.job_yr_q - \'3 month\'::interval)::date and 
a.job_yr_q = (c.job_yr_q + \'3 month\'::interval)::date
order by a.ssn, a.job_yr_q
'
cohort_2016_full <- dbGetQuery(con, qry)

Now that we have all records of full-quarter employment, along with their earnings in the quarter, we can easily calculate the number of individuals in our cohort who experienced our second measure of stable employment in at least one quarter.

In [None]:
# calculate number of individuals in our cohort that experienced full-quarter employment
cohort_2016_full %>%
    summarize(n=n_distinct(ssn))

In [None]:
# save number of individuals in our cohort that experienced full-quarter employment
full_n <- cohort_2016_full %>%
    summarize(n=n_distinct(ssn))

In [None]:
# calculate proportion of people in our cohort that experienced full-quarter employment
percent((full_n$n/n_distinct(df_2016$ssn)), .01)

We can also calculate the percentage of individuals in our cohort that experienced full quarter employment with the same employer in all four quarters.

In [None]:
cohort_2016_full %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(job_yr_q)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    summarize(n=n_distinct(ssn))

And then we can calculate this percentage.

In [None]:
# save as full_4
full_4 <- cohort_2016_full %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(job_yr_q)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    summarize(n=n_distinct(ssn))

In [None]:
percent((full_4$n/n_distinct(df_2016$ssn)), .01)

If you're curious, we can see if anyone experienced full quarter employment all four quarters with multiple employers as well.

In [None]:
# save as full_4
cohort_2016_full %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(job_yr_q)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    group_by(ssn) %>%
    summarize(n_emps = n_distinct(uiacct)) %>%
    filter(n_emps > 1) %>%
    summarize(n=n_distinct(ssn))

Are you surprised at the difference in percentages for our two measures of stable employment?

<font color=red><h3> Checkpoint 1: Recreate for 2009Q1 </h3></font> 

Find the percentage of our 2009Q1 cohort that experienced stable employment using these two metrics. How do they compare? Does this surprise you?

In [None]:
# How many individuals satisfy stable employment measure #1?


In [None]:
# What percentage of our cohort satisfies stable employment measure #1?


In [None]:
# How many individuals satisfy stable employment measure #2?

# Use table "ada_tdc_2020.full_q_wages_2009"


In [None]:
# What percentage of our cohort satisfies stable employment measure #2 for at least one quarter?


<font color=green><h3>Question 2: What were the average quarterly earnings within these stable jobs?</h3></font> 

Let's see if earnings differed for our cohort when comparing our two measures of stable employment. 

### Stable Employment Measure #1: Average Quarterly Earnings

We'll start with our first measure of those that had earnings with the same employer for all four quarters within our time frame. First, we will isolate all `ssn`/`uiacct` combinations that satisfied this stable employment measure, and then filter our original earnings data frame `df_2016_wages` to just include wages for these combinations.

In [None]:
# all ssn and uiacct values from stable employment measure #1 and save to stable_emp_1
stable_emp_1 <- df_2016_wages %>%
    group_by(ssn, uiacct) %>%
    summarize(n_quarters = n_distinct(quarter)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    select(-n_quarters)

> The code used to create `stable_emp_1` is copied from the code used earlier to isolate those who had earnings with the same employer for all four quarters within our time frame, with the addition of the last line so we don't store the number of quarters for which they were employed (which is always four in this case anyways).

In [None]:
# see stable_emp_1
head(stable_emp_1)

Now, we just need to `inner_join` rows in `df_2016_wages` for those with the same `uiacct` and `ssn` combinations as in `stable_emp_1`, and then we can find the average quarterly earnings.

In [None]:
# find average quarterly earnings for these individuals
df_2016_wages %>%
    inner_join(stable_emp_1, by = c('uiacct', 'ssn')) %>%
    summarize(mean_wages = mean(wages))

### Stable Employment Measure #2: Average Quarterly Earnings

For our second stable employment measure, we have already identified `ssn`/`uiacct`/`job_yr_q` combinations for full-quarter employment. We will use a similar strategy in joining `df_2016_wages` before finding the average quarterly earnings for quarters in which members of our cohort experienced full-quarter employment.

In [None]:
# see cohort_2016_full
head(cohort_2016_full)

In [None]:
# find average quarterly earnings for stable employment measure 2
df_2016_wages %>%
    inner_join(cohort_2016_full, by = c('uiacct', 'ssn', 'job_yr_q') %>%
    summarize(mean_wages = mean(wages))

<font color=red><h3> Checkpoint 2: Wages in Stable Employment for the 2009Q1 Cohort</h3></font> 

Find the average quarterly wages for those in our 2009Q1 cohort that experienced stable employment using the two defintions above.

In [None]:
# average quarterly wages under stable employment measure #1



In [None]:
# average quarterly wages under stable employment measure #2



## Indiana's Employers

In this section, we'll look at the characteristics of Indiana's employers. First, let's load in and take a quick look at our employer-level characteristics file `employers_2017` (located in the `ada_tdc_2020` schema for all employers in each quarter of 2017.

### Load the dataset

Before we get started answering these questions, let's load and then take a look at this file.

In [None]:
# look at employer-level characteristics table
qry <- "
select *
from ada_tdc_2020.employers_2017
limit 5
"
dbGetQuery(con, qry)

In [None]:
# read into R
qry <- "
select *
from ada_tdc_2020.employers_2017
"
employers <- dbGetQuery(con, qry)

Let's see how many rows are in `employer`.

In [None]:
# number of rows
nrow(employers)

Let's also see how many employers we have on file per quarter in 2017.

In [None]:
# number of employers by quarter
employers %>%
    count(quarter)

## Indiana's Employers

Now that the `employers` data frame is ready for use, as in the [Data Exploration](Data_Exploration.ipynb) notebook, we will try to answer some broad questions about Indiana's labor market through some more direct questions:

- What is the total number of jobs per quarter? What about total number of full quarter jobs?
- What are the most popular industries by number of employees? What about by number of employers?
- What is the distribution of both total and full-quarter employment of employers per quarter?
- What is the distribution of total and average annual earnings by quarter of these employers?
- Did average employment, hiring, and separation rates across all employers vary by quarter in 2017?

<font color=green><h3>Question 1: What is the total number of jobs per quarter? What about total number of full quarter jobs?</h3></font> 

There are two columns in `employers` we will focus on to answer this set of questions: `num_employed`, which is a calculation of the number of employers, and `full_num_employed`, which is the number of full-quarter employees.

In [None]:
# find number of employees and full-quarter employees
employers %>%
    summarize(total_jobs = sum(num_employed),
             total_full_quarter_jobs = sum(full_num_employed, na.rm=T))

<font color=green><h3>Question 2: What are the most popular industries by number of employees? What about by number of employers?</h3></font> 

Again, we will leverage the `num_employed` variable in `employers`, and this time, we will group by `naics_3_digit`.

In [None]:
# 10 most popular industries
employers %>%
    group_by(naics_3_digit) %>%
    summarize(num_employed = sum(num_employed)) %>%
    arrange(desc(num_employed)) %>%
    head(10)

Let's use our industry crosswalk to put some names to these NAICS codes. Like in the [Data Exploration](Data_Exploration.ipynb) notebook, we can use the `naics_2017` table in the `public` schema to act as a crosswalk.

In [None]:
# read naics_2017 table into R as naics
qry = '
select *
from public.naics_2017
'
naics <- dbGetQuery(con, qry)

In [None]:
# save 10 most popular industries
pop_naics <- employers %>%
    group_by(naics_3_digit) %>%
    summarize(num_employed = sum(num_employed)) %>%
    arrange(desc(num_employed)) %>%
    # make naics_3_digit character type instead of numeric
    mutate(naics_3_digit = as.character(naics_3_digit)) %>%
    head(10)

Now that we have stored `pop_naics` as a data frame, we can `left_join()` it to `naics` to find the industries associated with each 3-digit NAICS code.

In [None]:
# get industry names of most popular naics
pop_naics %>% 
    left_join(naics, by=c('naics_3_digit' = 'naics_us_code')) %>%
    # don't include the other columns
    select(-c(seq_no,naics_3_digit)) %>%
    # sort order of columns
    select(naics_us_title, num_employed)

Do any of these industries suprise you? Now, let's move on to our most common industries by number of employers.
> In the following code, `n_distinct()` is used to calculate the number of unique employers in 2017, whereas `n()` calculates the number of total employers for all four quarters in 2017.

In [None]:
# calculate number of distinct and total number of employers in all four quarters of 2017
employers %>%
    group_by(naics_3_digit) %>%
    summarize(distinct_emp = n_distinct(uiacct),
             num_emps = n()) %>%
    arrange(desc(distinct_emp)) %>%
    filter(!is.na(naics_3_digit)) %>%
    head(10)

Again, we can find the associated industry names with a quick join after saving the resulting data frame above.

In [None]:
# calculate number of distinct and total number of employers in all four quarters of 2017
# save to pop_naics_emps
pop_naics_emps <- employers %>%
    group_by(naics_3_digit) %>%
    summarize(distinct_emp = n_distinct(uiacct),
             num_emps = n()) %>%
    arrange(desc(distinct_emp)) %>%
    filter(!is.na(naics_3_digit)) %>%
    # again make naics_3_digit character type
    mutate(naics_3_digit = as.character(naics_3_digit)) %>%
    head(10)

In [None]:
# get industry names of most popular naics
pop_naics_emps %>% 
    left_join(naics, by=c('naics_3_digit' = 'naics_us_code')) %>%
    # don't include the other columns
    select(-c(seq_no,naics_3_digit)) %>%
    # sort order of columns
    select(naics_us_title, distinct_emp, num_emps)

How does this list compare to the one of the most popular industries by number of total employees?

<font color=green><h3>Question 3: What is the distribution of both total and full-quarter employment of employers per quarter?</h3></font> 

Now, instead of aggregating `num_employed` by quarter, we will simply look at the distribution of `num_employed` within each quarter. We will find the 1st, 10th, 25th, 50th, 75th, 90th and 99th percentiles.

In [None]:
# find distribution of total employees by employer and quarter
employers %>%
    summarize('.01' = quantile(num_employed, .01, na.rm=TRUE),
              '.1' = quantile(num_employed, .1, na.rm=TRUE),
              '.25' = quantile(num_employed, .25, na.rm=TRUE),
              '.5' = quantile(num_employed, .5, na.rm=TRUE),
              '.75' = quantile(num_employed, .75, na.rm=TRUE),
              '.9' = quantile(num_employed, .9, na.rm=TRUE),
              '.99' = quantile(num_employed, .99, na.rm=TRUE),
             )

In [None]:
# find distribution of full-quarter employees by employer and quarter
employers %>%
    summarize('01' = quantile(full_num_employed, .01, na.rm=TRUE),
              '.1' = quantile(full_num_employed, .1, na.rm=TRUE),
              '.25' = quantile(full_num_employed, .25, na.rm=TRUE),
              '.5' = quantile(full_num_employed, .5, na.rm=TRUE),
              '.75' = quantile(full_num_employed, .75, na.rm=TRUE),
              '.9' = quantile(full_num_employed, .9, na.rm=TRUE),
              '.99' = quantile(full_num_employed, .99, na.rm=TRUE),
             )

What does this tell you about the relative size of employers in Indiana?

<font color=green><h3>Question 4: What is the distribution of total and average annual earnings by quarter of these employers?
</h3></font> 

In [None]:
# find distribution of total earnings by employer and quarter
employers %>%
    summarize('.01' = quantile(total_earnings, .01, na.rm=TRUE),
              '.1' = quantile(total_earnings, .1, na.rm=TRUE),
              '.25' = quantile(total_earnings, .25, na.rm=TRUE),
              '.5' = quantile(total_earnings, .5, na.rm=TRUE),
              '.75' = quantile(total_earnings, .75, na.rm=TRUE),
              '.9' = quantile(total_earnings, .9, na.rm=TRUE),
              '.99' = quantile(total_earnings, .99, na.rm=TRUE),
             )

In [None]:
# find distribution of average annual earnings by employer and quarter
employers %>%
    summarize('.1' = quantile(avg_earnings, .1, na.rm=TRUE),
              '.25' = quantile(avg_earnings, .25, na.rm=TRUE),
              '.5' = quantile(avg_earnings, .5, na.rm=TRUE),
              '.75' = quantile(avg_earnings, .75, na.rm=TRUE),
              '.9' = quantile(avg_earnings, .9, na.rm=TRUE),
              '.99' = quantile(avg_earnings, .99, na.rm=TRUE),
             )

Is this what you were expecting to see? How do overall average earnings by employees compare to average earnings within our cohort?

<font color=green><h3>Question 5: Did average employment, hiring, and separation rates across all employers vary by quarter in 2017?</h3></font> 

Here, we will go back to using `group_by` and `summarize` to find our answers.

In [None]:
# find mean and standard deviation of employment rates by quarter
employers %>%
    group_by(quarter) %>%
    summarize(mean = mean(emp_rate, na.rm=TRUE),
             sd = sd(emp_rate, na.rm=TRUE))

In [None]:
# find mean and standard deviation of hiring rates by quarter
employers %>%
    group_by(quarter) %>%
    summarize(mean = mean(hire_rate, na.rm=TRUE),
             sd = sd(hire_rate, na.rm=T))

In [None]:
# find mean and standard deviation of separation rates by quarter
employers %>%
    group_by(quarter) %>%
    summarize(mean = mean(sep_rate, na.rm=T),
             sd = sd(sep_rate, na.rm=T))

Based on your knowledge of employment patterns in 2017, are these results consistent with the overall trends in the United States at the time?

<font color=red><h3> Checkpoint 3: Understanding Our Cohort within Labor Market </h3></font> 

Optimally, we would like to get a better sense of who is employing our 2016 cohort - are they larger employers with lots of turnover? Do they tend to pay their employees better? Please find the answers to the questions posed in "Indiana's Employers" for employers that employed members of our cohort. Filter the `employers` data frame based on the `uiacct` and `quarter`.

In [None]:
# guiding question 1



In [None]:
# guiding question 2



In [None]:
# guiding question 3



In [None]:
# guiding question 4



In [None]:
# guiding question 5



In this notebook, you have explored two separate definitions of stable employment and how quarterly wages changed under the two definitions. Then, you switched over to looking at the demand side of the labor market, learning about all of Indiana's employers in 2017. 

After answering the final checkpoint, you will be able to compare employers of our cohort to the overall labor market in Indiana. Did you find that individuals in our cohort were not employed by certain types of employers? For your next assignment, you will repeat this analysis with our 2009Q1 cohort to better understand the labor market as it began to recover from the Great Recession.