<center><br><br>
    <h4>TANF Data Collaborative </h4>
    <h4>Applied Data Analytics Training | Spring 2022</h4>
    <h1> Longitudinal Analysis and Creating Employment Measures (Advanced)</h1>
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Maryah Garner, Allison Nunez, Rukhshan Arif Mian, Benjamin Feder</center>
</center>

<br>

***

This notebook covers record linkage and creating a data frame that allows for longitudinal analysis for a TANF exit cohort. Job-level metrics are created for this specific cohort of individuals previously developed in the Creating a Cohort notebook. Our goals are as follows:
1. Link the cohort to wage records for a year after exit to follow their employment over time
2. Create job-level employment metrics using the linked data
3. Create person-level employment metrics from the job-level data

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# for as.yearqtr()
suppressMessages(library(zoo))

#Switching on warnings
options(warn = 0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Record Linkage

Record linkage is an important component of any analysis, unless you have a fictitious perfectly curated dataset with no messiness or missing variables, but especially when linking administrative records. Unlike survey data that allows for perfectly selected variables with some potential for messiness, administrative data is tailored to adminisitrative purposes (and not academic). That means that we won't have all of the variables we ideally want, and it also means that the data can be messy (either missing responses or with variables that we may not quite understand or have at our disposal). While we may not directly address missing responses (more on indirectly addressing this in the inference lecture), we can do some things to enrich our data set by pulling in relevant information from other data sets. We will proceed to describe how to link TANF, UI Wages, and employer information to create a panel of individual records over time. We also describe some of the issues that arise when linking records of various sorts.

In the prior notebook, we established the beginnings of our analysis by establishing our cohort of interest, the 2018 Q2 TANF exit cohort, and generated some information about the demographics and TANF experiences of the cohort. Now we would like to extend the information about the cohort's experiences by tracking the cohort's employment outcomes over time. This is one way in which we may begin to track employment 'success' over a period of time. We can do this by using the IN UI Wage data for a year out from the date of exit, for example. This means we can track a cohort over the span of 4 quarters (as in this notebook), or the number of quarters of interest for your own analysis. We begin by discussing the linking of individuals in our cohort to their associated wage records. We will discuss employment outcomes in more detail below. 


## Linking Cohort to Wage Records

We first consider our newly-constructed cohort table found in **tr_tdc_2022.dbo.nb_cohort** – recall that at this stage, we only have instances of a person and their associated characteristics at TANF exit. We use this cohort table to link to wage records using a `LEFT JOIN` in SQL ON **SSN**. This means that for every individual from our cohort, we pull in unique employers for quarters in which that person was employed. The code below performs this join and is very similar to the code used to developing the tables in the `Teaser` section from the Creating a Cohort notebook.
> Note: We are using `nb.*` within the `SELECT` statement to pull in all varables from the **nb_cohort** data table we created in the previous notebook.

In [None]:
## Linking TANF and UI Wages over time
qry <- "SELECT nb.*,wr.Empr_no, wr.Year, wr.Quarter, wr.Wage, wr.yr_quarter 
    FROM 
    tr_tdc_2022.dbo.nb_cohort nb
    LEFT JOIN 
    (
        select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('2018 Q2', '2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2', '2019 Q3')
        AND (SSN IN (SELECT DISTINCT(SSN) FROM tr_tdc_2022.dbo.nb_cohort))
    ) wr
    ON wr.SSN=nb.SSN;
"

cohort_wages <- dbGetQuery(con,qry)

head(cohort_wages)

Despite being outside of the four-quarter period of interest after exit, we included 2018 Q2 and 2019 Q3 because they will help us develop information on employment transitions. For example, we will use 2018 Q2 to determine where individuals worked (if at all) in the period before the first quarter we track their outcomes (2018 Q3). Similarly, we may want to know if a person remained in the same job after 2019 Q2. This would require having wage records (and corresponding employers) for 2019 Q3, even if we are documenting an outcome for 2019 Q2.

Those without any wages in the wage records will have a single record with "NA" for **Empr_no**, **Year**, **Quarter**, and **Wage** (all variables pulled in from the wage record panel). Think about what types of individuals are missing from wage records. It may be unemployed individuals, but it also may be those not at a UI-covered job, or more likely a mix of both. These caveats will be covered in more depth in the Inference lecture.

We can filter our dataframe to include only those individuals who have a missing **Wage** record, for example. Note: there will only be one record for these individuals. 

In [None]:
#Check for missing values
cohort_wages %>% 
    filter(is.na(Wage)) %>%
    head()

In [None]:
#Confirm that there is only one record per ssn-year-quarter when Wage is missing
cohort_wages %>% 
    filter(is.na(Wage)) %>%
    group_by(ssn) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    head()

Given that we have completed and verified the linkage to the UI wage records, let's summarize some of this new information for our exit cohort. We have quarterly information on their UI-covered employment within 2018Q2 – 2019Q3 and their associated wages. We will use this base information to generate new measures of employment later in this notebook, but we will first use our new linked data asset to generate per-person total wages and get an aggregate measure of the mean quarterly wages. 

In [None]:
# calculate the average quarterly wage for employed individuals in the cohort
cohort_wages %>%
    group_by(ssn, yr_quarter) %>% 
# compute the total wages for each person within a given year quarter
    summarise(totalwages=sum(Wage)) %>%  
    ungroup() %>% 
    group_by(yr_quarter) %>%
# calculate the average quarterly wage
    summarize(mean_wages=mean(totalwages))

## Link to Employer Industry
We created a table that provides an associated NAICS code for each employer in each quarter. This table will allow us to add industry-level information to our linked data asset of TANF and employment data. We will start by reading in the **employer_industry** from the **tr_tdc_2022** database. For more information on how employer industry was created, see scripts tdc_create_employer_industry_update.sql and tdc_populate_employer_industry_update.sql in Notebooks>Scripts. For every person-employer pair in the quarters that person-employer pair (job) exists, we are pulling in a 6-digit NAICS code associated with their employer.

In [None]:
# We create yr_quarter here so that we can merge with the cohort data based on employer id + relevant quarter
# The format for yr_quarter is: YYYY Q# where # refers to the quarter number. 
# For example: 2018 Q2 – this is the same as the format in your cohort data
qry<- "
SELECT empr_no AS Empr_no, naics_code, yr_quarter
FROM tr_tdc_2022.dbo.employer_industry
WHERE yearq <= 20193
AND yearq >= 20182 
ORDER BY year, quarter 
"
employer_industry <-dbGetQuery(con,qry)

# see employer_industry
head(employer_industry)

Now we can join on the **Empr_no** and **yr_quarter** variables.

> Note: We are using another left join to avoid filtering out any employment information for individuals where their employer does not have an associated record in the `employer_industry` table.

In [None]:
# Link employer industry for each employer (Empr_no) in every year-quarter combination (yr_quarter)
cohort_wages_industry <- cohort_wages %>% 
    left_join(employer_industry, by=c("Empr_no", "yr_quarter"))

head(cohort_wages_industry)

In [None]:
# Which naics code pays the most in quarterly wages?
cohort_wages_industry %>% 
    group_by(naics_code, yr_quarter) %>% 
    summarize(total_wages=sum(Wage)) %>% 
    arrange(desc(total_wages)) %>%
    head()

Interestingly enough, among those without missing NAICS codes, the largest employer (by total wages paid quarterly) industry is temporary help services. According to the U.S. Census Bureau (see References), "This industry comprises establishments primarily engaged in supplying workers to clients' businesses for limited periods of time to supplement the working force of the client. The individuals provided are employees of the temporary help services establishment. However, these establishments do not provide direct supervision of their employees at the clients' work sites." We will keep this in mind when we begin to generate employer measures in the next section of this notebook.

## Employment Measures
Our goal is to create the following set of job-level and person-level measures:

* Job-level
    + same job existed in the prior quarter relative to the quarter of interest
    + same job existed in the quarter following the quarter of interst
    + a job in a quarter is a full-quarter employment job (it exists before, during, and after the quarter of interest)
* Person-level
    + a person has full employment in a quarter
    + a person has one or multiple jobs in a quarter
    
Note that we need the job-level measures here explicitly to create person-level employment measures since our unit of analysis is at the individual-level

Now that we have a basic frame that includes employee-employer pairs for the cohort from 2018Q2 - 2019Q3 (where any cohort individuals without wages have a single record of NA for employer number and other variables), we can begin the creation of some employment measures. We will start by enumerating all time periods in the UI records, starting from one, in order from earliest year-month combination to the most recent year-month pair. This will allow us to work with a fixed-time variable that is easier to work with than dates directly. 

### Setup

In [None]:
#Get time period from UI wage records (for each unique year quarter, generate a fixed time variable-easier to work with than dates)

#Pull out any DISTINCT year and quarter records from the wage tables and make sure that year is a string of length 4 
#(we want year to be 4 digits) and quarter is only a single digit.

wage_qry1<- "
SELECT DISTINCT Year, Quarter, 
CAST(year AS VARCHAR) + ' Q' + CAST(quarter AS VARCHAR) as yr_quarter
FROM ds_in_fssa.dbo.wage_10pct
WHERE LEN(year)=4 AND LEN(Quarter)=1
ORDER BY Year, Quarter
"

#Order the unique year and quarter observations, create an object "yr_quarter" which joins together year and quarter (no spaces or other separators)
#Then cast that string to an integer, and create a time variable that enumerates by row number. 
year_q <-dbGetQuery(con, wage_qry1) %>% 
                       mutate(TIME=row_number()) %>% 
    select(yr_quarter, TIME) 

In [None]:
head(year_q,3)
tail(year_q,3)

Next, we want to join the continuous time variable **TIME** to our data frame. We will use a left join on **cohort_wages** using **year_quarter** to bring in the **TIME** variable. Once complete, we will have a data frame that consists of all the wage records for our 2018Q2 exit cohort from 2018Q2-2019Q3. Recall that if a person from our cohort has no wages in this time period, they will have a single record missing year, quarter, and any other variables from the wage records but they will still exist as a unique **ssn** in the data frame.

In [None]:
# Join the continous TIME variable from the year_q data frame
cohort_wages_industry <- cohort_wages_industry %>% 
    left_join(year_q, by = "yr_quarter") 

head(cohort_wages)

Given that we now have a linked data asset consisting of our original cohort, their associated employment history, and information on their employer, as well as a fixed time variable for our quarters of interest, we can now construct the job-level employment measures. 

By job, we are referring to an employee-employer pair. As a reminder, we will create measures of whether a job existed in the prior quarter or not, whether a job stayed around for a future quarter, and whether a job was a full employment job (existed in three consecutive quarters, where the assumption is that the person was at this job for at least one full quarter: the middle quarter). We also can count the number of quarters an employee was with an employer (the number of quarters this job existed) and the number of full employment quarters.

To start off, we will convert the **yr_quarter** variable into the data type `yearqtr` since it will be useful when we look at year-quarter combinations.

In [None]:
cohort_wages_industry$yr_quarter <- as.yearqtr(cohort_wages_industry$yr_quarter)

Let's begin by creating a data frame that will select, group, and order relevant variables to measure employment. The resulting dataframe will have job (person-employer) records that exist for more than one quarter arranged by calendar time using the variable **TIME**. We will continue to use this grouped dataframe below.

In [None]:
#Create job-quarter level measures (will be aggregated to job-level measures)
emp_measures <- cohort_wages_industry %>%
# Select variables that you need to measure employment 
    select(ssn, yr_quarter, Empr_no, TIME) %>% 
# To identify whether a person has the same employer, 
# from one month to the next, we will group by individuals and employers 
    group_by(ssn, Empr_no) %>%
# If an individual is only employed for one quarter, then they do not have steady employment (this also removes those not in wage records)
    filter(n() > 1) %>%
# Within each individual-employer grouping, sort the data by time
    arrange(TIME) 

head(emp_measures)

### Job-level Measures in Each Quarter

Now that jobs are ordered by time, we can create our two baseline measures for each job in each quarter:
1. **same_emp**, takes value 1 if the job existed in the immediately prior quarter and is 0 otherwise
2. **stay_emp**, takes value 1 if the job existed in the quarter immediately following and is 0 otherwise
3. **naics_match**, takes value 1 if the NAICS code for a person's job existed in the immediately prior quarter

In [None]:
# We will use the mutate function to create new variables from the job-quarter level measures to create job-level measures
emp_measures <- emp_measures %>% 
        mutate(new_emp = ifelse(TIME == min(TIME) & yr_quarter > '2018 Q2',1,0), # first record of grouping is a new job
# use the lag function to calculate how many time periods passed between two observations
           time_change_emp = TIME-lag(TIME), 
# The person has the same employer if only one time period passed between current and past observation
           same_emp = case_when(time_change_emp == 1 ~ 1,
                                is.na(time_change_emp) ~ 0,
                                TRUE ~ 0),
# The person stayed with the same employer if only one time period passed between the next and 
# current observation
           time_change_emp2 = lead(TIME)-TIME, 
           stay_emp = case_when(time_change_emp2 == 1 ~ 1,
                               is.na(time_change_emp2) ~ 0, 
                                TRUE ~ 0)) %>%
    select(-c(time_change_emp, time_change_emp2))
                                
# A person has full employment if the same employer employed them in the previous and next time periods
           
head(emp_measures)

Finally, let's create an indicator for whether or not a job was a full quarter job as indicated by the job existing in the quarter of interest, in the immediately prior quarter, and in the quarter immediately following.

In [None]:
# Generate full employment indicator, count the number of times a job exists (number of quarters with employer), and
#count the number of full employment quarters
emp_measures <- emp_measures %>% 
    mutate(full_emp = ifelse(same_emp==1 & stay_emp == 1,1,0)) %>%
# remove TIME from the data frame
    select(-c(TIME))

head(emp_measures)

We used the quarter before and after our analysis window to generate the full employmnent variable but will now drop these quarters to display these measures in a neater manner.

In [None]:
#Filter for only the quarters we are intersted in (2018Q3-2019Q2)
emp_measures <- emp_measures %>% 
    filter(yr_quarter != min(emp_measures$yr_quarter, na.rm=TRUE) & yr_quarter != max(emp_measures$yr_quarter, na.rm=TRUE))

tail(emp_measures)    

Say we want to add in information about person-job pairs and associated industry changes. With this linked data asset, we can measure if a job in a given time period matches any of the industries in the previous time periods (if they are continuing to work in a specific industry). We are only considering immediately contiguous quarters (time) so that if a person was employed in industry A in Q2 and Q4, that doesn't count as a match, even if they were not employed in Q3.

In [None]:
# Create all distinct person-industry groupings per quarter (TIME)
previous_emp <- cohort_wages_industry %>%
    distinct(ssn, naics_code, TIME) %>%
# remove people who never show up in the wage records
    filter(!is.na(TIME)) %>% 
# Increment time by 1 so that TIME now represents the next time period, but with this period's naics code
    mutate(TIME = TIME + 1) %>% 
    mutate(naics_match = 1)


We are now left with a data frame that associates each person's **TIME** with the naics codes in the previous time period. When we join on **TIME**, **naics_code**, and **ssn**, we will pick up on matches between adjacent time periods. Records that are matched will have a **naics_match**=1, and will otherwise be NA due to the left join.

In [None]:
#Create a data frame that has information on whether a person had a naics code in the previous quarter that matches their current naics code
industry_df <- cohort_wages_industry %>% 
    left_join(previous_emp, by = c("ssn", "TIME", "naics_code")) %>% 
    mutate(naics_match=case_when(naics_match==1 ~ 1, TRUE ~ 0)) %>% 
    select(ssn, Empr_no, naics_match, yr_quarter)

head(industry_df)

With our new **naics_match** variable created, we can join it back to **cohort_wages**, which contains all individuals from our cohort, and **emp_measures**. Here, we will use another left join to ensure we do not accidentally filter out any individuals in our cohort. 

Recall that missing employment measures indicate either a person did not have repeated job records or that they had no records in the wage records. If it's the first case, we will assign **new_emp**=1 to indicate a new employment (job) instead of NA and assign only a single quarter of employment. If it's the latter case where a person did not have wages recorded in the relevant time frame, we set all measures equal to 0. 

> Note: `replace_na()` takes a list of variables and replaces the associated NA values on these variables with values you specify.

In [None]:
#Merge in the emp_measures variables and the naics_match from industry_df
#Replace na values in the emp_measures variables as follows:
cohort_wages_industry_2 <- cohort_wages_industry %>%
    left_join(emp_measures, by = c('ssn', 'Empr_no', 'yr_quarter')) %>% 
    left_join(industry_df, by=c('ssn', 'Empr_no', 'yr_quarter')) %>%
    mutate(new_emp == ifelse(is.na(Wage), 0, new_emp)) %>%
    # The following code replaces missing values for each variable with the value specified. 
    replace_na(list(new_emp = 1, # if new_emp is missing, replace it with 1 
                    same_emp = 0, # if same_emp is missing, replace it with 0
                    stay_emp = 0, # if stay_emp is missing, replace it with 0
                    full_emp = 0)) %>% # if full_emp is missing, replace it with 0
    ungroup()


head(cohort_wages_industry_2)

### Person-level Measures in Each Quarter

Next, let's begin our exploration of person-level measures. Recall that we planned on producing measures tracking a person's quarterly job counts and whether a person did not have any employment in wage records. Note that all person-level measures are generated at the year-quarter level. 

> Note: As an example, a person who is employed at 2 jobs in 2018 Q3 and 3 jobs in 2018 Q4 will have 2 separate person-level job_counts: the job_count would be 2 in 2018 Q3 and 3 in 2018 Q4. 

In [None]:
#Begin person-level measures
cohort_wages_industry_2 <- cohort_wages_industry_2 %>% 
    group_by(ssn, yr_quarter) %>% 
    mutate(job_count=n(),
           one_job=case_when(job_count==1 & !is.na(Wage) ~ 1, # if a person has one job + non-missing wage
                           job_count==1 & is.na(Wage) ~ 0, # if a person has one job + missing wage
                           TRUE ~ 0),
           multiple_jobs=case_when(job_count>1 ~ 1, TRUE ~ 0)) %>% # if a person has more than one jobs
    ungroup()

head(cohort_wages_industry_2)

Finally, we will continue to aggregate our measures to the person-level for every quarter. The resulting data frame will provide us with person-level aggregates for new employment, same employment, stay employment, full employment, quarters of employment, and quarters of full employment for every quarter for every person in the cohort.

> Note: We will assign the population of the cohort to a variable called **cohort_pop**, as we will use this as a denominator in future calculations.

In [None]:
# find total cohort population
cohort_pop <- cohort_wages_industry_2 %>%
    summarise(cohort_pop = n_distinct(ssn)) %>%
    pull(cohort_pop)
            
cohort_pop 

In [None]:
# aggregate measures by person
cohort_wages_industry_3 <- cohort_wages_industry_2 %>%
    group_by(ssn, yr_quarter, TIME)  %>%
    summarize(earnings = sum(Wage),
              full_emp = max(full_emp),
              naics_matches = max(naics_match),
              one_job = min(one_job),
              multiple_jobs=max(multiple_jobs)) %>%
    ungroup() %>%
    group_by(ssn) %>%
    arrange(TIME) %>% 
    ungroup() %>% 
    filter(yr_quarter!=min(cohort_wages_industry_2$yr_quarter, na.rm = TRUE) & yr_quarter!=max(cohort_wages_industry_2$yr_quarter,  na.rm = TRUE))

head(cohort_wages_industry_3)

In [None]:
# we see a lower number of individuals because we have filtered out the first and last quarters ('2018 Q2' and '2019 Q3')
cohort_wages_industry_3 %>% 
    summarize(
        n_distinct(ssn)
    )

## Employment Outcomes
Given that we have created some measures at the job- and person-level, we can use these measures to generate aggregate employment outcome information for our cohort in each of the 4 quarters subsequent to TANF exit. We may be interested in the fraction of the cohort that is employed in any capacity in each of the four quarters, the fraction of the cohort at new jobs, or the share of the cohort that has changed industries. 

* What fraction of the cohort is ever employed in any of the four quarters post-exit?
* What fraction of the cohort is employed in each of the four quarters in our analytical window?
* What fraction of the cohort is fully employed in 2018Q3-2019Q2?
* What fraction of the cohort consists of multiple-job holders?
* What fraction remain in at least one job in the same industry from the last quarter in 2018Q3-2019Q4?



In [None]:
# What proportion of individuals were ever employed the year after exiting TANF?
cohort_wages_industry_3 %>% 
    summarize(count_employed = n_distinct(ssn), 
              share_emp=count_employed/cohort_pop)

In [None]:
# What share have any employment?
cohort_wages_industry_3 %>% 
    group_by(yr_quarter) %>% 
    summarize(count_employed=n_distinct(ssn), 
              prop_employed=count_employed/cohort_pop)


In [None]:
 # What fraction of the cohort is fully employed in 2018Q3-2019Q2?
df1 <- cohort_wages_industry_3 %>% 
    group_by(yr_quarter) %>% 
    summarise(count_full_emp=sum(full_emp),
              prop_full_emp = count_full_emp/cohort_pop) %>%
 # converting variable to factor to make the plot more readable
    mutate(yr_quarter = as.factor(yr_quarter))

head(df1)


In [None]:
df1 %>% 
    ggplot(aes(x=yr_quarter, y=prop_full_emp)) + geom_col()

In [None]:
# Multiple jobs?
cohort_wages_industry_3 %>% 
    group_by(yr_quarter) %>% 
    summarise(count_multiple_jobs = sum(multiple_jobs),
              prop_multiple_jobs = count_multiple_jobs/cohort_pop)

In [None]:
# What fraction remain in job at same prior industry?
cohort_wages_industry_3 %>%  
    group_by(yr_quarter) %>% 
    summarize(count_match=sum(naics_matches), 
              prop_match=count_match/cohort_pop)

## References

U.S. Census Bureau. (n.d.). North American Industry Classification System. United States Census Bureau. Retrieved May 31, 2022, from https://www.census.gov/naics/?input=561320&year=2022&details=561320