<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 

# **<center> Data Exploration: Cohort Analysis </center>**

<a href="https://doi.org/10.5281/zenodo.4589024"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4589024.svg" alt="DOI"></a>

Tian Lou, Dave McQuown

## **1. Introduction**

In the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb), we have calculated Illinois weekly certified claimant counts during the COVID recession and average weekly total pay. The first measure allows us to see the **stock of certified claimants** during a specific week and how it changes over time. While it helps us understand how many workers continue to experience unemployment and rely on UI benefits for financial support, **it does not allow us to see the inflow of new certified claimants and to track them over time**. For example, among all the certified claimants who received benefits during the week ending March 28th, how many of them are new entrants? How fast do they leave the UI programs? Do claimants from different demographic groups, industries, and locations exit the program at different rates? If and when have they been exposed to interventions, such as economic shocks and training programs? Do these interventions accelerate or slow down UI program exit rates?

To answer these questions, we need to use **longitudinal data**. In this notebook, we introduce a second method of analyzing the UI claims data: **Cohort Analysis**. Specifically, we will focus on new certified claimants who entered UI programs at the beginning of COVID recession and track their records in the PROMIS file for at least 26 weeks, i.e., the maximum duration of UI benefits in Illinois.[<sup>1</sup>](#fn1) <a id = "5"> </a> Then, we will investigate during each week after program entry, how many and what percentage of these claimants exit the program. We will also look at whether claimants from different industries have different exit rates. 

Similar to the first data exploration notebook, after we finish the analysis, we will save our results in csv files and use them to create various visualizations later. Moreover, in the machine learning analysis, we will use the cohort sample we create in this notebook to examine how the economic shock influences our cohort's program exit rates. As you work through the notebook, we will have checkpoints for you to practice using the code. You can think about how you might apply any of the techniques and code presented in this notebook to your project.

## **2. Learning Objectives**

Throughout this series of notebooks, our overarching questions are: How can states best produce information that can be used by local workforce boards to help inform resource allocation for unemployed workers? What information is the most useful for workforce boards for strategic resource planning and efficient resource allocation? What measures are most useful to local workforce boards and how do we characterize variation in those measures? We use Illinois administrative data, Bureau of Labor Statistics (BLS) workforce data, and Opportunity Insignts Economic Tracker data as an example and show you step by step how we develop a project, including: explore the dataset, select sample, define outcome measures, conduct subgroup analyses, create visualizations, and build a prediction machine learning model. 

In this notebook, our focus is the **cohort analysis**. After you finish this notebook, you should understand:
- the difference between cross-sectional analysis and cohort analysis
- how to select a cohort from the PROMIS file
- how to define exiters and calculate exit rates
- how to calculate exit rates for claimants from different groups

#### **Research Questions**
Illinois started its massive lockdown in mid-March 2020.[<sup>2</sup>](#fn2) <a id = "6"> </a>In this notebook, we are interested in outcomes of certified claimants who entered UI programs shortly after the lockdown, i.e., during the week ending March 28th and the week ending April 4th, 2020. This cohort will be referred to as **"the COVID-19 cohort"** throughout this project. The specific questions we seek to answer in this notebook are:

- How many certified claimants entered UI programs in Illinois during the week ending March 28th and the week ending April 4th, 2020?
- How many and what percentage of certified claimants in the COVID-19 cohort exit during each week after program entry?
- How does the COVID-19 cohort's exit rate vary by industry?

#### **Datasets** ####
We will continue to use the Illinois PROMIS certified claims file in this notebook:
- **2020 Illinois PROMIS certified claims file**: weekly UI claims data. Each record represents a certified claim in a certain week. The data has a claimant's demographics, education level, prior industry, occupation, and locations. It also contains detailed information about the claim, such as program type, claim type, certification status, benefit starting date, and benefit amount. Federal Pandemic Unemployment Compensation (FPUC, $600/week) and dependent benefits are included in the total amount paid.

**We will only use the 1% random sample in this notebook. You should also only use the 1% random sample when walking through all notebooks and when identifying the scope of your analysis.**

#### **Analytical Methods**
The specific techniques include but not limited to:
- **SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `GROUP BY`: aggregate data over the variables of interest
 - `ORDER BY`: sort data based on the variables of interest
 - `DISTINCT`: look at distinct values of a variable
 - `JOIN ... ON`: join tables
- **R code**:
 - `group_by` and `summarize` to find group-based measures
 - `mutate` to create new variables
 - `arrange` and `desc` to sort values

#### **Directory Structure**

We will constantly read and write csv files to load crosswalks and to save results in all the notebooks. If you have not done so already, let's create a few folders in your U drive first so it is eaiser for you to organize all the files. 

- Open a Windows File Explorer
- On the left hand side, find U drive (U:) and click into it
- On the right hand side, open your user folder: FirstName.LastName.UserID
- In your user folder, create a new folder: ETA Training
- In the "ETA Training" folder, create three subfolders: "Notebooks", "Results", "Output"
- You can copy and paste the class notebooks to the "Notebook" folder, save summary statistics to the "Results" folder, and save visualizations (in the third notebook) to the "Output" folder.

For example, we read all the crosswalks from **"P:\tr-dol-eta\ETA Class Notebooks\xwalks"**.  At the end of this notebook, **we save all the summary statistics to "U:\\FirstName.LastName.UserID\ETA Training\Results\filename.csv"**.

## **3. Create the COVID Cohort**
In this section, we will use the table `il_des_promis_1pct` in the schema `tr_dol_eta.dbo` to create the COVID-19 cohort, i.e., certified claimants who entered UI programs during the week ending March 28th and the week ending April 4th, 2020. First, let's Load R functions and establish database connection.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

#### **What's the Difference Between the Cross-sectional Analysis and the Cohort Analysis?**
In the cohort analysis, we slice the data by using `byr_start_week`, which indicates the week a claimant **started to receive** UI benefits. (*Similar to `week_end_date`, `byr_start_week` is also always a Saturday.*) Recall that in the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb), we use `week_end_date` to select our sample. The variable `week_end_date` shows the week a certified claimant **received** his/her benefits. The use of different date variables determines the key difference between the cross-sectional sample and the cohort sample. Specifically, the cross-sectional sample includes every certified claimant who received benefits during the time period we are looking at, regardless when they entered the UI programs. The cohort sample only includes certified claimants who entered the program during the time period we are interested in. **In other words, the cross-sectional sample shows us the stock of certified claimants during a time period, while the cohort sample represents the inflow of new certified claimants during a time period.** 

Moreover, in the cohort analysis, the sample we select should include a claimant's records from program entry to program exit. This is because the table `il_des_promis_1pct` already has all the weekly PROMIS files and has been cleaned. In reality, the PROMIS file is updated weekly. You need to append weekly data to create a table similar to `il_des_promis_1pct` first. 

Next, let's see an example with hypothetical data.

|ssn_id|week_end_date|byr_start_week|
|---|---|---|
| 1 | 2020-03-07 | 2020-03-07 |
| 1 | 2020-03-14 | 2020-03-07 |
| 1 | 2020-03-21 | 2020-03-07 |
| 1 | 2020-03-28 | 2020-03-07 |
| 2 | 2020-03-28 | 2020-03-28 |
| 2 | 2020-04-04 | 2020-03-28 |
| 2 | 2020-04-11 | 2020-03-28 |
| 3 | 2020-03-28 | 2020-03-28 |
| 3 | 2020-04-04 | 2020-03-28 |

The hypothetical data has the same format as the data in the table `il_des_promis_1pct`. Each record represents a person's certified claim during a specific week. For each claimant, `week_end_date` is different in each record, while `byr_start_week` stays the same. We can see that claimant 1 entered during the week ending March 7th and stayed for four weeks. Claimants 2 and 3 both entered during the week ending March 28th and stayed for three and two weeks, respectively. 

Suppose we are interested in looking at the week ending March 28th. When we select the data by using `week_end_date='2020-03-28'`, we get the cross-sectional sample:

|ssn_id|week_end_date|byr_start_week|
|---|---|---|
| 1 | 2020-03-28 | 2020-03-07 |
| 2 | 2020-03-28 | 2020-03-28 |
| 3 | 2020-03-28 | 2020-03-28 |

It shows us that during the week ending March 28th, there were three certified claimants in total.

When we select the data by using `byr_start_week='2020-03-28'`, we get the cohort sample:

|ssn_id|week_end_date|byr_start_week|
|---|---|---|
| 2 | 2020-03-28 | 2020-03-28 |
| 2 | 2020-04-04 | 2020-03-28 |
| 2 | 2020-04-11 | 2020-03-28 |
| 3 | 2020-03-28 | 2020-03-28 |
| 3 | 2020-04-04 | 2020-03-28 |

It tells us that two new claimants (two distinct `ssn_id`) entered during the week ending March 28th. They both stayed in the program during the week ending April 4th. One of them left during the week ending April 11th and the other one left during the week after.

#### **The COVID-19 Cohort**
We are interested in **certified claimants who entered the regular state UI program during the week ending March 28th and the week ending April 4th** (`byr_start_week='2020-03-28' or byr_start_week='2020-04-04'`). We combine them as one cohort and call them **the COVID-19 cohort**. We choose this cohort for two reasons. First, they are mostly workers who lost jobs immediately after the massive lockdown in Illinois. Recall that Illinois lockdown order started in mid-March. Claimants usually need two weeks to file and certify their claims. Second, those two weeks account for a disproportionately large number of the claimants who started new benefits during the COVID-19 pandemic. This sample gives us enough observations to conduct subgroup analysis at various levels (e.g.: by region and by industry).

As with the cross-sectional analysis, we focus on the **Regular state UI program** (`program_type=1` and `sub_program_type=1`). An additional restriction we include in the cohort analysis is that we limit the sample to **new claims only** (`claim_type=1`). New claims are claims from individuals who are not currently active in the unemployment insurance system. These claims involve a benefit determination based on previous quarters earnings and begin a new benefit year for the claimant. If a claimant files for unemployment insurance but already has an active benefit year that was opened in response to a previous job loss within one calendar year, that is considered an additional claim and the claimant resumes the existing benefit year. By limiting to only new claims and excluding the additional ones, we focus only on the spell of unemployment that began the benefit year in this section.

In [None]:
# Select PROMIS certified claimant records from the database to a dataframe

# Store SQL query to variable
query <- "
SELECT ssn_id,
    week_end_date,
    byr_start_week,
    sub_program_type,
    program_type,
    claim_type,
    birth_date,
    gender,
    race,
    ethnicity,
    disability,
    education,
    county_fips_code,
    naics_code,
    occupation_code,
    total_pay
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND claim_type = 1
AND byr_start_week IN ('2020-03-28', '2020-04-04');
"

# Execute query
df_claimants <-dbGetQuery(con,query)

# R interprets dates as character when pulling from the database, must convert with ymd()
df_claimants <- df_claimants %>%
    mutate(week_end_date=ymd(week_end_date),
          byr_start_week=ymd(byr_start_week),
          birth_date=ymd(birth_date))

# See top records in the dataframe
head(df_claimants)

# Close the database connection
dbDisconnect(con)

In [None]:
# Count the number of people in the COVID-19 cohort
# We are using the distinct count of individuals in the cohort as the starting population,
# rather than the count of claimants in the first week because not everyone in the cohort will
# be paid benefits in the first week.
# Save the number in a variable for calculating exit rates
cohort_pop <- (df_claimants %>% summarize(n_distinct(ssn_id)))[1,1]

# Check the result
cohort_pop

`cohort_pop` represents the number of new certified claimants from our 1% random sample that entered regular state UI program in Illinois during the week ending March 28th and April 4th. They account for roughly one quarter of new certified claimants since the start of COVID recession.

In [None]:
# Join/merge the DataFrame with the county-region crosswalk

# Load Restore Illinois Health Regions
region_dict <- read_csv('P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\reopening_regions.csv', col_types = "ic") %>%
    mutate(county_fips_code = substr(fips,3,5)) %>%
    select(county_fips_code, region)

#Left join region to certified claimants data
df_claimants <- left_join(df_claimants, region_dict, by.y = "county_fips_code")

# See top records in the dataframe
head(df_claimants)

#### **Checkpoint 1: Create Your Regional Cohort Sample**
In the class, we have asked you to identify a **region of interest**. Now use the DataFrame `df_claimants` we have created in the example to select your regional cohort sample. How many new claimants in the COVID-19 cohort are from your region of interest?

> Values of `region`: Central, Cook County, North-Central, Northeast, Southern

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

In [None]:
# Subset the dataset to one region.
# Replace ___ with the region name
df_reg_claimants <- df_claimants %>%
    filter(region == "___")

# See top records in the dataframe
head(df_reg_claimants)

## **4. Calculate Exit Rates**

In this section, we will aggregate the data in `df_claimants` to get the number of and the percentage of the COVID-19 cohort staying in the UI program each week after program entry. Then, we will calculate exit rates of the COVID-19 cohort.

First, we need to create an indicator, `week_number`, which shows the number of weeks after program entry. For example, for claimants who entered the program during the week ending March 28th, March 28th is the first week, April 4th is the second week, and so on. For claimants started during the week ending April 4th, April 4th is the first week, April 11th is the second week, and so on. Using our hypothetical data, it should look like:

|ssn_id|week_end_date|byr_start_week|week_number|
|---|---|---|---|
| 2 | 2020-03-28 | 2020-03-28 | 1 |
| 2 | 2020-04-04 | 2020-03-28 | 2 |
| 2 | 2020-04-11 | 2020-03-28 | 3 |
| 3 | 2020-03-28 | 2020-03-28 | 1 |
| 3 | 2020-04-04 | 2020-03-28 | 2 |
| 4 | 2020-04-04 | 2020-04-04 | 1 |
| 4 | 2020-04-11 | 2020-04-04 | 2 |

In [None]:
# Create week number field
df_claimants <- df_claimants %>% 
    mutate(week_number = as.integer(difftime(week_end_date, byr_start_week, units = "weeks")) + 1) %>%
    arrange(ssn_id, week_end_date)

# See top records in the dataframe
head(df_claimants)

Second, we aggregate our sample by using `week_number`. The resulting data frame shows the number of claimants in the COVID cohort who stayed in the UI program during each week after program entry. 

In [None]:
# Aggregate the sample on week_number
# Save the results to a data frame
cs_week_number <- df_claimants %>%
    group_by(week_number) %>%
    summarize(claimant_count=n_distinct(ssn_id)) %>%
    filter(week_number >= 1)

# See top records in the data frame
head(cs_week_number)

Next, we adjust `claimant_count` in the first week to equal the total cohort population, `cohort_pop`. This ensures that week 1 will be 100% since a small share of claimants get their first payment after the benefit year start week. Finally, we divide `claimant_count` in each week by `cohort_pop` to calculate the percent of claimants remaining as of that week. **The exit rate** is one minus the percent of stayers.

In [None]:
# Calculate percent of stayers and percent of exiters during each week 
# Save the results to a data frame
cs_exits <- cs_week_number %>% mutate(cohort_start_pop=cohort_pop) %>% #Add the starting cohort population to the data frame
    mutate(claimant_count = case_when(week_number == 1 ~ cohort_start_pop, TRUE ~ claimant_count)) %>% #Replace count of claimants in week 1 with the cohort population
    mutate(stay_pct = claimant_count/cohort_start_pop) %>% #Calculate share staying
    mutate(exit_pct = 1 - stay_pct) #Calculate share exiting

# See top 30 records in the data frame
head(cs_exits, 30)

We can see that nearly (REDACTED) of the cohort has exited by  (REDACTED). The rate of exit generally slows over the life of the cohort, but (REDACTED). This is because regular benefits only cover 26 weeks within a benefit year. However, some claimants may draw down their benefits over more than 26 weeks calendar time if their benefits paid are reduced due to part time wages earned, which is why some may still remain after the 26th week.

#### **Checkpoint 2: Calculate Your Regional Cohort's Exit Rates**
Now use the regional cohort sample you created in Checkpoint 1 to calculate exit rates of your regional cohort. Remember to create the `week_number` indicator first. Does your regional cohort have different trend in exit rates from the state level trend showed in the example?  

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

In [None]:
# Create week number field for your regional dataset
df_reg_claimants <- df_reg_claimants %>% 
    mutate(week_number = as.integer(difftime(week_end_date, byr_start_week, units = "weeks")) + 1) %>%
    arrange(ssn_id, week_end_date)

# See top records in the data frame
head(df_reg_claimants)

In [None]:
# Calculate cohort starting population for the regional subset
reg_cohort_pop <- (df_reg_claimants %>% summarize(n_distinct(ssn_id)))[1,1]

# Check the result
reg_cohort_pop

In [None]:
# Aggregate the regional subset on week_number
# Save the results to a data frame
cs_week_number_reg <- df_reg_claimants %>%
    group_by(week_number) %>%
    summarize(claimant_count=n_distinct(ssn_id)) %>%
    filter(week_number >= 1)

# See top records in the data frame
head(cs_week_number_reg)

In [None]:
# Calculate percent of stayers and percent of exiters during each week 
# Save the results to a data frame
cs_reg_exits <- cs_week_number_reg %>% mutate(cohort_start_pop=reg_cohort_pop) %>% #Add the starting cohort population to the data frame
    mutate(claimant_count = case_when(week_number == 1 ~ cohort_start_pop, TRUE ~ claimant_count)) %>% #Replace count of claimants with cohort population in week 1
    mutate(stay_pct = claimant_count/cohort_start_pop) %>% #Calculate share staying
    mutate(exit_pct = 1 - stay_pct) #Calculate share exiting

# See top 30 records in the data frame
head(cs_reg_exits, 30)

## **5. Calculate Exit Rates by Industry**

On May 5th, 2020, the state of Illinois released a five-phase plan to gradually reopen its economy based on regional health metrics and hospital capacities. According to the plan, restrictions on different industries were removed or reduced in different phases. For example, in phase 1, only essential businesses can open; in phase 2, non-essential retail stores can open for curb-side pickup and delivery; in phase 3, manufacturing, offices, retail, barbershops, and salons can open with capacity limits; in phase 4, restaurants and bars can open with capacity limits and travel resumes.[<sup>3</sup>](#fn3) <a id = "7"> </a>

As businesses in different industries reopen in different phases and in different capacities, UI claimants from different industries may return to jobs at varying rates. In this section, we will calculate the COVID cohort's exit rates for each industry and investigate claimants from which industries have higher exit rates and claimants from which industries tend to leave the UI program slowly.

We start with calculating weekly counts of the COVID cohort claimants who stayed in the UI program for each industry. **We will save the starting cohort count of each industry in a separate column and use them as denominators when calculating exit rates.**

In [None]:
# Join industry groupings used for UI dashboard

# Convert 6-digit NAICS codes to 2-digit NAICS codes by keeping only the first two characters
df_claimants <- df_claimants %>% mutate(naics_maj_code = substr(naics_code,1,2))

# Remove anything not in the list of known NAICS major codes
df_claimants <- df_claimants %>% filter(!(naics_maj_code %in% c((REDACTED))))

# See `naics_groups.csv` in the shared/xwalk folder
naics_groups <- read_csv('P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\naics_groups.csv', col_types = "ccc")

# Join NAICS groupings to claimant dataset
df_claimants <- left_join(df_claimants, naics_groups, by = 'naics_maj_code')

In [None]:
# Next, calculate starting cohort population and weekly certified claimant count by industry

# Calculate cohort starting population for each industry
ind_cohort_pop <- df_claimants %>% 
    group_by(naics_maj_code_rv) %>%
    summarize(cohort_start_pop=n_distinct(ssn_id))

# Count certified claimants by week and industry
cs_ind_count <- df_claimants %>% 
    group_by(week_number, naics_maj_code_rv, naics_maj_desc_rv) %>%
    summarize(claimant_count=n_distinct(ssn_id)) %>%
    filter(week_number >= 1)

# Join cohort starting population to count of claimants,
# Replace week 1 count with cohort population
cs_remaining_ind <- left_join(cs_ind_count, ind_cohort_pop, by = 'naics_maj_code_rv') %>%
    mutate(claimant_count = case_when(week_number == 1 ~ cohort_start_pop, TRUE ~ claimant_count)) #Replace count of claimants with cohort populatin in week 1

# Show top records in the data frame
head(cs_remaining_ind,20)

Now we can calculate the share of claimants who stayed in the UI program for each industry by dividing `claimant_count` by `cohort_start_pop`. The share staying subtracted from one is equal to the share who have exited.

In [None]:
# Calculate share remaining/share exited by industry, save as a data frame
cs_ind_exits <- cs_remaining_ind %>% mutate(stay_pct = claimant_count/cohort_start_pop) %>% #Calculate share staying
    mutate(exit_pct = 1 - stay_pct) #Calculate share exiting

# Show 30 weeks for REDACTED
cs_ind_exits %>% filter(naics_maj_code_rv == (REDACTED))

# Show 30 weeks for REDACTED
cs_ind_exits %>% filter(naics_maj_code_rv == (REDACTED))

Earlier in this notebook, we found that about (REDACTED) of the COVID cohort as a whole had exited by (REDACTED). Looking at a subset of industries, we see that both REDACTED and (REDACTED) had a lower exit rate by (REDACTED) than the cohort as a whole, though the difference is more significant for (REDACTED). By (REDACTED), the final week of regular benefits, about (REDACTED) of the COVID cohort overall had exited. (REDACTED) has a higher exit rate by (REDACTED), while (REDACTED) has a lower exit rate. While (REDACTED) had lower exit rates than (REDACTED), (REDACTED) had a slightly lower exit rate at (REDACTED), but a notably higher exit rate than the cohort overall by (REDACTED).

#### **Checkpoint 3: Calculate Exit Rates by Your Variables of Interest**

In Checkpoint 3 of the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb), you have chosen a dimension (age group, gender, race, education level, occupation, or industry). Now investigate how exit rates of your regional cohort vary by that dimension. Remember that **the denominator should be different for each subgroup**. So start with calculating the number of distinct `ssn_id` for each subgroup and save them in a separate column. In which subgroup do claimants have the highest exit rate? What about the lowest exit rate? 

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

Here are some suggested methods of how to regroup these variables.

- **Age Group**: 14-24, 25-34, 35-44, 45-54, 55-64, 64-99; remove anyone younger than 14 or older than 99
- **Race**: white, African American, other race, race unknown
- **Education Level**: less than 12 years, high school graduate, Associate's degree, Bachelor's degree, Master degree or higher, education unknown
- **Occupation**: use 2-digit occupation codes and combine small categories

> <font color='red'>**Note that if you want to break down your regional sample by industry, you need to create the industry category by following the code showed in the example.**</font>

In [None]:
# Example of recoding education level into the groups above
df_reg_claimants <- df_reg_claimants %>%
    mutate(educ_recode = case_when (education >= 1 & education <= 13 ~ "Less than HS graduate",
                                    education >= 14 & education <= 18 ~ "HS graduate to some college",
                                    education >= 19 & education <= 20 ~ "Associate's degree",
                                    education >= 21 & education <= 22 ~ "Bachelor's degree",
                                    education >= 23 ~ "Master's degree or higher",
                                    TRUE ~ "Other"))

# Show top records in the dataframe
head(df_reg_claimants)

In [None]:
# Calculate exit rates for your selected dimension
# First, regroup your chosen dimension field based on the example in the previous cell
# Next, replace ___ below with the selected dimension

# Calculate cohort starting population by subgroup
reg_sub_cohort_pop <- df_reg_claimants %>% 
    group_by(___) %>%
    summarize(cohort_start_pop=n_distinct(ssn_id))

# Count certified claimants by week and subgroup
cs_reg_sub_count <- df_reg_claimants %>% 
    group_by(week_number, ___) %>%
    summarize(claimant_count=n_distinct(ssn_id)) %>%
    filter(week_number >= 1)

# Join cohort starting population to count of claimants,
# Replace week 1 count with cohort population
cs_reg_sub_remaining <- left_join(cs_reg_sub_count, reg_sub_cohort_pop, by = '___') %>%
    mutate(claimant_count = case_when(week_number == 1 ~ cohort_start_pop, TRUE ~ claimant_count)) #Replace count of claimants with cohort populatio in week 1

# Calculate share remaining/share exited, save as a data frame
cs_reg_sub_exits <- cs_reg_sub_remaining %>% mutate(stay_pct = claimant_count/cohort_start_pop) %>% #Calculate share staying
    mutate(exit_pct = 1 - stay_pct) #Calculate share exiting

# Show top records in the data frame
head(cs_reg_sub_exits)

## **6. Export Results to .csv Files**

Now you have successfully finished your cohort analysis! The last step is to save your results in csv files. We will use these files in the Data Visualization notebook and show you how to create various graphs.

<font color=red> Note that you need to change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Save the aggregates as CSV

# Save exits overall dataset
write.csv(cs_exits, "U:\\..\\ETA Training\\Results\\cs_exits.csv", row.names=F)

# Save exits by industry dataset
write.csv(cs_ind_exits, "U:\\..\\ETA Training\\Results\\cs_ind_exits.csv", row.names=F)

#### **Checkpoint 4: Save Your Results**
Save the results you get in Checkpoints 2 and 3 in .csv files. 

In [None]:
# Save the results of your checkpoints as well

# Save exits for your regional selection
write.csv(cs_reg_exits, "U:\\..\\ETA Training\\Results\\cs_reg_exits.csv", row.names=F)

# Save exits for your regional selection by your chosen dimension
write.csv(cs_reg_sub_exits, "U:\\..\\ETA Training\\Results\\cs_reg_sub_exits.csv", row.names=F)

### **Footnotes:**
<span id="fn1"> 1. <a href='https://www2.illinois.gov/ides/IDES%20Forms%20and%20Publications/CLI105L.pdf'>Illinois Unemployment Insurance benefit Handbook</a> </span>     
[[Go back]](#5)

<span id="fn2"> 2. <a href='https://www2.illinois.gov/pages/executive-orders/executiveorder2020-10.aspx'>Illinois Executive Order in Response to COVID-19</a> </span>    
[[Go back]](#6)

<span id="fn3"> 3. <a href='https://coronavirus.illinois.gov/s/restore-illinois-introduction'>Restore Illinois</a> </span>   
[[Go back]](#7)

> Note that the above links don't work inside of the ADRF since you don't have internet access.

> Click [Go back] to go back to where you were.