<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 

# **<center> Data Exploration: Cross-sectional Analysis </center>**

<a href="https://doi.org/10.5281/zenodo.4588936"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4588936.svg" alt="DOI"></a>

Tian Lou, Dave McQuown

## **1. Introduction**
In the spring of 2020, many states enacted stay-at-home orders and extensive business shutdowns to slow the spread of COVID-19. Within a few months, an unprecedented number of workers filed Unemployment Insurance (UI) claims, placing heavy demand on state governments' financial obligations. Using the state of Illinois for the month of April as an example, about half a million workers filed first time UI claims. The state government paid more than 700 million dollars to compensate 1.9 million claims.[<sup>1</sup>](#fn1) <a id = "1"> </a> Even as the state gradually began to open its economy, the number of unemployed workers remained high with many continuing to rely on UI benefits. 

How can local workforce boards help these unemployed workers get back to work quickly and reduce governments' financial burdens? What information is the most useful for them to allocate resources and  plan strategically? How is UI claimants' reemployment affected by economic shocks? Throughout this project, we will explore Illinois UI claims data and use advanced data analytical methods to answer these questions. 

This notebook focuses on understanding the Illinois UI claims data (i.e., PROMIS file) and using it to construct two measures: weekly certified UI claimant counts and average weekly payments. We will start with introducing you to the data analytical tools to load the data, including connecting R to the database and using SQL queries to pull the data. We then use these tools to explore the PROMIS file. We will create a cross-sectional sample and investigate the trends in Illinois UI claimant counts, their weekly total payments during the COVID recession, and how these two measures varied across industry. At the end of this notebook, we will save the summary statistics in csv files and use them in the visualization notebooks. As you work through the notebook, we will have checkpoints for you to practice on the code. As you work through the material presented in the notebook, think about how you might apply the techniques and code to your project. 

## **2. Learning Objectives**

Throughout this series of notebooks, our overarching questions are: How can states best produce information that can be used by local workforce boards to help inform resource allocation for unemployed workers? What information is the most useful for workforce boards for strategic resource planning and efficient resource allocation? What measures are most useful to local workforce boards and how do we characterize variation in those measures? We will use Illinois administrative data, Bureau of Labor Statistics (BLS) workforce data, and Opportunity Insignts Economic Tracker data as an example and show you step by step how we develop a project, including: exploring the dataset, selecting a sample, defining outcome measures, conducting subgroup analyses, creating visualizations, and building a prediction machine learning model. 

In this notebook, our focus is the **cross-sectional analysis**. After you finish this notebook, you should understand:
- how to load R libraries and how to estabilish a connection to the Database
- how to create a cross-sectional sample by using the PROMIS file
- how to calculate weekly claimant counts and average weekly total pay
- how to conduct subgroup analysis

#### **Research Questions** 
In this notebook, we focus on seeking answers to the following questions: 
- What are the trends of Illinois certified UI claimant counts during the COVID-19 recession?
- How do Illinois certified UI claimant counts vary by industry? Which industries had the most job losses during the COVID-19 recession?
- What are the average weekly payments received by Illinois certified UI claimants?
- How do Illinois certified UI claimants' weekly payments vary by industry?

#### **Datasets** ####
We will explore and understand the Illinois PROMIS file in this notebook:
- **2020 Illinois PROMIS certified claims file**: weekly UI claims data. Each record represents a certified claim in a certain week. The data has a claimant's demographics, education level, prior industry, occupation, and locations. It also contains detailed information about the claim, such as program type, claim type, certification status, benefit starting date, and benefit amount. Federal Pandemic Unemployment Compensation (FPUC, $600/week) and dependent benefits are included in the total amount paid.

The full certified claims data has more than 26 million rows. Using the full dataset for data exploration is time-consuming. Therefore, **we will use a 1% random sample of the certified claims data in all notebooks. You should also use the 1% random sample when walking through all notebooks and when identifying the scope of your project.** After you decide your research questions and analysis scope, only pull the data and the variables you need from the full dataset.

#### **Analytical Methods** ####
You will work through various techniques of how to use SQL and R to explore the datasets in the ADRF and better understand what you are working with. This will form the basis of all the other types of analyses you will do in this class and is a crucial first step for any data analysis workflow. We will provide an introduction and examples for:

- How to estabilish a connection to the database in R
- How to create new tables from the larger tables in a database (sometimes called the "analytical frame")
- How to explore different variables of interest
- How to clean data
- How to create aggregate metrics
- How to generate descriptive statistics to describe a specific sample

The specific techniques include but not limited to:
- **SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `GROUP BY`: aggregate data over the variables of interest
 - `ORDER BY`: sort data based on the variables of interest
 - `DISTINCT`: look at distinct values of a variable
 - `JOIN ... ON`: join tables
- **R code**:
 - `group_by` and `summarize` to find group-based measures
 - `mutate` to create new variables
 - `arrange` and `desc` to sort values

#### **Directory Structure**

We will constantly read and write csv files to load crosswalks and to save results in all the notebooks. Let's create a few folders in your U drive first so it is eaiser for you to organize all the files. 

- Open a Windows File Explorer
- On the left hand side, find U drive (U:) and click into it
- On the right hand side, open your user folder: FirstName.LastName.UserID
- In your user folder, create a new folder: ETA Training
- In the "ETA Training" folder, create three subfolders: "Notebooks", "Results", "Output"
- You can copy and paste the class notebooks to the "Notebook" folder, save summary statistics to the "Results" folder, and save visualizations (in the third notebook) to the "Output" folder.

For example, we read all the crosswalks from **"P:\tr-dol-eta\ETA Class Notebooks\xwalks"**.  At the end of this notebook, **we save summary statistics to "U:\\FirstName.LastName.UserID\ETA Training\Results\filename.csv"**.


## **3. Load the Data** ##

In this section, we will demonstrate how to use R to read data from a relational database. First, we need to load packages in R.

#### **R Setup**

We will use several R functions that are not immediately available in base R. Therefore, we need to load them using the built-in function `library()`. For example, running `library(tidyverse)` loads the `tidyverse` suite of packages. It is a collection of packages designed for data science.

> When you run the following code cell, don't worry about the warning message below.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

__When in doubt, full documentation for a method can be printed with `?<package/function_name>`, e.g. `?tidyverse/ggplot` or `?sprintf`.__ Do not worry about memorizing the information in the help documentation - you can always run this command when you are unsure of how to use a function.

> Certain functions exist across multiple packages (e.g. the function `lag` exists in both the `dplyr` and `stats` package - also noted in the message yielded from `library(tidyverse)`. When calling a function, you can put the package name first to ensure that you are using the right one. For example, `dplyr::lag` or `stats::lag` calls the `lag` function from `dplyr` or `stats`, respectively. 

In [None]:
# See help documentation for head:
# a function we will use frequently to check the content of a table
# It returns the first few rows of a table
?head

#### **Establish a Connection to the Database**

Now, we are ready to connect to the database `tr_dol_eta`. We will create the database connection using the `DBI`  and `ODBC` libraries. 

> **Loading R functions** and **establishing database connection** should always be the first step in your Jupyter Notebooks. Make sure you copy these code chunks when you create a new notebook.

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

#### **Formulate Data Query**

Next, we need to dictate what we want to pull in from the database. This part is similar to writing a SQL query in DBeaver. In this example, we will pull in 20 rows of the Illinois PROMIS data, which is stored in the `il_des_promis_1pct` table inside the `tr_dol_eta` schema.

First, we create a query as a `character` string object in R.

In [None]:
# Create query character string
# Database name: tr_dol_eta
# Schema name: dbo
# Table name: il_des_promis_1pct
query <- '
SELECT TOP 20 *
FROM tr_dol_eta.dbo.il_des_promis_1pct;
'

We use `TOP` to read in only the first 20 rows because we're just looking to preview the data and we don't want to eat up memory by reading a huge data frame into R. 

> `TOP` provides one simple way to get a "sample" of data. You may get different samples of data from others using just the `TOP` clause. However, it is not because you get a random sample by using `TOP`. It is because the database returns the results that can be pulled the fastest.

#### **Read in the Data** 

Now we can use `con` and `query` as inputs to `dbGetQuery()` to read the data into an R data frame. 

In [None]:
# Read in data frame and save it in df
df <- dbGetQuery(con,query)

In [None]:
# See first few rows of df
head(df)

#### **Checkpoint 1: Explore Columns** 

Take a look at the columns in the table `il_des_promis_1pct` table. Which variables might be useful for your project?

> Refer to the data dictionary on the class website to understand what the different variables mean.

In [None]:
# Replace ____ with the table name
query <- '
SELECT TOP 20 *
FROM tr_dol_eta.dbo.____;
'

# Read in data frame and save it in df
df <- dbGetQuery(con,query)

# Write code to explore the columns of df
# Hint: there is more than one method. You can use head(), glimpse(), names(), etc.

## **4. Create the Cross-sectional Sample**

In this section, we will use the Illinois PROMIS certified claims file to create a sample of all certified claimants who received benefits since the week ending March 7th, 2020. Each week, PROMIS files are generated for both initial claims and certified claims. For the purpose of this class, **the `il_des_promis_1pct` table in `tr_dol_eta.dbo` only contains certified claims**.

> The full dataset is in the table `il_des_promis`. It contains more than 26 million rows. `il_des_promis_1pct` is a 1% random sample. Using the full dataset for data exploration is time-consuming. Therefore, **we will only use the table `il_des_promis_1pct` in all notebooks. You should also only use the 1% random sample when walking through all notebooks and when identifying the scope of your analysis.** After you decide your research questions and analysis scope, only pull the data and the variables you need from the full dataset.

Recall that an **initial claim** refers to a claim filed by a recently unemployed worker with state unemployment agencies to request a determination of basic eligibility for UI benefits. In Illinois, after filing the initial claim, the person will wait about a week to receive a letter which notify him/her when to **ceritify the claim**. During the certification process, the person will answer questions such as whether they have worked and whether they have actively looked for jobs in the past two weeks. If the person meets the eligibility requirements, he/she can receive UI benefits.[<sup>2</sup>](#fn2) <a id = "2"> </a>**In Illinois, regular UI claims need to be certified every two weeks.**[<sup>3</sup>](#fn3) <a id = "3"> </a>

We usually analyze UI claimants' weekly outcomes, because they collect their benefits on a **calendar week** basis. A calendar week starts on Sunday and ends on Saturday. The PROMIS file has a variable, `week_end_date`, that indicates the last day of the calendar week. *It is always a Saturday.* We can use it to limit the data to weeks or after the first week of March 2020 (`2020-03-07`). In sections 5 and 6, we will also use `week_end_date` to aggregate the data to weekly level.  

Moreover, there are different types of UI programs. In this class, we will focus on the **Regular state UI program**. Thus, we also restrict the sample to `program_type=1` and `sub_program_type=1`.

In [None]:
# Select PROMIS certified claimant records from the database to a dataframe

# Store SQL query to a variable
query <- "
SELECT ssn_id,
    week_end_date,
    byr_start_week,
    sub_program_type,
    program_type,
    claim_type,
    birth_date,
    gender,
    race,
    ethnicity,
    disability,
    education,
    county_fips_code,
    naics_code,
    occupation_code,
    total_pay
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND week_end_date >= '2020-03-07'
"

# Execute query
df_claimants <-dbGetQuery(con,query)

# R interprets dates as character when pulling from the database, must convert with ymd()
df_claimants <- df_claimants %>%
    mutate(week_end_date=ymd(week_end_date),
          byr_start_week=ymd(byr_start_week),
          birth_date=ymd(birth_date))

# See top records in the dataframe
head(df_claimants)

# Close the database connection
dbDisconnect(con)

We need to join `df_claimants` with the county-region crosswalk, `reopening_regions.csv`, on `county_fips_code` to get region. In Checkpoint 2, you will need to use `region` to define your regional sample. 

In [None]:
# Join/merge the DataFrame with the county-region crosswalk

# Load Restore Illinois Health Regions
region_dict <- read_csv('P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\reopening_regions.csv', col_types = "ic") %>%
    mutate(county_fips_code = substr(fips,3,5)) %>%
    select(county_fips_code, region)

#Left join region to certified claimants data
df_claimants <- left_join(df_claimants, region_dict, by.y = "county_fips_code")

# See top records in the dataframe
head(df_claimants)

#### **Checkpoint 2: Create Your Sample**
In the class, we have asked you to identify a **region of interest**. Now use the DataFrame `df_claimants` we have created in the example to select your regional sample. 

> Values of `region`: Central, Cook County, North-Central, Northeast, Southern

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

In [None]:
# Subset the dataframe to a specific region
# Replace ___ with your selected region
df_reg_claimants <- df_claimants %>% filter(region == '___')

# See top records in the dataframe
head(df_reg_claimants)

## **5. Weekly UI Claimant Count**

Now we can aggregate `df_claimants` on `week_end_date` to get weekly claimant counts.

In [None]:
# Aggregate cross-sectional dataset by benefit week ending date
# and confirm that each person has one record each benefit week
cs_counts <- df_claimants %>% 
    group_by(week_end_date) %>%
    summarize(claimant_count_dist=n_distinct(ssn_id), claimant_count=n()) %>% # Count records and distinct SSNs
    mutate(uniq_check = (claimant_count_dist == claimant_count)) # Compare the values calculated in the previous line

# If the there are no weeks with a value of FALSE, there is only one record per claimant/week
table(cs_counts$uniq_check)

# Assuming no FALSE values, clean up the dataframe to keep only week_end_date and claimant_count
cs_counts <- cs_counts %>% select(week_end_date, claimant_count)

# See the top 15 records in the dataframe
head(cs_counts, 15)

We can see that the number of certified claimants spiked in the REDACTED as people REDACTED as a result of the COVID-19 pandemic. The total number of claimants peaked the week ending REDACTED.

The COVID recession has hit a few industries especially hard, such as Accommodation and Food Services, Manufacturing, and Retail Trade. For example, compared to March, the national employment of the Accommodation and Food Services industry in April reduced by nearly 50%.[<sup>4</sup>](#fn4) <a id = "4"> </a> Next, we will break down Illinois weekly claimant counts by industry and investigate which industries have the most job losses and how many workers in these industries remain unemployed over time.

The variable `naics_code` in the `il_des_promis_1pct` table tells us a claimant's separation employer's industry. In the database, we have 6-digit NAICS codes. However, we will not directly use the 6-digit codes in the analysis, because it is too granular and will result in many small counts. These results will have very limited generalizability. **Mostly importantly, you will not be able to export any results based on less than 10 individuals from ADRF.** 

> You can use `select distinct(naics_code) from tr_dol_eta.dbo.il_des_promis_1pct;` in DBeaver to check the unique values of `naics_code` and use `select count(distinct(naics_code)) from tr_dol_eta.dbo.il_des_promis_1pct;` to check how many unique 6-digit NAICS codes are in the data.

Therefore, we need to convert the 6-digit NAICS codes to 2-digit NAICS codes so that we have fewer industry groups. However, even if we use 2-digit NAICS codes, some industries still have very small number of UI claimants in some weeks, such as REDACTED. You can either suppress counts of these industries or combine them. In this notebook, we will combine the small industries so that we get consistent results with the Certified UI Claimants by Industry in the portal. See `naics_groups.csv` in the `P:\tr-dol-eta\ETA Class Notebooks\xwalks` folder for how we combine small industries.

In [None]:
# Convert 6-digit NAICS codes to 2-digit NAICS codes by keeping only the first two characters
df_claimants <- df_claimants %>% mutate(naics_maj_code = substr(naics_code,1,2))

# See the top records in the dataframe
head(df_claimants)

In [None]:
# Check the value of NAICS code to make sure there aren't any unknown values.
# 2-digit NAICS codes list: 11,21,22,23,31,32,33,42,44,45,48,49,51,52,53,54,55,56,61,62,71,72,81,92
table(df_claimants$naics_maj_code)

# Remove anything not in the list of known NAICS major codes
df_claimants <- df_claimants %>% filter(!(naics_maj_code %in% c('0','00','99')))

# Check that it worked
table(df_claimants$naics_maj_code)

In [None]:
# Combine NAICS major codes based on grouping used for UI dashboard 
# See `naics_groups.csv` in the P:\tr-dol-eta\ETA Class Notebooks\xwalks folder
naics_groups <- read_csv('P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\naics_groups.csv', col_types = "ccc")

# See the top records in the dataframe
head(naics_groups)

In [None]:
# Join NAICS groupings to claimant dataset
df_claimants <- left_join(df_claimants, naics_groups, by = 'naics_maj_code')

# Break down the data by benefit week and NAICS code grouping and save it as a dataframe.
cs_ind_counts <- df_claimants %>% 
    group_by(week_end_date, naics_maj_code_rv, naics_maj_desc_rv) %>%
    summarize(claimant_count=n_distinct(ssn_id))
    
# Show top records in the dataframe
head(cs_ind_counts)

What were the top 3 industries at the peak week of REDACTED?

In [None]:
# Show counts for all industries in descending order
cs_ind_counts %>%
    filter(week_end_date == ymd("REDACTED")) %>%
    arrange(desc(claimant_count))

We can see that the industries with the most certified claimants during the benefit week ending REDACTED are REDACTED, REDACTED, and REDACTED. How many claimants are there from these industries four weeks later, the week ending REDACTED?

In [None]:
# Show counts of certified claimants in the REDACTED 
# industries the week ending REDACTED
cs_ind_counts %>%
    filter(week_end_date == ymd("REDACTED")) %>%
    filter(naics_maj_code_rv %in% c(REDACTED))

Compared to the week ending REDACTED, the number of certified claimants in the week ending REDACTED was about the same for REDACTED but had declined slightly for REDACTED and REDACTED.

#### **Checkpoint 3: Calculate Weekly UI Claimant Counts and Break Down Your Sample by Variables of Interest**

First, calculate weekly UI claimant counts by week using the regional sample you created in Checkpoint 2. During which week did the number of certified claimants in your region peak? Is it the same as the state level trend (i.e., peaked during the week ending REDACTED)?

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

In [None]:
# Aggregate regional cross-sectional dataset by benefit week ending date
# and confirm that each person has one record each benefit week
# Which variable should you fill in ___ ?
cs_reg_counts <- df_reg_claimants %>% 
    group_by(___) %>%
    summarize(claimant_count_dist=n_distinct(ssn_id), claimant_count=n()) %>% # Count records and distinct SSNs
    mutate(uniq_check = (claimant_count_dist == claimant_count)) # Compare the values calculated in the previous line

In [None]:
# If the there are no weeks with a value of FALSE, there is only one record per claimant/week
# Fill in ___ with the appropriate column
table(cs_reg_counts$___)

In [None]:
# Assuming no FALSE values, clean up the dataframe to keep only week_end_date and claimant_count
# Fill in ___ with the appropriate columns
cs_reg_counts <- cs_reg_counts %>% select(___, ___)

Second, in addition to separation industry, the PROMIS file contains detailed information about a claimant's demographics, occupation and educational attainment. Choose one dimension to conduct your subgroup analysis. What are the top demographic groups/industries/education groups during the peak week? What are the trends in these top groups in the subsequent weeks?

Similar to industry, some of these variables are at very granular level. Here are some suggested methods of how to regroup these variables to avoid small counts.

- **Age Group**: 14-24, 25-34, 35-44, 45-54, 55-64, 64-99; remove anyone younger than 14 or older than 99
- **Race**: white, African American, other race, race unknown
- **Education Level**: less than 12 years, high school graduate, Associate's degree, Bachelor's degree, Master degree or higher, education unknown
- **Occupation**: use 2-digit occupation codes and combine small categories if needed

> <font color='red'>**Note that if you want to break down your regional sample by industry, you need to create the industry category by following the code showed in the example.**</font> 

In [None]:
# Example of recoding education level into the groups above
# You may replace this with a dimension and grouping scheme of your choosing

df_reg_claimants <- df_reg_claimants %>%
    mutate(educ_recode = case_when (education >= 1 & education <= 13 ~ "Less than HS graduate",
                                    education >= 14 & education <= 18 ~ "HS graduate to some college",
                                    education >= 19 & education <= 20 ~ "Associate's degree",
                                    education >= 21 & education <= 22 ~ "Bachelor's degree",
                                    education >= 23 ~ "Master's degree or higher",
                                    TRUE ~ "Other"))

# Show top records in the dataframe
head(df_reg_claimants)

In [None]:
# Aggregate by benefit week and subgroup
# Replace ___ with the subgroup variable

cs_reg_sub_counts <- df_reg_claimants %>% 
    group_by(week_end_date, ___) %>%
    summarize(claimant_count=n_distinct(ssn_id))

## **6. Weekly Total Pay**

Another outcome measure we are interested in is the amount of benefits received by certified UI claimants. We will use the field `total_pay`, which is the total amount paid to the claimant for the week. This includes regular UI benefits, dependent allowances, and supplements such as Federal Pandemic Unemployment Compensation. It excludes deductions for part time wages earned and tax withholding.

In [None]:
# Calculate the state-level average total pay amount by week
cs_amounts <- df_claimants %>% 
    group_by(week_end_date) %>%
    summarize(avg_total_pay = mean(total_pay))

# Show top 25 records in the dataframe
head(cs_amounts, 25)

The average weekly amount paid was strongly affected by the Fededral Pandemic Unemployment Compensation (FPUC) program, which granted each claimant an additional $600 per week in benefits. We can see that the average weekly amount paid REDACTED times the week ending REDACTED when the program began, and REDACTED the week ending REDACTED. 

Next, let's see how the average weekly total pay varies by industry. We will focus on the top 3 industries with the most claimants during the peak week.

In [None]:
# Calculate state-level average total pay by industry for the 3 industries with the most claimants the overall peak week
cs_ind_amounts <- df_claimants %>% 
    group_by(week_end_date, naics_maj_code_rv) %>%
    summarize(avg_total_pay = mean(total_pay)) %>%
    filter(naics_maj_code_rv %in% c(REDACTED))

# Show top records in the dataframe
head(cs_ind_amounts, 25)

Similar to the overall state figures, average weekly total pay by industry is heavily impacted by FPUC and experiences REDACTED when that program begins and ends. Each of these industries has a REDACTED average weekly total paid than the state of Illinois overall. 

#### **Checkpoint 4: Calculate the Average Weekly Total Pay for Your Regional Sample**
Now calculate the average weekly total pay of claimants in your region of interest. Then calculate average weekly total pay by your subgroup of interest for claimants in your region. Compare your results with the state-level average weekly benefits. Do claimants in your region of interest and subgroup of interest receive higher or lower benefits on average?

> Note that we use the string `reg` in object names to indicate dataframes and variables based on the regional subset used in the checkpoints. This includes all objects with names beginning with `df_reg`,`cs_reg`, or `reg`. Objects without `reg` in the name reflect the full State and are primarily used in the exmples rather than the checkpoints.

In [None]:
# Calculate regional total pay amount by week, 
# using the DataFrame df_reg_claimants, which includes regional claims data and the subgroup variable you created in Checkpoint 3
# Replace ___ with the DataFrame, the benefit week, and the total pay field
cs_reg_amounts <- ___ %>% 
    group_by(___) %>%
    summarize(avg_total_pay = mean(___))

head(cs_reg_amounts)

In [None]:
# Aggregate by benefit week and a subgroup of your choice.
# Replace ___ with the appropriate values
cs_reg_sub_amounts <- ___ %>% 
    group_by(week_end_date, ___) %>%
    summarize(avg_total_pay = mean(___))

head(cs_reg_sub_amounts)

## **7. Export Results to .csv Files**

Now you have successfully finished your cross-sectional analysis! The last step is to save your results in .csv files. We will use these files in the Data Visualization notebook and show you how to create various graphs.

<font color=red> Note that you need to change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Save dataframes to CSV to use in later notebook

#weekly UI claimant counts
write.csv(cs_counts, "U:\\..\\ETA Training\\Results\\cs_counts.csv", row.names=F)

#weekly UI claimant counts by industry
write.csv(cs_ind_counts, "U:\\..\\ETA Training\\Results\\cs_ind_counts.csv", row.names=F)

#weekly UI average benefit amounts
write.csv(cs_amounts, "U:\\..\\ETA Training\\Results\\cs_amounts.csv", row.names=F)

#weekly UI benefit amounts by industry
write.csv(cs_ind_amounts, "U:\\..\\ETA Training\\Results\\cs_ind_amounts.csv", row.names=F)

#### **Checkpoint 5: Save Your Results**
Save the results you get in Checkpoints 3 and 4 in .csv files. 

In [None]:
# Save the outputs of your checkpoint work as CSV

#Counts by week for your region of interest
write.csv(cs_reg_counts, "U:\\..\\ETA Training\\Results\\cs_reg_counts.csv", row.names=F)

# Counts by week for your selected region and dimension grouping
write.csv(cs_reg_sub_counts, "U:\\..\\ETA Training\\Results\\cs_reg_sub_counts.csv", row.names=F)

# Average total pay by week for your region of interest
write.csv(cs_reg_amounts, "U:\\..\\ETA Training\\Results\\cs_reg_amounts.csv", row.names=F)

# Average amounts by week for your selected region and dimension grouping
write.csv(cs_reg_sub_amounts, "U:\\..\\ETA Training\\Results\\cs_reg_sub_amounts.csv", row.names=F)

### **Footnotes:**
<span id="fn1"> 1. Data Source: <a href='https://www.oui.doleta.gov/unemploy/claimssum.asp'>Department of Labor, Monthly Benefit and Claims data</a> </span>     
[[Go back]](#1)

<span id="fn2"> 2. <a href='https://www2.illinois.gov/ides/IDES%20Forms%20and%20Publications/CLI105L.pdf'>Illinois Unemployment Insurance benefit Handbook</a> </span>        
[[Go back]](#2)

<span id="fn3"> 3. <a href='https://www2.illinois.gov/ides/IDES%20Forms%20and%20Publications/Reg_UI_Certification_Timeline.pdf'>Illinois Regular UI Certification Timeline</a> </span>       
[[Go back]](#3)

<span id="fn4"> 4. Data Source: <a href='https://www.bls.gov/iag/tgs/iag72.htm'>Bureau of Labor Statistics</a> </span>       
[[Go back]](#4)

> Note that the above links don't work inside of the ADRF since you don't have internet access.

> Click [Go back] to go back to where you were.