# **<center> Disclosure Review </center>**

Benjamin Feder, Tian Lou, Dave McQuown

## **1. Introduction**

At this point, we have created useful visualizations and tables we may want to use outside of the ADRF for a final presentation or report. However, **data cannot leave the ADRF without passing the export review process**, where reviewers confirm that none of the findings are disclosive given the rules provided by the agency. This notebook contains information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and provides an overview of the information needed for disclosure review. Please read through the entire notebook because it will separately discuss different types of outputs that will be flagged in the disclosure review process. The examples are almost all taken directly from the code in the three notebooks from Module 2: [cross-sectional analysis](./1.Data_Exploration_Cross-section_Analysis.ipynb), [cohort analysis](./2.Data_Exploration_Cohort_Analysis.ipynb), and [data visualization](./3.Data_Visualization.ipynb).

## **2. Learning Objectives**

We will use a variety of examples from previous notebooks containing vastly different output to walk through how to safely prepare these outputs so they will pass the ADRF disclosure review process. At the end of each section, we will save the disclosure-proofed outputs as csv and png files and explain how they would be properly documented for export. As you work through the material presented in the notebook, think about how you might apply the techniques and code to your project. 

In this notebook, our focus is **disclosure control**. After you finish this notebook, you should understand:
- how to make sure tables and visualizations are safe for export
- how to provide and document supporting files
- how to adjust subgroups if they do not satisfy the disclosure control requirements

### **Research Questions** 
In this notebook, we focus on preparing our answers for the following questions for export: 

- How do Illinois certified UI claimant counts vary by industry? Which industries had the most job losses during the COVID-19 recession?
- What are the median weekly payments received by Illinois certified UI claimants?
- How many and what percentage of certified claimants in the COVID-19 cohort exit during each week after program entry?
- What are the trends of claimant counts in the industries with the most job losses during the COVID-19 recession over time?
- Which counties in Illinois had the highest claimant rate during the peak week?

### **Datasets**
As in the Data Visualization notebook, we will explore and understand the Illinois PROMIS file and labor force counts from the Bureau of Labor Statistics (BLS) in this notebook:
- **2020 Illinois PROMIS certified claims file**: weekly UI claims data. Each record represents a certified claim in a certain week. The data has a claimant's demographics, education level, prior industry, occupation, and locations. It also contains detailed information about the claim, such as program type, claim type, certification status, benefit starting date, and benefit amount. Federal Pandemic Unemployment Compensation (FPUC, $600/week) and dependent benefits are included in the total amount paid.

**All analyses in this notebook are based on the 1% random sample of the certified claims data. You should also only use the 1% random sample when walking through all notebooks and when identifying the scope of your analysis. The same techniques will apply on the entire population of certified claims.**

- **2019 BLS Labor Force Data**: 2019 annual average county level labor force data. The estimates are from a building-block approach, which uses several data sources, including Current Population Survey (CPS), Current Employment Statistics (CES), state UI system, and American Community Survey (ACS).[<sup>1</sup>](#fn1) <a id = "8"> </a>

### **Methods** ###

Through preparing the answers to these questions for export, we will cover techniques for disclosure-proofing the following types of exports:

- Counts
- Percentiles
- Percentages
- Bar plots
- Line plots
- Heat maps

### **Remarks about Disclosure Review**

In general, any kind of file format can be exported. However, researchers typically export tables, graphs, regression outputs and aggregated data. Thus, the requirement is to export one of these types, which implies that every result designed for export needs to be saved in either .csv, .txt or graph format. Results cannot be exported in a Jupyter notebook, as the disclosure review process is too burdensome.

However, this does not mean that you will not need Jupyter notebooks in the export process, as they may contain essential code that can be listed as supplemental information for the reviewers. Thus, it is important to document every step of an analysis in a Jupyter notebook, or any code for that matter. 

A more detailed description of the rules for exporting results can be found in the [export documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before preparing an export submission, please read through the export documentation. Please request an export only when results are final and they are needed for a presentation or final project report.

The general idea behind potential disclosure for the data provided in this class is that every statistic for export must be based on at least **10 individuals**. If the counts of individuals are not clear in the desired table or visualization for export, supplemental counts must be provided in the supporting documentation. 

## **3. Tables** ##

In this section, we demonstrate how to tackle disclosure-proofing tables for export. Here, we will discuss three different types of tables:
1. Counts
2. Percentiles
3. Percentages

Note that this is not an exhaustive list of the types of table-based exports. The list contains some of the most common table-based exports, and there is extensive documentation on disclosure-controlling other table-based exports in the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before we touch on these examples, we will load the necessary libraries and establish our connection to the database.

> These are the same packages used in the Data Visualization [notebook](./3.Data_Visualization.ipynb).

In [None]:
#database interaction imports
library(DBI)
library(odbc)

# for data manipulation/visualization
library(tidyverse)
library(lubridate)
library(sf)

# for calculating percentages
library(scales)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

### **Counts**
Recall the example from the [cross-sectional analysis](./1.Data_Exploration_Cross-section_Analysis.ipynb) where we answered the following question:
- How do Illinois certified UI claimant counts vary by industry? Which industries had the most job losses during the COVID-19 recession?

Our solution consisted of counting the number of certified claimants by their industry of job loss during the peak COVID week REDACTED as determined by the cross section of the most amount of total certified claimants in a benefit week. Let's load in the csv containing the counts of certified claimants by industry and benefit week, `cs_ind_counts.csv`.

<font color=red> Before you run the cell below, make sure you have run through [the cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
#Load the statewide aggregates that we exported in the cross-section notebook

#Import weekly certified claimant counts by industry
cs_ind_counts <- read_csv("U:\\..\\ETA Training\\Results\\cs_ind_counts.csv")

As we did in the first notebook, we can see the number of claimants by industry in the peak week.

In [None]:
# Show counts for all industries in descending order
cs_ind_counts %>%
    filter(week_end_date == ymd("REDACTED")) %>%
    arrange(desc(claimant_count))

The export rules for the data state that each statistic in an export must consist of at least 10 individuals. Therefore, for each row--indicating the number of certified claimants by industry in a specific week--there must be at least 10 individuals (certified claimants). To disclosure-proof this table, we can add a second clause inside `filter()` to only include industries with at least 10 claimants.

In [None]:
# filter industry counts to satisfy disclosure rules 
cs_ind_counts_export <- cs_ind_counts %>%
    filter(
        week_end_date == ymd("REDACTED"),
        claimant_count >= 10
    ) %>%
    arrange(desc(claimant_count))

head(cs_ind_counts_export)

Since the table designated for export already contains the counts of the number of indivduals per grouping, it does not require a supplemental file containing counts per group. This table is safe for export.

> The table would require secondary suppression if it contained the total number of claimants receiving benefits in this week, however. This additional piece of information would allow individuals to work backwards from the total amount of individuals to find out the number of claimants who were employed in the industry that had less than 10 individuals. In practice, the ADRF staff would suppress the claimant counts with the second-lowest amount of certified claimants.

### **Fuzzy Percentiles**

In the Cross-Section Analysis, we answered the following question:
- What are the average weekly payments received by Illinois certified UI claimants?

For the sake of displaying how to properly proof percentiles for export, we will slightly adjust this question to:
- What are the *median* weekly payments received by Illinois certified UI claimants?

Under no circumstances percentiles will be able to be exported, regardless of the unit of analysis potentially being directly subject to disclosure review. To get a sense of percentiles in the data, fuzzy percentiles can be exported, which can be created by finding the average of two true percentiles. For example, a fuzzy median can be created by finding the average of the true 45th and 55th percentiles, and this section will provide a template for creating fuzzy percentiles so that they can pass the disclosure review process. 

> The median is another way to describe the 50th percentile.

To answer the original question concerning average weekly payments, we used the following code:

    df_claimants %>% 
        group_by(week_end_date) %>%
        summarize(avg_total_pay = mean(total_pay))

Recall that `df_claimants` consisted of all certified claimants in Illinois that received benefits in any week since March 7, 2020. To calculate the median weekly payments, we will need to start with this data frame.

In [None]:
# Select PROMIS certified claimant records from the database to a dataframe

# Store SQL query to a variable
query <- "
SELECT ssn_id,
    week_end_date,
    byr_start_week,
    sub_program_type,
    program_type,
    claim_type,
    birth_date,
    gender,
    race,
    ethnicity,
    disability,
    education,
    county_fips_code,
    naics_code,
    occupation_code,
    total_pay
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND week_end_date >= '2020-03-07'
"

# Execute query
df_claimants <-dbGetQuery(con,query)

# R interprets dates as character when pulling from the database, must convert with ymd()
df_claimants <- df_claimants %>%
    mutate(week_end_date=ymd(week_end_date),
          byr_start_week=ymd(byr_start_week),
          birth_date=ymd(birth_date))

# See top records in the dataframe
head(df_claimants)

# Close the database connection
dbDisconnect(con)

Now, if we were to substitute the line `summarize(avg_total_pay = mean(total_pay))` with `median(total_pay)`, we would be calculating the true median pay received by Illinois certified UI claimants.

In [None]:
# calculate median pay
df_claimants %>% 
    group_by(week_end_date) %>%
    summarize(med_total_pay = median(total_pay)) %>%
    head()

However, as mentioned at the beginning of the section, direct percentiles cannot be exported from the ADRF. Instead, we can export a "fuzzy median", which will be calculated by averaging the true 45th and 55th percentiles. To calculate specific percentiles, we will use the `quantile()` function, which requires the variable of interest and the proportion corresponding to the percentile.

In [None]:
# implement fuzzy median
fuzzy_median_export <- df_claimants %>% 
    group_by(week_end_date) %>%
    summarize(
        fuzzy_med = (quantile(total_pay, .45) + quantile(total_pay, .55))/2
    )

head(fuzzy_median_export)

Notice how the fuzzy median weekly payments do not deviate much from the true median ones.

Although we have prepared the fuzzy median payments per week, we have not provided all of the information necessary to receive this file in an export because **we have not shown that there are at least 10 claimants receiving weekly payments in each week**. There are two potential options:

1. Include counts of certified claimants per week in `fuzzy_median_export` and filter the weeks for all of those with at least 10 certified claimants
2. Create a separate table to show that all weeks have at least 10 individuals and filter out any weeks that will not pass the check afterwards.

Here, we will opt for the second option, and create a separate data frame containing the number of individuals receiving weekly payments per week. To differentiate between the files we want for export and those to demonstrate proof of the export guidelines, we will name this data frame `fuzzy_median_input`.

In [None]:
# number of claimants receiving benefits per week
fuzzy_median_input <- df_claimants %>%
    group_by(week_end_date) %>%
    summarize(
        number_individuals = n_distinct(ssn_id)
    )

fuzzy_median_input

Luckily, the numbers of certified claimants in these weeks are all at least 10, so we will not need to filter out any weeks in `fuzzy_median_output`. `fuzzy_median_output` is now safe for export, assuming that `fuzzy_median_input` is included in the export request as a supplemental file showing satisfaction of the disclosure constraints.

> If `total_pay` contained any `NA`'s, and `NA` values were ignored in generating these fuzzy median estimates, the number of certified claimants per week must consist of only those with non-`NA` weekly payment values.

### **Percentages**

For any reported percentages or proportions, the underlying counts of individuals contributing to the numerators and denominators must be provided for each statistic in the desired export. In the Cohort Analysis notebook, we answered the following question:

- How many and what percentage of certified claimants in the COVID-19 cohort exit during each week after program entry?

The answer to this question is stored in the csv file `cs_exits.csv`. Let's load that into the environment as `cs_exits`.

<font color=red> Before you run the cell below, make sure you have run through [the cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
#Import weekly exit rates
cs_exits <- read_csv("U:\\..\\ETA Training\\Results\\cs_exits.csv")

# see cs_exits
head(cs_exits)

The answer to the question "How many and what percentage of certified claimants in the COVID-19 cohort exit during each week after program entry" is stored in the `exit_pct` column. This table cannot be exported in its current state because it is not clear to the reviewer how many individuals were in the numerator in creating `exit_pct`. 

Recall the code used to create `cs_exits`:

    cs_exits <- cs_week_number %>% mutate(cohort_start_pop=cohort_pop) %>% #Add the starting cohort population to the dataframe
        mutate(claimant_count = case_when(week_number == 1 ~ cohort_start_pop, TRUE ~ claimant_count)) %>% #Replace count of claimants in week 1 with the cohort population
        mutate(stay_pct = claimant_count/cohort_start_pop) %>% #Calculate share staying
        mutate(exit_pct = 1 - stay_pct)

Since `exit_pct` is simply created as the inverse of `stay_pct`, and the numerator and denominator of `stay_pct` are both counts of individuals, or claimants, the total number of individuals exiting per week can be calculated by subtracting `claimant_count` from `cohort_start_pop`. Then, as long as we make sure that there are at least 10 claimants in each numerator and denominator used to calculate `stay_pct` and `exit_pct`, this data frame is ready for export. 

However, because it is often important to keep the claimant count from the first week as a baseline, we will filter for only weeks where there are either 0 total exits, or at least 10. If the total number of exits is between 1 and 9, it is considered disclosive and will not pass export review. If you run into this issue, you can report biweekly exit rates, or redact exit rates for weeks where the number of total exiters is between 1 and 9. **Keep in mind that even if the total number of exits between two weeks is less than 10, as long as the number of exits within a week is either the same as or has a difference of at least 10 claimants from the original number of claimants in the first week, it will pass disclosure review.**

> In the documentation required for submitting an export, note which variables consist of numerator and denominator for all percentages and proportions.

In [None]:
# calculate number of individuals in the exit_pct
exit_rate_export <- cs_exits %>%
    mutate(
        total_exits = cohort_start_pop - claimant_count
    ) %>%
    filter(
        claimant_count >= 10, # numerator for stay_pct
        cohort_start_pop >= 10, # denominator for stay_pct and exit_pct
        total_exits == 0 | total_exits >= 10 # numerator for exit_pct 
    )

exit_rate_export

Due to potential churn, if differences in the amount of claimants receiving benefits in specific weeks is less than 10, we do not know if those receiving benefits in one week are the exact same individuals receiving benefits in the next week, as long as this difference is at least 10 when compared to the original population.

## **4. Visualizations**

Whereas the previous section contained examples on how to prepare tables for export, this section will discuss preparing certain visualizations for export. The approach for disclosure-proofing visualizations follows suit with proofing tables--it all pertains to the underlying data used to create the visualization. Here, we will discuss three different types of visualizations:
1. Bar plots
2. Line plots
3. Heatmaps

Keep in mind that this is not an exhaustive list of the types of visualization-based exports. Please refer to the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/) on approaching preparing other visualizations for disclosure review.

> The most common type of visualization that may require more non-obvious underlying data manipulation to satisfy export constraints is a *histogram*, as each bar in the histogram is technically a separate grouping. Therefore, each bar in the histogram, in this case, must contain information from at least 10 individuals.

### **Bar Plots**

Recall the question which was answered in the Data Visualization notebook using a bar plot:

- What were the top five industries with the most claimants during the peak week?

Since a bar plot is separated by distinct categories, or groups, the export requirements are straightforward: each bar must consist of observations from at least 10 individuals. Recall the code used to generate the final bar plot from the data frame `cs_ind_counts`, which contains the counts of certified claimants by industry and benefit week.

In [None]:
# Code adjusting overall graph attributes

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 12, repr.plot.height = 8)

In [None]:
#Find the top 5 industries as of REDACTED
#Filter by week_end_date, sort descending by claimant_count, and keep the top 5 records

cs_ind_counts_peak <- cs_ind_counts %>%
    filter(week_end_date == ymd("REDACTED")) %>%
    arrange(desc(claimant_count)) %>%
    head(5)

head(cs_ind_counts_peak)

In [None]:
#Assign a custom color palette to use with the bar graphs
palette_color <- c("REDACTED" = "orange",
                   "REDACTED" = "blue",
                   "REDACTED" = "red",
                   "REDACTED" = "purple",
                   "REDACTED" = "green3")

In [None]:
#Create a bar chart showing the 5 industries with the most claimants as of the week ending REDACTED

#Specify source dataset and x and y variables
cs_ind_counts_peak_plot <- ggplot(cs_ind_counts_peak, aes(x = claimant_count, 
                                                        y = reorder(naics_maj_desc_rv,claimant_count),
                                                        fill=naics_maj_desc_rv)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = palette_color, guide=FALSE) +

#Adjust the x scale to set the interval for tick marks
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, 1500, 100),
    limits = c(0, 1500)) +

#Add titles and axis labels
labs(title = "Certified claimants in Illinois by industry the week ending REDACTED",
     subtitle = "Top 5 Industries at the peak week",
     x = "Certified claimant counts", y = "Industry",
     caption = "Data Source: IL PROMIS file")

#Display the graph we just made
print(cs_ind_counts_peak_plot)

Although we can see that each bar contains at least 10 individuals, we should make it as clear as possible for the reviewer. To do so, we will include the specific counts for these five industries in the peak week as a supplemental file. These counts already exist in `cs_ind_counts_peak`. Just so we do not forget, let's save them in `cs_ind_counts_peak_plot_input` as well.

In [None]:
# save cs_ind_counts_peak as cs_ind_counts_peak_plot_input

cs_ind_counts_peak_plot_input <- cs_ind_counts_peak

# see cs_ind_counts_peak_plot_input
cs_ind_counts_peak_plot_input

Again, since each bar consists of at least 10 individuals, there is no need for any additional redaction.

### **Line Graphs**

In the Data Visualization notebook, the following question was answered with the line graph:

- What are the trends of claimant counts in the top five industries over time?

To display changes over time, a line graph can be used. Line graphs can be slightly more complicated than bar graphs for the disclosure review process, as each point utilized to create the line graph is also subject to the export guidelines, as opposed to each line just needing to consist of at least 10 individuals. Recall the code used to answer the question of "What are the trends of claimant counts in the top five industries over time?"

In [None]:
#Subset your data to the top 5 industries
#and also filter out the most recent benefit week, which ends 12-26-2020.
cs_ind_counts_top5 <- cs_ind_counts %>%
    filter(naics_maj_code_rv %in% c('REDACTED')) %>%
    filter(week_end_date != ymd("2020-12-26"))

In [None]:
#Create line graph showing trends in the top 5 industries over time

#Specify source and x and y variables
cs_ind_counts_top5_plot <- ggplot(cs_ind_counts_top5, aes(x = week_end_date, y = claimant_count, 
                                                        color=naics_maj_desc_rv)) +

#Adds a line to the graph
geom_line() + 

#Adjust the y scale to assign start and end points as well
#as the interval for tick marks
scale_y_continuous(labels = scales::comma,
    breaks = seq(0, 1400, 100),
    limits = c(0, 1400)) + 

#Adjust the x scale to specity date format, assign start and end points,
#and set the interval for tick marks
scale_x_date(
    breaks = seq(ymd("2020-03-14"), ymd("2020-12-19"), by='2 weeks'),
    labels = date_format(format="%b %d")) + 

#Apply the color palette
scale_color_manual("", values = palette_color) +

#Add a title and labels for the x and y axes
labs(title = "Certified claimants by benefit week in Illinois",
     subtitle = "Top 5 Industries at the peak week",
    x = "Week", y = "Certified claimant counts",
    caption = "Data Source: IL PROMIS file") +

#Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5))

#Display the plot we just made
print(cs_ind_counts_top5_plot)

Similar to the bar plot example, it is pretty clear that each point on all of the lines consists of at least 10 individuals. However, we will again need to make this as clear as possible for the reviewers. Luckily, this table already exists for us as the input table used to generate the above visualization, `cs_ind_counts_top5`. We just need to confirm that all `claimant_count` values are at least 10.

> Because we are not tracking a specific cohort here, but rather counting the number of certified claimants per week at different cross-sections, we do not need to compare the counts at the first week to others to make sure the difference is at least 10.

In [None]:
# get counts of certified claimants by industry and week
cs_ind_counts_top5

Unfortunately, there are too many rows to see all of the counts at once. There are a few different solutions to this issue--we will see if the number of rows change when filtering out rows where `claimant_count` is less than 10. As long as the following code outputs `TRUE`, there were no rows, or counts of individuals by industry and week, that did not satisfy the export rules.

In [None]:
# see if number of rows is differnet when filtering out rows with less than ten observations
nrow(cs_ind_counts_top5) == nrow(cs_ind_counts_top5 %>% filter(claimant_count >= 10))

If the above code resulted in a `FALSE`, the line plot as it is currently displayed would not pass the disclosure review process, and would require an alternative manipulation to visually display the answer to the question stated at the start of the section. But because it is true, let's save `cs_ind_counts_top5` and `cs_ind_counts_top5_plot_input`.

In [None]:
# save as cs_ind_counts_top5_plot_input
cs_ind_counts_top5_plot_input <- cs_ind_counts_top5

head(cs_ind_counts_top5_plot_input)

Note that if we were to export a line graph of exit rates by week, we would need to provide proof that both the counts of the number of individuals contributing to the numerator and denominator were at least 10 for each week.

### **Heat Maps**

In the Data Visualization notebook, a county-level heat map was created to answer this question: 

- Which counties in Illinois had the highest claimant rate during the peak week?

The heat map allowed us to visually display claimant rates by county according to their specific location in Illinois, where the counties with the highest claimant rates were indicated on a color spectrum. Heat maps follow similar rules to those of bar plots, as each location (instead of bar) must consist of observations from at least 10 individuals. However, since the metric displayed is not of individual counts, but rather claimant rates, we cannot filter by claimant rates to make sure the visualization will pass the export review process.

Recall the code written to create the original heat map:

In [None]:
#SQL statement to select ssn_id, week_end_date, and county_fips_code
#from the certified claimants data

# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

# Store SQL query to a variable
query <- "
SELECT ssn_id,
    week_end_date,
    county_fips_code
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND week_end_date >= '2020-03-07';
"

# Execute query
df_claimants_county <-dbGetQuery(con,query)

# R interprets dates as character pulling from the database, must convert with ymd()
df_claimants_county <- df_claimants_county %>%
    mutate(week_end_date=ymd(week_end_date))

# Close the database connection
dbDisconnect(con)

In [None]:
#Calculate certified claimant counts by county
cs_county <- df_claimants_county %>% 
    group_by(week_end_date, county_fips_code) %>%
    summarize(claimant_count=n())

#Filter the data to the week ending REDACTED
cs_county_peak <- cs_county %>%
    filter(week_end_date == ymd("REDACTED")) %>%
    arrange(desc(claimant_count))

In [None]:
#Import Local Area Unemployment Statistics labor force counts,
#limit to most recent year, which is currently 2019.
#Keep the FIPS code and the labor force count, which we rename to simply handling later on.

laus <- read_csv("P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\labor_by_county.csv", col_names = TRUE, col_types = "cciiiid") %>%
    filter(YEAR == 2019) %>%
    select(county_fips_code = FIPS, labor_force_2019 = `LABOR_FORCE`)

#Join claimant counts by county at REDACTED to the 2019 labor force
#Divide counts by labor force to get the claimant rate per county
#Since we are using a 1% sample, we first divide the labor force by 100 to adjust.
#Coalesce replaces any county without any claimants in the sample with 0.
#Otherwise these would just get values of NA and not display on the map.

cs_county_peak_cr <- left_join(laus, cs_county_peak, by = c("county_fips_code")) %>%
    mutate(claimant_rate = coalesce(claimant_count/(labor_force_2019/100),0))

In [None]:
#Load TIGER/line county shapefile
data_poly <- read_sf(dsn="P:\\tr-dol-eta\\ETA Class Notebooks\\geom", layer="tl_2019_us_county") %>%
    filter(STATEFP == 'REDACTED') %>% #Limit to Illinois
    mutate(lat=as.numeric(INTPTLAT), long=as.numeric(INTPTLON)) %>% #Convert X and Y coordinates from character to numeric
    rename(county_fips_code = COUNTYFP, name = NAME) %>% #Rename some fields
    select(county_fips_code, name, lat, long, geometry) #Select a subset of columns

#Join geographies to the counts dataframe
#Since the join fields have different names we must specify which joins to which using by=
cs_county_peak_cr_geo <-
    left_join(cs_county_peak_cr, data_poly)

In [None]:
# First, let's adjust the plot attributes so they are more appropriate for maps
# In this case, we want our plot to be taller than it is wide for a map

# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 8, repr.plot.height = 12)

In [None]:
# Map the claimant rate for each county as of REDACTED

# Specify the source dateset, geometry column, and fill column
cs_peak_county_map <- ggplot(cs_county_peak_cr_geo, aes(geometry = geometry, fill = claimant_rate)) +  

# Plot the map
geom_sf() +

# Specifying the coordinate system improvest the appearance of the projection
coord_sf(crs=4269) +

# Apply county label names
geom_text(aes(x = long, y = lat, label = name),
               size=2.5, color = "black") +

# Define gradient for fill
scale_fill_gradient(low='white',high='brown') +

# Apply classic theme, which includes some nice visual defaults
theme_classic() +

# Apply map labels
labs(x = "", y = "", color = "", fill = "",
    title = "Claimant rates varied substantially between Illinois counties \nin the week ending REDACTED",
    caption = "Data Source: IL PROMIS file") +

# This code removes some visual elements such as x- and y-axis lines that are
# not desirable for maps
theme(panel.grid = element_blank(),
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    title = element_text(size=15))

# Print the map we just created
print(cs_peak_county_map)

From the heat map itself, we cannot see the number of individuals contributing to the claimant rates in each county. To see if we will have to work backwards to find the number of individuals contributing to the claimant rate calculations in county, we will start by looking at the columns in `cs_county_peak_cr_geo`, the underlying data frame used to create the heat map above.

In [None]:
# see cs_county_peak_cr_geo
cs_county_peak_cr_geo

After reviewing the code above, you can see how `claimant_rate` was created:

    cs_county_peak_cr <- left_join(laus, cs_county_peak, by = c("county_fips_code")) %>%
        mutate(claimant_rate = coalesce(claimant_count/(labor_force_2019/100),0))

Since `labor_force_2019`, the denominator in calculating `claimant_rate` comes from public data, we do not need to provide any additional information for disclosure review, outside of noting that the variable comes from public data in the export request. The numerator, `claimant_count`, is not public data, though, and we need to confirm that each county contains at least 10 individuals. As you can see, some counties do not have `claimant_count` values of at least 10, so we will need to filter these counties so that their claimant rates are `NA` and regenerate the visualization.

In [None]:
# setting all counties with less than 10 claimants to claimant rate of NA
cs_county_peak_cr_geo_filtered <- cs_county_peak_cr_geo %>%
    mutate(
        claimant_rate = ifelse(claimant_count >= 10, claimant_rate, NA)
        )

# see updated data frame
head(cs_county_peak_cr_geo_filtered)

Now that we have confirmed that counties with less than 10 claimants have `NA` claimant rates, we can recreate the heat map.

In [None]:
# Map the claimant rate for each county as of REDACTED

# Specify the source dateset, geometry column, and fill column
cs_peak_county_map_filtered <- ggplot(cs_county_peak_cr_geo_filtered, aes(geometry = geometry, fill = claimant_rate)) +  

# Plot the map
geom_sf() +

# Specifying the coordinate system improvest the appearance of the projection
coord_sf(crs=4269) +

# Apply county label names
geom_text(aes(x = long, y = lat, label = name),
               size=2.5, color = "black") +

# Define gradient for fill
scale_fill_gradient(low='white',high='brown') +

# Apply classic theme, which includes some nice visual defaults
theme_classic() +

# Apply map labels
labs(x = "", y = "", color = "", fill = "",
    title = "Claimant rates varied substantially between Illinois counties \nin the week ending REDACTED",
    caption = "Data Source: IL PROMIS file") +

# This code removes some visual elements such as x- and y-axis lines that are
# not desirable for maps
theme(panel.grid = element_blank(),
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    title = element_text(size=15))

# Print the map we just created
print(cs_peak_county_map_filtered)

As you can see, counties with an insufficient amount of individuals contributing to the claimant rates are now grey.

Finally, let's save `cs_county_peak_cr_geo_filtered` as `cs_peak_county_map_filtered_input`.

In [None]:
# save as cs_ind_counts_top5_plot_input
cs_peak_county_map_filtered_input <- cs_county_peak_cr_geo_filtered

head(cs_peak_county_map_filtered_input)

## **5. Supervised Machine Learning**


Whenever you are creating your training and test datasets, after creating them, you must include the counts of each variable, and please do not alter the datasets after doing so. If you use any dummy variables, you need to provide the counts of 0s and 1s for each dummy variable.

Remember that if you are plotting y-scores, it is still a histogram, and each estimate represents an individual data point, therefore, it needs to comply with the disclosure threshold described above, that each bin must consist of at least 10 individuals. Please refer to the disclosure review documentation for more information about exporting histograms.

## **6. Exporting Output and Supplemental Files as .csv and .png**

Before saving the data frames and visualizations as .csv and .png files, respectively, refer to the file naming convention mentioned in the export documentation. The export team will not review submissions that do not adhere to the naming convention.

> As done in the Data Exploration and Visualization notebooks, you can save to .csv and .png using `write.csv()` and `png()`.

## 7. **Footnotes**
<span id="fn1"> 1. <a href='https://www.bls.gov/lau/laumthd.htm'>BLS Local Area Unemployment Statistics Estimation Methodology</a> </span>   
[[Go back]](#8)

> Note that the above link does not work inside of the ADRF since you do not have internet access.

> Click [Go back] to go back to where you were.