<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 

# **<center> Data Visualization </center>**

<a href="https://doi.org/10.5281/zenodo.4589040"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4589040.svg" alt="DOI"></a>

Tian Lou, Dave McQuown

## **1. Introduction**
In the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb) and the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb), we have analyzed IL certified claimants data and have created summary statistics, such as weekly claimant counts, average weekly total pay, exit rate of the COVID-19 cohort claimants during each week after program entry, etc. Sometimes, it is hard to see trends and to communicate with your audience by only looking at summary statistics. In this notebook, we will import the csv files we saved in the previous two notebooks and learn how to turn them into informative visualizations. We will also discuss what type of visualization to choose for different types of analyses and how to use labels and plot adjustments to better present your results. As you work through the notebook, we will have checkpoints for you to practice using the code. You can think about how you might apply any of the techniques and code presented in this notebook to your project.

## **2. Learning Objectives**

After you finish this notebook, you should know:

- how to use the R package `ggplot2` to create line plots, bar charts, and maps
- how to choose visualization types for your analyses
- how to improve your visualizations with informative titles, labels, and plot settings


#### **Research Questions** 
We will create various visualizations by using the summary statistics we created in the two data exploration notebooks.  In addition, we will create a map to show how claimant rates vary by county. Claimant rate is defined as the percentage of certified claimants in the labor force. The questions we seek to answer are:

**Cross-sectional Analysis**
- What are the trends of Illinois certified claimant counts during the COVID-19 recession? Which week had the highest number of certified claimants?
- What were the top five industries with the most claimants during the peak week? What are the trends of claimant counts in the top five industries over time?
- Which counties in Illinois had the highest claimant rate during the peak week?

**Cohort Analysis**
- How many and what percentage of certified claimants in the COVID-19 cohort exit during each week after program entry?
- How do the COVID-19 cohort exit rates vary by the top five industries with the most claimants?

#### **Datasets** ####
The summary statistics we created in the two data exploration notebooks are based on Illinois PROMIS file.
- **2020 Illinois PROMIS certified claims file**: weekly UI claims data. Each record represents a certified claim in a certain week. The data has a claimant's demographics, education level, prior industry, occupation, and locations. It also contains detailed information about the claim, such as program type, claim type, certification status, benefit starting date, and benefit amount. Federal Pandemic Unemployment Compensation (FPUC, $600/week) and dependent benefits are included in the total amount paid.

**All analyses in this notebook are based on the 1% random sample of the certified claims data. You should also only use the 1% random sample when walking through all notebooks and when identifying the scope of your analysis.**

We will also use labor force counts from Bureau of Labor Statistics (BLS).
- **2019 BLS Labor Force Data**: 2019 annual average county level labor force data. The estimates are from a building-block approach, which uses several data sources, including the Current Population Survey (CPS), Current Employment Statistics (CES), state UI system, and American Community Survey (ACS).[<sup>1</sup>](#fn1) <a id = "8"> </a>

#### **Methods** ####
We will cover the following visualizations in this notebook:
- **Line Plot**: is typically used for time series data to show how a variable changes over time
- **Bar Plot**: visualizes relationships between numerical and categorical variables
- **Heat Map**: shows geographical variations in a variable using graded differences in color

We use R package `gglot2` to create all visualizations. Here is a brief introduction of **the syntax of `ggplot2`**:

- start with `ggplot()` <br>
- then, supply a dataset and aesthetic mapping with `x` pertaining to the variable on the x-axis, and so on, for example: `ggplot(dataset, aes(x = ..., y = ...)` <br>
- from there, provide a geometric object represented by `geom_` to convey the desired type of visualization <br>
- finally, add additional layers if necessary using `+` <br>

For example, we can use the code blow to create a line plot (geometric object).

    ggplot(data, aes(x = ... , y = ...) + 
        geom_line()
   
We can add an additional layer `labs()` to create a line plot with a title.

    ggplot(data, aes(x = ... , y = ...) + 
        geom_histogram() + 
        labs(title = 'My plot title')
        
The `aes()` call can contain additional arguments outside of `x` and `y` to potentially match the `fill`, `color`, `linetype`, and additional specifications of specific variables in a dataset.

## **3.Cross-sectional Analysis Visualizations**

In this section, we create various visualizations for the cross-sectional analysis to investigate the stock of ceritified claimants in Illinois over time, by industry, and by county. As always, our first step is to load R functions and establish the database connection. Note that we include new packages `lubridate` and `sf` to work with date-time data and geography data, respectively.

In [None]:
#database interaction imports
library(DBI)
library(odbc)

# for data manipulation/visualization
library(tidyverse)
library(lubridate)
library(sf)

# for calculating percentages
library(scales)

Then, let's import the cross-sectional analysis csv files. Here is a reminder of what summary statistics each file contains:
- **`cs_counts.csv`**: weekly certified claimant counts
- **`cs_ind_counts.csv`**: weekly certified claimant counts by industry

We also need county level weekly certified claimant counts. We will show you how to calculate these counts and normalize them with county level labor force in the later section.

<font color=red> Before you run the cell below, make sure you have run through [the cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
#Load the statewide aggregates that we exported in the cross-section notebook

#Import weekly certified claimant counts
cs_counts <- read_csv("U:\\..\\ETA Training\\Results\\cs_counts.csv")

#Import weekly certified claimant counts by industry
cs_ind_counts <- read_csv("U:\\..\\ETA Training\\Results\\cs_ind_counts.csv")

#### **Weekly Certified Claimant Count**
Previously, by looking at the counts in DataFrame `cs_counts`, we have identified that the number of Illinois certified claimants spiked in (REDACTED) and peaked during the week ending (REDACTED). However, when presenting you results, a long list of numbers is not an effective way for your audience to understand and absorb the information. In this example, since we want to show the time trend in certified claimant counts, we will use a line plot with week ending date as the x-axis and claimant counts as the y-axis.

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in claimant counts

#Specify source dataset and x and y variables
ggplot(cs_counts, aes(x = week_end_date, y = claimant_count)) + 

#Plots a line on the graph
geom_line()

From the graph, we can see that there was a sharp increase in the number of (REDACTED). After the number reached the peak point, it started to decrease. About (REDACTED), the number of certified claimants stopped reducing and started to fluctuate at a relatively constant level, which is about (REDACTED) of the (REDACTED). **Note that there is a sharp decrease in the (REDACTED). Due to the biweekly certification schedule in Illinois, the most recent week available in the certified claimants data only contains half of the claimants. We will filter out this (REDACTED) in subsequent graphs.** In this case, we will filter out the week ending (REDACTED).

Although we can see the trends from the above graph quickly and clearly, there are several places we need to improve so that the graph provides the audience some context and delivers information more efficiently.

1. **Graph Title**: The title of the visualization should convey the major takeaway(s) of the plot and should answer the original question. In this case, depending on the information you want to address, you may have different titles. For example, if you want to describe the unprecedented increase in the number of certified claimants, the title could be "IL certified claimants increased dramatically since (REDACTED) and peaked in (REDACTED)". If you want to describe the more recent trend, the title could be "IL certified claimants have decreased but remain higher than pre-pandemic level".
2. **X-axis and Y-axis Labels**: The labels in our current graph are variable names. We need to change them to short and easy-to-understand descriptions, such as "certified claimant counts" and "week".
3. **Data Source**: Providing clear reference and source of the underlying data used for the visualization can increase the credibility and enable the reproducibility of your results. The reference can be the data agency or the name of the dataset. In this analysis, we use "IL PROMIS File" as our major data source. 

> We can add a `labs()` layer to the visualization to change the title, axes' names, and caption of the plot. We simply use `+` to add this layer to the code we showed earlier. 

    labs(
        title = "YOUR TITLE",
        x = "X-AXIS LABEL",
        y = "Y-AXIS LABEL",
        caption = "DATA SOURCE OR NOTES"
        )

4. **X-axis and Y-axis Tick Mark**: On our current graph, x-axis only shows months and the interval is three months. It is hard for the audience to see key time points, such as during which week the number of certified claimants started to increase and when it peaked. Therefore, we want to adjust the x-axis tick so that it shows weeks instead of months and shows the time points more frequently. In this case, we use `scale_y_continuous()` for the y-axis to specify a continuous data type and to adjust tick marks and axis extent, and `scale_x_date()` for the x-axis to specify a date data type, adjust the x-axis start and end points, specify the display format, and set the tick mark interval. In both cases, `breaks=` statement controls the minimum and maximum values of the axis labels as well as the tick mark intervals between them. For `scale_x_date()`, the `labels=` statement specifies how the date will display. In this case, `%b %d` means a three-letter abbreviation of the month name followed by the date.

5. **Key Time Points**: In this analysis, labelling key timepoints, such as when the lockdown in IL started or the week with the peak number of claimants, helps the audience understand the context. However, it is not necessary to add key timepoints to every graph. We can add vertical lines to the graph to represent significant events using the `geom_vline()` statement and control labels for these lines using the `annotate()` statement. Within `geom_vline()` we must specify where to place the line using the `xintercept=` command. In our case, we assign the date of the Illinois stay at home order (3-21-2020) to a variable and then pass this variable to `xintercept=`. When we use the `annotate()` statement, we must specify the location of the labels using the `x=` and `y=` statements. Keep in mind you will likely need to adjust these based on specific details of your graph. In the example below, we use a height of REDACTED, which places the label nearly at the top since the max extent of our y-axis is (REDACTED).

In [None]:
# Code adjusting overall graph attributes

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 12, repr.plot.height = 8)

In [None]:
# First, assigning the stay at home order date and overall peak week to variables.
# We will use these to plot vertical lines on the graph representing these dates

# Illinois stay at home order
stayhome_start <- ymd(("2020-03-21"))

# Overall peak for certified claimants
peak_overall <- ymd((REDACTED))

In [None]:
# Filter out the week ending REDACTED
cs_counts <- filter(cs_counts, week_end_date < ymd((REDACTED)))

In [None]:
# Example ggplot2 syntax to make a visualization of trends in claimant counts
# with titles, labels, adjustments to tick marks on the axes, and vertical lines
# reflecting the timing of key events

# Specify source and x and y variables
cs_counts_plot <- ggplot(cs_counts, aes(x = week_end_date, y = claimant_count)) +

# Adds a line to the graph
geom_line() + 

# Add a red vertical dashed line at the stay at home order date
# with a label at the top explaining what it is
geom_vline(xintercept = stayhome_start,
    color = "red", linetype = "dashed") +
annotate("text", x = stayhome_start, y = REDACTED, color = "red",
    hjust = -0.1, label = "COVID-19\nStay at Home") +

# Add a blue vertical dashed line at the peak week for certified claimants overall
# with a label at the top explaining what it is
geom_vline(xintercept = peak_overall,
    color = "blue", linetype = "dashed") +
annotate("text", x = peak_overall, y =REDACTED, color = "blue",
    hjust = -0.1, label = "Peak") +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(labels = scales::comma,
    breaks = seq(0,REDACTED, REDACTED),
    limits = c(0,REDACTED)) + 

#Adjust the x scale to specify date format, assign start and end points,
#and set the interval for tick marks
scale_x_date(
    breaks = seq(ymd("2020-03-14"), ymd(REDACTED), by='2 weeks'),
    labels = date_format(format="%b %d")) + 

#Add a title, labels for the x and y axes, and data source
labs(title = "IL certified claimants increased dramatically since (REDACTED) and peaked in (REDACTED)",
    x = "Week", y = "Certified claimant counts",
    caption = "Data Source: IL PROMIS file") +

#Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5))

#Display the graph that we just created
print(cs_counts_plot)

#### **Checkpoint 1: Create A Line Plot**
Import the regional analysis csv files you saved in the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb). Then create a line plot to show the trend of certified claimant counts in your region of interest. Remember that your plot should include an informative title, axes' labels, data source, and key timepoints. Adjust axes' tick marks and graph sizes to make it easy to read.  

Comment on the trend of certified claimant counts in your region of interest. During which week did the number of certified claimants peak in your region of interest?

Recall that we saved the following files containing the results of your regional selection:
- **`cs_reg_counts.csv`**: weekly certified claimant counts for your selected region
- **`cs_reg_sub_counts.csv`**: weekly certified claimant counts by your selected dimension and groupings

<font color=red> Before you run the cell below, make sure you have finished all checkpoints in [the cross-section analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# Load weekly certified claimants for your regional selection
cs_reg_counts <- read_csv("U:\\..\\ETA Training\\Results\\cs_reg_counts.csv")

# Load weekly certified claimants for your regional selection by your chosen dimension
cs_reg_sub_counts <- read_csv("U:\\..\\ETA Training\\Results\\cs_reg_sub_counts.csv")

In [None]:
#Start with a simple line plot and observe the trend, axes' limits, and other information you may need
#Use this information to improve your graph in the next step

#Specify source dataset and x and y variables
ggplot(___, aes(x =___, y =___)) + 

#Plots a line on the graph
geom_line()

In [None]:
#Note that your regional selection may have a different peak week than the State overall
#Set that here if you want to show a vertical line for the peak
peak <- ymd("___")

#Filter out your dataset to exclude the week ending (REDACTED)
cs_reg_counts <- filter(cs_reg_counts, week_end_date != ymd((REDACTED)))

In [None]:
#Create a line graph for your regional subset
#Replace each ___ with values relevant to your regional selection

#Replace ___ with the x-axis variable and the y-axis variable
cs_reg_counts_plot <- ggplot(cs_reg_counts, aes(x = ___, y = ___)) +

#Adds a line to the graph
geom_line() + 

#Add a red vertical dashed line at the stay at home order date
#with a label at the top explaining what it is
#Replace ___ with the y value where the label should appear
geom_vline(xintercept = stayhome_start,
    color = "red", linetype = "dashed") +
annotate("text", x = stayhome_start, y = ___, color = "red",
    hjust = -0.1, label = "COVID-19\nStay at Home") +

# Add a blue vertical dashed line at the peak week for certified claimants overall
# with a label at the top explaining what it is
geom_vline(xintercept = peak,
    color = "blue", linetype = "dashed") +
annotate("text", x = peak, y = ___, color = "blue",
    hjust = -0.1, label = "Peak") +

#Adjust the y scale to set start and end points as well
#as the interval for tick marks
#In the "breaks=" statement, replace ___ with the maximum y-axis label value and the tick mark inverval
#In the "limits=" statement, replace ___ with the range of values we want to display on the y-axis
scale_y_continuous(labels = scales::comma,
    breaks = seq(0, ___, ___),
    limits = c(0, ___)) + 

#Adjust the x scale to specity date format, assign start and end points,
#and set the interval for tick marks
scale_x_date(
    breaks = seq(ymd((REDACTED)), ymd((REDACTED)), by='2 weeks'),
    labels = date_format(format="%b %d")) + 

#Add a title and labels for the x and y axes
#Replace ___ with your title, x-axis label, y-axis label, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___") +

#Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5))

#Display the graph that we just created
print(cs_reg_counts_plot)

#### **Top Five Industries during the Peak Week**

During the peak week (REDACTED) we saw in the previous graph, which industries had the most certified claimants? To answer this question, we can use a bar plot to compare relative and absolute claimant counts (a numerical variable) of different industries (a categorical variable). Recall that we saved claimant counts by industry in the DataFrame `cs_ind_counts`. Let's get the data during the week ending REDACTED and sort it in descending order.

In [None]:
#Find the top 5 industries as of (REDACTED)
#Filter by week_end_date, sort descending by claimant_count, and keep the top 5 records

cs_ind_counts_peak <- cs_ind_counts %>%
    filter(week_end_date == ymd((REDACTED))) %>%
    arrange(desc(claimant_count)) %>%
    head(5)

head(cs_ind_counts_peak)

In this project, we will create several graphs to show variations across industries. To make sure the color scheme is consistent in these graphs, we need to **assign a color palette to the industry category**.  

In [None]:
#Assign a custom color palette to use with the bar graphs
palette_color <- c((REDACTED) = "orange",
                   (REDACTED) = "blue",
                   (REDACTED) = "red",
                   (REDACTED) = "purple",
                   (REDACTED) = "green3")

Now we can create the bar plot to show peak week certified claimants in each industry. The code structure is similar to the code we used to create the line plot. But instead of `geom_line()`, we use `geom_col()`. The `x` and `y` variables we include in `aes()` are `claimant_count` and `naics_maj_desc_rv`. This way, we get a horizontal bar chart with x-axis showing the counts and y-axis showing the descriptions of industries. If you want to create a vertical bar chart, you just need to switch the `x` and `y` variables. In the `scale_fill_manual()` statement, we use `values=` to specify how we want each series to display by passing the `palette_color` variable that we assigned above. `palette_color` relates potential values of `naics_maj_desc_rv` to the display color for each. This statement can accept common English names for many colors as well as hexidecimal color codes.

In [None]:
#Create a bar chart showing the 5 industries with the most claimants as of the week ending (REDACTED)

#Specify source dataset and x and y variables
cs_ind_counts_peak_plot <- ggplot(cs_ind_counts_peak, aes(x = claimant_count, 
                                                        y = reorder(naics_maj_desc_rv,claimant_count),
                                                        fill=naics_maj_desc_rv)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = palette_color, guide=FALSE) +

#Adjust the x scale to set the interval for tick marks
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, REDACTED, REDACTED),
    limits = c(0, REDACTED)) +

#Add titles and axis labels
labs(title = "Certified claimants in Illinois by industry the week ending (REDACTED)",
     subtitle = "Top 5 Industries at the peak week",
     x = "Certified claimant counts", y = "Industry",
     caption = "Data Source: IL PROMIS file")

#Display the graph we just made
print(cs_ind_counts_peak_plot)

We can see that the five industries with the most certified claimants during the week ending (REDACTED) are (REDACTED),(REDACTED), (REDACTED), (REDACTED), and (REDACTED). How do claimant counts in these industries change over time? Do they have the same trends as the state-level trend, i.e., dramatically increased at the beginning and then slowly decreased? Are there any differences across industries? To answer these questions, we can use a line plot to show time trends of the top five industries.

In [None]:
#Subset your data to the top 5 industries
#and also filter out the most recent benefit week, which ends (REDACTED).
cs_ind_counts_top5 <- cs_ind_counts %>%
    filter(naics_maj_code_rv %in% c((REDACTED))) %>%
    filter(week_end_date != ymd((REDACTED)))

In [None]:
#Create line graph showing trends in the top 5 industries over time

#Specify source and x and y variables
cs_ind_counts_top5_plot <- ggplot(cs_ind_counts_top5, aes(x = week_end_date, y = claimant_count, 
                                                        color=naics_maj_desc_rv)) +

#Adds a line to the graph
geom_line() + 

#Add a red vertical dashed line at the stay at home order date
#with a label at the top explaining what it is
geom_vline(xintercept = stayhome_start,
    color = "red", linetype = "dashed") +
annotate("text", x = stayhome_start, y = REDACTED, color = "red",
    hjust = -0.1, label = "COVID-19\nStay at Home") +

#Add a blue vertical dashed line at the peak date of certified claimants
#with a label at the top explaining what it is
geom_vline(xintercept = peak_overall, 
    color = "blue", linetype = "dashed") +
annotate("text", x = peak_overall, y = REDACTED, color = "blue",
    hjust = -0.1, label = "Peak") +

#Adjust the y scale to assign start and end points as well
#as the interval for tick marks
scale_y_continuous(labels = scales::comma,
    breaks = seq(0, REDACTED, REDACTED),
    limits = c(0, REDACTED)) + 

#Adjust the x scale to specity date format, assign start and end points,
#and set the interval for tick marks
scale_x_date(
    breaks = seq(ymd((REDACTED)), ymd((REDACTED)), by='2 weeks'),
    labels = date_format(format="%b %d")) + 

#Apply the color palette
scale_color_manual("", values = palette_color) +

#Add a title and labels for the x and y axes
labs(title = "Certified claimants by benefit week in Illinois",
     subtitle = "Top 5 Industries at the peak week",
    x = "Week", y = "Certified claimant counts",
    caption = "Data Source: IL PROMIS file") +

#Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5))

#Display the plot we just made
print(cs_ind_counts_top5_plot)

We can see that there is significant variation in trends by industry and when comparing a single industry to all certified claimants. In an earlier step, we graphed counts of certified claimants in Illinois for all industries. We found that the overall peak was the week ending (REDACTED). By the week ending (REDACTED), (REDACTED), the count of certified claimants had fallen by (REDACTED). Looking by industry, we first see that not every industry has the same peak week. For some industries, like(REDACTED), the peak did not occur until well into (REDACTED). While each industry generally increased after COVID and decreased in the year following, the details show substantial variation. While (REDACTED),(REDACTED), and (REDACTED) similarly fell by roughly half between (REDACTED) and (REDACTED), (REDACTED) fell by closer to (REDACTED) over this period. (REDACTED) has a notable trend over the year compared to the others because it flattened over (REDACTED) while the other industries and certified claimants overall were geneally decreasing over that time.

#### **Checkpoint 2: Create a Bar Chart**
Now use the subgroup certified claimant counts you saved in the [cross-sectional analysis notebook](./1.Data_Exploration_Cross-section_Analysis.ipynb) to create two visualizations. First, identify the categories with the most certified claimants in your region of interest during the peak week with a bar chart. Second, show the time trends of certified claimant counts in these categories with a line plot. Remember to create your color palette first.

> We have also calculated average weekly total pay in the cross-sectional analysis notebook. You can use the same methods we provide in this section to create graphs for average weekly total pay or other outcomes you are interested in.

In [None]:
#Find the 5 categories with the highest count of claimants as of the week ending REDACTED
cs_reg_sub_counts_peak <- cs_reg_sub_counts %>%
    filter(week_end_date == ymd((REDACTED))) %>%
    arrange(desc(claimant_count)) %>%
    head(5)

head(cs_reg_sub_counts_peak)

In [None]:
#Create a color palette for your categories. Expand or shorten the list based on your needs
___ <- c("___" = "orange",
                   "___" = "blue",
                   "___" = "red",
                   "___" = "purple",
                   "___" = "green3")

In [None]:
#Make your bar graph

#Specify source dataset and x and y variables
#Replace ___ with your dimension variable
___ <- ggplot(cs_reg_sub_counts_peak, aes(x = claimant_count, y = ___, fill=___)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = ___, guide=FALSE) +

#Adjust the x-axis scale to set minimum and maximum values
#and the interval for tick marks
#In the "breaks=" statement, replace ___ with the maximum x-axis label value and the tick mark inverval
#In the "limits="" statement, replace ___ with the x-axis maximum value.
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, ___, ___),
    limits = c(0, ___)) +

#Add titles and axis labels
#Replace ___ with your title, x-axis label , y-axis label, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___")

#Display the plot we just made
print(___)

#### **Certified Claimants by County of Residence** 

We have seen the differences in certified claimants across industry in the previous section. Are there any variations across other dimensions, such as geographical areas? IL PROMIS file contains claimants' county of residence, `county_fips_code`. We can use it to calculate county level weekly certified claimant counts. Since we have not calculated this variable previously, we need to read in the claimant data and create the measure first.

In [None]:
#SQL statement to select ssn_id, week_end_date, and county_fips_code
#from the certified claimants data

# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

# Store SQL query to a variable
query <- "
SELECT ssn_id,
    week_end_date,
    county_fips_code
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND week_end_date >= '2020-03-07';
"

# Execute query
df_claimants_county <-dbGetQuery(con,query)

# R interprets dates as character pulling from the database, must convert with ymd()
df_claimants_county <- df_claimants_county %>%
    mutate(week_end_date=ymd(week_end_date))

# See top records in the dataframe
head(df_claimants_county)

# Close the database connection
dbDisconnect(con)

In [None]:
#Calculate certified claimant counts by county
cs_county <- df_claimants_county %>% 
    group_by(week_end_date, county_fips_code) %>%
    summarize(claimant_count=n())

#Filter the data to the week ending (REDACTED)
cs_county_peak <- cs_county %>%
    filter(week_end_date == ymd((REDACTED))) %>%
    arrange(desc(claimant_count))

#Show the counties with the most certified claimants as of the peak
head(cs_county_peak)

Should we draw any conclusions based on county level claimant counts? Probably not. From the table, we know that (REDACTED) has the highest number of certified claimants. However, (REDACTED) also has the highest population density among all the counties in IL. If we just use the absolute counts, (REDACTED) will always seem to have the worst labor market condition. However, some counties may have been disproportionately affected during the COVID-recession, especially those with an industry mix that favoring heavily impacted industries.

Therefore, we need to normalize certified claimant counts by a workforce measure so that it shows us the proporation of workers who received UI benefits during a specific week. Here, we will divide claimant counts by **labor force**, which is available in `labor_by_county.csv` in the shared folder. The labor force measure comes from Bureau of Labor Statistics (BLS). By the time of writing this notebook, the most recent data we can get from BLS website is 2019 data. We will call the result the **claimant rate** to distinguish it from the unemployment rate published by BLS. **Labor force is the sum of unemployed workers and employed workers**.

> BLS defines **unemployed workers** as "all persons who had no employment during the reference week, were available for work, except for temporary illness, and had made specific efforts to find employment some time during the 4 week-period ending with the reference week". 
> BLS defines **employed workers** as "all persons who, during the referencec week, (a) did any work as paid employees, worked in their own business or profession or on their own farm, or worked 15 hours or more as unpaid workers in an enterprise operated by a member of their family, or (b) were not working but who had jobs from which they were temporarily absent because of vacation, illness, bad weather, childcare problems, maternity or paternity leave, labor management dispute, job training, or other family or personal reasons, whether or not they were paid for the time off or were seeking other jobs".[<sup>2</sup>](#fn2)  <a id = "9"> </a>

In [None]:
#Import Local Area Unemployment Statistics labor force counts,
#limit to most recent year, which is currently 2019.
#Keep the FIPS code and the labor force count, which we rename to simply handling later on.

laus <- read_csv("P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\labor_by_county.csv", col_names = TRUE, col_types = "cciiiid") %>%
    filter(YEAR == 2019) %>%
    select(county_fips_code = FIPS, labor_force_2019 = `LABOR_FORCE`)

#Join claimant counts by county at 5-9-2020 to the 2019 labor force
#Divide counts by labor force to get the claimant rate per county
#Since we are using a 1% sample, we first divide the labor force by 100 to adjust.
#Coalesce replaces any county without any claimants in the sample with 0.
#Otherwise these would just get values of NA and not display on the map.

cs_county_peak_cr <- left_join(laus, cs_county_peak, by = c("county_fips_code")) %>%
    mutate(claimant_rate = coalesce(claimant_count/(labor_force_2019/100),0))

To create a map in R, we also need geographic information, such as longitude and latitude. We can get this information by importing a shapefile of counties into R using the `read_sf()` function. A shapefile is a common format for spatial vector data. The file we will be using is sourced from the TIGER/Line shapefile series that is maintained by the Census Bureau.

In [None]:
#Load TIGER/line county shapefile
data_poly <- read_sf(dsn="P:\\tr-dol-eta\\ETA Class Notebooks\\geom", layer="tl_2019_us_county") %>%
    filter(STATEFP == '17') %>% #Limit to Illinois
    mutate(lat=as.numeric(INTPTLAT), long=as.numeric(INTPTLON)) %>% #Convert X and Y coordinates from character to numeric
    rename(county_fips_code = COUNTYFP, name = NAME) %>% #Rename some fields
    select(county_fips_code, name, lat, long, geometry) #Select a subset of columns

#Join geographies to the counts dataframe
#Since the join fields have different names we must specify which joins to which using by=
cs_county_peak_cr_geo <-
    left_join(cs_county_peak_cr, data_poly)

Now we can use `geom_sf()` to create a county level map which shows how claimant rate varies across county during the week ending REDACTED.

In [None]:
# First, let's adjust the plot attributes so they are more appropriate for maps
# In this case, we want our plot to be taller than it is wide for a map

# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 8, repr.plot.height = 12)

In [None]:
# Map the claimant rate for each county as of (REDACTED)

# Specify the source dateset, geometry column, and fill column
cs_peak_county_map <- ggplot(cs_county_peak_cr_geo, aes(geometry = geometry, fill = claimant_rate)) +  

# Plot the map
geom_sf() +

# Specifying the coordinate system improvest the appearance of the projection
coord_sf(crs=4269) +

# Apply county label names
geom_text(aes(x = long, y = lat, label = name),
               size=2.5, color = "black") +

# Define gradient for fill
scale_fill_gradient(low='white',high='brown') +

# Apply classic theme, which includes some nice visual defaults
theme_classic() +

# Apply map labels
labs(x = "", y = "", color = "", fill = "",
    title = "Claimant rates varied substantially between Illinois counties \nin the week ending (REDACTED)",
    caption = "Data Source: IL PROMIS file") +

# This code removes some visual elements such as x- and y-axis lines that are
# not desirable for maps
theme(panel.grid = element_blank(),
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    title = element_text(size=15))

# Print the map we just created
print(cs_peak_county_map)

While there is substantial variation in the claimant rates between Illinois counties in the week of (REDACTED) is complex, some broad trends emerge. More populous urban counties, which tend to have a higher share of service jobs affected by the stay at home order, have claimant rates on the higher end of the range. This is true in the (REDACTED). We also see some higher claimant rates in REDACTED in counties that generally have REDACTED median incomes and REDACTED labor forces. 

## **4.Cohort Analysis Visualization**

In this section, we will create visualizations for the cohort analysis. Recall that in the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb), we have analyzed claimants who entered UI programs during the week ending March 28 and the week ending April 4 and have combined them as one cohort, **the COVID-19 cohort**. We have investigated their exit rates after program entry and how their exit rates vary by industry.

Again, let's import the summary statistics first.

- **`cs_exits.csv`**: exits by week, statewide
- **`cs_ind_exits.csv`**: exits by industry by week, statewide

<font color=red> Before you run the cell below, make sure you have run through [the cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# Load exits by week, statewide
cs_exits <- read_csv("U:\\..\\ETA Training\\Results\\cs_exits.csv")

# Load exits by industry by week, statewide
cs_ind_exits <- read_csv("U:\\..\\ETA Training\\Results\\cs_ind_exits.csv")

#### **Attrition Curve of the COVID-19 Cohort** 

`cs_exits` and `cs_ind_exits` are both time-series data. Therefore, we will use line plots to show exit rates over time in this section.

In [None]:
# Before plotting the attrition graphs, reset the plot configuration to undo the changes
# we just made for the map
options(repr.plot.width = 12, repr.plot.height = 8)

In [None]:
# Specify source and x and y variables
cs_exits_plot <- ggplot(cs_exits, aes(x = week_number, y = stay_pct)) +

# Adds a line to the graph
geom_line() + 

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) + 

# Adjust the x scale, assign start and end points,
# and set the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 30, 1),
    limits = c(1, 30)) + 

# Add a title and labels for the x and y axes
labs(title = "(REDACTED) of claimants in the COVID-19 cohort exited by (REDACTED), Illinois",
    x = "Week", y = "Share",
    caption = "Data Source: IL PROMIS file") +

# Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust = .5),
        title = element_text(size=18))

# Display the plot you just made
print(cs_exits_plot)

We see that the number of people remaining in the cohort decreases over the life of the cohort, though the rate at which people exit slows down over time. For example, roughly (REDACTED) of the cohort has exited by week eight. Looking at the end of the cohort, however, only roughly (REDACTED) of claimants who persist to (REDACTED) have exited by (REDACTED). After (REDACTED), there is a REDACTED. We expect this because regular UI benefits are only authorized for 26 weeks in Illinois, though claimants may draw them down at a slower rate by earning part time wages, which is why some still remain after this point. Note the biweekly pattern we observe with the biggest changes in counts every other week. This is related to Illinois biweekly certification schedule: claimants tend to certify for both weeks, or neither.

#### **Checkpoint 3: Create the Attrition Curve for Your Regional Cohort**

Now import the csv files you saved in the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb). Then create a line plot to show exit rates of the COVID-19 cohort in your region of interest. Does it present similar or different trends from the state-level trend showed in the example?

- **`cs_reg_exits.csv`**: exits by week for your selected region
- **`cs_reg_sub_exits.csv`**: exits by week for your selected dimension and groupings

<font color=red> Before you run the cell below, make sure you have finished the checkpoints in [the cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb) and have saved the csv files in your "U:\\..\\ETA Training\\Results" directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# Load exits by week for your selected region
cs_reg_exits <- read_csv("U:\\..\\ETA Training\\Results\\cs_reg_exits.csv")

# Load exits by week for your selected dimension and groupings
cs_reg_sub_exits <- read_csv("U:\\..\\ETA Training\\Results\\cs_reg_sub_exits.csv")

In [None]:
# Plot an attrition curve for your regional selection

# Replace __ to specify x and y variables
cs_reg_exits_plot <- ggplot(cs_reg_exits, aes(x = ___, y = ___)) +

# Adds a line to the graph
geom_line() + 

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) + 

# Adjust the x scale, assign start and end points,
# and set the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 30, 1),
    limits = c(1, 30)) + 

# Add a title , labels for the x and y axes, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___") +

# Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5))

# Display the plot you just made
print(cs_reg_exits_plot)

#### **Attrition Curves of Top Five Industries**

We mentioned in the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb) that Illinois has a five-phase plan to reopen its economy. Based on regional health metrics and hospital capacities, businesses in some industries could open earlier than others and may open in different capacities. This plan could have different impacts on workers from different industries. In this section, we will identify the five industries with the most COVID-19 cohort claimants in week 1 and use a line plot to examine exit rates across the five industries.

In [None]:
# Find the 5 industries with the most claimants at week 1
cs_ind_exits %>% 
    filter(week_number == 1) %>% 
    arrange(desc(claimant_count)) %>% 
    head(5)

In [None]:
# Plot attrition curves for the top 5 industries

# Subset the top5 industries
cs_ind_exits_top5 <- filter(cs_ind_exits, naics_maj_code_rv %in% c((REDACTED)))


# Specify source and x and y variables
cs_ind_exits_top5_plot <- ggplot(cs_ind_exits_top5, aes(x = week_number, y = stay_pct, color=naics_maj_desc_rv)) +

# Adds a line to the graph
geom_line() + 

# Apply the color palette
scale_color_manual("", values = palette_color) +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) + 

# Adjust the x scale, assign start and end points,
# and set the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 30, 1),
    limits = c(1, 30)) + 

# Add a title and labels for the x and y axes
labs(title = "(REDACTED) \nhad a relatively slow exit rate in the COVID-19 cohort",
     x = "Week", y = "Share",
     caption = "Data Source: IL PROMIS file") +

# Rotate the x-axis labels 90 degrees, shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(angle = 90, vjust=.5),
      title = element_text(size=15))

# Display the plot you just made
print(cs_ind_exits_top5_plot)

Comparing attrition by industry reveals significant differences across industries and in comparison with all industries. While (REDACTED) of the cohort overall had exited by (REDACTED), (REDACTED) had (REDACTED) people exit, and this slower rate of exit is apparent throughout the life of the cohort. At (REDACTED), (REDACTED) of the cohort in these two industries remained, compared to (REDACTED) of the cohort overall.

#### **Checkpoint 4: Create the Attrition Curve for Your Subgroup of Interest**
Now identify the categories with the most COVID-19 cohort claimants in your region of interest during week 1. Then create a line plot to show how exit rates vary across your subgroup of interest. Are there any differences in exit rates across your subgroup of interest? Are they similar or different from the trend you observed in Checkpoint 3?

In [None]:
# Find the 5 categories with the most claimants at week 1,
# replace ___ with correct data frame and variable name
filter(___, week_number == 1) %>% 
    arrange(desc(___)) %>% 
    head(5)

# Subset your data frame to only the top5 categories
# Replace ___ with the dataset name, the variable name, and the 5 categories.
___ <- filter(___, ___ %in% c(___,___,___,___,___))

In [None]:
# Plot an attrition curve for your regional subgroup selection

# Replace __ to specify target, source, x, and y variables
___ <- ggplot(___, aes(x = ___, y = ___, color = ___)) +

# Adds a line to the graph
geom_line() + 

# Apply the color palette
scale_color_manual("", values = ___) +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) + 

# Adjust the x scale, assign start and end points,
# and set the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 30, 1),
    limits = c(1, 30)) + 

# Add a title, labels for the x and y axes, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___") +

# Rotate the x-axis labels 90 degrees
theme(axis.text.x = element_text(angle = 90))

# Display the plot you just made
print(___)

## **5.Save Visualizations**
Congratulations on finishing the visualization notebook! The final step is to save the visualzations you created. In order to save our graphs, we must set the output location and dimensions using `png()`, display the graph with `print()`, and then use `dev.off()` to close the output. **Note that if you need to export your visualizations from ADRF, you will need to provide the underlying counts of your visualizations.** We will discuss more details in the disclosure review notebook.

<font color=red> Note that you need to change the directory in the png() statements below. Replace ".." with your username.</font>

In [None]:
# Example code to save our visualizations

# A graph
png("U:\\..\\ETA Training\\Output\\cs_counts_plot.png", width=12, height=8, units="in", res=150)
print(cs_counts_plot)
dev.off()

# A map using different dimensions
png("U:\\..\\ETA Training\\Output\\cs_peak_county_map.png", width=8, height=10, units="in", res=150)
print(cs_peak_county_map)
dev.off()

#### **Checkpoint 5: Save Your Visualizations**

Save the graphs you created in Checkpoints 1-3.

In [None]:
# Save your visualizations

# Replace ___ in the example below with relevant values for your visualizations
# Repeat as many times as necessary
png("U:\\..\\ETA Training\\Output\\___.png", width=___, height=___, units="in", res=150)
print(___)
dev.off()

### **Footnotes:**
<span id="fn1"> 1. <a href='https://www.bls.gov/lau/laumthd.htm'>BLS Local Area Unemployment Statistics Estimation Methodology</a> </span>   
[[Go back]](#8)

<span id="fn2"> 2. <a href='https://www.bls.gov/cps/eetech_methods.pdf'>BLS Local Area Unemployment Statistics Concepts and Definitions</a> </span>  
[[Go back]](#9)

> Note that the above links don't work inside of the ADRF since you don't have internet access.

> Click [Go back] to go back to where you were.