<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

<center> Julia Lane, Clayton Hunter, Brian Kim, Benjamin Feder, Ekaterina Levitskaya, Tian Lou, Lisa Osorio-Copete. 
</center>

## Data Visualization in R

#### Introduction

In this notebook you will learn how to use different visualization methods in R to explore and analyze your data. We will also talk about proper annotation of graphs in order to be able to clearly and accurately communicate your results.

We will cover the following methods:
- **Histogram** 
(visualizing distributions, continuous variables)
- **Bar plot**
(visualizing relationships between numerical and categorical variables)
- **Small multiples**
(using a series of mini-graphs to compare information by different groups)
- **Heatmap** 
(adding highlights to your data with color-coding)
- **Geographic heatmap**
(showing regional differences in your data)

For all visualizations we are going to use an R package called `ggplot2` (`ggplot2` is included in the `tidyverse` suite of packages). The syntax of `ggplot2` in most cases stays the same:

- you always start with `ggplot()` <br>
- then, supply a dataset and aesthetic mapping - x and/or y variables, like this: `ggplot(dataset, aes(x = ..., y = ...)` <br>
- and then you can add layers using `+` <br>
for example, <br>
`ggplot(dataset, aes(x = ... , y = ...) + geom_histogram()` to create a histogram, or <br>
`ggplot(dataset, aes(x = ... , y = ...) + geom_histogram() + ggtitle('My plot title')` to create a histogram and add a title for the graph, and so on.

In this notebook we will visualize the following examples for our 2016Q4 cohort (defined in the Data Exploration notebook):

- TANF experience:
    - **Top 10 counties by number of TANF leavers** (bar plot)
    - **Number of TANF leavers by county** (geographic heatmap)
    - **Distribution of spell lengths in the cohort** (histogram)
    

- Employment outcomes:
    - **Distribution of wages in 5 most popular industries** (small multiples)
    - **Most popular industries with the highest average wages** (bar plot)
    - **Employment patterns by quarters** (heatmap)

### R Setup

Let's start by importing necessary R libraries and connecting to the database.

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)
library(lubridate)
library(sf)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

Let's read-in a table for our 2016Q4 cohort that we used in the Data Exploration notebook.

In [None]:
# 2016Q4 cohort with most recent case information
qry <- "
SELECT *
FROM ada_tdc_2020.cohort_2016
"

#read into R as df
df_2016 <- dbGetQuery(con,qry)

In [None]:
head(df_2016)

## TANF experience

### Top 10 counties by number of TANF leavers

To begin with, it could be useful for us to visualize the differences in the number of TANF leavers by top 10 counties for 2016Q4 cohort. Bar plots can help you to visually inspect the differences between groups in your own data exploration work, and they are also an effective communication tool for the outside audience (for example, if you would like to highlight a significant difference between groups, as we do in the example below).

In [None]:
# count number of leavers by county
leavers_county <- df_2016 %>%
                group_by(county) %>%
                summarize(count=n()) %>%
                arrange(desc(count)) %>%
                slice(1:10)  # choose only top 10 counties

In [None]:
head(leavers_county)

If we want to match the county codes to the names of the counties, we can load the `tl_2016_us_county` table from the `public` schema into R and join the two data frames, like we did in the Data Exploration notebook.

> Indiana's state fips code is 18.

In [None]:
# Get county codes, county names, polygons, and centroids of those polygons for the Indiana state
qry <- 
"SELECT countyfp as county, name
FROM public.tl_2016_us_county
WHERE statefp = '18'
"
#read into R as df
counties <- dbGetQuery(con, qry)

In [None]:
# see counties
head(counties)

Recall that similar to SQL's `LEFT JOIN`, one of the `tidyverse` packages, `dplyr`, contains `left_join()` which we can use to match the county codes to their proper names.

In [None]:
# left join to county lookup table
leavers_county <- leavers_county %>% 
    left_join(counties, by="county") %>%
    select(name, count)

In [None]:
leavers_county

We will now use `ggplot` and `geom_bar` function to create a bar plot - with top 10 counties on the x-axis and counts of TANF leavers on the y-axis. 

For any graph, remember to always:
- add a title (`ggtitle`) - title should ideally describe the main message/takeaway from the graph.
- x and y labels (`xlab`, `ylab`)
- data source for the graph (`labs(caption = ...`)

In [None]:
# Full code for the plot

ggplot(leavers_county, aes(x=reorder(name, -count), y=count)) +    # use reorder() to order bars from high to low based on count
geom_bar(stat = 'identity', fill = 'blue') +                       # add color here in the fill = ...
ggtitle('REDACTED county: highest # of TANF leavers in 2016Q4') +    # add title here
xlab('County') +                                                   # x-axis label
ylab('Number of TANF leavers') +                                   # y-axis label
theme(text=element_text(size=15, face="bold"))  +                  # change font size of the graph
labs(caption = 'Source: Indiana TANF data') +                      # add a caption below the plot
theme(plot.caption = element_text(hjust=0)) +                      # this line of code adds caption in the left corner of the plot
theme(axis.text.x = element_text(angle=90))                        # rotate labels on x-axis

<font color=red><h3> Checkpoint 1: Recreate for 2009Q1 </h3></font> 

Recreate the same bar plot for the 2009Q1 cohort (the 2009Q1 cohort table is stored in the `ada_tdc_2020` schema as `cohort_2009`.

### TANF leavers by county - geographic heatmap

What if we wanted to show regional differences in the number of TANF leavers by county using a map? A heatmap is a powerful visualization tool which allows to easily compare and communicate regional differences.

We can leverage an available public table in our database with geographic coordinates of counties, and we will use an `sf` package, which allows us to read in geographic location information in one line of code (using `st_read` function) and prepare the data for plotting. 

In [None]:
# count number of leavers by county
leavers_county_map <- df_2016 %>%
                group_by(county) %>%
                summarize(count=n()) %>%
                arrange(desc(count))

We will read in the necessary geographic information using `st_read` function in the `sf` package which prepares the data for plotting: county codes, county names, polygons, and centroids of those polygons for the Indiana state.

In [None]:
# Get county codes, county names, polygons, and centroids of those polygons for the Indiana state
qry <- 
"SELECT countyfp as county, LOWER(name) as name, geom, ST_X(ST_Centroid(geom)) as long, ST_Y(ST_Centroid(geom)) as lat
FROM public.tl_2016_us_county
WHERE statefp = '18'
"
#read into R as df
counties <- st_read(con,query=qry)

In [None]:
head(counties)

Now we join `counties` dataframe with out dataset of TANF leavers' counts by county:

In [None]:
counties <- inner_join(counties,leavers_county_map,by="county")

In [None]:
head(counties)

We will create the map using `ggplot` and `geom_sf` as primary functions (and we will also add title, data source, and county names as labels):

In [None]:
# Create the plot
ggplot(counties) +   # insert the name of the main dataset here
    geom_sf(aes(fill=count), color='white') +                   # in the fill parameter use "count" varable
    scale_fill_gradient(low="light blue",high="red") +          # choose colors for the gradient
    geom_text(aes(x=long, y=lat, label=name), size=2.3) +       # add county names as labels using centroids defined in the table
    ggtitle("REDACTED County: highest # of TANF leavers in 2016Q4") +   # add the title
    theme(plot.title = element_text(size=15, face="bold")) +          # change title font size and make the font bold
    labs(caption = 'Source: Indiana TANF data') +                     # add a caption below the plot
    theme(plot.caption = element_text(size=17, hjust=0))              # move caption to the left

<font color=red><h3> Checkpoint 2: Recreate for 2009Q1 </h3></font> 

Recreate the same geographic heatmap for the 2009Q1 cohort.

### Distribution of spell lengths in the cohort

Now let's take a look at the summary statistics of spell lengths in the cohort:

In [None]:
summary(df_2016$tanf_spell_months)

We can use a histogram (`geom_histogram`) to inspect spikes and drops by the spell lengths in the cohort:

In [None]:
# Full code for the plot
ggplot(df_2016, aes(tanf_spell_months)) + 
geom_histogram(bins=25, fill = 'blue') +
ggtitle('Most Common Spell Length in 2016Q4 Cohort: REDACTED') +
xlab('TANF spell months') +
ylab('Number of individuals') +
theme(text=element_text(size=12, face="bold"))  +
labs(caption = 'Source: Indiana TANF data') +
theme(plot.caption = element_text(hjust=0))

<font color=red><h3> Checkpoint 3: Recreate for 2009Q1 </h3></font> 

Recreate this histogram for the 2009Q1 cohort (or use your own variables of interest).

## Employment outcomes

### Distribution of wages in 5 most popular industries

We can use a density plot (a smoothed version of a histogram) to visualize distributions of wages in the 5 most popular industries among TANF leavers in 2016Q4 cohort. 

When we want to compare multiple groups in one plot, there is a high chance that the plot will become overcrowded. A good solution for such case is using small multiples (a series of mini-graphs for each gorup which use the same scale and axes) - in this example we will visualize wages distribution in 5 most popular industries using mini-density plots for each industry.

First, we need to read-in `cohort_2016_earnings` table.

In [None]:
# read table into R
qry <- "
select *
from ada_tdc_2020.cohort_2016_earnings
"
df_2016_wages <- dbGetQuery(con, qry)

Let's create a table with the top 5 most popular industries.

In [None]:
# save most popular naics as pop_naics
pop_naics <- df_2016_wages %>%
    group_by(naics_3_digit) %>%
    summarize(num = n_distinct(ssn)) %>% 
    arrange(desc(num))  %>%
    slice(1:5)    # choose top 5 industries

In [None]:
pop_naics

In [None]:
# get wages for most popular industries
wages_pop_naics <- df_2016_wages %>%
    filter(naics_3_digit %in% pop_naics$naics_3_digit) %>%
    select(ssn, wages, naics_3_digit, quarter)

It will be useful to add NAICS industry code names - similar to how we did it in the Data Exploration notebook, we will use a table `naics_2017` in the `public` schema with the NAICS industry code names.

In [None]:
# see the naics_2017 table
qry <- '
select *
from public.naics_2017
limit 5
'
dbGetQuery(con, qry)

In [None]:
# read naics_2017 table into R as naics
qry = '
select *
from public.naics_2017
'
naics = dbGetQuery(con, qry)

In [None]:
# get industry names of most popular naics using left join, like we did in the Data Exploration notebook
wages_pop_naics <- wages_pop_naics %>% 
    left_join(naics, by=c('naics_3_digit' = 'naics_us_code')) %>%
    select(ssn, wages, naics_us_title, quarter)

In order to create small multiples in R, we will use `facet_grid`:

In [None]:
# Full code for the small multiples plot
ggplot(wages_pop_naics, aes(x = wages)) +                          # include the main dataset name and what variable to use on x-axis
geom_density(fill = 'blue') +                                      # choose a fill color
facet_grid(naics_us_title ~ .) +                                   # use facet grid
theme(strip.text.y = element_text(angle=0, hjust=0)) +             # rotate text labels
ggtitle('Cohort 2016Q4: Wages distribution in top 5 industries') + # add title
xlab('Wages') +                                                    # add x-axis label
ylab('Density') +                                                  # add y-axis label
theme(text=element_text(size=14,face="bold"))  +                   # change font size and make it bold
labs(caption = 'Source: Indiana TANF, UI Wage data') +             # add data source caption
theme(plot.caption = element_text(hjust=0))                        # move data source caption to the left

<font color=red><h3> Checkpoint 4: Recreate for 2009Q1 </h3></font> 

Recreate small multiples for the 2009Q1 cohort (or use your own variables of interest).

### Most popular industries with  the highest average wages

We can also inspect top popular industries with the highest average wages and visualize those using a bar plot - to see if there are any drastic differences between the top industries (like we saw in the bar plot above for the differences in counties by number of TANF leavers).

In [None]:
# save most popular naics as pop_naics
pop_naics <- df_2016_wages %>%
    group_by(naics_3_digit) %>%
    summarize(num = n_distinct(ssn)) %>% 
    arrange(desc(num))  %>%
    slice(1:10)     # choose top 10 industries

# get wages for top 10 industries
wages_pop_naics <- df_2016_wages %>%
    filter(naics_3_digit %in% pop_naics$naics_3_digit) %>%
    select(ssn, wages, naics_3_digit, quarter)

In [None]:
# save to quarterly_naics
quarterly_naics <- wages_pop_naics %>%
    group_by(ssn, quarter, naics_3_digit) %>%
    summarize(tot_wages = sum(wages)) %>%
    ungroup()

In [None]:
# find average quarterly wages by industry and include number of people employed at least one quarter in each industry
top_mean_wages <- quarterly_naics %>%
                group_by(naics_3_digit) %>%
                summarize(avg_wages = mean(tot_wages),
                         num_ssns = n_distinct(ssn)) %>%
                arrange(desc(num_ssns)) 

In [None]:
top_mean_wages

In [None]:
# get industry names using left join with "naics" table, like we did above
top_mean_wages <- top_mean_wages %>% 
    left_join(naics, by=c('naics_3_digit' = 'naics_us_code')) %>%
    select(naics_us_title, avg_wages)

Recall that you can use `reorder` function on the x-axis in order to sort the bars in the descending order.

In [None]:
# Full code for the plot
ggplot(top_mean_wages, aes(x= reorder(naics_us_title, -avg_wages), y=avg_wages)) +
geom_bar(stat = 'identity', fill = 'blue') +
ggtitle('REDACTED - highest average wages (2016Q4)') +
xlab('Industry') +
ylab('Average wages') +
theme(text=element_text(size=13, face="bold"))  +
labs(caption = 'Source: Indiana TANF, UI Wage data') +
theme(plot.caption = element_text(hjust=0)) +
theme(axis.text.x = element_text(angle=90))                  

<font color=red><h3> Checkpoint 5: Recreate for 2009Q1 </h3></font> 

Recreate the same bar plot for the 2009Q1 cohort (or use your own variables of interest).

### Employment Patterns by Quarters

The last visualization that we would like to introduce in this notebook is a heatmap table of employment patterns of TANF leavers by quarters.

We know that only a certain proportion of our cohort showed up in Indiana's UI wage records in the year after leaving TANF. It would logically follow that since not every individual in our cohort showed up in the UI wage records, not every individual would be represented in the `cohort_2016_earnings` table. Let's confirm that idea.

In [None]:
#see amount of rows where wages are null
qry = '
select count(*)
from ada_tdc_2020.cohort_2016_earnings
where wages is null
'
dbGetQuery(con, qry)

When looking at common employment patterns for our cohort, if we were to just use data from `cohort_2016_earnings`, we would be ignoring those who did not appear at all in the wage records. To address that issue, we can `LEFT JOIN` our full 2016 Q4 cohort to our current wage data frame `df_2016_wages`, and then we can work from there.

Before we do the left join, remember how in the Data Exploration notebook we looked at the age breakdown of our cohort? 

Do you think it will make sense to include children under 18 years old in our employment patterns table? We would probably want to filter the dataframe only to adults. Let's do that first.

Let's recreate our 2016 Q4 cohort to include a date of birth variable in order to find out the age of TANF recipients, like we did in the Data Exploration notebook.

In [None]:
# Recreate the cohort to include the date of birth variable (extract only year from the date of birth)
qry <- "
SELECT distinct on (a.ssn)
a.ssn, a.caseid, a.month, a.tanf_start, a.tanf_end, a.tanf_spell_months, a.tanf_total_months,b.county,
substring(a.month,1,4) as rep_year, substring(a.month,5,2) as rep_month, extract(year from dob) as dob_yr  
FROM in_fssa.person_month a
INNER JOIN in_fssa.case_month b 
on a.caseid = b.caseid
WHERE a.affil = '1' and
a.tanf_end = TRUE and 
ssn not in (REDACTED) and
substring(a.month,1,4) = '2016' and 
substring(a.month,5,2) in ('10','11','12')
order by a.ssn, a.month desc;
"

#read into R as df
df_2016_age <- dbGetQuery(con,qry)

In [None]:
head(df_2016_age)

In [None]:
# Add a column with age
df_2016_age <- df_2016_age %>%
    mutate(age = as.integer(rep_year) - dob_yr) 

# Flag those who are 18 years old or older
df_2016_age <- df_2016_age %>%    
    mutate(age_ind = ifelse(age >= 18, "adult", "non_adult"))

In [None]:
head(df_2016_age)

In [None]:
# Choose only those who are 18 years old or older, save to a dataframe called "df_2016_adult"
df_2016_adult <- df_2016_age[df_2016_age$age_ind == "adult", ]

In [None]:
head(df_2016_adult)

Now we can left join our `df_2016_adult` table with the `df_2016_wages`:

In [None]:
# left join df_2016_adult to df_2016_wages
total_df <- df_2016_adult %>%
    left_join(df_2016_wages, c("ssn", "tanf_spell_months", "tanf_total_months", "county"))

Now that we have earnings (or lackthereof) for our cohort, let's aggregate wages by quarter since we will eventually only want an indicator of whether an individual was employed in the quarter, which we can find by seeing if their wages for the quarter was greater than 0.

In [None]:
# aggregate by quarter
agg_df <- total_df %>%
    group_by(ssn, quarter) %>%
    summarize(wages = sum(wages)) %>%
    ungroup()

In [None]:
# see that all missing wages are for those with no wages yet
agg_df %>%
    filter(is.na(wages)) %>%
    summarize(n=n(),
             num = n_distinct(ssn))

When we aggregate by quarter, we can still see that there is only one row for everyone in the cohort who has missing wages. Since we want our employment patterns table to encompass all four quarters, we need to create a data frame where each row corresponds to an `ssn` in our cohort, and then an indicator (`wages` > 0) if the individual was employed in the specific quarter. Thus, we need four entries per `ssn` to represent the four quarters of potential employment.

To solve this problem, we can use `complete()`, which will "complete" a data frame based on all potential values of certain columns. We will want to `complete()` our data frame based on the `ssn` and `quarter` combinations, so we need to make sure our quarters span from 1 to 4 and our ssn's encompass all `ssn` values in our cohort.

Let's check to see what the values of `quarter` are right now, and the amount of counts we have for each quarter.

In [None]:
agg_df %>%
    group_by(quarter) %>%
    summarize(n_distinct(ssn))

Due to the left join above, all members of our cohort (only adults) that did not appear in `df_2016_wages` have `NA` values for quarter. We need to fix that before we can use complete, so let's arbitrarily assign all rows with `NA` `quarter` values as Q1 so we can use `complete()`.

In [None]:
# set all where quarter is na equal to 1 so complete uses 1,2,3,4 as the options for quarter
agg_df$quarter[is.na(agg_df$wages)] =1

Now, we can `complete()` `agg_df` based on all combinations of `ssn` and `quarter`, and for those that currently don't exist in `agg_df`, we can set their wages to 0.

In [None]:
# need to complete df_2016_wages and fill out nas for all four quarters someone doesn't exist
complete_df <- agg_df %>%
    complete(ssn, quarter, fill=list(wages=0))

You can perform a few sanity checks to make sure you used `complete()` properly for this situation.

In [None]:
# see if number of rows of complete_df is 4 times the amount of rows in df_2016_adult (each ssn, quarter combo)
nrow(complete_df) == nrow(df_2016_adult) * 4

In [None]:
# see if number of ssns is same for complete_df and df_2016_adult
n_distinct(complete_df$ssn) == n_distinct(df_2016_adult$ssn)

In [None]:
# see complete_df
head(complete_df)

Now, let's add in our indicator variable based on whether an individual had earnings greater than 0 in a given quarter.

In [None]:
# add new indicator column for if wages are greater than 0
# then also get rid of the actual wages
complete_df <- complete_df %>%
    mutate(wage_ind = ifelse(wages == 0, "no", "yes")) %>%
    select(-wages)

Now that we have a row for every `ssn`/`quarter` combination and our indicator variable if the individual was employed in a given `quarter`, we can now morph our data frame into having each row as one `ssn`, with a "yes"/"no" indicator of employment for each quarter, which will be represented as individual columns. 

In [None]:
# now need to expand so that each quarter is a row 
# but first, need to expand a character/factor, not a numerical column
complete_df <- complete_df %>%
    mutate(new_quarter = case_when(
    quarter == 1 ~ "q1",
    quarter == 2 ~ "q2",
    quarter == 3 ~ "q3",
    quarter == 4 ~ "q4")) %>%
    select(-quarter)

In [None]:
# see complete_df
head(complete_df)

From here, we can use the `tidyverse`'s `pivot_wider()` function to "widen" our data frame to get our desired output.

In [None]:
# spread the wage indicator by the new_quarter
complete_df %>% 
    pivot_wider(names_from = new_quarter, values_from = wage_ind) %>%
    head()

In [None]:
# save pivoted table
wage_by_q <- complete_df %>% 
    pivot_wider(names_from = new_quarter, values_from = wage_ind)

Now, we want to count the number of people per unique combination of employment by each quarter.

In [None]:
patterns <- wage_by_q %>%
            group_by(q1, q2, q3, q4) %>%
            summarize(count=n_distinct(ssn)) %>%
            arrange(desc(count))

In [None]:
patterns

Do these counts make sense? While this table can be plenty telling, let's put the cherry on top of this analysis by visualizing it.

We would like to visualize this table using color to highlight "employment" and "no employment" in each quarter. 

In [None]:
patterns

It is better not to use `count` column as an index, as there could be duplicate values - we will add those counts to our heatmap later.

In [None]:
# Save counts to use later in the heatmap - we cannot use the counts as index, as there could be duplicate values 
counts <- patterns$count

We will add index with unique sequential numbers and remove the `count` column:

In [None]:
patterns$Pattern <- seq.int(nrow(patterns))
patterns$count <- NULL

In [None]:
patterns

We now need to convert this table from wide to long format - we can use `pivot_longer` function:

In [None]:
patterns_long <- pivot_longer(patterns, names_to = 'Quarter', values_to = 'Status', -c(Pattern))

In [None]:
head(patterns_long)

Now we are ready to create the visualization using `geom_tile` in `ggplot`:

In [None]:
# Full code for the plot

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowest

ggplot(data = patterns_long, aes(x = Quarter, y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
ylab('Employment - Individual Counts') +                                                     # add y-axis label
ggtitle('2016Q4 Cohort: Employment Patterns by Quarters') +                                                 # add title for the plot
labs(caption = 'Source: Indiana TANF, UI Wage data') +                                       # add data sourcing caption
theme(plot.caption = element_text(hjust=0)) +                                                # move the data sourcing caption to the left corner of the graph
scale_y_discrete(labels=rev(counts))  # rename the y-axis ticks to correspond to the counts from the table

> **Side note**: what if we didn't filter our cohort to only adults? How would our employment patterns look like? 
> Let's recreate the above code using our dataframe with the full cohort, `df_2016`, including children:

In [None]:
# left join df_2016 with the df_2016_wages
total_df <- df_2016 %>%
    left_join(df_2016_wages, c("ssn", "tanf_spell_months", "tanf_total_months", "county"))
    
# aggregate by quarter
agg_df <- total_df %>%
    group_by(ssn, quarter) %>%
    summarize(wages = sum(wages)) %>%
    ungroup()

# set all where quarter is na equal to 1 so complete uses 1,2,3,4 as the options for quarter
agg_df$quarter[is.na(agg_df$wages)] =1

# need to complete df_2016_wages and fill out nas for all four quarters someone doesn't exist
complete_df <- agg_df %>%
    complete(ssn, quarter, fill=list(wages=0))

# add new indicator column for if wages are greater than 0
# then also get rid of the actual wages
complete_df <- complete_df %>%
    mutate(wage_ind = ifelse(wages == 0, "no", "yes")) %>%
    select(-wages)

# now need to expand so that each quarter is a row 
# but first, need to expand a character/factor, not a numerical column
complete_df <- complete_df %>%
    mutate(new_quarter = case_when(
    quarter == 1 ~ "q1",
    quarter == 2 ~ "q2",
    quarter == 3 ~ "q3",
    quarter == 4 ~ "q4")) %>%
    select(-quarter)

# save pivoted table
wage_by_q <- complete_df %>% 
    pivot_wider(names_from = new_quarter, values_from = wage_ind)

patterns <- wage_by_q %>%
            group_by(q1, q2, q3, q4) %>%
            summarize(count=n_distinct(ssn)) %>%
            arrange(desc(count))

# Save counts to use later in the heatmap - we cannot use the counts as index, as there could be duplicate values 
counts <- patterns$count

patterns$Pattern <- seq.int(nrow(patterns))
patterns$count <- NULL

patterns_long <- pivot_longer(patterns, names_to = 'Quarter', values_to = 'Status', -c(Pattern))

# Plot
levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowest

ggplot(data = patterns_long, aes(x = Quarter, y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
ylab('Employment - Individual Counts') +                                                     # add y-axis label
ggtitle('2016Q4 Cohort: Employment Patterns by Quarters') +                                                 # add title for the plot
labs(caption = 'Source: Indiana TANF, UI Wage data') +                                       # add data sourcing caption
theme(plot.caption = element_text(hjust=0)) +                                                # move the data sourcing caption to the left corner of the graph
scale_y_discrete(labels=rev(counts))  # rename the y-axis ticks to correspond to the counts from the table

> Do you notice the difference in the counts? The highest count of those who are not employed now also includes children under 18 years old from the cohort. It would be more accurate to filter our cohort table by only adults, like we did above. It's a good example of how we should think through our data and what we would like to show in our visualization.

<font color=red><h3> Checkpoint 6: Recreate for 2009Q1 </h3></font> 

Recreate this heatmap for the 2009Q1 cohort. Use a filtered cohort only with adults (and you can also try visualizing the full cohort (including children), to see the difference in the counts).