# Presentation Prep – Advanced

Maryah Garner, Rukhshan Arif Mian, Allison Nunez

**_Presentation Prep Examples & Exercises_**

This notebook provides information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives an overview of the information needed for disclosure review. _Please read through the entire notebook because it will separately discuss all outputs that will be flagged in the disclosure review process._


For the purpose of this class, the disclosure rules are as follows:

# TDC 2022 Class Export Review Guidelines 

- **Each team will be able to export up to 10 figures/tables**
    
    
- **Every statistic for export should be based on at least 10 individuals and at least 3 employers**.
     - Statistics that are based off of 0-9 individuals must be surpressed
     - Statistics that are based off of 0-2 employers must be surpressed
    
    
- **All counts will need to be rounded**
    - Counts below 1000 should be rounded to the nearest ten
    - Counts greater than or equal to 1000 should be rounded to the nearest hundred
    > For example, a count of 868 would be rounded to 870 and a count of 1868 would be rounded to 1900

- **All reported wages will need to be rounded to the nearest hundred** 
    
- **All reported averages will need to be rounded to the nearest hundredth** 
   
   
- **All percentages and proportions need to be rounded**
    - The same rounding rule that is applied to counts must be applied to both the numerator and denominator
    - Percentages must then be rounded to the nearest percent
    - Proportions must be rounded to the nearest hundredth


- **Exact percentiles can not be exported** 
    - Instead, for example, you may calculate a “fuzzy median”, by averaging the true 45th and 55th percentiles
       > If you are calculating the fuzzy percentiles for wage, you will need to round to the nearest hundred after calculating the fuzzy percentile.

       > If you are calculating the fuzzy percentile for a number of individuals, you will need to round to the nearest 10 if the count is less than 1000 and to the nearest hundred if the count is greater than or equal to 1000.
  
- **Exact Maxima and Minima can not be exported**
    - Suppress maximum and minimum values in general. 
    - You may replace an exact maximum or minimum with a top-coded value or a fuzzy maximum or minimum value. For example: If the maximum value for earnings is 154,325, it could be top-coded as '100,000+'. And a fuzzy maximum value could be: 
    $$\frac{95th\ percentile\ of\ earnings + 154325}{2}$$
 
 
- **Complementary suppression**
    - If your figures include totals or are dependent on a preceding or subsequent figures, you need to take into account complementary disclosure risks—that is, whether the figure totals or the separate figures when read together, might disclose information about less then 10 individuals in the data in a way that a single, simpler table would not. Team facilitators and export reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.


##  Supporting Documentation for Exports

For each exported figure, you will need to provide a table with **underlying counts** of individuals and employers for each statistic depicted in the figure. 

- You will need to include both the rounded and the unrounded counts of individuals.

- If percentages or proportions are to be exported, you must report both the rounded and the unrounded counts of individuals for the numerator and denominator. You must also report the counts of employers for both the numerator and the denominator

**Code**
- Please provide the code for every output that needs to be exported and the code generating every table (csv) with underlying counts. It is important for the ADRF staff to have the code to better understand what exactly was done and to be able to replicate results. Understanding how research results are created is important in understanding the research output. Thus, it is important to document every step of the analysis in the Jupyter notebook. 

# Goal
In this notebook we will show you how to implement the above export rules while creating the necessary input files and supporting code to create beautiful visuals that are presentation ready. 

We will cover the following visualizations in this notebook:
- **Bar Plot**: visualizes relationships between numerical and categorical variables
- **Bar Plot with distribution bars**: visualizes relationships between numerical and categorical variables
- **Heat map**: commonly is used to show the "magnitude of a phenomenon as color in two dimensions" ([wikipedia](https://en.wikipedia.org/wiki/Heat_map)); however we will simplify it to show patterns of employment
- **Line Plot**: is commonly used for time series data to show how a variable changes over time

## Colorblind-friendly Palette
Throughout this notebook we will use colors from this color-blind friendly selection

"#009E73", "#0072B2", "#D55E00", "#CC79A7", "#999999", "#E69F00",  "#56B4E9", "#F0E442"

## R Setup

In [None]:
# switching off warnings
options(warn=-1)

#database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)
options(warn=0)

# implementing option
options(scipen = 100) 

Next, we connect to the database.

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

# Summarize Information for the Cohort of interest
For our first visual, we will create a barplot that will summarise information about our cohort of interest. In this example we will visualize the average total months on TANF broken down by number of children comparing TANF recipients from our cohort who are on cases with one adult (Single Adult) and recipients from our cohort who are on cases with more than one adult (Not a Single Adult). As a reminder, our cohort consists of TANF recipients who exited in '2018 Q2'. 

#### Query the data

The number of children (**num_child**) and number of adults (**num_adult**) who are on each case are found in the case_month file. In the following query we will join together the tr_tdc_2022.dbo.**nb_cohort** table we created in the **02_Creating_a_cohort.ipynb**  notebook and the ds_in_fssa.dbo.**case_month** table.
- We will LEFT JOIN the case_month to the nb_cohort table by caseid and the year month variables (yr_month from nb_cohort and rptmn from case_month). Since we only have the last yr_month each person had data during '2018 Q2', we are only pulling data from the case_month file for the same year month. 

Note that: 
- Within`nb_cohort`, the `yr_month` variable is structured in the following way: '2018-05-01' 
- Within `case_month`, the `rptmn` variable is structured in the following way: '201805'

Since we are using these variables as part of our join statement, we will need to align how these are structured. For this purpose, we use the `FORMAT` function in SQL to convert the `yr_month` variable from 'YYYY-MM-DD' to 'YYYY-MM' so that it follows the same format as `rptmn`. 

In [None]:
query <- "
SELECT nb.ssn, nb.caseid, nb.tanf_total_months, nb.yr_month, cm.rptmn, cm.num_child, cm.num_adult
FROM tr_tdc_2022.dbo.nb_cohort nb
LEFT JOIN  ds_in_fssa.dbo.case_month cm
ON nb.caseid = cm.caseid AND FORMAT(nb.yr_month, 'yyyyMM') = cm.rptmn
;
"


cohort <- dbGetQuery(con, query)

head(cohort,5)

First we will look at the number of children who are on the same case as the adults that are in our cohort

In [None]:
cohort %>%
    count(num_child)

Next we will look at the the number of adults who are on the same case for the individuals that are in our cohort

In [None]:
cohort %>%
    count(num_adult)

####  Regroup the data
As you can see, there are fewer than 10 individuals who have REDACTED, REDACTED or REDACTED children associated with their case. Thus, we will have to combine these categories to make it possible to export the data. 

We suggest the following regrouping: 1 child, 2 children, 3 children, and 4 or more children (4+). Note that this is a rougher grouping than necessary for export purposes. However, it will help communicate the information in a more visually appealing way. 

We will also want to regroup the number of adults. To do this we will create a binary variable called `single_adult` that is equal to 1 if there is only one adult on the case and 0 otherwise.

In [None]:
cohort2 <- cohort %>%
    # categorical variable for number of children associated with a case
    mutate(num_child = ifelse(num_child > 3, "4+", num_child),
           # categorical variable for number of adults associated with a case
           single_adult = ifelse(num_adult == 1, "Single Adult", "Not a Single Adult"))

Count the number of individuals in each children by adult categories to make sure you have at least 10 individuals in each num_child by single_adult group. 

In [None]:
cohort2 %>%
    group_by(num_child, single_adult) %>%
    summarise(count = n_distinct(ssn))

#### Prepare the Data 
Now we have a sufficent number of individuals in each group, we will calculate the average number of months each group spent on TANF.

For each figure/table you create, you should prepare the data for the supplementary table, and then use that data to produce the figure/table.

> Since we are reporting the average number of months on TANF for the different groups, we will need to **round these averages to the nearest hundredth**. We will use the rounded averages in the figure and provide both the rounded and unrounded averages in the supplementary table.

In [None]:
Figure_1_data <- cohort2 %>%
    # grouping by children + adult categories
    group_by(num_child, single_adult) %>%
    # calculating average months
    summarise(average_months = mean(tanf_total_months),
              # rounding average months to nearest hundreth
              average_months_rounded = round(average_months, 2),
              # counting number of individuals 
              individuls = n_distinct(ssn)) %>%
    ungroup()

Figure_1_data
              
              

#### Save the Data 
Save the data frame Figure_1_data to a csv. 
> In order to submit the following figure for export, you would need this csv as the supporting table. This table will not be exported, it will be used by the export team to make sure the figure passes the disclosure requirments. As in this example, all supporting tables should be generated programmatically (using only code) and the associated figure should be generated only using data that correspond to that table.

> **Note: in order to save the data in this way, you will need a folder called "Data" in the same folder that contains your code/notebooks. If you copied over the Module 3 folder, the Data folder should already exist. If not, please first add a Data folder before running this code.**

In [None]:
write_csv(Figure_1_data, "Data/Figure_1_data.csv")

We define colors for each category below:

In [None]:
single_color <- c('Single Adult' = "#009E73",
                'Not a Single Adult' = "#0072B2") 
single_color

## Bar Plot

Once we have prepared our data, we can use the following code to create a bar plot that depicts the average number of months on TANF for each sub-category. When creating this graph, what we want to compare is the average months on TANF for single adults with non-single adults based on the different number of children categories. Since the comparison variable is `single_adult`, that is what we will input in the `fill` parameter. 

In [None]:
Figure_1 <- Figure_1_data %>%
               # num_child category on the x-axis
    ggplot(aes(x = num_child, 
               # rounded average months on the y-axis
               y = average_months_rounded, 
               # filling bars based on single_adult categories
               fill = single_adult)) +
    geom_bar(stat = "identity", position = 'dodge') +
    scale_fill_manual("", values = single_color) +
    labs(
        x = 'Number of Children', # labelling x axis
        y = 'Average Months on TANF', # labelling y axis
        # Add a title that conveys the main takeaway of the graph
        title = 'On Average, Single Adults Spend Less Time on TANF \n When There is More than One Child on a Case', # The \n splits the title into two lines 
        caption = 'Indiana TANF data (2018 Q2 Exiters)' # cite the source of your data
        ) 

#### Adjusting Font Sizes
In order to make the plot presentation ready, we advise using readable font sizes, as the image will be added to either a presentation or report. We use the following code to implement this:

In [None]:
Figure_1 <- Figure_1 +
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )

# Display the graph that we just created
Figure_1

Note, given the scarcity of non-single adult households, these differences could be driven by a few outliers and you should not be using these differences in averages to make policy recommendations. This is purely for pedagogical purposes.

# Fuzzy quantiles

You might not want a figure that only shows the average number of months, and want to convey information about the distribution of time on TANF for each subgroup. Next we will create a figure that is very similar to the one we just produced except we each bar will represent the fuzzy median and not mean, and we will include distribution bars on each bar that represent the fuzzy 25th and 75th quantiles of the number of months for each subgroup.

#### Steps for Export
- Counts of individuls for each subgroup (this will be exactly the same as in the figure above)
    - Counts below 1000 should be rounded to the nearest ten
    - Counts above or equal to 1000 should be rounded to the nearest hundred
    
For this figure we need to calculate the the fuzzy percentiles. 
- Fuzzy 25th percentile: Calculate the 20th and 30th percentiles and take the average 
- Fuzzy median : Calculate the 45th and 55th percentiles and take the average
- Fuzzy 75th percentile: Calculate the 70th and 80th percentiles and take the average

To get the average we add the percentiles together and divide by 2. For example:
$$fuzzy\  25th\ = \frac{20th + 30th}{2}$$

> If you are calculating the fuzzy percentiles for wage, you will need to round to the nearest hundred after calculating the fuzzy percentile.

> If you are calculating the fuzzy percentile for a number of individuals, you will need to round to the nearest 10 if the count is less than 1000 and to the nearest hundred if the count is greater than or equal to 1000.

#### Calculate the Quantiles

We will be using the cohort2 data frame.

In [None]:
Figure2_data <-  cohort2 %>%
    group_by(num_child, single_adult) %>%
    summarise(individuls = n_distinct(ssn),
              # fuzzy 25th percentile
              fuzzy_25 = (quantile(tanf_total_months, .20) + quantile(tanf_total_months, .30))/2,
              # fuzzy median (50th percentile)
              fuzzy_median = (quantile(tanf_total_months, .45) + quantile(tanf_total_months, .55))/2,
              # fuzzy 75th percentile
              fuzzy_75 = (quantile(tanf_total_months, .70) + quantile(tanf_total_months, .80))/2) %>%
    ungroup()

Figure2_data

## Bar Plot with Distribution Bars

Once we have prepared our data and calculated fuzzy percentiles, we can use the following code to create a bar plot that depicts the (fuzzy) median number of months on TANF for each sub-category. Similar to what we did above, we want to compare is the median months on TANF for single adults with non-single adults based on the different number of children categories. Since the comparison variable is `single_adult`, that is what we will input in the `fill` parameter. 

Furthermore, we use the `geom_errorbar` function to add distribution bars that reflect the (fuzzy) 25th and 75th percentiles of months on TANF. 

> Note: Just because you use the `geom_errorbar` function does not mean you have to depict standard errors. You may use it to depict other distribution information such as the 25th and 75th percentiles. 

In [None]:
Figure_2 <- Figure2_data %>%
    ggplot(aes(x = num_child, 
               y = fuzzy_median, 
               fill = single_adult)) +
    geom_bar(stat="identity", position='dodge') +
    # adding distribution bars
    geom_errorbar(aes(ymin = fuzzy_25, 
                      ymax = fuzzy_75),
                width = .2,                 
                position = position_dodge(.9)) +
    scale_fill_manual("", values = single_color) +
    labs(
        x = 'Number of Children', # labelling x axis
        y = 'Median Number of Months on TANF', # labelling y axis
        # Add a title that conveys the main takeaway of the graph
        title = 'Single Adults Spend Less Time on TANF \n When There is More than One Child on a Case', # The \n splits the title into two lines 
        caption = 'Indiana TANF data (2018 Q2 Exiters)' # cite the source of your data
        ) 
Figure_2

Adjust the font size using the code below:

In [None]:
Figure_2 <- Figure_2 +
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )

# Display the graph that we just created
Figure_2

# Patterns of Employment

## Employment Patterns by Quarters (Heatmap)

The final visualization in this notebook is a heatmap displaying our cohort of TANF recipients' employment patterns by quarter, as we will focus on the 8 most common patterns. We do not use a heatmap in the classic way where each "box" in the map corresponds to a proportion or number. Instead, we will use the heatmap as a format by which to map employment patterns, as we will color-code each box depending on if the pattern has or does not have employment in a specific quarter. 

#### Query the Data

We will start by combining our cohort of TANF exiters with UI Wage Records. We do this for 2 years after our cohort's exit date ('2018 Q2'). Thus, we will read in wage records from '2018 Q3' up to and including '2020 Q3'.

In [None]:
# Linking TANF and UI Wages over time
# Utilizing a sub-query to first filter out wage records
qry <- "SELECT nb.ssn, wr.Empr_no, wr.Wage, wr.yr_quarter
    FROM 
    tr_tdc_2022.dbo.nb_cohort nb
    LEFT JOIN 
    (   select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2', '2019 Q3','2019 Q4','2020 Q1','2020 Q2','2020 Q3')
        AND (SSN IN (SELECT DISTINCT(SSN) FROM tr_tdc_2022.dbo.nb_cohort))
    ) wr
    ON wr.SSN=nb.SSN;
"

cohort_wages <- dbGetQuery(con,qry)

head(cohort_wages)

#### Summarize the Data

Once we have our linked cohort-wages dataframe, we will summarise the data so that each person only has one observation per quarter. Here we are considering a person to be employed if they have any positive earnings (anything above zero) in a quarter. In your project you might want to set a higher threshold.

> Note: We create the `employed` variable as a character variable as it will make a better visualization later on. 

In [None]:
cohort_wages2 <- cohort_wages %>%
# Create a variable employed, that is 1 for every positive wage and 0 otherwise
        mutate(employed = ifelse(Wage > 0, '1','0'),
               employed = ifelse(is.na(Wage), '0', employed)) %>%
    group_by(ssn, yr_quarter) %>%
# sumarise employed so that each person only has at most one observation per yr_quarter
    summarise(employed = max(employed)) %>%
    ungroup() 

head(cohort_wages2)

#### Restructure the Data
We will use the `pivot_wider` function to create a new column for every `yr_quarter`, so instead of a person having an observation for every quarter they are employed, each person will have only one observation, and there will be a 1 every quarter they are employed, and a 0 for quarters they are not employed. 

When using the `pivot_wider` function, a value of 'NA' will be assigned for every year-quarter an individual is not employed. We will replace these 'NA' values with 0. 

There will also be a new column "NA" since everyone in our cohort who does not have a job in the two years after exit has a value of NA for `yr_quarter`. We remove this column once the data is in its wide format since the Employment status is fully represented in the 8 year-quarter columns. 

In [None]:
cohort_wages3 <- cohort_wages2 %>%
    arrange(yr_quarter) %>%
#Convert the data from long to wide 
    pivot_wider(names_from = yr_quarter, values_from = employed) %>%
# Remove the column NA 
    select(-'NA')
    
# replace NA values with 0
cohort_wages3[is.na(cohort_wages3)] <- '0'

# view the first 2 observations
head(cohort_wages3, 2)

Later in this notebook it will be problematic to have column names start with a number, so we will use the `names` function to assign new names to the data frame.

> Note: The order of how these columns are defined is important. Renaming the columns will not reorder them so the new names asigned should align with the pre-existing information.

In [None]:
names(cohort_wages3) <- c('ssn', 'Q3_2018', 'Q4_2018', 'Q1_2019', 'Q2_2019', 'Q3_2019', 'Q4_2019', 'Q1_2020', 'Q2_2020', 'Q3_2020')

head(cohort_wages3)

Assign the number of people in your cohort to `pop`, we will use this later.

In [None]:
pop <- length(unique(cohort_wages3$ssn))
pop

#### Creating Employment Patterns

To identify employment patterns, we group the data by the 8 quarter-year variables to calculate the number of individuals and portion of the cohort who have each employment pattern. This will convert the data so that there is only one observation per unique employment pattern.

> Note: We will use `pop` to get the portion of the cohort who have each employment pattern.

In [None]:
employment_patterns <- cohort_wages3 %>%
# group the data by the 8 quarter-year variables
    group_by(Q3_2018, Q4_2018, Q1_2019, Q2_2019, Q3_2019, Q4_2019, Q1_2020, Q2_2020, Q3_2020) %>%
# count the number of individuals per unique employment patttern
    summarise(Individuals = n_distinct(ssn)) %>%
    ungroup() %>%
# Using the pop variable we created in the previous code cell, calculate the portion of the cohort in each employment patttern
    mutate(portion = Individuals/pop) %>%
    arrange(desc(Individuals))
    
head(employment_patterns)          

#### Steps for Export
Now if we want to create a figure that depicts unique patterns of employment, we will need to take the following steps"
- Get the number of employers that are associated with each employment pattern
- Select only those patterns that are associated with at least 10 invididuals and 3 employers
- Round the counts of individuals and population (`pop`)
- Create a percent variable using the rounded counts of individuals and population (`pop`)
- Round the percent
- Make the visual

Getting the number of employers that are associated with each employment pattern is not a trivial task. First, we have to get the SSNs that are associated with each employment pattern, and then we have to link those SSNs back to their employers. 

### Number of Employers Associated with each Employment Pattern

Assign a pattern ID to each employment pattern and then select only the 8 quarter-year variables and the pattern ID. 

In [None]:
employment_patterns2 <- employment_patterns %>%
    mutate(pattern_ID = row_number()) 

head(employment_patterns2, 2)

Now that we have a pattern ID, we join the `employment_patterns2` data frame with the `cohort_wages3` data frame linking on the quarter-year variables. This will identify the `pattern_ID` each `ssn` is associated with.

Only keep the `pattern_ID` and `ssn` variables in this new data frame.

In [None]:
ssn_group_xwalk <- employment_patterns2 %>% 
    select(-c(Individuals, portion)) %>%
    right_join(cohort_wages3, by = c('Q3_2018', 'Q4_2018', 'Q1_2019', 'Q2_2019', 'Q3_2019', 'Q4_2019', 'Q1_2020', 'Q2_2020', 'Q3_2020'))  %>%
    select(pattern_ID, ssn)

head(ssn_group_xwalk, 2)

Use the `ssn` to link the `cohort_wages` data frame to the `ssn_group_xwalk` data frame. This will allow us to calculate the number of distinct employers associated with each employment pattern.
> We are using this particular cohort wages data since it is the only one that has all the employer numbers (Empr_no)


In [None]:
employers_employment_pattern <- cohort_wages %>%
    left_join(ssn_group_xwalk,  by = 'ssn') %>%
    group_by(pattern_ID) %>%
    # counting the number of employers associated with each pattern
    summarise(Num_Employers = n_distinct(Empr_no))

head(employers_employment_pattern,2)

We link `employers_employment_pattern` to `employment_patterns2` data so that we have a variable for the number of employers associated with each pattern.

In [None]:
employment_patterns3 <- employment_patterns2 %>%
    left_join(employers_employment_pattern, by = "pattern_ID")

head(employment_patterns3, 10)

#### Select only those patterns that are associated with at least 10 invididuals and three employers
As you can see there are several of these patterns that do not have a sufficient number of individuals associated with them to pass export review. In the figure we will only include patterns of employment that are associated with at least 10 people and 3 employers. 

In [None]:
employment_patterns4 <- employment_patterns3 %>%
    filter(Individuals >=10, 
          Num_Employers >= 3)
employment_patterns4

#### Preparing the Input File for Export
Now we will complete the last 3 steps for getting the export file ready for export
- Round the counts of individuals and population (`pop`) 
- Create a precent variable using the rounded counts and population (`pop`)
- Round the percentage

In [None]:
employment_patterns4 <-employment_patterns4 %>%
# pull the pop variable into the data frame
    mutate(pop = pop, 
# Round population to the nearest 10 if it's less than 1000. Round it to nearest 100 if it's more than 1000
           pop_round = ifelse(pop < 1000, round(pop, digits = -1), round(pop, digits = -2)),
# Round individual count to the nearest 10 if it's less than 1000. Round it to nearest 100 if it's more than 1000
           Individuals_round = ifelse(Individuals < 1000, round(Individuals, digits = -1), round(Individuals, digits = -2)),
# Create a percent variable
            percent = 100*(Individuals_round/pop_round), 
# Round percent
            percent_round = round(percent, 0))

head(employment_patterns4, 2)   

#### Save the Data

> In order to submit the following figure for export, you would need this csv as the supporting table. This table will not be exported, it will be used by the export team to make sure the figure passes the disclosure requirments. As in this example, all supporting tables should be generated programmatically (using only code) and the associated figure should be generated only using data that correspond to that table.

In [None]:
write_csv(employment_patterns4, "Data/Figure3_data.csv")

### Preparing Data for Visualization

We are going to convert the data back into a long format using the `pivot_longer` function. This will create a categorical variable, `Quarter`, that contains the 8 year-quarter variables we had as columns before and a variable, `Employed`, that represents the different patterns. Lastly, the variable, `percent_rounded` which has the (rounded) percentage of people from the cohort with each employment pattern. 

In [None]:
Figure3_data <- employment_patterns4 %>%
    select(-c(Individuals, Num_Employers, pop, pop_round, Individuals_round, percent, pattern_ID, portion)) %>%
    pivot_longer(names_to = 'Quarter', values_to = 'Employed', -c(percent_round)) 

head(Figure3_data)

## Heat map

In [None]:
Figure_3 <- Figure3_data %>% 
              # Quarter on the x-axis
    ggplot(aes(x = Quarter, 
               # sort y-axis according to levels specified above
               y = ordered(percent_round))) +    
    # fill the table with value from Employed column, create black contouring
    geom_tile(aes(fill = Employed), colour = 'black') +  
    # specify a color palette
    scale_fill_brewer("Employed", palette = "Paired") + 
    # move x axis labels to the top
    scale_x_discrete(position = 'top') + 
    # include x-axis labels on top of the plot
    labs(
        # Label Y axis
        y = "Employment - Percentages",
        # Label X axis
        x = "Quarter-Year" ,
        # Add a title that reflects the main takeaway of the figure
        title = "The most common employment patern is No Employments for 2-years after exit",
        # Cite the source of your data
        caption = "Indiana TANF data (2018Q2 Exiters) \n Indiana UI wage Records 2018Q3-2020Q")

Update the font size using the code below

In [None]:
Figure_3 <- Figure_3 + theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )

# Display the graph that we just created
Figure_3

## Percent Employed Over Time 

We will create a line graph that depicts the percent of individuals from our cohort employed in '2018 Q3' who were employed over the year following exit broken down by **subgroups**. The information in this figure will be the same as in the last figure that we created in the `04_Characterizing_Demand_Beginner.ipynb` notebook. The focus of this exercise is to create the necessary supporting table and to improve the visual so that it better communicates relevant information.

#### Read in the Data

We read in the `merged_cohort_wages_growth` table that we created in `04_Characterizing_Demand_Beginner.ipynb` notebook. 

In [None]:
query <- "
SELECT *
FROM tr_tdc_2022.dbo.merged_cohort_wages_growth
;
"


cohort_GrowthRate <- dbGetQuery(con, query)

head(cohort_GrowthRate, 5)

### Supporting Table

In the figure we will depict the portion of people employed over time for the different subgroups. Thus the supporting table **must** include the following: 
- count: The number of people employed in each quarter for each subgroup 
- num_employers: The number of Employers in each quarter for each subgroup 
- count_round: The number of people employed in each quarter for each subgroup rounded 
    - Counts below 1000 should be rounded to the nearest ten
    - Counts above or equal to 1000 should be rounded to the nearest hundred
- pop: The number of people included in each subgroup
- pop_round: The number of people included in each subgroup rounded
    - Population below 1000 should be rounded to the nearest ten
    - Population above or equal to 1000 should be rounded to the nearest hundred
- percent: The percent of people employed in each quarter for each subgroup
    - We have to use the rounded count and rounded population to calculate this
- percent_round: The percent of people employed in each quarter for each subgroup rounded
    - We have to round to the nearest full percent

In [None]:
Figue4_data <- cohort_GrowthRate %>%
    group_by(emp_rate_cat) %>%
    mutate(pop = n_distinct(ssn)) %>%
    group_by(emp_rate_cat, yr_quarter) %>% 
    summarise(count = n_distinct(ssn), 
              # rounding count
              cound_round = ifelse(count < 1000, round(count, digits = -1), 
                                   round(count, digits = -2)),
              pop = unique(pop),
              # rounding population
             pop_round = ifelse(pop < 1000, round(pop, digits = -1), 
                                round(pop, digits = -2)),
             percent = (cound_round/pop_round)*100,
              # rounding percent
              percent_round = round(percent, digits = 0),
             num_employers = n_distinct(Empr_no))


Figue4_data

### Save the Data
> In order to submit the following figure for export, you would need this csv as the supporting table. This table will not be exported, it will be used by the export team to make sure the figure passes the disclosure requirments. As in this example, all supporting tables should be generated programmatically (using only code) and the associated figure should be generated only using data that correspond to that table.

In [None]:
write_csv(Figue4_data, "Data/Figure4_data.csv")

## Line Plot

We define colors that correspond to each type of employer (high growth, medium growth or low growth). This will be reflected in the line plot that we create.

In [None]:
Employer_color <- c(High = "#D55E00",
                Medium = "#CC79A7",
                   Low =  "#999999") 

We create a line plot using the code below

In [None]:
Figure_4 <- Figue4_data %>%
    ggplot(aes(x = yr_quarter, 
               y = percent_round, 
               group = emp_rate_cat, 
               color = emp_rate_cat)) +
    geom_line(size = 1.3) + 
    geom_point(size = 5) + 
    expand_limits(y = 0) +
    labs(colour = "Employment Growth Rate") + # Chance the title for the legend
    scale_fill_manual("", values = Employer_color) +
    labs(
        # Labelling x axis
        x = 'Quarter-Year', 
        # Labelling y axis
        y = 'Percent Employed', 
        # Add a title that conveys the main takeaway of the graph
        title = 'TANF Exiters who are employed by High-Growth employers are the least likley to remain employed', 
        # cite the source of your data
        caption = 'Indiana TANF data (2018Q2 Exiters) \n Indiana UI Wage records'
        )




We update the font size and display the line plot. 

In [None]:
Figure_4 <- Figure_4 + 
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )

# Display the graph that we just created
Figure_4

# **Saving Visuals**

In this section, we look at the best ways to export our presentation-ready plot. We use ``ggsave`` to save our plot in a png, jpeg and pdf format without losing quality. In the Example we save Figure_1 but you would want to save each figure that you create for your project. 

### **PNG**

First, we provide an example of using ``ggsave`` with two parameters: `filename` and `plot`.

In [None]:
ggsave(
    filename = sprintf("Figures\\Figure_1.png"), # saving path
    plot = Figure_1 # plot name
)

This might not be the preferred way of saving a plot since the dimensions of the plot default to 6.67 x 6.67. We suggest looking at the file we just saved in its respective path. You will see how all the labels are cluttered and the graph can not be interpretted. Thus, we recommend using the `width` and `height` parameters in addition to `filename` and `plot`.

In [None]:
ggsave(
    filename = sprintf("Figures\\Figure_1.png"), # saving path
    plot = Figure_1, # plot name
    width = 20, # width
    height = 12 # height
)

The code above saves the plots in a format that can be interpretted conveniently. We reuse this code to save in a JPEG and PDF format below:

### **JPEG**

In [None]:
ggsave(
    filename = sprintf("Figures\\Figure_1.jpeg"), # saving path
    plot = Figure_1, # plot name
    width = 20, # width
    height = 12 # height
)

### **PDF**

In [None]:
ggsave(
    filename = sprintf("Figures\\Figure_1.pdf"), # saving path
    plot = Figure_1, # plot name
    width = 20, # width
    height = 12 # height
)

## References:
- Presentation Preparation, Applied Data Analytics Training, National Center of Science and Engineering Statistics, 2021
- Characterizing Demand, Applied Data Analytics Training, TANF Data Collabarative, 2022
- Data Visualization, Applied Data Analytics Training, California, 2021