# Presentation Prep

Maryah Garner, Rukhshan Arif Mian, Allison Nunez

# Introduction:

The purpose of this checkpoint notebook is to apply the methods you saw in `05_Presentation_Prep_Beginner.ipynb` or `05_Presentation_Prep_Advanced.ipynb` to your cohort.

In the checkpoint notebooks `02_Creating_a_cohort.ipynb`, we asked you to create and save your cohort as an SQL table. You will be utilizing the cohort you created as part of this checkpoint notebook (or an updated cohort you have created since). 

> Note: This checkpoint notebook has been created by keeping a TANF exit cohort in mind. We encourage you to reach out to your team facilitator to learn more about the methods you can use to Characterize Demand if your cohort focuses on TANF entry.

At each checkpoint you will be replacing the `___` with the appropriate variable, function or R code snippet. 

Participants are encouraged to attempt the checkpoints on their own. Having said that, hints and suggested solutions are provided and these can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution` respectively. Note: Codes for accessing hints and solutions are currently commented out – in order for these to run, we would need to uncomment them first. 


As a reminder, the export rules for this class are as follows:

# TDC 2022 Class Export Review Guidelines 

- **Each team will be able to export up to 10 figures/tables**
    
    
- **Every statistic for export should be based on at least 10 individuals and at least 3 employers**.
     - Statistics that are based off of 0-9 individuals must be surpressed
     - Statistics that are based off of 0-2 employers must be surpressed
    
    
- **All counts will need to be rounded**
    - Counts below 1000 should be rounded to the nearest ten
    - Counts greater than or equal to 1000 should be rounded to the nearest hundred
    > For example, a count of 868 would be rounded to 870 and a count of 1868 would be rounded to 1900

- **All reported wages will need to be rounded to the nearest hundred** 
    
- **All reported averages will need to be rounded to the nearest hundredth** 
   
   
- **All percentages and proportions need to be rounded**
    - The same rounding rule that is applied to counts must be applied to both the numerator and denominator
    - Percentages must then be rounded to the nearest percent
    - Proportions must be rounded to the nearest hundredth


- **Exact percentiles can not be exported** 
    - Instead, for example, you may calculate a “fuzzy median”, by averaging the true 45th and 55th percentiles
       > If you are calculating the fuzzy percentiles for wage, you will need to round to the nearest hundred after calculating the fuzzy percentile.

       > If you are calculating the fuzzy percentile for a number of individuals, you will need to round to the nearest 10 if the count is less than 1000 and to the nearest hundred if the count is greater than or equal to 1000.
  
- **Exact Maxima and Minima can not be exported**
    - Suppress maximum and minimum values in general. 
    - You may replace an exact maximum or minimum with a top-coded value or a fuzzy maximum or minimum value. For example: If the maximum value for earnings is 154,325, it could be top-coded as '100,000+'. And a fuzzy maximum value could be: 
    $$\frac{95th\ percentile\ of\ earnings + 154325}{2}$$
 
 
- **Complementary suppression**
    - If your figures include totals or are dependent on a preceding or subsequent figures, you need to take into account complementary disclosure risks—that is, whether the figure totals or the separate figures when read together, might disclose information about less then 10 individuals in the data in a way that a single, simpler table would not. Team facilitators and export reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.


##  Supporting Documentation for Exports

For each exported figure, you will need to provide a table with **underlying counts** of individuals and employers for each statistic depicted in the figure. 

- You will need to include both the rounded and the unrounded counts of individuals.

- If percentages or proportions are to be exported, you must report both the rounded and the unrounded counts of individuals for the numerator and denominator. You must also report the counts of employers for both the numerator and the denominator

**Code**
- Please provide the code for every output that needs to be exported and the code generating every table (csv) with underlying counts. It is important for the ADRF staff to have the code to better understand what exactly was done and to be able to replicate results. Understanding how research results are created is important in understanding the research output. Thus, it is important to document every step of the analysis in the Jupyter notebook. 

## Color-blind friendly palette
Throughout this notebook we will use colors from this color-blind friendly selection

"#009E73", "#0072B2", "#D55E00", "#CC79A7", "#999999", "#E69F00",  "#56B4E9", "#F0E442"

## R Setup

In [None]:
# switching off warnings
options(warn=-1)

#database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)
options(warn=0)

source('05_Presentation_Prep_checkpoints_hints_solutions.txt')

# implementing option
options(scipen = 100) 

We connect to the database using the code below.

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Checkpoint 1: Pulling linked cohort-wage data


You will start by combining your cohort of TANF exiters with UI Wage Records. You will read in as many quarters as relevant for you research. We have provided 4 blank spaces to fill in the quarters of your interest, but you can use as many or few quarters as you would like.

In [None]:
# Linking TANF and UI Wages over time
# Utilizing a sub-query to first filter out wage records
qry <- "SELECT nb.ssn, nb.tanf_total_months, wr.Empr_no, wr.Wage, wr.yr_quarter
    FROM 
    tr_tdc_2022.dbo.____ nb
    LEFT JOIN 
    (   select SSN, Empr_no, Year, Quarter, Wage, yr_quarter
        FROM tr_tdc_2022.dbo.wages_tanf
        WHERE yr_quarter IN ('___', '___', '___', '___')
        AND (SSN IN (SELECT DISTINCT(SSN) FROM tr_tdc_2022.dbo.____))
    ) wr
    ON wr.SSN=nb.SSN;
"

cohort_wages <- dbGetQuery(con,qry)

head(cohort_wages)

In [None]:
# hint
# check_1.hint()

In [None]:
# solution
$ check_1.solution()

### Produce summary statistics for time on TANF

#### Goals:
The overall goal of this notebook is to produce a visual that depicts the portion of your cohort who is employed over time by sub-groups in addition to creating a supplementary table for exporting purposes. 

We will walk through this example looking at how employment differs for those who are on TANF for a long time vs those who are on TANF for a short time. As seen in the code below, we consider someone who was on TANF for more than 6 months to have been on TANF for long time and who was on TANF for 6 months of less to have been on TANF for short time.

> We recommend working through this notebook using this example and then coming back and changing the constructed categorical variable to something that is more relavant for your research

In [None]:
cohort_wages <- cohort_wages %>%
    mutate(tanf_time_cat = ifelse(tanf_total_months > 6, "Long Time", "Short Time"))

#### Summarise the Data

Once we have our linked cohort-wages dataframe, we will summarise the data so that each person only has one observation per quarter. Here, we are considering a person to be employed if they have any positive earnings (anything above zero) in a quarter. In your project you might want to set a higher threshold.

> Note: We create the `employed` variable as a character variable as it will make a better visualization later on. 

In [None]:
cohort_wages2 <- cohort_wages %>%
# Create a variable employed, that is 1 for every positive wage and 0 otherwise
        mutate(employed = ifelse(Wage > 0, '1','0'),
               employed = ifelse(is.na(Wage), '0', employed)) %>%
    group_by(ssn, yr_quarter, tanf_time_cat) %>%
# sumarise employed so that each person only has at most one observation per yr_quarter
    summarise(employed = max(employed), 
             tanf_time_cat = unique (tanf_time_cat)) 

head(cohort_wages2)

Assign the number of people in your cohort to `pop`, we will use this later.

In [None]:
pop <- length(unique(cohort_wages2$ssn))
pop

## Checkpoint 2: Calculate Rounded Portion and Counts
Calculate the portion of your cohort employed each quarter using all of the appropriate rounding rules. The goal of this checkpoint is to help reinforce the export rules. You will be updating the blanks to:
- Round the count of individuals 
- Round the population count
- Round percentages

When rounding counts, you will have to identify the condition for which the different rounding rules apply and the appropriate level of rounding that needs to be applied. 

In [None]:
cohort_wages3 <- cohort_wages2 %>%
    group_by(tanf_time_cat) %>%  
    mutate(pop = n_distinct(ssn)) %>%
    ungroup() %>%
    group_by(yr_quarter, tanf_time_cat) %>%
    summarise(count = n_distinct(ssn),
               count_round = ifelse(count < ____, round(count, digits = ___), 
                                   round(count, digits = ___)),
              pop = unique(pop),
              pop_round = ifelse(pop < ___, round(pop, digits = ___), 
                round(pop, digits = ___)),
             percentage = 100*(count_round/pop_round),
             percentage = round(percentage, digits = __))

cohort_wages3

In [None]:
# hint
# check_2.hint()

In [None]:
# solution
# check_2.solution()

Next you will have to get the counts of employers. You need the quarterly number of employers for each group for the supporting table. Since you will not be exporting these counts, you do not need to apply any rounding rules. 

In [None]:
employer_stats <- cohort_wages %>%
    group_by(tanf_time_cat) %>%  
    mutate(pop_empr = n_distinct(Empr_no)) %>%
    ungroup() %>%
    group_by(yr_quarter, tanf_time_cat) %>%
    summarise(count_empr = n_distinct(Empr_no))

employer_stats

Join the employer counts with your cohort wages table

In [None]:
cohort_wages4 <- cohort_wages3 %>%
    inner_join(employer_stats, by=c('tanf_time_cat', 'yr_quarter')) %>%
            filter(!is.na(yr_quarter))

head(cohort_wages4)

## Checkpoint 3: Choosing colors
Select colors for your categories that will be used in the visuals produced below. We recommend using 2 colors from the color-blind friendly palette provided at the beginning of this notebook. 

In [None]:
fill_color <- c('Long Time' = '____',
                'Short Time' = '___') 
fill_color

In [None]:
# hint
# check_3.hint()

In [None]:
# solution
# check_3.solution()

# Visualizations

We provide code to produce a bar plot and the code to produce a line a plot. You can choose to complete only one of these visuals. However, working through both of them might reinforce the overall general structure of `ggplot` and the similarity and differences from one plot type to another.

## Checkpoint 4A
In this checkpoint, we ask you to create a bar plot that: 
- Depicts the percentage of each subgroup that is employed on the y-axis
- Depicts the Year quarter on the x-axis
- Compares the group of individuals who were on TANF for a long time and the group of individuals who were on TANF for a short time

### Bar Plot

In [None]:
Figure_4a <- cohort_wages4 %>%
    ggplot(aes(x = ___, 
               y = ___, 
               fill = ___)) +
    geom_bar(stat="identity", position='dodge') +
    expand_limits(y = 0) +
    labs(colour = "___") + # Chance the title for the legend
    scale_fill_manual("", values = fill_color) +
    labs(
        # Labelling x axis
        x = '___', 
        # Labelling y axis
        y = '___', 
        # Add a title that conveys the main takeaway of the graph
        title = '___', 
        # cite the source of your data
        caption = '___'
        )

In [None]:
Figure_4a <- Figure_4a +
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )


Figure_4a

In [None]:
# hint
# check_4a.hint()

In [None]:
# solution
# check_4a.solution()

## Checkpoint 4B
In this checkpoint, we ask you to produce a line graph depicting the same information as you did in the figure you created above. 

Depending on the time interval you selected, a line graph might be a better way to depict this information. 

As a reminder, we ask you to update the following blanks so that the: 
- Percentage of each subgroup that is employed is on the y-axis
- Year-quarter is on the x-axis
- You are able to compare the group of individuals who were on TANF for a long time and the group of individuals who were on TANF for a short time

### Line Plot

In [None]:
Figure_4b <- cohort_wages4 %>%
    ggplot(aes(x = ___, 
               y = ___, 
               group = ___, 
               color = ___)) +
    geom_line(size = 1.3) + 
    geom_point(size = 5) + 
    expand_limits(y = 0) +
    labs(colour = "Employment Growth Rate") + # Chance the title for the legend
    scale_color_manual("", values = fill_color) +
    labs(
        # Labelling x axis
        x = '____', 
        # Labelling y axis
        y = '____', 
        # Add a title that conveys the main takeaway of the graph
        title = '____', 
        # cite the source of your data
        caption = '___'
        )

In [None]:
Figure_4b <- Figure_4b +
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )


Figure_4b

In [None]:
# hint
# check_4b.hint()

In [None]:
# solution 
# check_4b.solution()

## References:
- Presentation Preparation, Applied Data Analytics Training, National Center of Science and Engineering Statistics, 2021
- Characterizing Demand, Applied Data Analytics Training, TANF Data Collabarative, 2022
- Data Visualization, Applied Data Analytics Training, California, 2021