<center><br><br>
    Arkansas Work-Based Learning to Workforce Outcomes <br>
    Applied Data Analytics Training | Spring 2022
    <h1> Presentation Preparation Checkpoints </h1>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Joshua Edelmann and Benjamin Feder</center>
</center>

***

# Introduction

The purpose of this checkpoint notebook is to apply the methods you saw in `05_Presentation_Prep.ipynb` to a potential visualization of your cohort. Here, we will produce an image depicting your cohort's wages by an aggregated race variable and confirm that the final visual will pass export review.

In the checkpoint notebooks `02_Creating_a_cohort.ipynb`, you were asked to create and save your cohort as an SQL table. You will be utilizing the cohort you created as part of this checkpoint notebook (or perhaps an updated cohort you have since created). 

At each checkpoint, you will be replacing the `___` with the appropriate variable, function or R code snippet. 

You are encouraged to attempt the checkpoints on your own. Having said that, hints and suggested solutions are provided and these can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution()`, respectively.

> Note: This checkpoint notebook has been created by keeping a cohort of apprenticeship completers in mind. We encourage you to reach out to your team facilitator to learn more about the methods you can use if your cohort is defined differently, such as by apprenticeship starters. Also, the code for accessing hints and solutions is currently commented out – in order for the cells to run, you will need to uncomment them first. 

As a reminder, the export rules for this class are as follows:

# AR 2022 Class Export Review Guidelines 

- **Each team will be able to export up to 10 figures/tables**
    
    
- **Every statistic for export must be based on at least 11 individuals and at least 3 employers (when using wage records)**
     - Statistics that are based off of 0-10 individuals must be supressed
     - Statistics that are based off of 0-2 employers must be supressed
    
    
- **All counts will need to be rounded**
    - Counts below 1000 should be rounded to the nearest ten
    - Counts greater than or equal to 1000 should be rounded to the nearest hundred
    > For example, a count of 868 would be rounded to 870 and a count of 1868 would be rounded to 1900

- **All reported wages will need to be rounded to the nearest hundred** 
    
- **All reported averages will need to be rounded to the nearest hundredth** 
   
- **All percentages and proportions need to be rounded**
    - The same rounding rule that is applied to counts must be applied to both the numerator and denominator
    - Percentages must then be rounded to the nearest percent
    - Proportions must be rounded to the nearest hundredth


- **Exact percentiles can not be exported** 
    - Instead, for example, you may calculate a “fuzzy median”, by averaging the true 45th and 55th percentiles
       - If you are calculating the fuzzy percentiles for wage, you will need to round to the nearest hundred after calculating the fuzzy percentile
       - If you are calculating the fuzzy percentile for a number of individuals, you will need to round to the nearest 10 if the count is less than 1000 and to the nearest hundred if the count is greater than or equal to 1000
  
- **Exact Maxima and Minima can not be exported**
    - Suppress maximum and minimum values in general
    - You may replace an exact maximum or minimum with a top-coded value or a fuzzy maximum or minimum value. For example: If the maximum value for earnings is 154,325, it could be top-coded as '100,000+'. And a fuzzy maximum value could be: 
    $$\frac{95th\ percentile\ of\ earnings + 154325}{2}$$
 
 
- **Complementary suppression**
    - If your figures include totals or are dependent on a preceding or subsequent figures, you need to take into account complementary disclosure risks—that is, whether the figure totals or the separate figures when read together, might disclose information about less then 11 individuals in the data in a way that a single, simpler table would not. Team facilitators and export reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques

##  Supporting Documentation for Exports

For each exported figure, you will need to provide a table with **underlying counts** of individuals and employers (when appropriate) for each statistic depicted in the figure. 

- You will need to include both the rounded and the unrounded counts of individuals

- If percentages or proportions are to be exported, you must report both the rounded and the unrounded counts of individuals for the numerator and denominator. You must also report the counts of employers for both the numerator and the denominator when working with wage records

**Code**
- Please provide the code for every output that needs to be exported and the code generating every table (csv) with underlying counts. It is important for the ADRF staff to have the code to better understand what exactly was done and to be able to replicate results. Understanding how research results are created is important in understanding the research output. Thus, it is important to document every step of the analysis in the Jupyter notebook.

# R Setup

In this notebook, we will use colors from this color-blind friendly selection:

"#009E73", "#0072B2", "#D55E00", "#CC79A7", "#999999", "#E69F00",  "#56B4E9", "#F0E442"

In [None]:
# switching off warnings
options(warn=-1)

#database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)
options(warn=0)

source('05_Presentation_Prep_checkpoints_hints_solutions.txt')

# implementing option
options(scipen = 100) 

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

# Checkpoint 1: Pulling linked cohort-wage data


You will start by combining your cohort of apprenticeship completers with UI Wage Records. We have provided 1 blank space to fill in with your cohort.

In [None]:
qry <- "
SELECT C.race,
F.Quarter_ID - P.Apprenticeship_End_Quarter_ID AS Quarters_Relative_to_Completion,
P.Person_ID,
F.Primary_Employer_Wages ,
PE.Federal_EIN,
C.apprnumber --pulling in for later
FROM 
tr_ar_2022.dbo.___ C --COHORT
JOIN tr_ar_2022.dbo.AR_MDIM_Person P ON (P.Apprentice_Number=C.apprnumber) --PERSON
JOIN tr_ar_2022.dbo.AR_FACT_Quarterly_Observation F --QUARTERLY OBSERVATION FACT
    ON (P.Person_ID=F.Person_ID) 
    AND (F.Quarter_ID BETWEEN (P.Apprenticeship_End_Quarter_ID + 1) AND (P.Apprenticeship_End_Quarter_ID+4))  --QTRS POST COMPLETION
JOIN tr_ar_2022.dbo.AR_RDIM_NAICS_National_Industry NNI ON (P.Apprenticeship_NAICS_National_Industry_ID=NNI.NAICS_National_Industry_ID) --APPRENTICESHIP INDUSTRY
JOIN tr_ar_2022.dbo.AR_MDIM_Employer PE ON (PE.Employer_ID=F.Primary_Employer_ID)  --PRIMARY EMPLOYER
"

# create binary race variable
cohort_wages <- dbGetQuery(con, qry) %>%
    mutate(binary_race = ifelse(race != "White" | is.na(race), "Non-White", "White"))
    
head(cohort_wages)

In [None]:
# hint
# check_1.hint()

In [None]:
# solution
# check_1.solution()

## Create a wage indicator

Recall that we hope to visualize an employment outcome for our cohort by the aggregated race variable, **binary_race**. The employment outcome we will use in this example is a binary indicator tracking if an individual at their primary employer in a given quarter earned more than $5720 (minimum wage in Arkansas in 2022).

We will walk through this example looking at how this outcome differs amongst White and Non-White individuals.

> Note: As with the other checkpoint notebooks, we recommend working through this example and then coming back and changing the constructed categorical variable to a more relevant one for your project. Additionally, note that the wage records have not been inflation adjusted and do not longitudinally track minimum wage in the state, so this is a rough estimate.

In [None]:
cohort_wages <- cohort_wages %>% 
    mutate(
        wage_ind = ifelse(Primary_Employer_Wages >= 5721, "Above Minimum Wage", "At or Below Minimum Wage")
    )

# Checkpoint 2: Calculate Rounded Portion and Counts
Now that you have your employment outcome indicator, you will calculate the portion of your cohort meeting this employment outcome by the binary race and quarter variables using all of the appropriate rounding rules. You will also count the number of employers because you need to verify that there are a minimum of 3 employers for each group for the supporting table. Remember that the goal of this checkpoint is to help reinforce the export rules. You will be updating the blanks to:

- Round the count of individuals 
- Round the population count
- Round percentages

> Note: When rounding counts, you will have to identify the condition for which the different rounding rules apply and the appropriate level of rounding that needs to be applied.

In [None]:
cohort_wages_checkpoint_2 <- cohort_wages %>%
    group_by(Quarters_Relative_to_Completion, binary_race) %>%  
    mutate(pop = n_distinct(Person_ID)) %>%
    ungroup() %>%
    group_by(Quarters_Relative_to_Completion, wage_ind, binary_race) %>%
    summarise(employer_count = n_distinct(Federal_EIN),
              count = n_distinct(Person_ID),
               count_round = ifelse(count < ___, round(count, digits = ___), 
                                   round(count, digits = -2)),
              pop = unique(pop),
              pop_round = ifelse(pop < ___, round(pop, digits = ___), 
                round(pop, digits = ___)),
             percentage = 100*(count_round/pop_round),
             percentage_round = round(percentage, digits = ___))

cohort_wages_checkpoint_2

In [None]:
# hint
# check_2.hint()

In [None]:
# solution
# check_2.solution()

# Checkpoint 3: Bar plot

In this checkpoint, we ask you to create a bar plot that: 

- Depicts the percentage of each subgroup that meets the previously-established employment outcome on the y-axis
- Depicts the quarter relative to graduation on the x-axis
- Compares the group of individuals who have a high or low wage 

Working through this checkpoint will reinforce the overall general structure of `ggplot()` and the ease of customizing your visual.

In [None]:
# color palette
fill_color <- c('Non-White' = '#009E73', 'White' = '#0072B2')

In [None]:
Figure_checkpoint_3 <- cohort_wages_checkpoint_2 %>%
    ggplot(aes(x = ___, 
               y = ___, 
               fill = ___)) +
    geom_bar(stat="identity", position='dodge') +
    facet_grid(. ~ fct_relevel(____, 'Above Minimum Wage', 'At or Below Minimum Wage')) +
    expand_limits(y = 0) +
    labs(colour = "___") + # Chance the title for the legend
    scale_fill_manual('Race Category', values = fill_color) +
    labs(
        # Labelling x axis
        x = '___', 
        # Labelling y axis
        y = '___', 
        # Add a title that conveys the main takeaway of the graph
        title = '___', 
        # cite the source of your data
        caption = '___'
        )

In [None]:
Figure_checkpoint_3 <- Figure_checkpoint_3 +
   theme(
        legend.text = element_text(size = 24), # legend text font size
        legend.title = element_text(size = 24), # legend title font size
        axis.text.x = element_text(size = 24), # x axis label font size
        axis.title.x = element_text(size = 24), # x axis title font size
        axis.text.y = element_text(size = 24), # y axis label font size
        axis.title.y = element_text(size = 24) # y axis title font size
    )


Figure_checkpoint_3

In [None]:
# hint
# check_3.hint()

In [None]:
# solution
# check_3.solution()

# References
- Presentation Prep: Advanced Checkpoints, Applied Data Analytics Training, TANF Data Collabarative, 2022
- Presentation Prep: Beginner Checkpoints, Applied Data Analytics Training, TANF Data Collabarative, 2022
- Presentation Prep, Applied Data Analytics Training, Arkansas WBL, 2022