<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Benjamin Feder, Nathan Barrett </center>
<a href="https://doi.org/10.5281/zenodo.6412967"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6412967.svg" alt="DOI"></a>



# **Presentation Preparation**

At a certain point in your project work, you will begin thinking about how you might want to convey your results to an external audience. Additionally, you might be concerned that some of your results may not align with the regulations imposed by the data providers. This notebook provides information on how to prepare research output for the export process - both from disclosure control and production  viewpoints. 

In Module 2, we explored different earnings outcomes for a cohort of bachelor's degree recipients in Texas. Here, we will prepare different types of some previously-generated outputs for use in final presentations and reports. Additionally, we will ensure that each figure or table contains the information required to be accepted through the export review process.

## **Class-Specific Export Review Guidelines**

- Each team will be able to export up to 10 files - Teams are permitted to export summary data tables and figures

- Every statistic for export should be based on at least 10 individuals

- If wage records are used to generate a file, each statistic must also be based on at least 3 employers

- Counts must be rounded - 0-999 to the nearest 10 and above 999 to the nearest 100

- Percentages, proportions and ratios need to be rounded - First must apply the rounding rules to both the numerator and denominator, before calculating percentages that must be rounded to the nearest percent and proportions to the nearest hundredth (.01)

-  Wages must be rounded to the nearest 100

- Exact percentiles can not be exported - Instead, for example, you may calculate a “fuzzy median” by averaging the true 45th and 55th percentiles

- Exact maxima and minima can not be exported - You may replace an exact maximum or minimum with a top-coded value

- Complementary suppression - If your files include values that are dependent on a separate file, you may need to take into account complementary disclosure risks—that is, whether the file’s values, or a combination of files when read together, might disclose information about less than 10 individuals in the data in a way that a single, simpler table would not. Team facilitators and export reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.

### **Supporting documentation for exports**

For each exported file, you will need to provide a table with **unrounded underlying counts** of individuals and employers (if using wage records) for each cell/statistic, as well as one with **unrounded values of any other reported statistics**.

- If percentages or proportions are to be exported, you must report both the rounded and the unrounded counts of individuals for the numerator and denominator. You must also report the counts of employers for both the numerator and the denominator if wage records were used.
   
- Please provide the code used to generate each file. It is important for the ADRF staff to have the code to better understand the creation method. Clearly-commented code files will decrease review time.

A more detailed description of the rules for exporting results can be found in the [export documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before preparing an export submission, please read through the export documentation. Your team lead is expected to spearhead the export submission process.

## **Learning Objectives**

We will prepare both tables and visualizations for potential export in answering the following question:

- Do the average/median quarterly earnings and number of individuals employed by quarter in our cohort vary by major?

After completing this notebook, you should know how to disclosure-proof both tables and figures and how to optimize visualizations for final products.

## **Import Packages and Set Up**

In [None]:
# insert ADRF username Firstname.Lastname.UserID
username <- "____"

In [None]:
# Database interaction imports
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)
options(warn=0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

We will also load in our cohort and cohort's employment information tables.

In [None]:
# load in cohort
qry <- 
"
SELECT * 
FROM tr_tx_2021.dbo.grads15
"

df_cohort <- dbGetQuery(con, qry)

In [None]:
# read in linked wage records
qry <- "
SELECT *
FROM tr_tx_2021.dbo.nb_cohort_wages_link
"
df_wages_undup <- dbGetQuery(con, qry)

## **Do the average quarterly earnings and number of individuals employed by quarter in our cohort vary by major?**

Recall the question from both the Data Exploration: Wages and Data Visualization notebooks, as the two notebooks provided answers as a table and an image, respectively.  Here, we will walk through preparing both for production and export review.

### **Creating the Unrounded Table**

Recall the code from the Data Exploration: Wages notebook that produces the unrounded table. **You will always need to provide an unrounded table as part of the export submission.**

> Note: Whenever pulling information from the wage records, remember to include individual and employer information, as they will be required to check if the statistics in question pass the disclosure thresholds. Our previously-generated table `nb_cohort_wages_link` already contains employer information in the variable `empr_no`.

In [None]:
# get quarter from graduation
df_wages_undup <- df_wages_undup %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    )

# ignore quarters 0 and 13
df_wages_undup <- df_wages_undup %>%
    filter(!(quarter_number %in% c(0, 13)))

In [None]:
# df_cohort: Create a 2 digit CIP program code from the full CIP code in `gradmaj`
df_cohort <- df_cohort %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

# df_wages_undup: Create a 2 digit CIP program code from the full CIP code in `gradmaj`
df_wages_undup <- df_wages_undup %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )
    
# load CIP crosswalk into R
qry <- "
SELECT *
FROM ds_public_1.dbo.cip_lookup
"
cip_lookup <- dbGetQuery(con, qry)

# only select 2010 columns
cip_lookup <- cip_lookup %>%
    select(ends_with("2010"))


# 5 most common majors
com_majors <- df_cohort %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

In [None]:
# earnings and number employed for most common majors
# first find quarterly wages for each person while including major
# then find average wages within groups
avg_and_num_major <- df_wages_undup %>%
    filter(CIP_Program %in% com_majors$CIP_Program) %>%
    group_by(gradid, quarter_number, CIP_Program) %>%
    summarize(
        total_wages = sum(wage)
    ) %>%
    ungroup() %>%
    group_by(CIP_Program, quarter_number) %>%
    summarize(
            mean_wage = mean(total_wages),
            n_employed = n_distinct(gradid)
        )  %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010")) %>%
    ungroup()


avg_and_num_major

### **Finding Individual Counts**

From the table above, we can observe that not only did we produce the indvidual counts per grouping, but also that all categories satisfy the individual count criteria.

### **Finding Employer Counts**

Since we are working with wage records, we also need to make sure each row meets the employer count requirement.

> Note: If you are exporting a table that does not contain information from wage records, you do not need to report underlying employer counts.

In [None]:
# find number of unique employers within each major/quarter 
avg_and_num_major_employers <- df_wages_undup %>%
    filter(CIP_Program %in% com_majors$CIP_Program) %>%
    group_by(CIP_Program, quarter_number) %>%
    summarize(
        number_employers = n_distinct(empr_no)
    ) %>% 
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010")) %>%
    ungroup()

avg_and_num_major_employers

We can confirm that all groups also satisfy the employer count requirement. Let's save both the underlying employer counts as well as the original table that contains the unrounded individual counts and additional statistics.

> If a subgroup did not meet a requirement, we could either drop the category or further aggregate (i.e. combine with another major) it.

### **Saving Unrounded and Underlying Counts Tables**

In [None]:
# Save the unrounded table for export that also has individual counts
avg_and_num_major %>% 
    write_csv(sprintf("U:\\%s\\TX Training\\Output\\avg_and_num_major_unrounded.csv", username))

In [None]:
# Save the underlying employer counts
avg_and_num_major_employers %>%
    write_csv(sprintf("U:\\%s\\TX Training\\Output\\employer_counts_for_avg_and_num_major.csv", username))

### **Applying Primary Suppression**

We already verified that all cells met the export conditions based on their individual and employer counts. However, If we were working with a table that either did not meet certain standards and/or was too long to view each row in the notebook, we could implement the following code to suppress necessary cells.

> Note: We have purposely included this step after saving the unrounded and underlying counts tables because even if suppression is required, the ADRF staff will need the complete unsuppressed and unrounded counts.

In [None]:
# set all cells that must be suppressed because of indivdual counts to NA
# note: don't need to set the groupings (major and quarter) to NA because they are the natural groupings, not what we are reporting
avg_and_num_major <- avg_and_num_major %>% 
    mutate(
        mean_wage = ifelse(n_employed < 10, NA, mean_wage),
        n_employed = ifelse(n_employed < 10, NA, n_employed)
    )

# see that no cells were suppressed based on individual counts
avg_and_num_major %>%
    filter(is.na(mean_wage))

In [None]:
# set all cells that must be suppressed because of employer counts to NA
# first find all rows that don't satisfy employer counts
# then join that to original data frame and preserve all rows from original data frame
# set all rows that employer counts joined for to NA
avg_and_num_major <- avg_and_num_major_employers %>% 
    filter(number_employers < 3) %>%
    right_join(avg_and_num_major, by = c("CIP_Program","quarter_number", "CIPTitle2010")) %>%
    mutate(
        mean_wage = ifelse(is.na(number_employers), mean_wage, NA),
        n_employed = ifelse(is.na(number_employers), n_employed, NA)
    ) %>% 
    select(-number_employers)

# see that no cells were suppressed based on employer counts
avg_and_num_major %>%
    filter(is.na(mean_wage))

### **Applying Complementary Suppression**

You will not be able to properly evaluate a single file for any potential complementary disclosure issues until all of your files requested for export are created. Your team lead will help you with the process, but you can begin to look for complementary disclosure by comparing totals reported within subgroups across files.

### **Rounding Table**

As a final step in preparing this table for export, we must apply the rounding rules required by the export guidelines. Recall that all counts less than 1000 must be rounded to the nearest 10, counts of at least 1000 must be rounded to the nearest 100, and wages must be reported to the nearest 100. 

In [None]:
# apply proper rounding rules
avg_and_num_major_rounded <- avg_and_num_major %>%
    mutate(
        mean_wage = round(mean_wage, digits = -2),
        n_employed = ifelse(n_employed < 1000, round(n_employed, digits = -1), round(n_employed, digits = -2))
    )

Now we can save our suppressed and rounded table that is ready for export.

In [None]:
# Save final table for export
avg_and_num_major_rounded %>% 
    write_csv(sprintf("U:\\%s\\TX Training\\Output\\avg_and_num_major_rounded.csv", username))

## **Plotting**

Prior to creating any potentially presentation-ready visualization, we *highly recommend* that you check to make sure your underlying tables satisfy the disclosure requirements, as you may have to  manipulate the underlying data to avoid or apply suppression. The approach for disclosure-proofing visualizations follows is similar relative to the one with proofing tables - it all pertains to the underlying data used to create the visualization.

Please refer to the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/) on approaching preparing other visualizations for disclosure review.

> The most common type of visualization that may require more non-obvious underlying data manipulation to satisfy export constraints is a *histogram*, as each bar in the histogram is technically a separate grouping. Therefore, each bar in the histogram, in this case, must contain information from at least 10 individuals and 3 employers, if wage records are used.

### **Line Plot**

We will work on presenting the information previously saved as a table (`avg_and_num_major_rounded`) as a line plot, just as we did in the Data Visualization notebook. With a line plot, each point constituting each line must pass the export criteria.

In [None]:
# code from data viz notebook

#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
# you can also use all words but make the legend font smaller
avg_wages_major_plot <- ggplot(avg_and_num_major_rounded, aes(x = quarter_number, y = mean_wage, color = word(CIPTitle2010, 1))) + 

#Plots a line on the graph
geom_line(size=1) +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(4000, 20000, 2000),
    limits = c(4000,20000)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Use the scale_colour_brewer function to use colors that are 
#easier to distinguish for those who are color blind
scale_colour_brewer(
  type = "qual",
  palette = 1,
  direction = 1,
  aesthetics = "colour"
) +

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "TX Graduates majoring in Engineering experienced REDACTED earnings after graduation",
    x = "Quarter after Graduation", y = "Average Quarterly Wages",
    caption = "Data Sources: TWC & THECB",
    color = "Major") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

### **Adjusting Font Sizes**

In order to make the plot presentation ready, we advise using readable font sizes, as the image will be added to either a presentation or report. We use the following code to implement this:

In [None]:
avg_wages_major_plot <- avg_wages_major_plot + 
    theme(
        legend.text = element_text(size=24), # legend text font size
        legend.title = element_text(size=24), # legend title font size
        axis.text.x = element_text(size=24), # x axis label font size
        axis.title.x = element_text(size=24), # x axis title font size
        axis.text.y = element_text(size=24), # y axis label font size
        axis.title.y = element_text(size=24) # y axis title font size
    )

avg_wages_major_plot

Please keep in mind that any statistics (e.g. for counts of subpopulations) have to comply with the export thresholds -  not only the **counts of individuals** (**at least 10**), but also the **counts of employers** (**at least 3**).

## Fuzzy percentiles

Instead of finding the average quarterly earnings for those who graduated with the most common majors, let's say we wanted to report the median quarterly earnings to avoid any potential outlier influence. Since the median describes the 50th percentile, exporting any tables or visualizations based on medians is not permitted. To get a sense of percentiles in the data, fuzzy percentiles can be exported, which can be created by finding the average of two true percentiles. For example, a fuzzy median can be created by finding the average of the true 45th and 55th percentiles. This section will provide a template for creating fuzzy percentiles so that they can pass the disclosure review process. 

In [None]:
# find true median earnings and number employed for most common majors
# first find quarterly wages for each person while including major
# then find median wages within groups
avg_and_num_major_median <- df_wages_undup %>%
    filter(CIP_Program %in% com_majors$CIP_Program) %>%
    group_by(gradid, quarter_number, CIP_Program) %>%
    summarize(
        total_wages = sum(wage)
    ) %>%
    ungroup() %>%
    group_by(CIP_Program, quarter_number) %>%
    summarize(
            median_wage = median(total_wages),
            n_employed = n_distinct(gradid)
        )  %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010")) %>%
    ungroup()


avg_and_num_major_median

As mentioned above, to export the median wages (50th percentile), the medians need to be transformed into fuzzy percentiles by averaging close-by percentiles. For a fuzzy median, we will take the average of the true 45th and 55th percentiles.

In [None]:
# find true median earnings and number employed for most common majors
# first find quarterly wages for each person while including major
# then find median wages within groups
avg_and_num_major_fuzzy_median <- df_wages_undup %>%
    filter(CIP_Program %in% com_majors$CIP_Program) %>%
    group_by(gradid, quarter_number, CIP_Program) %>%
    summarize(
        total_wages = sum(wage)
    ) %>%
    ungroup() %>%
    group_by(CIP_Program, quarter_number) %>%
    summarize(
            fuzzy_median_wage = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
            n_employed = n_distinct(gradid)
        )  %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010")) %>%
    ungroup()


avg_and_num_major_fuzzy_median

Based on a quick glance, we can see that the reported fuzzy median wages do not deviate much from the true medians in this example. Since we have already found the underlying individual and employer counts, and have applied primary suppression (found nothing to suppress), we can finalize this table by rounding the fuzzy median and number employed values.

In [None]:
# apply proper rounding rules
avg_and_num_major_fuzzy_median_rounded <- avg_and_num_major_fuzzy_median %>%
    mutate(
        fuzzy_median_wage = round(fuzzy_median_wage, digits = -2),
        n_employed = ifelse(n_employed < 1000, round(n_employed, digits = -1), round(n_employed, digits = -2))
    )

head(avg_and_num_major_fuzzy_median_rounded)

Now we can save our suppressed and rounded table that is ready for export.

In [None]:
# Save final table for export
avg_and_num_major_fuzzy_median_rounded %>% 
    write_csv(sprintf("U:\\%s\\TX Training\\Output\\avg_and_num_major_fuzzy_median_rounded.csv", username))

Do not forget to also saved the unsuppressed and unrounded version as well.

In [None]:
# Save unsuppressed, unrounded table for export
avg_and_num_major_fuzzy_median %>% 
    write_csv(sprintf("U:\\%s\\TX Training\\Output\\avg_and_num_major_fuzzy_median_unrounded.csv", username))

## **Proportions/Percentages/Ratios**

When working with percentages (both numerically and visually), you must show the underlying numerator and denominator counts. The following are the rules for calculating any percentages/proportions/ratios:
- You must provide unrounded counts for both the numerator and denominator. They must both meet the threshold for each group, bar, or point on a line specified in the disclosure review (in this class, at least 10 individuals and potentially at least 3 employers, depending on the data source).
- Round the numerator and denominator before calculating any percentage, proportion, or ratio. Then a percentage, proportion, or ratio can be calculated and must be rounded afterwards.
- Do not report 0 and 100%.

## **Unsupervised Machine Learning**

Exporting clusters must be treated as any other grouping variable, as each cluster must satisfy a minimum number of individuals and (when applicable) employers to pass disclosure control.

## **Supervised Machine Learning**


Whenever you are creating your training and test datasets, after creating them, you must include the counts of each variable, and please do not alter the datasets after doing so. If you use any dummy variables, you need to provide the counts of 0s and 1s for each dummy variable.

Remember that if you are plotting y-scores, it is still a histogram, and each estimate represents an individual data point, therefore, it needs to comply with the disclosure threshold described above, that each bin must consist of at least 10 individuals. Please refer to the disclosure review documentation for more information about exporting histograms.

## **Saving Visuals**

In this section, we look at the best ways to export our presentation-ready plot. We use ``ggsave`` to save our plot in a png, jpeg and pdf format without losing quality. 

### **PNG**

First, we provide an example of using ``ggsave`` with two parameters: `filename` and `plot`.

In [None]:
ggsave(
    filename = sprintf("U:\\%s\\TX Training\\Output\\avg_wages_major_plot.png", username), # saving path
    plot = avg_wages_major_plot # plot name
)

This might not be the preferred way of saving a plot since the dimensions of the plot default to 6.67 x 6.67. We suggest looking at this file we just saved in its respective path. You will see how all the labels are cluttered and the graph can not be interpretted. Thus, we recommend using the `width` and `height` parameters in addition to `filename` and `plot`.

In [None]:
ggsave(
    filename = sprintf("U:\\%s\\TX Training\\Output\\avg_wages_major_plot.png", username), # saving path
    plot = avg_wages_major_plot,  # plot name
    width = 20, # width
    height = 12 # height
)

The code above saves the plots in a format that can be interpretted conveniently. We reuse this code to save in a JPEG and PDF format below:

### **JPEG**

In [None]:
ggsave(
    filename = sprintf("U:\\%s\\TX Training\\Output\\avg_wages_major_plot.jpeg", username), # saving path
    plot = avg_wages_major_plot,  # plot name
    width = 20, # width
    height = 12 # height
)

### **PDF**

In [None]:
ggsave(
    filename = sprintf("U:\\%s\\TX Training\\Output\\avg_wages_major_plot.pdf", username), # saving path
    plot = avg_wages_major_plot,  # plot name
    width = 20, # width
    height = 12 # height
)

## **Reminder**
Every single item for export, regardless of whether it is a .csv, .pdf, .png, or something else, must have corresponding proof in the input file to show that every group used to create the statistic followed the disclosure review rules.

> Note: After the end of the course, it is possible to export the code that have been used during the class. In order to do that, make your team lead aware of your group's interest and they will handle it from there.

In [None]:
# Close the database connection
dbDisconnect(con)

## References

Edelmann, Joshua, Mian, Rukhshan, & Feder, Benjamin. (2022, April 1). Preparing Safe Outputs using Tennessee Education and Employment Data. Zenodo. https://doi.org/10.5281/zenodo.6407279

