<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Joshua Edelmann, Rukhshan Mian, Benjamin Feder </center>
<a href="https://doi.org/10.5281/zenodo.6407279"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6407279.svg" alt="DOI"></a>


## **1. Introduction**

At this point, we have created useful visualizations and tables we may want to use outside of the ADRF for a final presentation or report. However, **data cannot leave the ADRF without passing the export review process**, where reviewers confirm that none of the findings are disclosive given the rules provided by the agency. This notebook contains information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and provides an overview of the information needed for disclosure review. Please read through the entire notebook because it will separately discuss different types of outputs that will be flagged in the disclosure review process. The examples are almost all taken directly from the code in the notebooks from Module 2.

## **2. Learning Objectives**

We will use a variety of examples from previous notebooks to walk through how to safely prepare various outputs so they will pass the ADRF disclosure review process. At the end of each section, we will explain how the disclosure-proofed files should be properly documented to pass the export process. As you work through the material presented in the notebook, think about how you might apply the techniques and code to your project. 

In this notebook, our focus is **disclosure control**. After you finish this notebook, you should understand:
- how to make sure tables and visualizations are safe for export
- how to provide and document supporting files
- how to adjust subgroups if they do not satisfy the disclosure control requirements

### **Research Questions**
In this notebook, we revisit the research questions presented in the Module 2 notebooks:
- How many students graduated from Tennessee community colleges by subgroup (e.g. demographics, institutions, enrollment type, etc.)?
- What are the average quarterly wages of our graduating cohort in Tennessee?
- Do the stable employment outcomes of our cohort vary by gender?
- What were the most common major types within the cohort?
- Do the average quarterly earnings vary within the most common majors in our cohort?

### **Datasets** 
We will explore and understand the Tennessee Education to Earnings Data System tables in this notebook:
- **Tennessee Board of Regents**: Data are collected from Universities in Tennessee which contain education data.
- **Unemployment Insurance UI wage records**: Data are collected from businesses and industries in Tennessee that participate, as required by law, in the Unemployment Insurance (UI) program and its trust fund. Wages are reported monthly by industry and compiled quarterly. This information is used for administering the UI program and feeds into mandatory reports to the US Department of Labor.

### **Methods** ###

Through preparing the answers to these questions for export, we will cover techniques for disclosure-proofing the following types of exports:

- Counts
- Percentiles
- Percentages
- Bar plots
- Line plots

### **Remarks about Disclosure Review**

In general, any kind of file format can be exported. However, researchers typically export tables, graphs, regression outputs and aggregated data. Thus, we expect you to export files in either .csv, .txt or graph format. Results cannot be exported in a Jupyter notebook, as the disclosure review process is too burdensome.

However, this does not mean that you will not need Jupyter notebooks in the export process, as they may contain essential code that can be listed as supplemental information for the reviewers. It is essential to document every step of an analysis in a Jupyter notebook, or any code for that matter both for the reviewers and in case you need to update it to satisfy specific export requirements.

A more detailed description of the rules for exporting results can be found in the [export documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before preparing an export submission, please read through the export documentation. Please request an export only when results are final and they are needed for a presentation or final project report. Your team leads are expected to coordinate the export request process.

The general idea behind potential disclosure for the data provided in this class is that every statistic for export must be based on at least **10 individuals**, and if the cell contains data derived from UI wage records, it must also be based on observations from at least **3 unique employers**. <u>If the counts of individuals and/or employers are not clear in the desired table or visualization for export, supplemental counts must be provided in the supporting documentation.</u> 

> Note: A table using solely Kentucky UI wage records will be held to a more stringent requirement. For the purposes of this class, though, we recommend that if you plan to isolate out-of-state earnings, you pool the states together to report on in-state and out-of-state earnings relative to Tennessee. 

### **Rounding Rules**

For all Applied Data Analytics training programs, you must also abide specific rounding rules for exports. These pertain to both tables and visualizations:

- Wages: Nearest 100 (ex. 4578 --> 4600)
- Percentages: Nearest whole number (ex. 25.4% --> 25%)
- Counts less than 100: Nearest 10 (ex. 97 --> 100)
- Counts greater than 100: Nearest 100 (ex. 154667 --> 154700)

## **3. Tables** ##

In this section, we demonstrate how to disclosure-proof tables for export. Here, we will walk-through preparing tables with three different types of variables:
1. Counts
2. Percentiles
3. Percentages

Note that this is not an exhaustive list of the types of table-based exports. The list contains some of the most common table-based exports, and there is extensive documentation on disclosure-controlling other table-based exports in the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before we touch on these examples, we will load the necessary libraries and establish our connection to the server.

> These are the same packages used in the Data Visualization notebook.

In [None]:
# Database interaction imports
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

In [None]:
# Load cohort table for future use
qry <- "
select * 
from tr_tn_2021.dbo.grads1516
"

df_cohort <- dbGetQuery(con, qry)

### **Counts**
Recall the example from the `Data Exploration: Tennessee Community Colleges` notebook where we answered the following question:

- How many students graduated from Tennessee community colleges by subgroup (e.g. demographics, institutions, enrollment type, etc.)?

Given the potential subgroups of major and gender, our solution consisted of counting the number of graduates from the 2015-16 *academic* year, which includes summer 2015, fall 2015, and spring 2016 graduates, by gender and most common majors to find the most common majors for each gender value. Let's load in the csv containing the most common majors by gender, `common_major_gender.csv`.

> Note: In this section we will not focus on preparing the proportion column because disclosure-proofing proportions will be described in detail in a later section.

<font color=red> Before you run the cell below, make sure you have run through the `1A.Data Exploration` notebook and have saved the csv files in your `U:\\..\\TN Training\\Results` directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
#most common majors by gender
common_major_gender <- read_csv("U:\\..\\TN Training\\Results\\common_major_gender.csv") %>%
    select(-prop)

As we calculated in the first notebook, we can see the number of graduates by gender for the most common majors.

In [None]:
# see common_major_gender
common_major_gender

The export rules for the TBR data state that each statistic in an export must consist of at least <b>10 individuals</b>. Therefore, for each row--indicating the number of graduates--there must be at least 10 individuals. To disclosure-proof this table, we can add a `mutate()` call to change `count` values to NA if the count (`n` in this table) is less than 10. This will ensure that we do not have any counts per gender group in our table with less than 10 individuals, and that we will be able to note which gender groups did not satisfy the disclosure review guidelines for the reviewer.

> Note: Because we did not use any UI wage record data to generate this table, we do not need to be concerned with ensuring we have observations from at least 3 unique employers.

In [None]:
# adapt major/gender group counts to satisfy disclosure rules 
common_major_gender_export1 <- common_major_gender %>%
    mutate(
        n = ifelse(n < 10, NA, n)
    ) 

common_major_gender_export1

Finally, to align the desired table for export with the export guidelines, we must adhere to the rounding rules for the training program. Recall that each count less than 100 must be rounded to the nearest 10 and that every count greater than 100 must be rounded to the nearest 100.

> Within the `round()` function, the argument `digits = -1` will round to the nearest 10, and `digits = -2` will round to the nearest 100.

In [None]:
# apply rounding rules for counts
common_major_gender_export1_rounded <- common_major_gender_export1 %>%
    mutate(
        count_rounded = ifelse(n > 100, round(n, digits = -2), round(n, digits = -1))
    ) %>%
# don't need count anymore
    select(-n)

# see table
common_major_gender_export1_rounded

However, excluding a large number of groups below 10 may begin to muddle your output, so other strategies can be employed to maintain table structure while meeting reporting standards for export file release. A common method for doing so is aggregating smaller groups to meet the relevant disclosure review standards (at least 10 individuals in this case).

In this example, we will create aggregate `CIP_Family` values so that `Gender`/`CIP_Family` combinations will satisfy the disclosure review limits. First, we will identify the rows in `common_major_gender` where their current counts are insufficient.

In [None]:
# identify small counts
common_major_gender %>% 
    filter(n<10)

From here, we can recode the rows that will not pass disclosure review to aggregate them into one major grouping, and then we can recalculate `n` as we did in the first notebook, except renaming the variable so that it is clear for the reviewer that it tracks the number of graduates per major within each gender, respectively.

In [None]:
# subgroup aggregation
common_major_gender_export2 <- common_major_gender %>%
    mutate(
        CIP_Family = case_when(
            is.na(Gender) & CIP_Family %in% c("Personal Improvement & Leisure", "Health Professions & Related Services", "Protective Services & Public Affairs") ~ "major_other",
            TRUE ~ CIP_Family
        )
    ) %>%
    group_by(Gender, CIP_Family) %>%
    summarize(
        number_individuals = sum(n)
    ) %>%
    ungroup()

common_major_gender_export2

Then we can go ahead and apply the rounding rules and make sure none of the values in `number_individuals` are redacted.

In [None]:
# adapt major/gender group counts to satisfy disclosure rules 
common_major_gender_export2 <- common_major_gender_export2 %>%
    mutate(
        number_individuals = ifelse(number_individuals < 10, NA, number_individuals)
    ) %>%
    arrange(desc(number_individuals))

common_major_gender_export2

In [None]:
# apply rounding rules for counts
common_major_gender_export2_rounded <- common_major_gender_export2 %>%
    mutate(
        count_rounded = ifelse(number_individuals > 100, round(number_individuals, digits = -2), round(number_individuals, digits = -1))
    ) %>%
# don't need unrounded count anymore
    select(-number_individuals)

# see table
common_major_gender_export2_rounded

To provide proof that the proper rounding rules were applied, we will also save the corresponding unrounded tables (`common_major_gender_export1` for `common_major_gender_export1_rounded` and `common_major_gender_export2` for `common_major_gender_export2_rounded`) as supplemental files containing the true counts per group. These tables (`common_major_gender_export1_rounded` and `common_major_gender_export2_rounded`) are now safe for export.

### **Fuzzy Percentiles**

Suppose we took a question adapted from the Data Exploration: Wages notebook:

- What are the average quarterly wages of our graduating cohort in Tennessee?

but discovered that a median would better represent what was in the data due to skewness and outliers. Let's slightly adjust this question to:
- What are the *median* quarterly wages of our graduating cohort in Tennessee?
> The median is another way to describe the 50th percentile.

Under no circumstances will true percentiles pass export review because percentiles represent <b><i>ACTUAL</b></i> observations for an individual, regardless of the unit of analysis potentially being directly subject to disclosure review. To get a sense of percentiles in the data, fuzzy percentiles can be exported, which can be created by finding the average of two true percentiles. For example, a fuzzy median can be created by finding the average of the true 45th and 55th percentiles. This section will provide a template for creating fuzzy percentiles so that they can pass the disclosure review process. 

To answer the original question concerning quarterly wages, we used the following code:

    quarterly_wages %>%
        group_by(quarter_number) %>%
        summarize(mean_wage = mean(total_wages))

To calculate the median quarterly wages, we will need to start with the `quarterly_wages` data frame, which is an aggregated version of the table `tr_tn_2021.dbo.nb_cohort_wages_link` created in the `2.Data_Exploration_Wages` notebook.


In [None]:
## read tr_tn_2021.dbo.nb_cohort_wages_link into R
qry <- "
select *
from tr_tn_2021.dbo.nb_cohort_wages_link
"
df_wages <- dbGetQuery(con, qry)

head(df_wages)

Before we can create `quarterly_wages`, we need to create the `quarter_number` variable and filter out wage observations 0 and 13 quarters after graduation.

In [None]:
# get quarter from graduation and ignore quarters 0 and 13
df_wages <- df_wages %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    ) %>%
    filter(!(quarter_number %in% c(0, 13)))

Now we can create `quarterly_wages` using the same code as we did in `2.Data_Exploration_Wages` notebook.

In [None]:
# find and save as quarterly_wages
quarterly_wages <- df_wages %>%
    group_by(SSN, quarter_number) %>%
    summarise(
        total_wages = sum(wge_amt),
    ) %>%
    ungroup()

head(quarterly_wages)

Recall us using the following code to find the average quarterly earnings per quarter after graduation for our cohort:

    quarterly_wages %>%
            group_by(quarter_number) %>%
            summarize(mean_wage = mean(total_wages))

Now, if we were to substitute the line `summarize(mean_wage = mean(total_wages))` with `summarize(median_wage = median(total_wages))`, we can calculate the true median quarterly pay received by TN graduates in their first 12 quarters after graduation.

In [None]:
# calculate median pay
quarterly_wages %>% 
    group_by(quarter_number) %>%
    summarize(median_wage = median(total_wages))

However, as mentioned at the beginning of the section, direct percentiles cannot be exported from the ADRF. Instead, we can export a "fuzzy median", which will be calculated by averaging the true 45th and 55th percentiles. To calculate specific percentiles, we will use the `quantile()` function, which requires the variable of interest and the proportion corresponding to the percentile.

In [None]:
# implement fuzzy median
fuzzy_median_export <- quarterly_wages %>% 
    group_by(quarter_number) %>%
    summarize(
        fuzzy_median = (quantile(total_wages, .45) + quantile(total_wages, .55))/2
    )

fuzzy_median_export

Notice how the fuzzy median quarterly wages do not deviate much from the true median ones in this example.

Although we have prepared the fuzzy median quarterly wages, we have not provided all of the information necessary to receive this file in an export because **we have not shown that there are at least 10 graduates per statistic**, or per quarter after graduation.  Additionally, since the table relies on data from Tennessee's UI wage records, **we need to show that there are observations from at least 3 unique employers per cell.** There are two potential options:

1. Include counts of graduates per quarter in `fuzzy_median_export` and filter the quarters for all of those with at least 10 graduates (which is unlikely in this case due to the large numbers of those employed per quarter) OR
2. Create a separate table to show that all quarters have observations from at least 10 individuals and 3 employers and filter out any quarters that will not pass the check afterwards.

Here, we will opt for the second option and create a separate data frame containing the number of unique graduates receiving quarterly wages per quarter, as well as the number of employers contributing to those observations. To differentiate between the files we want for export and those to demonstrate proof of the export guidelines, we will eventually name this new data frame `fuzzy_median_input`.

We can easily find the number of individuals contributing to each cell in the table by counting the number of unique individuals in `quarterly_wages` by quarter.

In [None]:
# find number of individuals contributing to fuzzy_median_export by quarter_number
fuzzy_median_num_ind <- quarterly_wages %>%
    group_by(quarter_number) %>%
    summarize(
        number_individuals = n_distinct(SSN)
    )

fuzzy_median_num_ind

However, since `quarterly_wages` is aggregated to the `SSN`/`quarter_number` level from the initial `SSN`/`quarter_number`/`empr_nbr` level in Tennessee's wage records, we cannot find the number of unique employers per cell using the same method. Luckily, `df_wages` contains the same subset of individuals as `quarterly_wages` at a less aggregated level, so we can count the number of unique employers by `quarter_number` directly from `df_wages`.

In [None]:
# find number of employers
fuzzy_median_num_emp <- df_wages %>% 
    group_by(quarter_number) %>%
    summarize(
        number_employers = n_distinct(empr_nbr)
    )

fuzzy_median_num_emp

We can then join this data frame to `fuzzy_median_num_ind` so that we have the number of unique employers and unique individuals that contributed to each cell in our table for export.

In [None]:
# add in number of unique employers
fuzzy_median_input <- fuzzy_median_num_ind %>%
    inner_join(fuzzy_median_num_emp, by= c("quarter_number"))

fuzzy_median_input

Now that we can confirm that each cell in `fuzzy_median_export` has observations from a sufficient number of individuals and employers, we can apply the rounding rules for wages. The rounding rule for wages dictates that wages must be rounded to the nearest 100. Once we round the wages, `fuzzy_median_export` will be safe for export.

> Note: If `quarterly_wages` contained any `NA`'s or values of 0, and these values were ignored in generating these fuzzy median estimates, the number of graduates must consist of only those with non-`NA` and non-zero quarterly median wage values. Also, if certain cells did not satisfy the export requirements, you can redact or aggregate those cells.

In [None]:
# apply rounding rules
fuzzy_median_export_rounded <- fuzzy_median_export %>%
    mutate(
        fuzzy_median_rounded = round(fuzzy_median, digits = -2)
    ) %>% 
    select(-fuzzy_median)

fuzzy_median_export_rounded

Like before, to provide proof that the proper rounding rules were applied, we will also save the corresponding unrounded tables as supplemental files containing the true counts per group. In summary, with `fuzzy_median_export_rounded` as the designated file for export, we can provide the contents of `fuzzy_median_input` and `fuzzy_median_export` as additional information to satisfy the proof of number of employers and individuals as well as use of proper rounding, respectively.

### **Percentages**

For any reported percentages or proportions, the underlying counts of individuals contributing to the numerators and denominators must be provided for each statistic in the desired export. Recall the example from the Data Exploration: Wages notebook where we answered the following question:

- Do the stable employment outcomes of our cohort vary by gender?

We saved the answer in the csv file `full_q_stats_gender`. To start this section, let's read in this csv.

<font color=red> Before you run the cell below, make sure you have run through the Data Exploration: Wages notebook and have saved the csv files in your `U:\\..\\TN Training\\Results` directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# read file
full_q_stats_gender <- read_csv("U:\\..\\TN Training\\Results\\full_q_stats_gender.csv") %>%
    select(-avg_wage)

In [None]:
# see full_q_stats_gender file
head(full_q_stats_gender)

For the purposes of this example, we will focus on disclosure-proofing the `prop` column specfically. As mentioned above, to export any proportion (or percentage), you need to provide the counts for the numerator as well as the denominator. Recall that we created this file using the following code:

    full_q_stats_gender <- full_q_wages %>%
        group_by(Gender) %>%
        summarize(
            num_individuals = n_distinct(SSN),
            avg_wage = mean(wge_amt)
        ) %>%
        mutate(
            prop = num_individuals/sum(num_individuals)
        )

    full_q_stats_gender

As you can see, `prop` was created by dividing the `num_individuals` value for each row by the total `num_individuals`, or the total number of graduates in the cohort. Thus, the numerator for each `prop` value is the row's corresponding `num_individuals`, and the denominator is the sum of all the `num_indviduals` values. Let's make this clearer for the reviewer in a new data frame `full_q_stats_gender_input`.

In [None]:
# describe numerator and denominator of proportion
full_q_stats_gender_input <- full_q_stats_gender %>%
    mutate(
        numerator_for_prop = num_individuals,
        denominator_for_prop = sum(num_individuals)
    )

head(full_q_stats_gender_input)

Now, if `numerator_for_prop` and `denominator_for_prop` do not both contain observations from at least 10 individuals, the corresponding proportions will need to be redacted. However, since `full_q_stats_gender` uses information derived from Tennessee's UI wage records, we also need to confirm that each statistic, or proportion, is based on observations from at least 3 employers as well. Unfortunately, we do not have employer-level information currently in `full_q_stats_gender`, so we need to go back to the code used to generate the file. 

> We've combined all of the code into one code cell for simplicity.

In [None]:
# get full quarter instances
qry <- "
select b.SSN, b.empr_nbr, b.wge_amt, b.job_date, b.grad_date, b.Gender
from  tr_tn_2021.dbo.nb_cohort_wages_link a,  tr_tn_2021.dbo.nb_cohort_wages_link b,  tr_tn_2021.dbo.nb_cohort_wages_link c
where a.SSN = b.SSN and a.empr_nbr = b.empr_nbr and a.job_date = dateadd(month, 3, b.job_date)
and a.SSN = c.SSN and a.empr_nbr = c.empr_nbr and b.job_date = dateadd(month, 3, c.job_date)
"
full_q_wages <- dbGetQuery(con, qry)

full_q_stats_gender <- full_q_wages %>%
    group_by(Gender) %>%
    summarize(
        num_individuals = n_distinct(SSN),
        avg_wage = mean(wge_amt)
    ) %>%
    mutate(
        prop = num_individuals/sum(num_individuals)
    )

full_q_stats_gender

From the cell above, we can see that the data frame `full_q_wages`, which contains information only on those experiencing at least one quarter of full-quarter employment in Tennessee within a three-year time frame, was used to create `full_q_stats_gender`. This table happens to contain employer-level information.

In [None]:
# see full_q_wages
head(full_q_wages)

Therefore, to find the number of unique employers that contributed to each `prop` value, we can calculate the number of distinct `empr_nbr` values per each `Gender`.

In [None]:
# number of employers per Gender
employer_info <- full_q_wages %>%
    group_by(Gender) %>%
    summarize(
        num_employers = n_distinct(empr_nbr)
    ) %>%
    mutate(
        Gender = str_trim(Gender)
    )

employer_info

With `employer_info` saved as a separate data frame, we can join it back to `full_q_stats_gender_input` so that all of the information to satisfy the disclosure controls is in one file.

In [None]:
# join employer_info to full_q_stats_gender_input
full_q_stats_gender_input <- full_q_stats_gender_input %>%
    left_join(employer_info, by="Gender")

full_q_stats_gender_input

Now that we have all of our information in `full_q_stats_gender_input`, we can see that each group in `full_q_stats_gender` satisfy the export requirements. Finally, before we can export `full_q_stats_gender`, we need to apply the proper rounding rules.

> Since percentages is to the nearest percentage, each proportion is to the nearest hundredths place.

In [None]:
# apply rounding
full_q_stats_gender_export <- full_q_stats_gender %>%
    mutate(
        prop_rounded = round(prop, digits= 2)
    ) %>%
    select(-prop)

full_q_stats_gender_export

`full_q_stats_gender_export` is now fully prepared for a safe export, as long as `full_q_stats_gender_input` is added in the export documentation section of the export request. Additionally, it may be worthwhile mentioning that the `numerator_for_prop` and `denominator_for_prop` also track the number of unique individuals per cell in the documentation.

## **4. Visualizations**

Whereas the previous section contained examples on how to prepare tables for export, this section will discuss preparing certain visualizations for export. The approach for disclosure-proofing visualizations follows is similar relative to the one with proofing tables--it all pertains to the underlying data used to create the visualization. Here, we will discuss two different types of visualizations:
1. Bar plots
2. Line plots

Keep in mind that this is not an exhaustive list of the types of visualization-based exports. Please refer to the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/) on approaching preparing other visualizations for disclosure review.

> The most common type of visualization that may require more non-obvious underlying data manipulation to satisfy export constraints is a *histogram*, as each bar in the histogram is technically a separate grouping. Therefore, each bar in the histogram, in this case, must contain information from at least 10 individuals and 3 employers, if wage records are used.

In [None]:
# Code adjusting overall graph attributes

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 12, repr.plot.height = 8)

### **Bar Plots**

Recall the question which was answered in the Data Visualization notebook using a bar plot:

- What were the most common major types within the cohort?

Since a bar plot is separated by distinct categories, or groups, the export requirements are straightforward: each bar must consist of observations from at least 10 individuals. Recall the code used to generate the final bar plot from the `common_major.csv` file, which contains the counts of graduates for 5 most common majors.

<font color=red> Before you run the cell below, remember to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# most common majors
common_major <- read_csv("U:\\..\\TN Training\\Results\\common_major.csv")

# see common_major
common_major

Recall the code used to generate the bar plot.

In [None]:
#Assign a custom color palette to use with the bar graphs
palette_color <- c("Liberal Arts & Science" = "orange",
                   "Health Professions & Related Services" = "blue",
                   "Business Management & Admin. Services" = "red",
                   "Engineering" = "purple",
                   "Computer & Information Sciences" = "green3"
                  )

#Create a bar chart showing the 5 majors most represented in the graduating cohort

#Specify source dataset and x and y variables
common_major_plot <- ggplot(common_major, aes(x = n, y = reorder(CIP_Family,n), fill=CIP_Family)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = palette_color, guide=FALSE) +

#Adjust the x scale to set the interval for tick marks
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, 7000, 1000),
    limits = c(0, 7000)) +

#Add titles and axis labels
labs(title = "TN graduates by major",
     subtitle = "Top 5 Majors",
     x = "Number of Graduates", y = "Major",
     caption = "Data Source: TN data")

#Display the graph we just made
print(common_major_plot)

Although we can see that each bar contains at least 10 individuals, we should make it as clear as possible for the reviewer. To do so, we will include the specific counts for these 5 majors as a supplemental file. These counts already exist in `common_major.csv`. Just so we do not forget, let's save them in `common_majors_input` as well.

In [None]:
# save common_major as common_major_input
# rename n to number_individuals to make as clear as possible for reviewer

common_major_input <- common_major %>%
    select (CIP_Family, n) %>%
    rename(number_individuals = n)

# see new file
common_major_input

Again, since each bar consists of at least 10 individuals and we are not using any wage records, there is no need for any additional redaction checks.

### **Line Graphs**

In the Data Visualization notebook, we answered the following question with a line graph:

- Do the average quarterly earnings vary within the most common majors in our cohort?

Line graphs can be slightly more complicated than bar graphs for the disclosure review process, as each point utilized to create the line graph is subject to the export guidelines, as opposed to each line just needing to consist of at least 10 individuals (and 3 employers depending on the data source). The information used to generate this line graph is available in `avg_and_num_major.csv`.

<font color=red> Before you run the cell below, remember to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# average quarterly earnings and number employed by quarter (common majors)
avg_and_num_major <- read_csv("U:\\..\\TN Training\\Results\\avg_and_num_major.csv")

avg_and_num_major

Recall the code written in the Data Visualization notebook used to create the line plot.

In [None]:
#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
avg_wages_major_plot <- ggplot(avg_and_num_major, aes(x = quarter_number, y = mean_wage, 
                                                      color = word(CIP_Family, 1))) + 

#Plots a line on the graph
geom_line() +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(4000, 14000, 2000),
    limits = c(4000,14000)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "TN Graduates majoring in engineering and health fields \nexperienced higher earnings after graduation on average",
    x = "Quarter after Graduation", y = "Average Quarterly Wages",
    caption = "Data Source: TN Data",
    color = "Major") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

Since the wage values were not rounded for the visualization, we need to re-run the visualization after rounding the average wages.

In [None]:
#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
avg_wages_major_plot <- avg_and_num_major %>% mutate(mean_wage_rounded = round(mean_wage, digits = -2)) %>%

ggplot(aes(x = quarter_number, y = mean_wage_rounded, color = word(CIP_Family, 1))) + 

#Plots a line on the graph
geom_line() +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(4000, 14000, 2000),
    limits = c(4000,14000)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "TN Graduates majoring in engineering and health fields \nexperienced higher earnings after graduation on average",
    x = "Quarter after Graduation", y = "Average Quarterly Wages",
    caption = "Data Source: TN Data",
    color = "Major") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

Since we are plotting average quarterly wages, the reviewer cannot find the number of individuals contributing to these average quarterly wages calculations for each quarter after graduation for these major groupings. Note that a key component of this table already exists for us, as the input data frame used to generate the above visualization, `avg_and_num_major` contains the number of individuals per data point.

In [None]:
# see avg_and_num_major
avg_and_num_major

Like we did in the Fuzzy Percentiles section, to find the number of employers within each grouping, we will need to leverage `df_wages`. We can do so by filtering `df_wages` for only the `CIP_Family` and `quarter_number` combinations in `avg_and_num_major` before finding the number of employers within the `CIP_Family`/`quarter_number` combinations.

In [None]:
# find number of employers per CIP_Family/quarter_number combination
employer_info <- df_wages %>%
    filter(CIP_Family %in% avg_and_num_major$CIP_Family, quarter_number %in% avg_and_num_major$quarter_number) %>%
    group_by(CIP_Family, quarter_number) %>%
    summarize(
        number_employers = n_distinct(empr_nbr)
    )

head(employer_info)

From here, we can join `employer_info` back to `avg_and_num_major` so that all of the supplemental information accompanying the visualization is in the same file.

In [None]:
# add number of employers
avg_and_num_plot_input <- avg_and_num_major %>%
    inner_join(employer_info, by=c("CIP_Family", "quarter_number"))

avg_and_num_plot_input

It appears that this line plot from an initial glance will pass disclosure review because there are no counts below 10.  However, what if there were too many rows to view in the notebook? There are a few different solutions to this issue--one being a test to see if the number of rows change when filtering out rows where `n_employed` is less than 10. As long as the following code outputs `TRUE`, there were no rows, or counts of individuals by major and quarter after graduation, that did not satisfy the export rules.

In [None]:
# see if number of rows is differnet when filtering out rows with less than ten observations
nrow(avg_and_num_major) == nrow(avg_and_num_major %>% filter(n_employed >= 10))

If the above code resulted in a `FALSE`, the line plot as it is currently displayed would not pass the disclosure review process, and would require an alternative manipulation to visually display the answer to the question stated at the start of the section. However, since it resulted in `TRUE`, the rounded version of the visualization can be exported barring the information in `avg_and_num_plot_input` is provided as supplemental information.

## **5. Unsupervised Machine Learning**

Exporting clusters must be treated as any other grouping variable, as each cluster must satisfy a minimum number of individuals and (when applicable) employers to pass disclosure control.

## **6. Exporting Output and Supplemental Files as .csv and .png**

To include visualizations and tables in an export request, you can save them in your files directly from a jupyter notebook as .csv and .png files using `write_csv()` (or `write.csv()`) and `png()`, respectively.

# References

Feder, Benjamin, Simone, Sean, & Barrett, Nathan. (2022, March 30). Preparing Safe Outputs using New Jersey Education to Earnings Data System Tables. Zenodo. https://doi.org/10.5281/zenodo.6399350