# **<center> Disclosure Review </center>**

Sean Simone, Nathan Barrett, Benjamin Feder

## **1. Introduction**

At this point, we have created useful visualizations and tables we may want to use outside of the ADRF for a final presentation or report. However, **data cannot leave the ADRF without passing the export review process**, where reviewers confirm that none of the findings are disclosive given the rules provided by the agency. This notebook contains information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and provides an overview of the information needed for disclosure review. Please read through the entire notebook because it will separately discuss different types of outputs that will be flagged in the disclosure review process. The examples are almost all taken directly from the code in the notebooks from Module 2.

## **2. Learning Objectives**

We will use a variety of examples from previous notebooks to walk through how to safely prepare various outputs so they will pass the ADRF disclosure review process. At the end of each section, we will save the disclosure-proofed outputs as csv and png files and explain how they should be properly documented to pass the export process. As you work through the material presented in the notebook, think about how you might apply the techniques and code to your project. 

In this notebook, our focus is **disclosure control**. After you finish this notebook, you should understand:
- how to make sure tables and visualizations are safe for export
- how to provide and document supporting files
- how to adjust subgroups if they do not satisfy the disclosure control requirements

### **Research Questions**
In this notebook, we revisit the research questions presented in the Module 2 notebooks:
- How many students graduated from New Jersey public postsecondary institutions by subgroup (e.g. demographics, institutions, enrollment type, etc.)?
- What are the average quarterly wages of our graduating cohort?
- Do the stable employment outcomes of our cohort vary by sex?
- What are the most common employment patterns of our cohort?
- What were the most common major types within the cohort?
- Are there differences in earnings outcomes for Psychology and Business graduates in our cohort over time?

### **Datasets** 
We will explore and understand the New Jersey Education to Earnings Data System (NJEEDS) tables in this notebook:
- **Higher Education (OSHE) Completions**: The completions table comes from the Office of the Secretary of Higher Education's (OSHE) Student Unit Record data system (SURE). The data include completions at all levels that are reported to the U.S. Department of Education's Integrated Postsecondary Education Data System (IPEDS) Completions Survey.
- **Higher Education (OSHE) Supplemental Tables**:  Multiple supplemental tables are available to append contextual information to the completions table. 
- **Unemployment Insurance UI wage records**: Data are collected from businesses and industries in New Jersey that participate, as required by law, in the Unemployment Insurance (UI) program and its trust fund. Wages are reported monthly by industry and compiled quarterly. This information is used for administering the UI program and feeds into mandatory reports to the US Department of Labor.


### **Methods** ###

Through preparing the answers to these questions for export, we will cover techniques for disclosure-proofing the following types of exports:

- Counts
- Percentiles
- Percentages
- Bar plots
- Line plots
- Heat maps

### **Remarks about Disclosure Review**

In general, any kind of file format can be exported. However, researchers typically export tables, graphs, regression outputs and aggregated data. Thus, we expect you to export files in either .csv, .txt or graph format. Results cannot be exported in a Jupyter notebook, as the disclosure review process is too burdensome.

However, this does not mean that you will not need Jupyter notebooks in the export process, as they may contain essential code that can be listed as supplemental information for the reviewers. It is essential to document every step of an analysis in a Jupyter notebook, or any code for that matter. 

A more detailed description of the rules for exporting results can be found in the [export documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before preparing an export submission, please read through the export documentation. Please request an export only when results are final and they are needed for a presentation or final project report.

The general idea behind potential disclosure for the data provided in this class is that every statistic for export must be based on at least **10 individuals**. <u>If the counts of individuals are not clear in the desired table or visualization for export, supplemental counts must be provided in the supporting documentation.</u> 

## **3. Tables** ##

In this section, we demonstrate how to disclosure-proof tables for export. Here, we will walkthrough preparing tables with three different types of variables:
1. Counts
2. Percentiles
3. Percentages

Note that this is not an exhaustive list of the types of table-based exports. The list contains some of the most common table-based exports, and there is extensive documentation on disclosure-controlling other table-based exports in the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before we touch on these examples, we will load the necessary libraries and establish our connection to the server.

> These are the same packages used in the Data Visualization notebook.

In [None]:
# Database interaction imports
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

### **Counts**
Recall the example from the [Data Exploration: Create a Cohort](./1.Data_Exploration.ipynb) where we answered the following question:
- How many students graduated from New Jersey public postsecondary institutions by subgroup (e.g. demographics, institutions, enrollment type, etc.)?

Our solution consisted of counting the number of graduates between July 1, 2012 and June 30, 2013 by sex and most common majors. Let's load in the csv containing the most common majors by sex, `common_major_sex.csv`.

<font color=red> Before you run the cell below, make sure you have run through the Data Exploration: Create a Cohort notebook and have saved the csv files in your `U:\\..\\NJ Training\\Results` directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
#most common majors by sex
common_major_sex <- read_csv("U:\\..\\NJ Training\\Results\\common_major_sex.csv")

As we calculated in the first notebook, we can see the number of graduates by sex for the most common majors.

In [None]:
# see common_major_sex
common_major_sex

The export rules for the data state that each statistic in an export must consist of at least <b>10 individuals</b>. Therefore, for each row--indicating the number of graduates--there must be at least 10 individuals (graduates). To disclosure-proof this table, we can add a second clause inside `filter()` to only include `sex`/`cip_family` combinations with at least 10 graduates.

In [None]:
# filter industry counts to satisfy disclosure rules 
common_major_sex_export1 <- common_major_sex %>%
    filter(
        n >= 10
    ) %>%
    arrange(desc(n))

common_major_sex_export1

Excluding a large number of groups below 10 may begin to muddle your output, so other strategies can be employed to maintain table structure while meeting reporting standards for export file release. A common method for doing so is aggregating smaller groups to meet the disclosure review standards for the dataset (at least 10 observations).

In this example, we will create aggregate `cip_family` values so that `sex`/`cip_family` combinations will satisfy the disclosure review limits. First, we will identify the rows in `common_major_sex` where their current counts are insufficient.

In [None]:
# identify rows where counts are not sufficient
common_major_sex %>% 
    filter(n<10)

From here, we can recode the rows that will not pass disclosure review to aggregate them into two major groupings, and then we can recalculate `n` and `prop` as we did in the first notebook, except renaming them so that they are clear for the reviewer that they track the number of observations and proportion of observations per major within each sex, respectively.

In [None]:
common_major_sex_export2 <- common_major_sex %>%
    mutate(
        cip_family = case_when(
            sex == 0 & cip_family %in% c("BUSINESS,", "COMMUNICAT", "SOCIAL SCI") ~ "major_group1",
            sex == 0 & cip_family %in% c("HEALTH PRO", "PSYCHOLOGY", "EDUCATION.", "HISTORY.") ~ "major_group2",
            TRUE ~ cip_family
        )
    ) %>%
    group_by(cip_family, sex) %>%
    summarize(
        number_individuals = sum(n)
    ) %>%
    ungroup() %>%
    group_by(sex) %>%
    mutate(
        proportion_individuals = number_individuals/sum(number_individuals)
    ) %>%
    arrange(sex, proportion_individuals) %>%
    ungroup()

common_major_sex_export2

Since the tables (`common_major_sex_export1` and `common_major_sex_export2`) designated for export already contain the counts of the number of individuals per grouping, they both do not require a supplemental file containing counts per group. These tables are safe for export.

> Tables may require secondary suppression if totals are included. This additional suppression would make it impossible for individuals to work backwards from the total amount of individuals to find out the number of graduates that had less than 10 individuals. In practice, the ADRF staff would suppress the next-lowest counts if this were the case.  Regarding `common_major_sex_export1`, however, multiple majors had groups below 10 and the total number of graduates per sex were not reported, so these inferences cannot be made.

### **Fuzzy Percentiles**

Suppose we took a question adapted from the Data Exploration: Wages notebook:
- What are the average quarterly wages of our graduating cohort?

but discovered that a median would better represent what was in the data due to skewness and outliers. Let's slightly adjust this question to:
- What are the *median* quarterly wages of our graduating cohort?
> The median is another way to describe the 50th percentile.

Under no circumstances will true percentiles pass export review because percentiles represent <b><i>ACTUAL</b></i> observations for an individual, regardless of the unit of analysis potentially being directly subject to disclosure review. To get a sense of percentiles in the data, fuzzy percentiles can be exported, which can be created by finding the average of two true percentiles. For example, a fuzzy median can be created by finding the average of the true 45th and 55th percentiles. This section will provide a template for creating fuzzy percentiles so that they can pass the disclosure review process. 

To answer the original question concerning quarterly wages, we used the following code:

    quarterly_wages %>%
        group_by(quarter_number) %>%
        summarize(mean_wage = mean(total_wages))

To calculate the median weekly payments, we will need to start with the `quarterly_wages` data frame, which is an aggregated version of the table `tr_nj_2021.dbo.nb_cohort_wages_link` created in the supplemental table creation notebook.

In [None]:
## read tr_nj_2021.dbo.nb_cohort_wages_link into R
qry <- "
select *
from tr_nj_2021.dbo.nb_cohort_wages_link
"
df_wages <- dbGetQuery(con, qry)

head(df_wages)

Before we can create `quarterly_wages`, we need to create the `quarter_number` variable and filter out wage observations 0 and 13 quarters after graduation.

In [None]:
# get quarter from graduation and ignore quarters 0 and 13
df_wages <- df_wages %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    ) %>%
    filter(!(quarter_number %in% c(0, 13)))

Now we can create `quarterly_wages` using the same code as we did in the second data exploration notebook.

In [None]:
# find and save as quarterly_wages
quarterly_wages <- df_wages %>%
    group_by(hashed_ssn, quarter_number) %>%
    summarise(
        total_wages = sum(wage),
    ) %>%
    ungroup()

Recall us using the following code to find the average quarterly earnings per quarter after graduation for our cohort:

    quarterly_wages %>%
            group_by(quarter_number) %>%
            summarize(mean_wage = mean(total_wages))

Now, if we were to substitute the line `summarize(mean_wage = mean(total_wages))` with `summarize(median_wage = median(total_wages))`, we can calculate the true median quarterly pay received by NJ graduates in their first 12 quarters after graduation.

In [None]:
# calculate median pay
quarterly_wages %>% 
    group_by(quarter_number) %>%
    summarize(median_wage = median(total_wages)) %>%
    head()

However, as mentioned at the beginning of the section, direct percentiles cannot be exported from the ADRF. Instead, we can export a "fuzzy median", which will be calculated by averaging the true 45th and 55th percentiles. To calculate specific percentiles, we will use the `quantile()` function, which requires the variable of interest and the proportion corresponding to the percentile.

In [None]:
# implement fuzzy median
fuzzy_median_export <- quarterly_wages %>% 
    group_by(quarter_number) %>%
    summarize(
        fuzzy_median = (quantile(total_wages, .45) + quantile(total_wages, .55))/2
    )

head(fuzzy_median_export)

Notice how the fuzzy median quarterly wages do not deviate much from the true median ones in this example.

Although we have prepared the fuzzy median quarterly wages, we have not provided all of the information necessary to receive this file in an export because **we have not shown that there are at least 10 graduates per statistic**, or per quarter after graduation. There are two potential options:

1. Include counts of certified claimants per week in `fuzzy_median_export` and filter the weeks for all of those with at least 10 graduates (which is unlikely in this case due to the large numbers of those employed per quarter) OR
2. Create a separate table to show that all weeks have at least 10 individuals and filter out any weeks that will not pass the check afterwards.

Here, we will opt for the second option, and create a separate dataframe containing the number of graduates receiving quarterly wages. To differentiate between the files we want for export and those to demonstrate proof of the export guidelines, we will name this dataframe `fuzzy_median_input`.

In [None]:
# find number of individuals contributing to fuzzy_median_export by quarter_number
fuzzy_median_input <- quarterly_wages %>%
    group_by(quarter_number) %>%
    summarize(
        number_individuals = n_distinct(hashed_ssn)
    )

fuzzy_median_input

The count of graduates employed by quarter are all above 10, so we will not need to filter out any weeks in `fuzzy_median_output`. `fuzzy_median_output` is now safe for export, assuming that `fuzzy_median_input` is included in the export request as a supplemental file showing satisfaction of the disclosure constraints.

> If `quarterly_wages` contained any `NA`'s, and `NA` values were ignored in generating these fuzzy median estimates, the number of graduates must consist of only those with non-`NA` weekly payment values.

### **Percentages**

For any reported percentages or proportions, the underlying counts of individuals contributing to the numerators and denominators must be provided for each statistic in the desired export. Recall the examples from the Data Exploration: Wages notebook where we answered the following questions:

- Do the stable employment outcomes of our cohort vary by sex?
- What are the most common employment patterns of our cohort?

By contrasting the approaches for these two examples, we will run through the steps required to export percentages.

The answers to these questions are stored in the csv files `patterns.csv` and `full_q_stats_sex.csv`.

<font color=red> Before you run the cell below, make sure you have run through the Data Exploration: Wages notebook and have saved the csv files in your `U:\\..\\NJ Training\\Results` directory. You also need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# read files
patterns <- read_csv("U:\\..\\NJ Training\\Results\\patterns.csv")
full_q_sex <- read_csv("U:\\..\\NJ Training\\Results\\full_q_stats_sex.csv")

In [None]:
# see files
head(patterns)
full_q_sex

As currently displayed, the proportion part of the tables above can be exported. It is easy for the reviewer to examine the numerators if the information is pointed out to them, and the denominators for the `prop` variables can be explained in the documentation accompanying the export request (in this case, the proportion is the proportion of total observations, no subgroup breakdown such as the example in the [Counts](#Counts) section). As long as the counts for the patterns are all at least 10, the proportions will pass disclosure review.

If we had to suppress the estimate for individuals that didn't report sex due to the lack of observations, we would run into a secondary disclosure problem where that group can be inferred by using simple arithmetic and division using the information present in the table. In that case, we would also need suppress the next smallest estimate (1), so you would end up with a table with only one category. 

> If you are unsure if your table will require secondary redaction, please ask your team lead.

## **4. Visualizations**

Whereas the previous section contained examples on how to prepare tables for export, this section will discuss preparing certain visualizations for export. The approach for disclosure-proofing visualizations follows is similar relative to the one with proofing tables--it all pertains to the underlying data used to create the visualization. Here, we will discuss three different types of visualizations:
1. Bar plots
2. Line plots
3. Heatmaps

Keep in mind that this is not an exhaustive list of the types of visualization-based exports. Please refer to the [user documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/) on approaching preparing other visualizations for disclosure review.

> The most common type of visualization that may require more non-obvious underlying data manipulation to satisfy export constraints is a *histogram*, as each bar in the histogram is technically a separate grouping. Therefore, each bar in the histogram, in this case, must contain information from at least 10 individuals.

### **Bar Plots**

Recall the question which was answered in the Data Visualization notebook using a bar plot:

- What were the most common major types within the cohort?

Since a bar plot is separated by distinct categories, or groups, the export requirements are straightforward: each bar must consist of observations from at least 10 individuals. Recall the code used to generate the final bar plot from the `common_major.csv` file, which contains the counts graduates for the top 10 majors.

<font color=red> Before you run the cell below, remember to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# most common majors
common_major <- read_csv("U:\\..\\NJ Training\\Results\\common_major.csv")

# see common_major
common_major

Recall the code used to generate the bar plot.

In [None]:
#Assign a custom color palette to use with the bar graphs
palette_color <- c("BUSINESS," = "orange",
                   "PSYCHOLOGY" = "blue",
                   "SOCIAL SCI" = "red",
                   "HEALTH PRO" = "purple",
                   "BIOLOGICAL" = "green3",
                   "COMMUNICAT" = "brown",
                   "EDUCATION." = "black",
                   "ENGINEERIN" = "yellow",
                   "VISUAL AND" = "lavender",
                   "LIBERAL AR" = "seagreen"
                  )

#Create a bar chart showing the 10 majors most represented in the graduating cohort

#Specify source dataset and x and y variables
common_major_plot <- ggplot(common_major, aes(x = n, y = reorder(cip_family,n), fill=cip_family)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = palette_color, guide=FALSE) +

#Adjust the x scale to set the interval for tick marks
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, 6000, 1000),
    limits = c(0, 6000)) +

#Add titles and axis labels
labs(title = "NJ graduates by major",
     subtitle = "Top 10 Majors",
     x = "Number of Graduates", y = "Major",
     caption = "Data Source: NJEEDS data")

#Display the graph we just made
print(common_major_plot)

Although we can see that each bar contains at least 10 individuals, we should make it as clear as possible for the reviewer. To do so, we will include the specific counts for these 10 majors as a supplemental file. These counts already exist in `common_major.csv`. Just so we do not forget, let's save them in `common_majors_input` as well.

In [None]:
# save common_major as common_major_input
# rename n to number_individuals to make as clear as possible for reviewer

common_major_input <- common_major %>%
    select (cip_family, n) %>%
    rename(number_individuals = n)

# see new file
common_major_input

Again, since each bar consists of at least 10 individuals, there is no need for any additional redaction.

### **Line Graphs**

In the Data Visualization notebook, we answered the following question with a line graph:

- Are there differences in earnings outcomes for Psychology and Business graduates in our cohort over time?

Line graphs can be slightly more complicated than bar graphs for the disclosure review process, as each point utilized to create the line graph is subject to the export guidelines, as opposed to each line just needing to consist of at least 10 individuals. The information used to generate this line graph is available in `avg_and_num_major.csv`.

<font color=red> Before you run the cell below, remember to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# average quarterly earnings and number employed by quarter (common majors)
avg_and_num_major <- read_csv("U:\\..\\NJ Training\\Results\\avg_and_num_major.csv")

Recall the code written in the Data Visualization notebook used to create the line plot.

In [None]:
# rename and change type of cip code variable
avg_and_num_major <- avg_and_num_major %>%
    rename(cip = `substring(major, 1, 2)`) %>%
    mutate(
        cip = as.character(cip)
    )

# read in cip code lookup
cip_xwalk <- read_csv("P:\\tr-nj-2021\\NJ Class Notebooks\\xwalks\\oshe_cip_xwalk.csv")

# match to avg_and_num_major and keep desired columns
avg_and_num_major <- avg_and_num_major %>%
    left_join(cip_xwalk, by = c("cip" = "CIPCode-text")) %>%
    select(CIPTitle, quarter_number, mean_wage, n_employed)

#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
avg_wages_major_plot <- ggplot(avg_and_num_major, aes(x = quarter_number, y = mean_wage, 
                                                      color = word(CIPTitle, 1))) + 

#Plots a line on the graph
geom_line() +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(4000, 14000, 2000),
    limits = c(4000,14000)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "NJ Graduates majoring in business experienced REDACTED earnings after graduation on average",
    x = "Quarter after Graduation", y = "Average Quarterly Wages",
    caption = "Data Source: NJEEDS Data",
    color = "Major") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

Since we are plotting average quarterly wages, the reviewer cannot find the number of individuals contributing to these average quarterly wages calculations for each quarter after graduation for the two majors. Luckily, this table already exists for us as the input table used to generate the above visualization, `avg_and_num_major`. We just need to confirm that all `mean_wage` values have at least 10 graduates.

In [None]:
# get counts of graduates by major and quarter
avg_and_num_major

It appears that this line plot will pass disclosure review because there are no counts below 10.  However, what if there were too many rows to view in the notebook? There are a few different solutions to this issue--one being a test to see if the number of rows change when filtering out rows where `n_employed` is less than 10. As long as the following code outputs `TRUE`, there were no rows, or counts of individuals by major and quarter after graduation, that did not satisfy the export rules.

In [None]:
# see if number of rows is differnet when filtering out rows with less than ten observations
nrow(avg_and_num_major) == nrow(avg_and_num_major %>% filter(n_employed >= 10))

If the above code resulted in a `FALSE`, the line plot as it is currently displayed would not pass the disclosure review process, and would require an alternative manipulation to visually display the answer to the question stated at the start of the section. But because it is true, let's save `avg_and_num_major` as `avg_and_num_major_plot_input`.

In [None]:
# save as avg_and_num_major_plot_input
avg_and_num_major_plot_input <- avg_and_num_major

### **Heat Maps**

In the Data Visualization notebook, a heat map was created to answer a question we have already analyzed in this notebook as part of the [Percentages](#Percentages) section: 

- What are the most common employment patterns of our cohort?

The heat map allowed us to display employment patterns by quarter after graduation, with an employment indicator taking on a blue or red tile. Heat maps follow similar rules to those of bar plots, as each observation (instead of bar) must consist of experiences from at least 10 individuals. 

Recall the code written to create the original heat map:

In [None]:
# only display 15 most common employment patterns
patterns <- patterns %>%
    head(15)

# Save counts to use later in the heatmap - we cannot use the counts as index, as there could be duplicate values 
counts <- patterns$cnt
pcts <- patterns$prop

# Add index with unique sequential numbers and remove the `count` AND `pct` columns
patterns$Pattern <- seq.int(nrow(patterns))
patterns$cnt <- NULL
patterns$prop <- NULL

# convert to long format
patterns_long <- pivot_longer(patterns, names_to = 'Quarter', values_to = 'Status', -c(Pattern))

# Full code for the plot

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowes

patterns_long$Quarter <- factor(patterns_long$Quarter, levels=c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10", "Q11", "Q12"))

ggplot(data = patterns_long, aes(x = Quarter, y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
labs(
    y = "Employment - Percentages",
    title = 'Employment Patterns by Quarters',
    caption = 'Source: NJEEDS data'
) +
scale_y_discrete(labels=rev(pcts))  # rename the y-axis ticks to correspond to the counts from the table


Because this heatmap tracks the relative percentages of the patterns to those of the rest of the cohort, it is impossible for the reviewer to determine the amount of observations pertaining to each employment pattern. Luckily, `patterns` already has the counts of individuals contributing to each pattern available in the data frame, so we can save this as a supporting file to accompany this visualization.

## **5. Unsupervised Machine Learning**

Exporting clusters must be treated as any other grouping variable, as each cluster must satisfy a minimum number of individuals and (when applicable) employers to pass disclosure control.

## **6. Exporting Output and Supplemental Files as .csv and .png**

To include visualizations and tables in an export request, you can save them in your files directly from a jupyter notebook as .csv and .png files using `write_csv()` (or `write.csv()`) and `png()`, respectively.

## 7. Geographical Analysis

If you are exporting **industry-level** information derived from the New Jersey UI wage records, in addition to showing that each reported value is based on observations from at least 10 individuals, you must also provide evidence that each reported value consists of observations from at least three employers and that no single employer makes up more than 80% of the employment for that industrial sector. 

Additional information regarding these requirements is available in the class resources.