<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Benjamin Feder, Nathan Caplan,  Brian Kim, Ekaterina Levitskaya.

**_Disclosure Review Examples & Exercises_**

This notebook provides information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives an overview of the information needed for disclosure review. _Please read through the entire notebook because it will separately discuss all outputs that will be flagged in the disclosure review process._

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)

# scaling data, calculating percentages, overriding default graphing
library(scales)

# adding weights 
library(survey)

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

# General Remarks on Disclosure Review

## Files that can be exported
In general, any kind of file format can be exported. However, most researchers typically export tables, graphs, regression outputs and aggregated data. Thus, please export one of these types, which implies that every result needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, results contained in a Jupyter notebook cannot be exported. Doing disclosure reviews on output in Jupyter notebooks is too burdensome for the export review team. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. **This does not mean that Jupyter notebooks are not necessary during the export process.** 

## Documentation of code is important
During the export process, please provide the code for every output that needs to be exported. It is important for the ADRF staff to have the code to better understand what exactly was done. Understanding how research results are created is important in understanding the research output. Thus, it is important to document every step of the analysis in the Jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found in the ADRF documentation. This is just a quick overview. Please go to the ADRF documentation and read the entire guidelines (link below) before preparing the files for export. 
- The disclosure review is based on the underlying observations of the study. **Every statistic for export should be based on at least 10 data points at an individual level and at least 3 individual data points at an institution level**. See the examples below. Please show the disclosure review team that every statistic for export is based on those numbers by providing counts in a corresponding input file. 
- Document the code so the reviewer can follow the data work. Assessing re-identification risks highly depends on the context. Therefore, it is important to provide context information with the analysis for the reviewer. When making comments in the code, make sure not to use any individual statistic (e.g. the median is ...).
- Save the requested output with the corresponding code in the "Input" and "Output" folders. Make sure the code is executable. The code in the Input folder should exactly produce the requested output.
- Export results only when they are final and when needed for the presentation or final project report.

## To-Do:
- Login to **Gitlab**.
- Read through the **documentation** link: coleridgeinitiative.org/documentation/using-the-adrf/exporting-results/
- Review the Export Request Memo in the shared folder.

# Disclosure Review Walkthrough

This notebook describes how to construct useful statistics and visualizations using the class data and how to prepare them so that they can be approved as output during the export process.

## Fuzzy percentiles

**Question of interest**: _What is the distribution of female earnings across the 2015 cohort?_

Ordinarily, when simply finding the distribution of a numerical variable, the `summary()` function may be used. As a reminder, this function outputs the minimum, first quartile, median, mean, third quartile, and maximum values as per its default output. However, these outputs will not pass the disclosure review process, as some of these statistics may be represented by individual points (such as minimum, maximum, any percentiles, and median). 

Instead, these outputs can be transformed into _fuzzy percentiles_, which can pass the disclosure review process. For example, the 20th and 30th percentiles can be averaged to find a fuzzy 25th percentile.

Before calculating the numerical distribution, the necessary data from the SDR must be properly retrieved. The relevant variables for this question are listed below and included in the SQL code in the following code cell.
- earnings (`salary`)
- survey weight (`wtsurvy`) - the data in the SDR needs to be weighted, as it is a subsample of a population
- individual ID (`refid`)
- institution ID (`sdrincd`)
- year of graduation (`sdr`) - where year is 2015
- gender (`gender`) - where gender is female

> Note: Whenever pulling statistics, remember to include individual and institution information, as they will be required to check if the statistics in question pass the disclosure threshold.

In [None]:
# Get the relevant variables from the SDR data and save in female_earnings data frame: 
# - earnings (salary),
# - survey weight (wtsurvy),
# - individual ID (refid),
# - institution ID (sdrincd),
# - where year of graduation (sdryr) is 2015, and
# - where gender (gender) is female.

query <- "
SELECT salary, wtsurvy, refid, sdrincd
FROM ncses_2019.nsf_sdr_2017
WHERE sdryr = '2015' 
AND gender = 'F'
"
female_earnings <- dbGetQuery(con,query)

In [None]:
# Show the first few rows of the table

head(female_earnings)

Recall that as per the Survey of Doctorate Recipients' (SDR) data dictionary, `salary` values equal to `9999998` are reserved for 'Logical Skips'. Therefore, these values can be removed. 

In [None]:
# Remove the rows with logical skip values

female_earnings <- female_earnings[!female_earnings$salary == '9999998', ]

> For filtering these logical skips using using the tidyverse, the code would be `female_earnings <- female_earnings %>% filter(salary != '9999998')`.

As done in the Data Exploration notebook, the survey weights can now be added using `svydesign` function from `survey` package to calculate the weighted earnings distribution.

In [None]:
# Add weights using "svydesign" function from "survey" library

female_earnings_weighted <- svydesign(ids=~1, data=female_earnings, weights=female_earnings$wtsurvy)

As mentioned above, to export the 25th, 50th, and 75th percentiles, they need to be transformed into fuzzy percentiles by averaging close-by percentiles. For example, in order to find a fuzzy 25th percentile, the 20th and 30th percentiles can be averaged.

Start by finding the following true percentiles on the weighted data using `svyquantile`:
- 20th and 30th (to create a fuzzy 25th percentile),
- 45th and 55th (to create a fuzzy 50th percentile),
- 70th and 80th percentile (to create a fuzzy 75th percentile). 

In [None]:
# Find 20, 30, 45, 55, 70, 80 percentiles on the weighted data

svyquantile(~salary, female_earnings_weighted, c(.20, .30, .45, .55, .70, .80), na.rm=TRUE)

The output from `svyquantile` can be saved as a data frame to allow for further manipulation.

> Note: It is not best practice to save column names as numerical outputs. The code below will show how to work with these types of columns. After creating the data frame, the default column names can be overwritten by manually assigning the desired column names with the base code <br> `colnames(dataframe) <- c('changed', 'column', 'names')`.

In [None]:
# Find 20, 30, 45, 55, 70, 80 percentiles on the weighted data and save results to a dataframe

true <- as.data.frame(svyquantile(~salary, female_earnings_weighted, c(.20, .30, .45, .55, .70, .80), na.rm=TRUE), 
                      col.names = c('Q.20', 'Q.30', 'Q.45', 'Q.55', 'Q.70', 'Q.80'))

# see true
true

If the column names of `true` dataframe are to be left as numbers, these columns can be referenced by surrounding them in backticks. These percentiles can be averaged to find fuzzy 25th, 50th (median), and 75th percentiles.

In [None]:
# Find values for the fuzzy quantiles by averaging the percentiles 
# (e.g. to find 25th, average 20th and 30th, etc.)

true %>%
    summarize(
        'fuzzy_25' = (`0.2` + `0.3`)/2,
        'fuzzy_50' = (`0.45` + `0.55`)/2,
        'fuzzy_75' = (`0.7` + `0.8`)/2,
    )

Save these fuzzy percentiles to a table.

In [None]:
# save to table

fuzzy <- true %>%
    summarize(
        'fuzzy_25' = (`0.2` + `0.3`)/2,
        'fuzzy_50' = (`0.45` + `0.55`)/2,
        'fuzzy_75' = (`0.7` + `0.8`)/2,
    )

To export these fuzzy percentiles as a csv, the `write_csv` function can be used as long as the file path and name of the csv are included. In this example, final .csv will be titled `fuzzy_female_earnings.csv` (the more descriptive the name of the file, the easier it is to review).

> Change `YOUR_USERNAME` in the `user` variable to your ADRF username in order to save the CSV to your home folder.

In [None]:
user = 'YOUR USERNAME'

In [None]:
fuzzy %>%
    write_csv(sprintf('/nfshome/%s/fuzzy_female_earnings.csv', user))

In order for these fuzzy percentiles to pass disclosure review, they require proof that the underlying counts contain at least 10 data points at the individual level and at least 3 data points at the institution level. 

For example, it is not possible to export a distribution of earnings of female PhD graduates of only 1 university or comparing earning distributions between just 2 universities. The sample should include at least 3 universities, for example, by comparing a group of at least 3 research universities versus a group of at least 3 teaching universities.

In the SQL query above, the data on individuals (`refid`) and institutions (`sdrincd`) has already been pulled, so it can be used to find the counts of distinct individuals and institutions.

> Individual and institutional counts provided as corresponding proof must be unweighted.

In [None]:
# save individual and institution counts
counts_for_female_earnings <- female_earnings %>%
    summarize(
        individual_count = n_distinct(refid),
        institution_count = n_distinct(sdrincd)
    )

In [None]:
counts_for_female_earnings

This can now be used as proof that the fuzzy percentiles are based on counts that pass the disclosure thresholds.

In [None]:
# save as counts_for_female_earnings.csv
counts_for_female_earnings %>%
    write_csv(sprintf('/nfshome/%s/counts_for_female_earnings.csv', user))

## Visualizations

### SDR - Histogram and Bar Plot

Recall the original question: _What is the distribution of female earnings across the 2015 cohort?_

This question can also be answered visually, instead of numerically, using a histogram. Because the underlying unweighted counts determine if the visualzation will pass the disclosure review threshold, the **unweighted** counts will be plotted first.

In [None]:
# unweighted counts for earnings
female_earnings %>%
    ggplot(aes(x=salary)) +
    geom_histogram() +
    scale_x_continuous(labels = comma)

With histograms, each individual bin is considered to be a separate group. Therefore, each bin must consist of at least 10 unique individuals and 3 unique institutions. Above, it seems as though multiple bins may not satisfy the individual count threshold.

Manually-created bins can solve this issue. The lowest and highest salaries in `female_earnings` will help inform potential bin ranges.

In [None]:
# Sort values by lowest first

female_earnings %>%
    arrange(salary) %>%
    head(5)

In [None]:
# Sort values by highest first (with "desc" parameter)

female_earnings %>%
    arrange(desc(salary)) %>%
    head(5)

Based on highest and lowest values, it may make sense to bin the data into 5 salary ranges.

In [None]:
# Create a new column with mutate, `salary_range`, with salary values categorized into 5 bins defined below
female_earnings <- female_earnings %>%
                    mutate(salary_range = case_when(salary >= 0 & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED'
                                                ))

In [None]:
# Show the head of the table with the new column

head(female_earnings)

Based on these bins, the unweighted individual counts can be calculated.

In [None]:
individual_counts <- female_earnings %>% 
                         group_by(salary_range) %>%
                        summarise(individual_count = n_distinct(refid))

individual_counts

The unweighted institution counts can also be calculated.

In [None]:
institution_counts <- female_earnings %>% 
                         group_by(salary_range) %>%
                        summarise(institution_count = n_distinct(sdrincd))

institution_counts

Based on these bins, there are no values in the REDACTED range, and there is only REDACTED value in the REDACTED range (outlier). The REDACTED value in the REDACTED range is not only an outlier, but it will also not pass the disclosure review since the count within this salary range is less than 10 and the corresponding institutions count is less than 3. In this case, because this one salary value is far from the others, it might make sense to remove the outlier.

> Note: in other cases, when the count per bin is under 10 for individuals and under 3 for institutions, consider changing the sizes of the bins in order to create bins which will contain more than 10 data points on the individual level and more than 3 data points on the institution level.

In [None]:
# drop all salaries in REDACTED category

female_earnings <- female_earnings %>%
    filter(salary_range != "REDACTED")

In [None]:
# Check that the REDACTED category was removed

unique(female_earnings$salary_range)

The unweighted counts at the individual level can be reviewed after this change.

In [None]:
individual_counts <- female_earnings %>% 
                         group_by(salary_range) %>%
                        summarise(individual_count = n_distinct(refid))

individual_counts

As well as those on the institution level.

In [None]:
institution_counts <- female_earnings %>% 
                         group_by(salary_range) %>%
                        summarise(institution_count = n_distinct(sdrincd))

institution_counts

To graph this histogram using these bins, the `breaks` argument inside `geom_histogram` can be manipulated to impose these edges.

In [None]:
# unweighted counts for earnings with new bins
female_earnings %>%
    ggplot(aes(x=salary)) +
    geom_histogram(breaks=c(REDACTED)) +
    scale_x_continuous(labels = comma)

This can be compared to the original distribution (ignoring the outlier) to confirm that the general distribution remains roughly the same using these bin edges.

In [None]:
# unweighted counts for earnings with new bins
female_earnings %>%
    ggplot(aes(x=salary)) +
    geom_histogram() +
    scale_x_continuous(labels = comma)

It does not appear as though the manipulated histogram is representative of a more granular visualization. Now that the outlier has been eliminated and the distribution appears as though it may require different bins than originally planned, this distribution can be plotted with the number of individuals in each bin available on the visualization.

> If there are fewer data points than needed, there are different ways to adjust the bins. Play around with the number of bins, drop outliers, or manipulate bins so that each bin contains at least 10 entries at the individual level and at least 3 institutions.

In [None]:
female_earnings %>%
    ggplot(aes(x=salary)) +
    geom_histogram() +
    scale_x_continuous(labels = comma) +
    stat_bin(aes(y=..count.., label= ..count..), geom="text", vjust = -.5)

There is a long tail with small values (less than 10) starting at around REDACTED. This can be further adjusted by customizing the bin range (for example, by combining the values in the tail into larger bin sizes at the end into REDACTED bin range).

> Note: The `breaks` argument must be added to both the `geom_histogram()` and `stat_bin()` calls.

In [None]:
# Try to combine the small values in the tail into larger bin sizes at the end
female_earnings %>%
    ggplot(aes(x=salary)) +
    geom_histogram(breaks=c(REDACTED)) +
    stat_bin(aes(y=..count.., label= ..count..), geom="text", vjust = -.5, breaks=c(REDACTED)) +
    scale_x_continuous(labels = comma) 

This visual, with the new bins, appears to reasonably represent the original distribution. The underlying number of unique institutions within each of these groups (bins) must be confirmed to be at least three.

In [None]:
# Create a new column with mutate, `salary_range_new', with the new bin ranges defined above

female_earnings <- female_earnings %>%
                    mutate(salary_range_new = case_when(salary >= 0 & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED',
                                                    salary >= REDACTED & salary <= REDACTED ~ 'REDACTED'
                                                ))

In [None]:
individual_counts <- female_earnings %>% 
                         group_by(salary_range_new) %>%
                        summarise(indiv_count = n_distinct(refid))

individual_counts

In [None]:
institution_counts <- female_earnings %>% 
                         group_by(salary_range_new) %>%
                        summarise(inst_count = n_distinct(sdrincd))

institution_counts

In [None]:
# Merge two dataframes to create one dataframe with individual and institution counts

all_counts <- merge(individual_counts, institution_counts, by='salary_range_new')
all_counts

Save this unweighted histogram and the unweighted counts as a disclosure review proof.

> Visualizations in `ggplot2` can be saved using `ggsave`, which will save the most recent visualization.

In [None]:
# Save in the "Input" folder

all_counts %>%
    write_csv(sprintf('/nfshome/%s/counts_for_unweighted_hist_female_earnings.csv', user))

In [None]:
# Save the unweighted histogram with new bins

ggsave(sprintf('/nfshome/%s/unweighted_hist_female_earnings.pdf', user))

For the weighted histogram, which is what will be exported, the same bins must be kept as above. To get the weighted counts, use `weight` parameter in the `ggplot()` call and specify the variable which contains weights (`wtsurvy`).

In [None]:
# Try to combine the small values in the tail into larger bin sizes at the end

female_earnings %>%
    ggplot(aes(x=salary, weight=wtsurvy)) +
    geom_histogram(breaks=c(REDACTED)) +
    scale_x_continuous(labels = comma) +
    labs(
        title = 'Most female doctoral recipients from 2015 cohort received a salary between REDACTED in 2017',
        caption = 'SED NCSES Data'
        )

Now that the visualization is properly prepared, it can be saved for export.

In [None]:
# Save the weighted histogram 
ggsave(sprintf('/nfshome/%s/weighted_hist_female_earnings.pdf', user))

As an alternative, since manipulating the bin edges can be viewed as transitioning from a histogram to a barplot, and the unweighted counts are already saved as input for this data, a bar plot based on the weighted counts can be produced. Each weight represents a certain number of individuals, so to find the weighted counts within each group (bar), the `wtsurvy` variable can be summed by `salary_range_new` grouping.

In [None]:
# Sum the weights per salary range

female_earnings_grouped <- female_earnings %>%
                                group_by(salary_range_new) %>%
                                summarise(weighted_sum = sum(wtsurvy)) %>%
                                arrange(desc(weighted_sum))

In [None]:
female_earnings_grouped

This can then be piped into a `ggplot` call.

> `scale_x_discrete` allows for the specification of an ordering of the x-axis, the salary ranges.

In [None]:
female_earnings_grouped %>%
    ggplot(aes(x=salary_range_new, y=weighted_sum)) +
    geom_bar(stat="identity") +
    scale_x_discrete(limits = c('REDACTED')) +
    labs(
        x = 'Salary Range',
        y = 'Individuals',
        title = 'Most female doctoral recipients from 2015 cohort received a salary between REDACTED in 2017',
        caption = 'SED NCSES Data'
        )

If preferred to the adapted histogram, the weighted bar plot can be saved instead.

In [None]:
# Save the weighted bar plot 
ggsave(sprintf('/nfshome/%s/weighted_barplot_female_earnings.pdf', user))

Please keep in mind that any statistics (e.g. for counts of subpopulations) have to comply with the disclosure threshold described above -  not only the **counts of individuals** (**more than 10**), but also the **counts of institutions** (**more than 3**).

### UMETRICS - Working with percentages (Bar Plot)

The final section will shift its focus to federal funding histories. As covered in the Data Visualization notebook, it is possible to create a bar plot of the percentage of individuals who received a given number of semesters of federal funding. This example will walk through how to properly prepare this visualization for export.

The following code created the bar plot.

In [None]:
# join sed to umetrics using umetrics_xwalk
qry <- "
select c.*, d.semester, d.team_size from (
select a.*, b.emp_number 
from ncses_2019.nsf_sed a
inner join ncses_2019.sed_umetrics_xwalk b
on a.drf_id = b.drf_id
where a.phdfy = '2015' and a.phdinst in (REDACTED) ) c
inner join ncses_2019.iris_semester d
on c.emp_number = d.emp_number
"
cohort_joined <- dbGetQuery(con, qry)


# find number of semesters of federal funding per person
fed_sems <- cohort_joined %>%
    group_by(drf_id, team_size) %>%
    summarize(n_sems_fed = n_distinct(semester)) %>%
    ungroup() %>%
    mutate(
        n_sems_fed = ifelse(is.na(team_size), 0, n_sems_fed)
    ) %>%
    group_by(drf_id) %>%
    summarize(
        fed_funds = sum(n_sems_fed)
    ) %>%
    ungroup()

# transform ..prop.. variable
prop_fed_funds <- fed_sems %>%
    ggplot(aes(x=fed_funds, y=..prop..*100)) +
    geom_bar()

# add in labels
prop_fed_funds +
    geom_text(aes(label=percent(..prop.., .1), y=..prop..*100), stat="count", vjust = -.5) +
    scale_x_continuous(breaks = seq(0,12, by=1)) + 
    labs(
        x = "Number of Semesters Receiving Federal Funding",
        y = "Percent",
        title = "Most PHD Candidates did not Receive more than REDACTED semesters of Federal Funding",
        caption = "Source: SED NCSES and UMETRICS data"
    )

When working with percentages (both numerically and visually), it is important to show the underlying numerator and denominator counts.

The following are the rules for calculating any percentages/proportions/ratios:
- Calculate percentages, proportions, and ratios using the unweighted counts.
- The numerator and denominator should be above the threshold for each group, or bar, specified in the disclosure review (in this class, at least 10 individuals and at least 3 institutions).
- Round percentages calculated from unweighted counts to 1 decimal.
- Do not report 0 and 100%.

In this example, the counts are unweighted, as UMETRICS does not contain any weights. Because the underlying counts (numerator and denominator) for the bars are not evident directly on the visualization, they must be included in the export request as supplemental information in a CSV file.

In [None]:
# Create the underlying table with proportions
fed_sems_prop <- fed_sems %>%
    group_by(fed_funds) %>%                          # for each number of semesters
    summarise(count_individuals = n_distinct(drf_id)) %>%        # count number of unique individuals
    mutate(total_individuals = sum(count_individuals)) %>%                   # get the total count of unique individuals
    mutate(percent = round((count_individuals/total_individuals) * 100, 1))  # find the percentage and round to 1 decimal (not necessary but nice to see)

fed_sems_prop

According to the table, there is a value that does not have at least 10 individuals. Before deciding on an approach for working with this group that will not pass disclosure review, it makes sense to check the underlying number of institutions per group as well.

In [None]:
# see if phdinst is already in fed_sems
names(fed_sems)

The `phdinst` field, which is needed to count the number of underlying institutions within each grouping, is not in `fed_sems`. Recall the code used to create `fed_sems`.

In [None]:
# find number of semesters of federal funding per person
fed_sems <- cohort_joined %>%
    group_by(drf_id, team_size) %>%
    summarize(n_sems_fed = n_distinct(semester)) %>%
    ungroup() %>%
    mutate(
        n_sems_fed = ifelse(is.na(team_size), 0, n_sems_fed)
    ) %>%
    group_by(drf_id) %>%
    summarize(
        fed_funds = sum(n_sems_fed)
    ) %>%
    ungroup()

Perhaps `cohort_joined` contains information linking each individual to an institution, as each bar in the visualization corresponds to a specific individual in the data frame. 

In [None]:
# see if phdinst is in cohort_joined
head(cohort_joined$phdinst)

Because `cohort_joined` contains `phdinst` values, `phdinst` information can be added to `fed_sems` by joining by `drf_id` values.

In [None]:
# add phdinst information to fed sems and remove duplicates (multiple rows in cohort_joined for some drf_id values)
fed_sems_inst <- fed_sems %>%
    left_join(cohort_joined %>% select(drf_id, phdinst), by = 'drf_id') %>%
    distinct()

head(fed_sems_inst
    )

The underlying number of institutions by numbers of semesters with federal funding can now be calculated using similar code to that for calculating the number of individuals, substituting `drf_id` with `phdinst`.

> Because the number of institutions cannot simply be added up as the number of individuals were, the total number of institutions for the denominator will be added below.

In [None]:
# Create the underlying table for institutions
fed_sems_inst_cnt <- fed_sems_inst %>%
    group_by(fed_funds) %>%                          # for each number of semesters
    summarise(count_institutions = n_distinct(phdinst))        # count number of unique individuals

fed_sems_inst_cnt

None of the other `fed_funds` values are in violation of the number of distinct institutions. Therefore, to receive this visualization, the bar where `fed_funds` is REDACTED can be removed from the visualization, or can be combined with other bars (i.e. greater than 10).

The number of unique institutions involved in this calculation can be calculated and added to `fed_sems_inst_prop` using the code cell below.

In [None]:
tot_insts <- fed_sems_inst %>%
    summarize(n = n_distinct(phdinst)) 

fed_sems_inst_cnt <- fed_sems_inst_cnt %>%
    rbind(
        data.frame(fed_funds = as.character("Total/Denominator"), count_institutions = tot_insts$n)
        )

fed_sems_inst_cnt

Assuming this visualization is updated to aggregate or ignore the grouping that would not pass disclosure review, the visualization is ready for export.

In [None]:
# Save the table with the underlying counts to the CSV file 
fed_sems_prop %>%
    write_csv(sprintf('/nfshome/%s/counts_for_funding_histories_bar_plot.csv', user))

In [None]:
# Save the table with the underlying counts to the CSV file 
fed_sems_inst_cnt %>%
    write_csv(sprintf('/nfshome/%s/counts_for_funding_histories_bar_plot.csv', user))

In [None]:
# Save the bar plot 
ggsave(sprintf('/nfshome/%s/funding_histories_bar_plot.pdf', user))

## Machine Learning

Exporting clusters must be treated as any other grouping variable, as each cluster must satisfy a minimum number of individuals and (when applicable) employers to pass disclosure control.

## Reminder
Every single item for export, regardless of whether it is a .csv, .pdf, .png, or something else, must have corresponding proof in the input file to show that every group used to create this statistic followed the disclosure review rules.

> Note: After the end of the course, it is possible to export the code that have been used during the class. In order to do that, make your facilitator aware of your group's interest and they will handle it from there.

<font color=red><h3> Checkpoint: Prepare the first export</h3></font>

Prepare a first export of descriptive statistics or visualizations using the rules and code above.