# Presentation Prep

Maryah Garner, Benjamin Feder, Nathan Caplan,  Brian Kim, Ekaterina Levitskaya, Allison Nunez, Rukhshan Arif Mian.

**_Presentation Prep Examples & Exercises_**

This notebook provides information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives an overview of the information needed for disclosure review. _Please read through the entire notebook because it will separately discuss all outputs that will be flagged in the disclosure review process._


For the purpose of this class, the disclosure rules are as follows:



# NCSES 2021 Class Export Review Guidelines 

- **Each team will be able to export up to 7 figures**
    -Teams will not be allowed to export summary data tables to create figures outside the ADRF
    
- **Every statistic for export should be based on at least 10 individuas and at least 3 institutions**.
     - This includes statistics that are baised off 0-9 individuals must be surpressed
- **All counts will need to be rounded** 
    - Numbers below 999 should be rounded to the nearest ten, and numbers above 999 should be rounded to the nearest hundred.  
    
- **All percentages and proportions need to be rounded**
    - The same rounding rule that is applied to counts must be applied to both the numerator and denominator, then percentages must be rounded to the nearest percent, and proportions must be rounded to the nearest hundredth.

- **Exact percentiles can not be exported** 
    - Instead, for example, you may calculate a “fuzzy median,” by averaging the true 45th and 55th percentiles. We illustrated an example of this in the `01_Data_Exploration_SED_SDR.ipynb` notebook in Module 2 as well – we do go over this in more detail in this notebook. 
  
- **Exact Maxima and Minima can not be exported**
    - Suppress maximum and minimum values in general. You may replace an exact maximum or minimum with a top-coded value or a .
 
- **Complementary suppression**
    - If your figures includes totals or are dependent on a preceding or subsequent figures, you need to take into account complementary disclosure risks—that is, whether the figure’ totals, or the separate figures when read together, might disclose information about less then 10 individuals in the data in a way that a single, simpler table would not. Team facilitators and export reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.
 
- **Text Analysis**
    - To export figures for text analysis, the same rules apply. So you will have to recover the number of individuals and the number of institutions that are behind the figures you are exporting. 


##  Supporting documentation for exports

For each exported figure, you will need to provide a table with **underline counts** of individuals and institutions for each statistic depicted in the figure. 

- You will need to include both the rounded and the unrounded counts of individuals.

- If percentages or proportions are to be exported, you must report both the rounded and the unrounded counts of individuals for the numerator and denominator. You must also report the counts of institutions for both the numerator and the denominator

- If weighted results are to be exported, you must report both weighted and unweighted individual counts as well as institution counts (you do not need to report weighted institution counts).

**Text Analysis**
- If you are exporting figures for text analysis on grants, you must report counts of individuals for each `term`. Counts should include the intended population of the figure which may be less them all the individuals on each grant. 
    - Ex. If you are using text analysis to say something about PhD graduates from 2015, you must report the counts for the PhD graduates from 2015, even if other people are on the grants. 
    
**Code**
- Please provide the code for every output that needs to be exported. It is important for the ADRF staff to have the code to better understand what exactly was done. Understanding how research results are created is important in understanding the research output. Thus, it is important to document every step of the analysis in the Jupyter notebook. 

#### General rules to keep in mind
A more detailed description of the rules for exporting results can be found in the [export documentation](https://coleridgeinitiative.org/adrf/documentation/using-the-adrf/exporting-results/). Before preparing an export submission, please read through the export documentation. The overall guidelines are:


## R Setup

In [None]:
# switching off warnings
options(warn=-1)

#database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# adding weights 
suppressMessages(library(survey))

# to better view images
# For easier viewing of graphs
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)
options(warn=0)

We use an additional option for this notebook:

- `options(scipen = 100)`prevents R from using scientific notation when plotting or printing numbers. An example is shown below:

In [None]:
# before implementation
print(100000000)

In [None]:
# implementing option
options(scipen = 100) 

In [None]:
# after implementation
print(100000000)

Next, we connect to the database.

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Motivation: Comparing female earnings to male earnings across the 2015 cohort

The motivating question for this notebook is to look at how female earnings compare to male earnings for doctorate students who graduated in 2015. We address this question through multiple steps. Furthermore, this notebook describes how to construct useful statistics and visualizations using the class data and how to prepare them so that they can be approved as output during the export process.

To start off, we retrieve the necessary data for this notebook. We will be using the SDR table and the relevant variables are: 
- earnings (`salary`)
- survey weight (`wtsurvy`) - the data in the SDR needs to be weighted, as it is a subsample of a population
- individual ID (`refid`)
- institution ID (`sdrincd`)
- year of graduation (`sdr`) - where year is 2015
- gender (`gender`) – Male or Female
    - In this case, we are using 2 categories for gender (Male/Female) even though this is not necessary. There are no other categories that can be used as these are the options that SDR respondents are presented with when the survey is conducted. 
    
> Note: Whenever pulling statistics, remember to include individual and institution information, as they will be required to check if the statistics in question pass the disclosure threshold.

In [None]:
# Get the relevant variables from the SDR data and save in the earnings data frame: 
# - earnings (salary),
# - survey weight (wtsurvy),
# - individual ID (refid),
# - institution ID (sdrincd),
# - where year of graduation (sdryr) is 2015

query <- "
SELECT salary, wtsurvy, refid, sdrincd, gender
FROM ds_nsf_ncses.dbo.nsf_sdr_2017
WHERE sdryr = '2015' 
"
earnings <- dbGetQuery(con,query)

In [None]:
# Show the first few rows of the table
head(earnings)

Recall that as per the Survey of Doctorate Recipients' (SDR) data dictionary, `salary` values equal to `9999998` are reserved for 'Logical Skips'. Therefore, these values can be removed. 

In [None]:
# Remove the rows with logical skip values
earnings <- earnings[earnings$salary != '9999998', ]

> For filtering these logical skips using using the tidyverse, the code would be `earnings <- earnings %>% filter(salary != '9999998')`.

## Regrouping Data
In our data exploration notebooks in Module 2, we worked on creating broader categories from categorical variables. This sub-section focuses on using the same methods but on a different variable type (continuous).

The column we choose to regroup is **salary**. Remember that we are using the SDR table and it requires us to utilize survey weights for any summary statistics. Since we are not performing any such analysis and are just regrouping, we move forward without using survey weights. 

A suggested regrouping is as follows:

In [None]:
earnings <- earnings %>%
                    mutate(salary_range = case_when(salary >= 0 & salary <= 50000 ~ '0 - 50K',
                                                    salary >= 50000 & salary <= 100000 ~ '50K - 100K',
                                                    salary >= 100000 & salary <= 500000 ~ '100K - 500K',
                                                    salary >= 500000 & salary <= 1000000 ~ '500K - 1M',
                                                    salary >= 1000000 & salary <= 5000000 ~ '1M - 5M'
                                                ))

all_counts <- earnings %>% 
                         group_by(salary_range) %>% # grouping by salary range
                        summarise(individual_count = n_distinct(refid), # get counts of individuals
                                 institution_count = n_distinct(sdrincd)) # get counts of institutions

all_counts

From the table above, we observe that two categories, *500K-1M* and *1M-5M* have an individual count or institution count (or both) that does not meet our disclosure review criteria. To move forward, we do the following: 

1. Drop these two categories (as these are outliers) 
2. Use a different type of grouping that takes these higher values into account

In [None]:
earnings <- earnings %>% 
    filter(salary_range != '500K - 1M', salary_range != '1M - 5M')

In [None]:
# Create a new column with mutate, `salary_range_new', with the new bin ranges defined above
earnings <- earnings %>%
                    mutate(salary_range_new = case_when(salary >= 0 & salary <= 30000 ~ '0 - 30K',
                                                    salary >= 30000 & salary <= 60000 ~ '30K - 60K',
                                                    salary >= 60000 & salary <= 90000 ~ '60K - 90K',
                                                    salary >= 90000 & salary <= 100000 ~ '90K - 100K',
                                                    salary >= 100000 & salary <= 150000 ~ '100K - 150K',
                                                    salary >= 150000 & salary <= 180000 ~ '150K - 180K',
                                                    salary >= 180000 ~ '180K+',
                                                ))

Note how this codes any salary greater than 180K in one single category. We check the associated individual and institution counts below:

In [None]:
all_counts <- earnings %>% 
                        group_by(salary_range_new) %>%
                        summarise(indiv_count = n_distinct(refid), # individual counts
                                  inst_count = n_distinct(sdrincd) # institution counts
)

all_counts

We see that all salary categories meet our disclosure review requirements. 

In [None]:
# Save the underlying counts
all_counts %>%
    write_csv("Tables\\counts_for_unweighted_hist_earnings.csv")

## Bar Plot

The question for this notebook focuses on comparing male earnings to female earnings. We consider using a bar plot to do so by looking at the **percentage of males/females who fall in a certain salary range.** (that we created above).


In `01_SED_SDR_Data_Exploration.ipynb`, we looked at the best practices for looking at percentages. We provide a brief recap below:

When working with percentages (both numerically and visually), it is important to show the underlying numerator and denominator counts. The following are the rules for calculating any percentages/proportions/ratios:
- Calculate percentages, proportions, and ratios using the unweighted counts.
- The numerator and denominator should be above the threshold for each group, or bar, specified in the disclosure review (in this class, at least 10 individuals and at least 3 institutions).
- Round percentages calculated from unweighted counts to 1 decimal.
- Do not report 0 and 100%.

#### Weighted counts
Within SDR, each weight represents a certain number of individuals, so to find the weighted counts within each group (bar), the `wtsurvy` variable can be summed by the **salary_range_new-gender** grouping as long as each individual only has one observation in the data frame. We will check this below.


In [None]:
# checking for duplicates
earnings %>%
    group_by(refid) %>% # grouping by unique ID
    mutate(count = n()) %>% # counting the number of instances
    arrange(desc(count)) %>% # ordering by count
    head() # checking first few observations

In [None]:
# creating underlying counts
earnings_prop <- earnings %>%
    group_by(gender, salary_range_new) %>%  # for each number of semesters
    summarise(count_individuals = n_distinct(refid), # counting the number of individuals
              count_institutions = n_distinct(sdrincd), # counting the number of institutions
              count_individuals_weighted = sum(wtsurvy)) %>% # getting the weighted counts of individuals
    mutate(total_individuals = sum(count_individuals)) %>%                   # get the total count of unique individuals
    mutate(percent = round((count_individuals/total_individuals) * 100, 1))  # find the percentage and round to 1 decimal (not necessary but nice to see)

earnings_prop

Note, the underlying counts (numerator and denominator) for the bars are not directly evident on the visualization, they must be included in the export request as supplemental information in a CSV file.

### Plotting:

We reuse our code from the `05_Data_Visualization.ipynb` notebook in Module 2 to create a Bar Plot.

In [None]:
# color-blind friendly palette
cbPalette <- c( "#009E73", "#0072B2", "#D55E00", "#CC79A7", "#999999", "#E69F00",  "#56B4E9", "#F0E442")

bar_plot <- earnings_prop %>%
    ggplot(aes(x=salary_range_new, y=percent, fill=gender)) +
    geom_bar(stat="identity", position='dodge') +
    scale_x_discrete(limits = c('0 - 30K', '30K - 60K', '60K - 90K', '90K - 100K', 
                                '100K - 150K', '150K - 180K', '180K+')) +
    scale_fill_manual(values = cbPalette) +
    labs(
        x = 'Salary Range', # labelling x axis
        y = '% of individuals', # labelling y axis
        title = 'Females who graduated in 2015 earn a REDACTED salary on average compared to Males', # adding title
        caption = 'NCSES SDR 2017 Data (2015 graduates)' # adding a caption
        )

bar_plot

### Adjusting Font Sizes

In order to make our plot presentation ready, it is always advised to use readable font sizes. We use the following code to implement this:

In [None]:
bar_plot <- bar_plot + theme(
        legend.text = element_text(size=24), # legend text font size
        legend.title = element_text(size=24), # legend title font size
        axis.text.x = element_text(size=24), # x axis label font size
        axis.title.x = element_text(size=24), # x axis title font size
        axis.text.y = element_text(size=24), # y axis label font size
        axis.title.y = element_text(size=24) # y axis title font size
    )
    
bar_plot

Please keep in mind that any statistics (e.g. for counts of subpopulations) have to comply with the disclosure threshold described above -  not only the **counts of individuals** (**more than 10**), but also the **counts of institutions** (**more than 3**). Additionally, in cases where weights are involved, we have to export weighted counts as well. 

In [None]:
all_counts <- earnings_prop %>% 
        select(c("gender", "salary_range_new", "count_individuals", "count_institutions", "count_individuals_weighted"))

all_counts 

In [None]:
# Save the underlying counts
all_counts %>%
    write_csv("Tables\\counts_for_weighted_bar_earnings.csv")

## Fuzzy percentiles

In this section, we delve into understanding the the distribution of female earnings compared to male earnings across the 2015 cohort. For this purpose, we utilize fuzzy percentiles.

Ordinarily, when simply finding the distribution of a numerical variable, the `summary()` function may be used. As a reminder, this function outputs the minimum, first quartile, median, mean, third quartile, and maximum values as per its default output. However, these outputs will not pass the disclosure review process, as some of these statistics may be represented by individual points (such as minimum, maximum, any percentiles, and median). 

Instead, these outputs can be transformed into _fuzzy percentiles_, which can pass the disclosure review process. For example, the 20th and 30th percentiles can be averaged to find a fuzzy 25th percentile.

As done in the Data Exploration notebook, the survey weights can now be added using `svydesign` function from `survey` package to calculate the weighted earnings distribution.

In [None]:
# Add weights using "svydesign" function from "survey" library
earnings_weighted <- svydesign(ids=~1, data=earnings, weights=earnings$wtsurvy)

As mentioned above, to export the 25th, 50th, and 75th percentiles, they need to be transformed into fuzzy percentiles by averaging close-by percentiles. For example, in order to find a fuzzy 25th percentile, the 20th and 30th percentiles can be averaged.

Start by finding the following true percentiles on the weighted data using `svyquantile`:
- 20th and 30th (to create a fuzzy 25th percentile),
- 45th and 55th (to create a fuzzy 50th percentile),
- 70th and 80th percentile (to create a fuzzy 75th percentile). 

In [None]:
# Find 20, 30, 45, 55, 70, 80 percentiles on the weighted data
svyquantile(~salary, earnings_weighted, c(.20, .30, .45, .55, .70, .80), na.rm=TRUE)

In [None]:
weighted_by_gender <- svyby(~salary, ~gender, earnings_weighted, svyquantile, quantiles=c(.20, .30, .45, .55, .70, .80), na.rm=TRUE, keep.var=FALSE, keep.names=FALSE)
names(weighted_by_gender) = c("gender", .20, .30, .45, .55, .70, .80)

weighted_by_gender

> Note: It is not best practice to save column names as numerical outputs. The code below will show how to work with these types of columns. After creating the data frame, the default column names can be overwritten by manually assigning the desired column names with the base code <br> `colnames(dataframe) <- c('changed', 'column', 'names')`.

In [None]:
# # Find 20, 30, 45, 55, 70, 80 percentiles on the weighted data and save results to a dataframe
true <- weighted_by_gender
names(true) <- c('gender', 'Q.20', 'Q.30', 'Q.45', 'Q.55', 'Q.70', 'Q.80')
true

If the column names of `true` dataframe are to be left as numbers, these columns can be referenced by surrounding them in backticks. These percentiles can be averaged to find fuzzy 25th, 50th (median), and 75th percentiles.

In [None]:
# Find values for the fuzzy quantiles by averaging the percentiles 
# (e.g. to find 25th, average 20th and 30th, etc.)
weighted_by_gender %>%
    summarize(
        'fuzzy_25' = (`0.2` + `0.3`)/2,
        'fuzzy_50' = (`0.45` + `0.55`)/2,
        'fuzzy_75' = (`0.7` + `0.8`)/2,
    )

Save these fuzzy percentiles to a table.

In [None]:
# save to table
fuzzy <- weighted_by_gender %>%
    summarize(
        'fuzzy_25' = (`0.2` + `0.3`)/2,
        'fuzzy_50' = (`0.45` + `0.55`)/2,
        'fuzzy_75' = (`0.7` + `0.8`)/2,
    ) 

# adding a variable for gender
fuzzy$gender <- c("Female", "Male")

In [None]:
fuzzy %>%
    write_csv("Tables\\fuzzy_earnings.csv")

In order for these fuzzy percentiles to pass disclosure review, they require proof that the underlying counts contain at least 10 data points at the individual level and at least 3 data points at the institution level. 

For example, it is not possible to export a distribution of earnings of female PhD graduates of only 1 university or comparing earning distributions between just 2 universities. The sample should include at least 3 universities, for example, by comparing a group of at least 3 research universities versus a group of at least 3 teaching universities.

In the SQL query above, the data on individuals (`refid`) and institutions (`sdrincd`) has already been pulled, so it can be used to find the counts of distinct individuals and institutions.

> Individual and institutional counts provided as corresponding proof must be unweighted.

In [None]:
# save individual and institution counts
counts_for_earnings <- earnings %>%
    group_by(gender) %>%
    summarize(
        count_individuals = n_distinct(refid),
        count_institutions = n_distinct(sdrincd),
        count_weighted_individuals = sum(wtsurvy)
    )

In [None]:
counts_for_earnings

This can now be used as proof that the fuzzy percentiles are based on counts that pass the disclosure thresholds.

In [None]:
# save as counts_for_earnings.csv
counts_for_earnings %>%
    write_csv("Tables\\counts_for_earnings.csv")

## Saving Plots

In this section, we look at the best ways to export our presentation-ready plots. We use ``ggsave`` to save our plots in a png, jpeg and pdf format without losing quality. 

### PNG

First, we provide an example of using ``ggsave`` with two parameters: `filename` and `plot`

In [None]:
ggsave(filename = "Figures\\bar_plot_prob.png", # saving path
       plot = bar_plot # plot name
      )

This might not be the preferred way of saving a plot as the dimensions of the plot default to 6.67 x 6.67. We suggest looking at this file we just saved in its respective path. You will see how all the labels are cluttered and the graph can not be interpretted. Thus, we recommend using the `width` and `height` parameters in addition to `filename` and `plot`. We provide examples to save our bar plots.

In [None]:
ggsave(filename = "Figures\\bar_plot.png", # saving path
       plot = bar_plot,  # plot name
       width = 20, # width
       height = 12 # height
      )

The code above saves the plots in a format that can be interpretted conveniently. We reuse this code to save in a JPEG and PDF format below:

### JPEG

In [None]:
ggsave(filename = "Figures\\bar_plot.jpeg", # saving path
       plot = bar_plot,  # plot name
       width = 20, # width
       height = 12 # height
      )

### PDF

In [None]:
ggsave(filename = "Figures\\bar_plot.pdf", # saving path
       plot = bar_plot,  # plot name
       width = 20, # width
       height = 12 # height
      )

## Reminder
Every single item for export, regardless of whether it is a .csv, .pdf, .png, or something else, must have corresponding proof in the input file to show that every group used to create this statistic followed the disclosure review rules.

> Note: After the end of the course, it is possible to export the code that have been used during the class. In order to do that, make your facilitator aware of your group's interest and they will handle it from there.

In [None]:
# Close the database connection
dbDisconnect(con)