<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Brian Kim, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Avishek Kumar, Jonathan Morgan, Benjamin Feder, Ekaterina Levitskaya, Nathan Caplan.

**_Disclosure Review Examples & Exercises_**

This notebook provides you with information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives you an overview of the information needed for disclosure review. _Please read through the entire notebook because it will separately discuss different types of outputs that will be flagged in the disclosure review process._

In [None]:
# data manipulation
import pandas as pd
import numpy as np

# database connection
from sqlalchemy import create_engine

# visualization
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [None]:
# Database Connection
host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = create_engine(connection_string)


# General Remarks on Disclosure Review

## Files you can export
In general, you can export any kind of file format. However, most researchers typically export tables, graphs, regression outputs and aggregated data. Thus, we ask you to export one of these types, which implies that every result you would like to export needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, you can't export results in a Jupyter notebook. Doing disclosure reviews on output in Jupyter notebooks is too burdensome for us. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. **This does not mean that you won't need your Jupyter notebooks during the export process.** 

## Documentation of code is important
During the export process, we ask you to provide the code for every output you would like to export. It is important for the ADRF staff to have the code to better understand what you exactly did. Understanding how research results are created is important in understanding your research output. Thus, it is important to document every step of your analysis in your Jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found on the class website. This is just a quick overview. You should go to the class website and read the entire guidelines (link below) before preparing your files for export. 
- The disclosure review is based on the underlying observations of your study. **Every statistic you want to export must be based on at least 10 data points at an individual level. When reporting firm-based statistics, on top of the 10 individual data points, you must show that there are at least 3 firms and 2) there are 3 or more firms, and employment in no one firm comprises more than 80% of the industry to receive your export**. You must show the disclosure review team that every statistic you wish to export is based on those numbers by providing the associated counts/percentages in your input file. 
- Document your code so the reviewer can follow your data work. Assessing re-identification risks highly depends on the context. Therefore, it is important that you provide context info with your analysis for the reviewer. When making a comments in the code, make sure not to use any individual statistic (e.g. the median is ...).
- Save the requested output with the corresponding code in your input and output folder. Make sure the code is executable. The code should exactly produce the output you requested.
- If you are exporting powerpoint slides that show project results, you have to provide the code which produces the output in the slide.
- Please export results only when they are final and you need them for your presentation or final project report.

## To-Do:
Read through the **documentation** link: adrf.readthedocs.io/en/latest/export_of_results/guidelines.html#documentation

# Disclosure Review Walkthrough

You will reconstruct the statistics and visualizations you created in the [Data Exploration](01_2_Data_Exploration.ipynb) and [Data Visualization](02_1_Data_Visualization.ipynb) notebooks and prepare them in a way so your output will pass the disclosure review.

### Counts

Recall Motivating Question #1 from the Data Exploration notebook.

**How many students got their degrees from Ohio community colleges during the 2012-13 academic year? How does the number vary by the regional location of the college and by degree field?**

To find these answers, you first found your desired cohort using the code below.

In [None]:
# store query to find 2012-13 academic year graduates in a temporary table
qry = '''
create temp table all_grads as
select *
from data_ohio_olda_2018.oh_hei_long
where (degcert_yr_earned = '2012' and (degcert_term_earned = '4' or degcert_term_earned = '1')) or 
    (degcert_yr_earned = '2013' and (degcert_term_earned = '2' or degcert_term_earned = '3'))
'''
conn.execute(qry)

In [None]:
# now create temp table because this is our cohort of 2012-13 community college graduates
# take most recent graduation
qry = '''
create temp table cc_grads as
select a.*, lkp.*
from all_grads a
left join data_ohio_olda_2018.oh_hei_campus_county_lkp lkp
on a.degcert_campus = lkp.campus_num
where lkp.campus_type_code in ('TC', 'SC', 'CC')
'''
conn.execute(qry)

From here, finding the number of students who received community college degrees was done using `unique()`. Because the desired statistic for export is a count not related to firm statistics, you just need to show that this count is greater than 10.
> If you were exporting a non-count, yet general statistic (i.e. total number of dollars made by this cohort in a year's time), you would need to provide the underlying individual counts per group.

In [None]:
qry = '''
select count(distinct(ssn_hash)) from cc_grads
'''
grad_count = pd.read_sql(qry, conn)

In [None]:
print(grad_count)

To export this statistic as a csv, you can use the `to_csv` function and designate the file path and the name. Here, we will call the file `graduate_counts.csv` (the more descriptive the name of the file, the easier it is to review).

> In the file path, include `YOUR_USERNAME` to save the CSV in your home folder.

In [None]:
grad_count.to_csv('/nfshome/YOUR_USERNAME/graduate_counts.csv')

To find the number of graduates by region, recall that you used Python's `.groupby()` combined with `nunique()`. Again, because this count does not concern firm statistics, it can serve as its own proof that the counts are all at least 10.

In [None]:
qry =  '''
select *
from cc_grads
'''
df = pd.read_sql(qry, conn)

In [None]:
# find number of graduates by region
df.groupby(['jobsohioregion'])['ssn_hash'].nunique()

Again, because these counts are all at least 10, they are safe for export.

In [None]:
grad_by_region = df.groupby(['jobsohioregion'])['ssn_hash'].nunique()
grad_by_region.to_csv('nfshome/YOUR_USERNAME/graduates_by_region_counts.csv')

To finish up the first motivating question, you found the number of graduates by two-digit subject codes. Let's see what the output looked like again.

In [None]:
# with creates a mini table
# In the last line of the query, we use ::varchar to convert 'subject_code' in the lkp table from integer
# to text. This is because when we join tables, the variable types should be the same. 
qry= '''
with subject as (select ssn_hash, left(degcert_subject,2) as code from cc_grads)
select subject.ssn_hash, lkp.subject_code_2010, lkp.subject_desc 
from subject
join data_ohio_olda_2018.oh_subject_codes_lkp lkp
on subject.code=lkp.subject_code_2010::varchar; 
'''

subject_df=pd.read_sql(qry,conn)

In [None]:
subject_df.groupby(['subject_code_2010', 'subject_desc'])['ssn_hash'].count().sort_values(ascending=False)

You have two options:
1. Only export those with counts of at least 10.
2. Aggregate the subjects with counts less than 10 so that all counts are at least 10.

Here, you will see how to only export those with counts of at least 10.

In [None]:
# save as dataframe to easily manipulate
df_counts_by_subject = pd.DataFrame(subject_df.groupby(['subject_code_2010', 'subject_desc'])['ssn_hash'].count().
                                    sort_values(ascending=False)).reset_index()

In [None]:
# limit to subjects with at least 10 graduates
df_counts_by_subject[df_counts_by_subject['ssn_hash'] >= 10]

This table is now good to export.

In [None]:
# limit to subjects with at least 10 graduates
counts_by_subject = df_counts_by_subject[df_counts_by_subject['ssn_hash'] >= 10]
counts_by_subject.to_csv('nfshome/YOUR_USERNAME/counts_by_subject.csv')

Let's move onto Motivating Question #2:

**How many 2012-13 Ohio community college graduates are employed in Ohio one year after graduation? How many of them have stable employment? How does the number vary by industry?**

And more specifically:

- How many people have positive earnings each quarter during their first year after graduation?
- What are the earning distributions within a year's time of graduates who have positive earnings during the first year after graduation?
- How many people achieved stable employment within the first year after graduation? 
    - **Stable employment metric 1**: have positive earnings during ALL four quarters after graduation
    - **Stable employment metric 2**: work for the same employer during the second quarter and the fourth quarter after graduation
- How does the number of people who have stable employment vary by industry?

To answer these questions, you utilized the `cohort_oh_jobs` table available in the `ada_20_osu` schema. So first, let's load the `cohort_oh_jobs` table into Python.

In [None]:
# subset to jobs within 1 year *after* graduation

qry = '''
select * from ada_20_osu.cohort_oh_jobs
'''

df_jobs = pd.read_sql(qry, conn)

To find the number of graduates who had positive earnings in any quarter one year after graduation, you used the code below.

In [None]:
# how many people had wages for at least one quarter
df_jobs['ssn_hash'].nunique()

Since this statistic's count is greater than 10 and does not concern specific employers, you do not need to show firm counts or firm employment percentages.

We will save the second subquestion, which concerns an earning distribution, for the last part of exporting this motivating question since it covers an important topic: fuzzy percentiles. In the meantime, let's move on to exporting stable employment metrics. Here, you will work with count statistics that do not include industry or regional breakdowns, so you don't need to worry about showing firm-specific information.

As a reminder, the first stable employment metric you used was finding those who had positive earnings all four quarters their first year after graduation.

In [None]:
#Stable employment metric 1:
# positive earnings during all four quarters
stable_1 = sum(df_jobs.groupby(['ssn_hash']).count()['sumwages'] == 4)
stable_1.to_csv('/nfshome/YOUR_USERNAME/stable_emp_count_1.csv')

For the second stable employment metric you used another table we created in the `ada_20_osu` schema, `cohort_oh_jobs_emp`. This table allows you to find the count for the second stable employment metric, which measured the amount of individuals who had the same primary employer in their second and fourth quarters after graduation.

In [None]:
# get all jobs 2 and 4 quarters after graduation
qry = '''
select ssn_hash, deg_date, job_date, employer, naics_3_digit
from ada_20_osu.cohort_oh_jobs_emp
where time_after_grad between 180 and 185 or time_after_grad = 365
'''
stable_emp = pd.read_sql(qry, conn)

In [None]:
# find the amount that had only one employer and showed up in stable_emp twice
stable_2 = sum((stable_emp.groupby(['ssn_hash'])['employer'].nunique() == 1) & 
    (stable_emp.groupby(['ssn_hash']).count()['employer'] == 2))
stable_2.to_csv('nfshome/YOUR_USERNAME/stable_emp_count_2')

Now that we've found and exported these counts, let's compare the different stable employment metrics by their top 10 industries.

In [None]:
qry = '''
select *
from ada_20_osu.cohort_oh_jobs_emp
'''
emp_df = pd.read_sql(qry, conn)

emp_df.head()

In [None]:
#Get the ssn_hash of people who have four quarters of records
ssn_4q_df=emp_df.groupby(['ssn_hash'])['wages'].agg(['count']).reset_index()
ssn_4q_df=ssn_4q_df[ssn_4q_df['count']==4]

#Merge this with emp_df to get industry code
emp_4q_df=ssn_4q_df.merge(emp_df,left_on='ssn_hash',right_on='ssn_hash')

#Keep the first quarter records only
emp_4q_df=emp_4q_df[emp_4q_df['time_after_grad']<=92]

In [None]:
# find top 10 industries
sort_ind = emp_4q_df.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)
sort_ind.iloc[0:10]

In [None]:
# get all jobs 2 and 4 quarters after graduation
qry = '''
select ssn_hash, deg_date, job_date, employer, naics_3_digit
from ada_20_osu.cohort_oh_jobs_emp
where time_after_grad between 180 and 185 or time_after_grad = 365
'''
stable_emp = pd.read_sql(qry, conn)

In [None]:
#Get the dataframe of people who worked for the same employer during the 2nd and the 4th quarter after graduation
stable_df=stable_emp.groupby(['ssn_hash','employer','naics_3_digit']).count().reset_index()
stable_df = stable_df[stable_df['job_date']==2]

#breakdown the number by industry
sort_ind2=stable_df.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)
sort_ind2.iloc[0:10]

In [None]:
#Compare the number of stable employment defined by the two metrics
# can use same df because same top 10
compare_df=pd.concat([sort_ind.iloc[0:10],sort_ind2.iloc[0:10]],axis=1).reset_index()
compare_df

Now, even though this statistic is broken down by industry breakdowns, because it does not concern specific employers, it is ready for export. 

In [None]:
compare_df.to_csv('nfshome/YOUR_USERNAME/stable_emp_by_industry.csv')

Finally, to answer the second subquestion on the earning distribution, you used outputs from the `.describe()` function. However, you cannot use these outputs because some of those statistics are represented by individual points (such as minimum, maximum, any percentiles, and median). Instead, you need to create _fuzzy percentiles_. For example, in order to find a fuzzy 25th percentile, you can take the average of the 20th and 30th percentiles.

In [None]:
# distribution of wages per person one year out
df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum').describe()

### Fuzzy percentiles

Let's walk through the code to create the fuzzy percentiles. You can use the `.quantile()` function to find the true values for some percentiles.

Let's say that you want to export the 25th, 50th, and 75th percentiles. You can start by finding the following true percentiles on our weighted data:
- 20th and 30th (to create a fuzzy 25th percentile),
- 45th and 55th (to create a fuzzy 50th percentile),
- 70th and 80th percentile (to create a fuzzy 75th percentile). 

In [None]:
# save distribution of annual wages per graduate
wages = df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum')

In [None]:
# Find 20, 30, 45, 55, 70, 80 percentiles
wage_qntl = wages.quantile([.20, .30, .45, .55, .70, .80])

In [None]:
wage_qntl

Now let's average the percentiles to create fuzzy 25th, 50th, and 75th percentiles.

In [None]:
# Find values for the fuzzy quantiles by averaging the percentiles 
# (e.g. to find 25th, average 20th and 30th, etc.)

fp_25 = str((wage_qntl[.20] + wage_qntl[.30])/2)
fp_50 = str((wage_qntl[.45] + wage_qntl[.55])/2)
fp_75 = str((wage_qntl[.70] + wage_qntl[.80])/2)

Let's save these fuzzy percentiles to a table.

In [None]:
# Save in pandas dataframe

fuzzy = pd.DataFrame()
fuzzy['percentile'] = ['fuzzy_25', 'fuzzy_50', 'fuzzy_75']
fuzzy['wages'] = [fp_25, fp_50, fp_75]

fuzzy

Now, these percentiles describing the wage distribution of your cohort are safe for export.

In [None]:
fuzzy.to_csv('/nfshome/YOUR_USERNAME/fuzzy_female_earnings.csv')

### Visualizations

In the Data Visualization notebook, Motivating Question #1 is as follows:

**What is the distribution of earnings during the first year after graduation for 2012-13 community college graduates? How does this differ by degree fields?**

To answer these questions, you created variations of histograms, first starting with a simple visualizaiton and then getting more advanced. How would you submit these visualizations for export? Is there anything else you would need to provide? Let's start with a histogram of earnings during the first year after graduation for 2012-13 community college students in Ohio. `df_jobs` already contains everything you need for the first histogram.

In [None]:
# bare histogram of earnings distribution
plt.hist(df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum'))

# The show() function outputs the current state of `pyplot`: our current fig.
plt.show()

The actual `plt.hist()` call has three outputs, which pertain to the counts of the bin sizes, the edges of the bins, and the actual graphical image. You need to show that each bin contains at least 10 individual data points before you can export this histogram. The code cell below shows you how you can find the counts per bin. Here, we are also assuming you have chosen the default number of bins, but the code can be extended to when you change the number of bins.

In [None]:
counts, edges, graph = plt.hist(df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum'))

In [None]:
# if all bin counts are greater than 10
counts

You can adjust the bin size by either aggregating the smaller bins into one larger bin that satisfies the disclosure review process, or cutting off outliers.

In [None]:
# see bin edges
edges

In [None]:
# change edges so that the counts are okay
counts, edges, graph = plt.hist(df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum'), bins = [REDACTED])

In [None]:
# check counts
counts

Now, all the bin counts are at least 10, so this histogram is safe to export. First, let's export the histogram by using the `.savefig()` function, which works similarly to `to_csv()`.
> You cannot save a figure directly after running `plt.show()`. To save the figure, you need to run the plot and then `.savefig()`.

In [None]:
plt.savefig('/nfshome/YOUR_USERNAME/earnings_hist.pdf')

You just need to also report the accompanying bin counts stored in `counts`. Please make sure the counts exports for visualizations are easy to link with your visualizations by naming them `counts_for_...`.

In [None]:
counts.to_csv('/nfshome/YOUR_USERNAME/counts_for_earnings_hist.csv')

When you want to export visualizations containing multiple groups, such as the subquestion regarding earnings differences by degree field, you need to show the counts within each group. Let's recall the code used to generate that visualization.

In [None]:
# Find most recent graduation within the span of 2012-13 academic year
# also get two-digit subject code
qry = '''
create temp table cc_grads_recent as
select distinct on (ssn_hash) *, left(degcert_subject, 2) as subject
from (
SELECT *, 
    CASE WHEN degcert_term_earned = 4 THEN
        format('%%s-%%s-01', degcert_yr_earned, 7)::date 
    WHEN degcert_term_earned = 1 THEN
        format('%%s-%%s-01', degcert_yr_earned, 10)::date 
    WHEN degcert_term_earned = 2 THEN
        format('%%s-%%s-01', degcert_yr_earned, 1)::date 
    WHEN degcert_term_earned = 3 THEN
        format('%%s-%%s-01', degcert_yr_earned, 4)::date 
    END AS deg_date
    from cc_grads
) q
order by ssn_hash, deg_date DESC
'''
conn.execute(qry)

In [None]:
df['subject'].unique()

In [None]:
# select these subjects so we can subset most_recent and add the corresponding subject description
# need to set as tuple so we can use .format() properly
pop_subs = tuple(subject_df.groupby(['subject_code_2010'])['ssn_hash'].count().sort_values(
    ascending=False)[0:10].reset_index()['subject_code_2010'])

pop_subs

In [None]:
# save as temp table ten_subs
qry= '''
create temp table ten_subs as
select cc.ssn_hash, cc.deg_date, cc.subject, lkp.subject_desc 
from cc_grads_recent cc
join data_ohio_olda_2018.oh_subject_codes_lkp lkp
on cc.subject=lkp.subject_code_2010::varchar
where cc.subject != 'TR' and cc.subject::int in {}
'''.format(pop_subs)
conn.execute(qry)

In [None]:
# Now that we have this, we can match it to the cohort_oh_jobs table because it already contains the earnings
# for most recent graduation within this time
qry = '''
select distinct t.*, j.deg_date, j.sumwages
from ten_subs t
join ada_20_osu.cohort_oh_jobs j
on j.ssn_hash = t.ssn_hash
'''
top_subs_wage = pd.read_sql(qry, conn)

In [None]:
top_subs_wage.head()

In [None]:
# Calculate each person's earnings during the first year after graduation
df_by_ssn = top_subs_wage.groupby(['ssn_hash', 'subject_desc'])['sumwages'].agg('sum').reset_index()

In [None]:
df_by_ssn.head()

In [None]:
plt.rc('figure', figsize=(15, 10))

# By convention, a returned Axes object is often called `ax`
ax = sns.barplot(
    y="subject_desc", # seaborn is clever enough to create a horizontal chart
    x="sumwages", 
    data=df_by_ssn, # order in data to order in figure
    palette='vlag',
    ci=None
)

ax.set_title('First Year Earnings Varies Considerably Across Degree Fields');

Before we can safely export this visualization, though, we need to show individual counts for each subject field.
> This policy would also follow for a line graph. Say you wanted to export earnings over time for your cohort, you would need to show counts for each division of time in the graph.

To get individual counts by subject field, we need to join two tables we've already made: 
- `ada_20_osu.cohort_oh_jobs`, which has UI wage records by quarter information
- `ten_subs`, which has subject information and limits cohort to those with degrees in our subjects of interest

In [None]:
qry = '''
select j.*, t.subject, t.subject_desc
from ada_20_osu.cohort_oh_jobs j
join ten_subs t
on t.ssn_hash = j.ssn_hash and t.deg_date = j.deg_date
'''
df_sub = pd.read_sql(qry, conn)

df_sub.head()

In [None]:
# one example that will motivate a loop
df_sub[df_sub['subject'] == '11']['ssn_hash'].nunique()

We can write a `for()` loop to find counts and percentages by each subject field.

In [None]:
# now let's get counts for each subject
count_stat1 = list()
for code in pop_subs:
    code = str(code)
    count_stat1.append(df_sub[df_sub['subject'] == code]['ssn_hash'].nunique())

In [None]:
discl_proof = pd.DataFrame({'subject_code':pop_subs, 'individual_counts':count_stat1})

In [None]:
discl_proof

Now that we have provided the necessary statistics for export, we can export the visualization and the corresponding counts per subject.

In [None]:
discl_proof.to_csv('nfshome/YOUR_USERNAME/earnings_hist_spring_counts.pdf')

In [None]:
plt.savefig('/nfshome/YOUR_USERNAME/earnings_hist_spring.pdf')

### Machine Learning

Whenever you are creating your training and test datasets, after creating them, please include the counts of each variable, and please do not alter the datasets afterwards. If you use any dummy variables, you need to provide the countof 0s and 1s for each dummy variable. This will be covered more extensively in the unsupervised machine learning notebooks.

Remember that if you are plotting y-scores, it is still a histogram, and each estimate represents an individual data point, therefore, it needs to comply with the disclosure threshold described above.

### Reminder
Every single item you wish to export, regardless of whether it is a .csv, .pdf, .png, or something else, must have corresponding proof in your input file to show that every group used to create this statistic followed our disclosure review rules.

> Note: After the end of the course, you can export the code that you have been using. In order to do that, you will need to clear the outputs of the notebooks.