**_Disclosure Review Examples & Exercises_**

This notebook provides you with information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives you an overview of the information needed for disclosure review. 

In [None]:
# Load packages
%pylab inline
import os
import pandas as pd
import numpy as np
import psycopg2

import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

# General Remarks on Disclosure Review

## Files you can export
In general, you can export any kind of file format. However, most research results that researchers typically export are tables, graphs, regression output and aggregated data. Thus, we ask you to export one of these types, implying that every result you would like to export needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, you can't export results in a jupyter notebook. Doing disclosure reviews on output in jupyter notebooks is too burdensome for us. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. This does not mean that you won't need your jupyter notebooks during the export process. 

## Documentation of code is important
During the export process, we ask you to provide the code for every output you are asking to export. It is important for the ADRF staff to have the code to better understand what you exactly did. Understanding how research results are created is important to understand your research output. Thus, it is important to document every step of your analysis in your jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found on the class website. This is just a quick overview. We recommend that you to go to the class website and read the guidelines before you prepare your files for export. 
- The disclosure review is based on the underlying observations of your study. Every statistic you want to export should be based on at least 10 individual data points
- Document your code so the reviewer can follow your data work. Assessing re-identification risks highly depends on the context. Thus it is important that you provide context info with your anlysis for the reviewer
- Save the requested output with the corresponding code in your input and output folder. Make sure the code is executable. The code should exactly produce the output you requested
- If you are exporting powerpoint slides that show project results, you have to provide the code which produces the output in the slide
- Please export results only when there are final and you need them for your presentation or final project report

# Disclosure Review Walkthrough

We will use Illinois Department of Employment Statistics data to construct our statistics we are interested in and prepare them in a way so we can submit the output for disclosure review.  

In [None]:
# get working directory
mypath = (os.getcwd())
print(mypath)

In [None]:
# pandas-related imports
import pandas as pd

# database interaction imports
import sqlalchemy

In [None]:
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

## Pull data

In this example we will use `ada_tdc_2019.q42014_cohort_wage`, which is a combination of all jobs during a specific subset of quarters that TANF recipients whose spells ended in 2014 Q4 worked in Indiana and Illinois, and look just at employment in 2015 Q1 for Indiana.

In [None]:
# Get data
query = """
SELECT *
from ada_tdc_2019.q42014_cohort_wage
where quarter = 1 and year = 2015 and state = 18
"""

df = pd.read_sql(query, conn)

In [None]:
# Check dataframe
df.head()

In [None]:
# another way to check dataframe
df.info()

In [None]:
# basic stats of
df.describe()

In [None]:
# let's add an earnings categorization for "low", "mid" and "high" using a simple function
def earn_calc(earn):
    if earn < 5000:
        return('low')
    elif earn < 10000:
        return('mid')
    else:
        return('high')
        

In [None]:
earn_calc(24000)

In [None]:
df['earn_cat'] = df['wages'].apply(lambda x: earn_calc(x))

We now have loaded the data that we need to generate some basic statistics about our populations we want to compare.

In [None]:
# Let's look at some first desccriptives by group
grouped = df.groupby('earn_cat')
grouped.describe()

In [None]:
grouped.describe().T

A statistic in this table will be released if the value is based on at least 10 entities (in this example, individuals). We can see that the total number of individuals we observe in each group completely does not satisfy this (see cell count). However, we also report percentiles, and we report the minimum and maximum value. Especially the minimum and maximum value are most likely representing one individual person. 

Thus, during disclosure review these values will be supressed. Instead, if we changed our cutoffs in the `earn_calc` function, we might be able to get the statistic(s) based off these groups released.

In [None]:
# let's add an earnings categorization for "low", "mid" and "high" using a simple function
def earn_calc(earn):
    if earn < 5000:
        return('low')
    elif earn < 7000:
        return('mid')
    else:
        return('high')
        

In [None]:
# Let's look at some first desccriptives by group
grouped = df.groupby('earn_cat')
grouped.describe()

In [None]:
grouped.describe().T

Now, we can export statistics based on these groups. Let's safely export the statistics in the following code cell.

In [None]:
# Now let's export the statistics. Ideally we want to have a csv file
# We can safe the statistics in a dataframe
export1 = grouped.describe()
# and then print to csv
export1.to_csv('descriptives_by_group_disclosive.csv')

### Reminder: Export of Statistics
You can save any dataframe as a csv file and export this csv file. The only thing you have to keep in mind is that besides the statistic X you are interested in, you have to include a variable count of X so we can see on how many observations the statistic is based on. This also applies if you aggregate data. For example if you aggregate by benefit type, we need to know how many observations are in each benefit program (because after the aggregation each benefit type will be only one data point). 

### Problematic Output
Some subgroups (e.g. for some of the Illinois datasets dealing with race and gender) will result in cell sizes representing less than 10 people. 

Tables with cells representing less than 10 individuals won't be released. In this case, disclosure review would mean to delete all cells with counts of less than 10. In addition, secondary suppression has to take place. The disclosure reviewer has to delete as many cells as needed to make it impossible to recalculate the suppressed values. 

### How to do it better
Instead of asking for export of a tables like this, you should prepare your tables in advance that all cell sizes are at least represented by a minimum of 10 observations. 

### Fuzzy quartiles
Percentile values themselves cannot be released as they represent a single observation. Below we show an example of calculating "fuzzy quartiles", which is a way to get close to percentiles but still pass disclosure review.

In [None]:
idx = pd.IndexSlice

In [None]:
grouped.describe(percentiles = [.20, .30, .45, .55, .70, .80])\
.loc[:,idx[('wages'), ('20%', '30%', '45%', '55.0%', '70%', '80%')]]

In [None]:
fuzzy_quartiles = grouped.describe(percentiles = [.20, .30, .45, .55, .70, .80])\
.loc[:,idx[('wages'), ('20%', '30%', '45%', '55%', '70%', '80%')]]

In [None]:
(fuzzy_quartiles.loc[:,idx[('wages'), ('20%')]] + fuzzy_quartiles.loc[:,idx[('wages'), ('30%')]])/2

In [None]:
fuzzy_quartiles['fuzzy_25th'] = (fuzzy_quartiles.loc[:,idx[('wages'), ('20%')]] + fuzzy_quartiles.loc[:,idx[('wages'), ('30%')]])/2
fuzzy_quartiles

In [None]:
# now do that for the 50th (median) and 75th percentiles:
fuzzy_quartiles['fuzzy_median'] = (fuzzy_quartiles.loc[:,idx[('wages'), ('45%')]] + 
                                 fuzzy_quartiles.loc[:,idx[('wages'), ('55%')]])/2
fuzzy_quartiles['fuzzy_75th'] = (fuzzy_quartiles.loc[:,idx[('wages'), ('70%')]] + 
                                 fuzzy_quartiles.loc[:,idx[('wages'), ('80%')]])/2


In [None]:
fuzzy_quartiles.loc[:,idx[("fuzzy_25th", 'fuzzy_median', 'fuzzy_75th'), ('','','')]]

In [None]:
# those "fuzzy_#" columns are now safe for export!
fuzzy_quartiles.loc[:,idx[("fuzzy_25th", 'fuzzy_median', 'fuzzy_75th'), ('','','')]].to_csv('fuzzy_quartiles.csv')

### Reminder: Export of Tables
For all tables, you need to provide the underlying counts of the statistics presented in the table. Make sure you provide all counts. If you calculate ratios (e.g. employment rates), you need to provide the count of individuals who are employed and the count of the ones who are not. If you are interested in percentages we still need the underlying counts for disclosure review. Please label the table in a way that we can easily understand what you are plotting. Let's use a similar cohort for this analysis, except just limiting the year to 2015 instead of 2015 Q1.

In [None]:
# Get data
query = """
SELECT *
from ada_tdc_2019.q42014_cohort_wage
where year = 2015 and state = 18
"""

df = pd.read_sql(query, conn)

In [None]:
#apply the earning categories function again to create a new variable earn_cat
df['earn_cat'] = df['wages'].apply(lambda x: earn_calc(x))

In [None]:
# group the categories
grouped = df.groupby('earn_cat')

In [None]:
# wages by category
grouped[['wages']].describe(percentiles = [.5, .9, .99, .999])

In [None]:
# Let's say we are interested in plotting parts of the crosstabulation as a graph
# First we need to calulate the counts
graph = df.groupby(['earn_cat', 'quarter'])['ssn'].count()

In [None]:
# Note: we need to add the unstack command here because our dataframe has nested indices. 
# We need to flatten out the data before plotting the graph
print(graph)
print(graph.unstack())

In [None]:
# Now we can generate the graph
mygraph = graph.unstack().plot(kind='bar')

In this graph, it is not clearly visible how many observations are in each bar. Thus we either have to provide a corresponding table (as we generated earlier), or we can use the table=True option to add a table of counts to the graph. In addition, we want to make sure that all our axes and legend are labeled properly.

In [None]:
# Graphical representation including underlying values: the option table=True displays the underlying counts
mygraph = graph.unstack().plot(kind='bar', table=True, figsize=(7,5), fontsize=7)
# Adjust legend and axes
mygraph.legend(["1", "2", "3", "4"], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
# Add table with counts
# We don't need an x axis if we display table
mygraph.axes.get_xaxis().set_visible(False)
# Grab table info
table = mygraph.tables[0]
# Format table and figure
table.set_fontsize(9)

> We're good to go from a counts perspective!

In [None]:
# Now we can export the graph as pdf
# Save plot to file
export2 = mygraph.get_figure()
export2.set_size_inches(15,10, forward=True)
export2.savefig('barchart_jobs_income_category.pdf', bbox_inches='tight', dpi=300)

### Reminder: Export of Graphs
It is important that every point which is plotted in a graph is based on at least 10 observations. Thus scatterplots for example cannot be released. In case you are interested in exporting a histogram, you have to change the bin size to make sure each bin contains at least 10 people. In addition to the graph, you have to provide the ADRF with the underlying table in a .csv or .txt file. This file should have the same name as the graph so the ADRF staff can directly see which files go together. Alternatively, you can include the counts in the graph as shown in the example above. 