**_Privacy and Confidentiality Exercises_**

This notebook shows you how to prepare your results for export and what you have to keep in mind in general when you want to export output. You will learn how to prepare files for export so they meet our export requirements.

In [None]:
# Load packages
%pylab inline
from __future__ import print_function
import os
import pandas as pd
import numpy as np
import psycopg2

import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

# General Remarks on Disclosure Review
This notebook provides you with information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives you an overview of the information needed for disclosure review. 

## Files you can export
In general you can export any kind of file format. However, most research results that researchers typically export are tables, graphs, regression output and aggregated data. Thus, we ask you to export one of these types which implies that every result you would like to export needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, you can't export results in a jupyter notebook. Doing disclosure reviews on output in jupyter notebooks is too burdensome for us. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. This does not mean that you won't need your jupyter notebooks during the export process. 

## Documentation of code is important
During the export process we ask you to provide the code for every output you are asking to export. It is important for ADRF staff to have the code to better understand what you exactly did. Understanding how research results are created is important to understand your research output. Thus, it is important to document every single step of your analysis in your jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found on the class website. This is just a quick overview. We recommend that you to go to the class website and read the entire guidelines before you prepare your files for export. 
- The disclosure review is based on the underlying observations of your study. Every statistic you want to export should be based on at least 10 individual data points
- Document your code so the reviewer can follow your data work. Assessing re-identification risks highly depends on the context. Thus it is important that you provide context info with your anlysis for the reviewer
- Save the requested output with the corresponding code in you input and output folder. Make sure the code is executable. The code should exactly produce the output you requested
- In case you are exporting powerpoint slides that show project results you have to provide the code which produces the output in the slide
- Please export results only when there are final and you need them for your presentation or final projcet report

# Disclosure Review Walkthrough

We will IL DES data and MO DES to construct our statistics we are interested in, and prepare it in a way so we can submit the output for disclosure review.  

In [None]:
# get working directory
mypath = (os.getcwd())
print(mypath)

In [None]:
# connect to database
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) 

## pull data

In this example we will use the workers who had a job in both MO and IL at some point over the course of our datasets (2005-2016)

In [None]:
# Get data
query = """
SELECT *, il_wage + mo_wage AS earnings
FROM ada_18_uchi.il_mo_overlap_by_qtr
WHERE year = 2011 
AND quarter IN (2,3)"""

In [None]:
# Save query in dataframe
df = pd.read_sql( query, con = conn )

In [None]:
# Check dataframe
df.head()

In [None]:
# another way to check dataframe
df.info()

In [None]:
# basic stats of
df.describe()

In [None]:
# let's add an earnings categorization for "low", "mid" and "high" using a simple function
def earn_calc(earn):
    if earn < 16500:
        return('low')
    elif earn < 45000:
        return('mid')
    else:
        return('high')
        

In [None]:
earn_calc(24000)

In [None]:
df['earn_cat'] = df['earnings'].apply(lambda x: earn_calc(x))

We now have loaded the data that we need to generate some basic statistics about our populations we want to compare

In [None]:
# Let's look at some first desccriptives by group
grouped = df.groupby('earn_cat')
grouped.describe()

In [None]:
grouped.describe().T

Statistics in this table will be released if the statistic is based on at least 10 entities (in this example individuals). We can see that the total number of individuals we observe in each group completely satisfies this (see cell count). However, we also report percentiles, and we report the minimum and maximum value. Especially the minimum and maximum value are most likely representing one individual person. 

Thus, during disclosure review these values will be supressed. 

In [None]:
# Now let's export the statistics. Ideally we want to have a csv file
# We can safe the statistics in a dataframe
export1 = grouped.describe()
# and then print to csv
export1.to_csv('descriptives_by_group.csv')

### Reminder: Export of Statistics
You can save any dataframe as a csv file and export this csv file. The only thing you have to keep in mind is that besides the statistic X you are interested in you have to include a variable count of X so we can see on how many observations the statistic is based on. This also applies if you aggregate data. For example if you agregate by benefit type, we need to know how many observations are in each benefit program (because after the aggregation each benefit type will be only one data point). 

### Problematic Output
Some subgroups (eg for some of the Illinois datasets dealing with race and gender) will result in cell sizes representing less than 10 people. 

Tables with cells representing less than 10 individuals won't be released. In this case, disclosure review would mean to delete all cells with counts of less than 10. In addition, secondary suppression has to take place. The disclosure reviewer has to delete as many cells as needed to make it impossible to recalculate the suppressed values. 

### How to do it better
Instead of asking for export of a tables like this, you should prepare your tables in advance that all cell sizes are at least represented by a minimum of 10 observations. 

### Reminder: Export of Tables
For tables of any kind you need to provide the underlying counts of the statistics presented in the table. Make sure you provide all counts. If you calculate ratios, for example employment rates you need to provide the count of individuals who are employed and the count of the ones who are not. If you are interested in percentages we still need the underlying counts for disclosure review. Please label the table in a way that we can easily understand what you are plotting. 

In [None]:
df[['il_flag', 'mo_flag']].describe(percentiles = [.5, .9, .99, .999])

In [None]:
# for this example let's cap the job counts to 5
df['il_flag'] = df['il_flag'].apply(lambda x: x if x < 5 else 5)
df['mo_flag'] = df['mo_flag'].apply(lambda x: x if x < 5 else 5)

In [None]:
# Let's say we are interested in plotting parts of the crosstabulation as a graph, for example benefit type and race
# First we need to calulate the counts
graph = df.groupby(['earn_cat', 'il_flag'])['ssn'].count()

In [None]:
# Note: we need to add the unstack command here because our dataframe has nested indices. 
# We need to flatten out the data before plotting the graph
print(graph)
print(graph.unstack())

In [None]:
# Now we can generate the graph
mygraph = graph.unstack().plot(kind='bar')

In this graph it is not clearly visible how many observations are in each bar. Thus we either have to provide a corresponding table (as we generated earlier), or we can use the table=True option to add a table of counts to the graph. In addition, we wnat to make sure that all our axes and legend are labeled properly.

In [None]:
# Graphical representation including underlying values: the option table=True displays the underlying counts
mygraph = graph.unstack().plot(kind='bar', table=True, figsize=(7,5), fontsize=7)
# Adjust legend and axes
mygraph.legend(["Unknown","1", "2", "3", "4", '5'], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
# Add table with counts
# We don't need an x axis if we display table
mygraph.axes.get_xaxis().set_visible(False)
# Grab table info
table = mygraph.tables[0]
# Format table and figure
table.set_fontsize(9)

> in this example there is a problematic value, we will instead cap to 4 maximum jobs to ensure all cells are more than 10

In [None]:
# for this example let's cap the job counts to 5
df['il_flag'] = df['il_flag'].apply(lambda x: x if x < 4 else 4)
df['mo_flag'] = df['mo_flag'].apply(lambda x: x if x < 4 else 4)

In [None]:
# create our new "graph" dataframe to plot with
graph = df.groupby(['earn_cat', 'il_flag'])['ssn'].count()

In [None]:
# confirm we solved the issue

mygraph = graph.unstack().plot(kind='bar', table=True, figsize=(7,5), fontsize=7)
# Adjust legend and axes
mygraph.legend(["Unknown","1", "2", "3", "4", '5'], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
# Add table with counts
# We don't need an x axis if we display table
mygraph.axes.get_xaxis().set_visible(False)
# Grab table info
table = mygraph.tables[0]
# Format table and figure
table.set_fontsize(9)

In [None]:
# We want to export the graph without the table though
# Because we already generated the crosstab earlier which shows the counts
mygraph = graph.unstack().plot(kind='bar', figsize=(7,5), fontsize=7, rot=0)
# Adjust legend and axes
mygraph.legend(["Unknown","1", "2", "3", "4", '5'], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
mygraph.set_xlabel("Income category", fontsize=9)
mygraph.annotate('Source: IL & MO DES', xy=(0.7,-0.2), xycoords="axes fraction");

In [None]:
# Now we can export the graph as pdf
# Save plot to file
export2 = mygraph.get_figure()
export2.set_size_inches(15,10, forward=True)
export2.savefig('barchart_jobs_income_category.pdf', bbox_inches='tight', dpi=300)

### Reminder: Export of Graphs
It is important that every point which is plotted in a graph is based on at least 10 observations. Thus scatterplots for example cannot be released. In case you are interested in a histogram you have to change the bin size to make sure that every bin contains at least 10 people. In addition to the graph you have to provide the ADRF with the underlying table in a .csv or .txt file. This file should have the same name as the graph so ADRF can directly see which files go together. Alternatively you can include the counts in the graph as shown in the example above. 