**_Privacy and Confidentiality Exercises_**

This notebook shows you how to prepare your results for export and what you have to keep in mind in general when you want to export output. You will learn how to prepare files for export so they meet our export requirements.

In [None]:
# Load packages
%pylab inline
from __future__ import print_function
import os
import pandas as pd
import numpy as np
import scipy
import sklearn
from sklearn import linear_model
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
import psycopg2
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
import sqlalchemy
import statsmodels.api as sm
import statsmodels.formula.api as smf

# General Remarks on Disclosure Review
This notebook provides you with information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and gives you an overview of the information needed for disclosure review. 

## Files you can export
In general you can export any kind of file format. However, most research results that researchers typically export are tables, graphs, regression output and aggregated data. Thus, we ask you to export one of these types which implies that every result you would like to export needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, you can't export results in a jupyter notebook. Doing disclosure reviews on output in jupyter notebooks is too burdensome for us. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. This does not mean that you won't need your jupyter notebooks during the export process. 

## Documentation of code is important
During the export process we ask you to provide the code for every output you are asking to export. It is important for ADRF staff to have the code to better understand what you exactly did. Understanding how research results are created is important to understand your research output. Thus, it is important to document every single step of your analysis in your jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found on the class website. This is just a quick overview. We recommend that you to go to the class website and read the entire guidelines before you prepare your files for export. 
- The disclosure review is based on the underlying observations of your study. Every statistic you want to export should be based on at least 10 individual data points
- Document your code so the reviewer can follow your data work. Assessing re-identification risks highly depends on the context. Thus it is important that you provide context info with your anlysis for the reviewer
- Save the requested output with the corresponding code in you input and output folder. Make sure the code is executable. The code should exactly produce the output you requested
- In case you are exporting powerpoint slides that show project results you have to provide the code which produces the output in the slide
- Please export results only when there are final and you need them for your presentation or final projcet report

# Disclosure Review Case Study
To illustrate the export process of research results let's assume we are working on a joint research project and are interesed in finding out more about people who are on different public benefit programs in 2015. Let's say we want to answer following questions:

1. Do people who receive foodstamps look different than people who receive other subsidies?
2. What does the earnings distribution of these populations look like in the last quarter of 2015?
3. Are different benefit receipts related to better or worse outcomes on the labor market?

We will use IDHS data and IDES data to construct our statistics we are interested in, and prepare it in a way so we can submit the output for disclosure review.  

In [None]:
# get working directory
mypath = (os.getcwd())
print(mypath)

In [None]:
# connect to database
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) 

## Do people who receive foodstamps look different than people who receive other subsidies?
We are interested in comparing the population of three different benefit programs: foodstamps, grant, tanf. Thus, we can use the IDHS individual case spell data as this gives us the information on the subsidies received and the duration of receipt for each individual. In addition, the database contains demographic information on the recipients. Let's select the SSN (enables linkage to wage data), the benefit type and length in months, age in years, race and gender for all the observations that start and end in 2015.

In [None]:
# Get data from respective IDHS table
query = """
SELECT ssn_hash, rootrace, sex, benefit_type,  
    (2015 - extract(year from birth_date))::int age_years,
ROUND((end_date - start_date)/30.44)::int AS dur_months 
FROM idhs.hh_indcase_spells 
WHERE ((start_date >= '2015-01-01') AND (end_date <= '2015-01-31'));"""

In [None]:
# Save query in dataframe
df_idhs = pd.read_sql( query, con = conn )

In [None]:
# Check dataframe
df_idhs.head()

We now have loaded the data that we need to generate some basic statistics about our populations we want to compare

In [None]:
# Let's look at some first desccriptives by group
grouped = df_idhs.groupby('benefit_type')
grouped.describe()

Statistics in this table will be released if the statistic is based on at least 10 entities (in this example individuals). We can see that the total number of individuals we observe in each group completely satisfies this (see cell count). However, we also report percentiles, and we report the minimum and maximum value. Especially the minimum and maximum value are most likely representing one individual person. Thus, during disclosure review these values will be supressed. If you look at the dummy "sex" you need to keep in mind that we only have two possible values here (1 and 2). You can see this by looking at the minimum and maximum. For dummy variables the mean can be used to calculate how many people are on the 1 and 2 if you have the total number of observations. In our case we know that XX% of the population are 2 (female). We observe a total of XXXX people which means that about XXX people are female. This is completely fine for disclosure review but what if we have only 0.005% of our population being female? Doing the same calculation (164609/100x0.005) will show us that there would be less than 10 people being female. In this case the table would not be released. 

In [None]:
# Now let's export the statistics. Ideally we want to have a csv file
# We can safe the statistics in a dataframe
export1 = grouped.describe()
# and then print to csv
export1.to_csv('descriptives_by_group.csv')

### Reminder: Export of Statistics
You can save any dataframe as a csv file and export this csv file. The only thing you have to keep in mind is that besides the statistic X you are interested in you have to include a variable count of X so we can see on how many observations the statistic is based on. This also applies if you aggregate data. For example if you agregate by benefit type, we need to know how many observations are in each benefit program (because after the aggregation each benefit type will be only one data point). 

In [None]:
# In addition, we are interested in looking at how many people are in each program by race and sex
# We can crosstab this info
pd.crosstab([df_idhs.benefit_type.fillna('missing'), df_idhs.rootrace.fillna('missing')], df_idhs.sex.fillna('missing'), margins=True)

### Problematic Output
We can see that we have a lot of small numbers here. This table won't be released. In this case, disclosure review would mean to delete all cells with counts of less than 10. In addition, secondary suppression has to take place. The disclosure reviewer has to delete as many cells as needed to make it impossible to recalculate the suppressed values. Also we can see in this table that we don't have labels for the information we are plotting. This means the person doing the disclosure review is lacking content of the analyses.

### How to do it better
Instead of asking for export of a tables like this, you should prepare your tables in advance that all cell sizes are at least represented by a minimum of 10 observations. In our example we can do this by grouping categories of race for instance. In addition, we will label data to make it easier to understand our table.

In [None]:
# Group race indicator: put all categories with few observations in one "other"
race = []
for row in df_idhs['rootrace']:
    if row == 1:
        race.append('White')        
    elif row == 2:
        race.append('Black')      
    elif row == 3 or row == 7 or row == 8 or row ==9 or row == 0 or row == 6:
        race.append('Hispanic/Other')
    else:
        race.append('')
df_idhs['race']=race

# Label sex variable
df_idhs['sex'] = df_idhs['sex'].replace([1],'male')
df_idhs['sex'] = df_idhs['sex'].replace([2],'female')

In [None]:
# Now let's tabulate again
pd.crosstab([df_idhs.benefit_type.fillna('missing'), df_idhs.race.fillna('missing')], df_idhs.sex.fillna('missing'), margins=True)

This table now satisfies the requirements for disclosure control. We can save the content in a csv file and then export the table.

In [None]:
# save crosstab in dataframe
export2 = pd.crosstab([df_idhs.benefit_type.fillna('missing'), df_idhs.race.fillna('missing')], df_idhs.sex.fillna('missing'), margins=True)

# save dataframe to csv
export2.to_csv('benefits_by_race_sex.csv')

### Reminder: Export of Tables
For tables of any kind you need to provide the underlying counts of the statistics presented in the table. Make sure you provide all counts. If you calculate ratios, for example employment rates you need to provide the count of individuals who are employed and the count of the ones who are not. If you are interested in percentages we still need the underlying counts for disclosure review. Please label the table in a way that we can easily understand what you are plotting. 

In [None]:
# Let's say we are interested in plotting parts of the crosstabulation as a graph, for example benefit type and race
# First we need to calulate the counts
graph = df_idhs.groupby(['benefit_type', 'race'])['ssn_hash'].count()

In [None]:
# Note: we need to add the unstack command here because our dataframe has nested indices. 
# We need to flatten out the data before plotting the graph
print(graph)
print(graph.unstack())

In [None]:
# Now we can generate the graph
mygraph = graph.unstack().plot(kind='bar')

In this graph it is not clearly visible how many observations are in each bar. Thus we either have to provide a corresponding table (as we generated earlier), or we can use the table=True option to add a table of counts to the graph. In addition, we wnat to make sure that all our axes and legend are labeled properly.

In [None]:
# Graphical representation including underlying values: the option table=True displays the underlying counts
mygraph = graph.unstack().plot(kind='bar', table=True, figsize=(7,5), fontsize=7)
# Adjust legend and axes
mygraph.legend(["Black","Don't know", "Hispanic", "Other", "White"], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
# Add table with counts
# We don't need an x axis if we display table
mygraph.axes.get_xaxis().set_visible(False)
# Grab table info
table = mygraph.tables[0]
# Format table and figure
table.set_fontsize(9)

In [None]:
# We want to export the graph without the table though
# Because we already generated the crosstab earlier which shows the counts
mygraph = graph.unstack().plot(kind='bar', figsize=(7,5), fontsize=7, rot=0)
# Adjust legend and axes
mygraph.legend(["Black","Don't know", "Hispanic", "Other", "White"], loc = 1, ncol= 3, fontsize=9)
mygraph.set_ylabel("Number of Observations", fontsize=9)
mygraph.set_xlabel("Benefit Received", fontsize=9)

In [None]:
# Now we can export the graph as png
# Save plot to file
export3 = mygraph.get_figure()
export3.set_size_inches(15,10, forward=True)
export3.savefig('barchart_benefit_type_race.png', bbox_inches='tight', dpi=300)

### Reminder: Export of Graphs
It is important that every point which is plotted in a graph is based on at least 10 observations. Thus scatterplots for example cannot be released. In case you are interested in a histogram you have to change the bin size to make sure that every bin contains at least 10 people. In addition to the graph you have to provide the ADRF with the underlying table in a .csv or .txt file. This file should have the same name as the graph so ADRF can directly see which files go together. Alternatively you can include the counts in the graph as shown in the example above. 

## What does the earnings distribution of these populations look like in the last quarter of 2015?
From the IDES earnings file we get all earnings for the last quarter of 2015. We will merge this information to the IDHS data constructed above in order to look at earning distributions of people in the three benefit types.

In [None]:
# To make our query of the ides database more efficient 
# we first get a unique list of SSNs we have in our idhs dataframe
ssns = df_idhs.ssn_hash.unique()
print(len(ssns))
# format to add to query: list that inlcudes all the SSNs
ssn_qry = ','.join(["'"+s+"'" for s in ssns])
ssn_qry[1:300]

In [None]:
# Select all spell in 4th quarter of 2015 & variables needed for people in our SSN list
# We want to have the total wage received in the quarter , thus we summarize
query = '''
SELECT ssn, sum(wage) wage 
FROM ides.il_wage 
WHERE year = 2015 AND quarter = 1
AND ssn IN ({})
GROUP BY ssn;'''.format(ssn_qry)

In [None]:
# Save query in dataframe
df_ides = pd.read_sql( query, con = conn )

In [None]:
# Close database connection
conn.close()

In [None]:
# Merge the wage information to our IDHS file
df_idhs_ides = pd.merge(left=df_idhs,right=df_ides, how='inner', left_on=['ssn_hash'], right_on=['ssn'])
print(len(df_idhs_ides))
df_idhs_ides.head()

In [None]:
# Plot distribution, start with foodstamps
# Generate dataframe with foodstamps only
foodstamps = df_idhs_ides[df_idhs_ides['benefit_type'] == "foodstamp"]
# Look at histogram
foodstamps.hist(column='wage', bins=100)

### Problematic Output
We can see that we have a lot of small numbers here. In order to export this graph each bar has to be represented by 10 or more people. However, we can't really see from the graph if this is the case. It also seems that we have a lot of outliers. 

### How to do it better
Instead of asking for export of a graph like this, you should prepare your graph in advance to make sure each data point is represented by a minimum of 10 observations. In our example we can do this by checking the bin size first and then adjust the number of bins. In addition, we can remove the outliers.

In [None]:
# Get rid of outliers, drop pells over 8000 (we don't loose a lot)
foodstamps = foodstamps[foodstamps['wage'] <= 8000]

In [None]:
# Get counts for each bin in the histogram
# count = number of obs in bin, division = default bin size
binsize = count, division = np.histogram(foodstamps['wage'], bins=100)
print(binsize)

In [None]:
# We can change the binsize until we have 10 or more observations in every bin
binsize = count, division = np.histogram(foodstamps['wage'], bins=10)
print(binsize)

In [None]:
# Now let's generate and export our histogram
hist1 = foodstamps.hist(column='wage', bins=60, xlabelsize=7, ylabelsize=7)
plt.title("Distribution of Earnings - Foodstamp Recipients", fontsize=10)
plt.ylabel("Frequency", fontsize=6)
plt.xlabel("Total Earnings in 2015Q1", fontsize=6)
savefig('earnings_dist_foodstamps.png')

In [None]:
# Now you can do the same for tanf46 and grant recipients

## Are different benefit receipts related to better or worse outcomes on the labor market?
In order to look at relationships between our demographic characteristics and labor market outcomes, in this case earnings, we need to run a OLS regression. We can do this using the statsmodels or scikit learn package. In the following we will outline how to do this in both packages. 

In [None]:
# For our regressions we want to have the log of earnings because it can be interpreted as elasticity
# We observe 0 earnings, but we can't take the log of 0
df_idhs_ides.loc[df_idhs_ides['wage'] == 0, 'wage']=0.0001

### Statsmodels package

In [None]:
# Run regression on wages using Statsmodels
model = smf.ols('log(wage) ~ C(sex) + C(race) + C(benefit_type) + age_years + I(age_years**2) + dur_months', data= df_idhs_ides)
results = model.fit()
res = results.summary()
print(res)

In [None]:
# We need to find out the number of observations for each dummy
counts = zip(model.exog.sum(0), model.exog_names)
print(counts)

In [None]:
# Write results in txt file
output = open('OLS_results.txt', "w")
output.write("%s" % res + '\n' "%s"  % counts)
output.close()

### Scikit learn package

In [None]:
# Here we need to create our dummy and quadratic variable first
df_idhs_ides = pd.get_dummies(df_idhs_ides, columns = ['sex','race','benefit_type'])

In [None]:
# Now calculate age square
df_idhs_ides['age_years2'] = (df_idhs_ides['age_years']*df_idhs_ides['age_years'])

In [None]:
# Now generate log of wage
df_idhs_ides['log_wage'] = (log(df_idhs_ides['wage']))

In [None]:
# Now do the same with scikit learn: regular OLS
ols = linear_model.LinearRegression()
ols.fit(df_idhs_ides[['benefit_type_tanf46','benefit_type_grant','sex_male','race_Hispanic/Other',
                      'race_White','age_years','age_years2','dur_months']], df_idhs_ides['log_wage'])
ols_coef = ols.coef_
ols_int = "Intercept:" + str(ols.intercept_)
ols_obs = "Number of observations:" + str(len(df_idhs_ides))

In [None]:
# Count frequencies of dummies
dummies = df_idhs_ides[['benefit_type_foodstamp','benefit_type_tanf46','benefit_type_grant','sex_male', 
                        'sex_female','race_Hispanic/Other','race_White','race_Black']]
dum_ct = dummies.apply(pd.value_counts)
dum_ct

In [None]:
# Define list for dependent variables
features = ['benefit_type_tanf46','benefit_type_grant','sex_male','race_Hispanic/Other','race_White','age_years','age_years2','dur_months']

In [None]:
# Save coefficients in dataframe
# Unfortunately it is not possible to get the standard errors
ols_coef = pd.DataFrame(zip(features, ols.coef_), columns=['Variable', 'Coefficient'])
print(ols_coef)
print(ols_obs)

In [None]:
# Write results in txt file
output = open('OLS_results2.txt', "w")
output.write("%s" % ols_obs + '\n' "%s"  % ols_coef + '\n' "%s"  % ols_int + '\n' "%s"  % dum_ct)
output.close()

### Reminder: Export of Regression Output
You need to provide the ADRF with the number of observations which are included in the regression. Regression output should be written in a .txt or .csv file. If you are including dummies in the regression you need to provide the number of observations for each dummy included in the regression.