# RDS Lab 10: Case Study: Predicting employment outcomes based on mental health indictors

For the next few weeks, we're going to be doing an in-depth case study of a single dataset and modelling efforts around that dataset. The process we'll go through should help you as you work on your projects.

The dataset we're going to use is a survey designed to help draw a connection between mental health and employment outcomes. It was collected by Michael Cooley in partnership with SurveyMonkey and published on Kaggle.


## Part 1: Data Collection

The survey asks for responses to the following questions/statements:
- I identify as having a mental illness 
- I have my own computer separate from a smart phone    
- I have been hospitalized before for my mental illness    
    - If yes: How many days were you hospitalized for your mental illness    
- I am currently employed at least part-time    
- I am legally disabled    
- I have my regular access to the internet    
- I live with my parents    
- I have a gap in my resume    
    - If yes: Total length of any gaps in my resume in months.    
- Annual income (including any social welfare programs) in USD  
- I am unemployed    
- I read outside of work and school    
- Annual income from social welfare programs  
- I receive food stamps    
- I am on section 8 housing    
- I have one of the following issues in addition to my illness:
    - Lack of concentration
    - Anxiety
    - Depression
    - Obsessive thinking
    - Mood swings
    - Panic attacks
    - Compulsive behavior
    - Tiredness
- Education level 
- Age    
- Gender    
- Household Income    
- Region    
- Device Type   


### Ethical Issues

This survey obviously includes numerous highly sensitive questions. We don't have any information about informed consent procedures, anonymization strategies, etc. that accompanied this data collection process, so we have to assume there were none. You can, however, read more about SurveyMonkey policies [here](https://help.surveymonkey.com/categories/Privacy_Legal).

Recall the four principles of the Menlo Report:
- Respect for persons: "individuals should be treated as autonomous agents... \[and\] persons with diminished autonomy are entitled to protection"
- Beneficence "(1) do not harm and (2) maximize possible benefits and minimize possible harms"
- Justice: the benefits and burdens of the research project should shared fairly 
- Respect for law and public interest

**Poll: Which of the four principles of the Menlo Report do you think is most relevant to *collecting* this data?**

There's no right answer here! Just be ready to explain your thought process.

When collecting data, we should also think about all the possible ways in which it could be used. Let's take a minute to brainstorm some such ways. 

**Poll: Which of the four principles of the Menlo Report do you think is most relevant to *using* this data?**


### Data Quality Issues

Before we even look at the data, we should think about potential data quality issues that are intrinsic to the data collection process. 

What should we be wary of when we look at this data?

## Part 2: Data Cleaning and Profiling

In [2]:
import pandas as pd
import numpy as np
np.random.seed(10)

In [3]:
student_path = "../shared/Lab 10/mental_health_employment_survey.csv"
instructor_path = "../../shared/Lab 10/mental_health_employment_survey.csv"
survey = pd.read_csv(instructor_path)
print(survey.columns)
survey.head()

Index(['Respondent ID', 'Collector ID', 'Start Date', 'End Date', 'IP Address',
       'Email Address', 'First Name', 'Last Name', 'Custom Data 1',
       'I identify as having a mental illness', 'Education',
       'I have my own computer separate from a smart phone',
       'I have been hospitalized before for my mental illness',
       'How many days were you hospitalized for your mental illness',
       'I am currently employed at least part-time', 'I am legally disabled',
       'I have my regular access to the internet', 'I live with my parents',
       'I have a gap in my resume',
       'Total length of any gaps in my resume in months.',
       'Annual income (including any social welfare programs) in USD',
       'I am unemployed', 'I read outside of work and school',
       'Annual income from social welfare programs', 'I receive food stamps',
       'I am on section 8 housing',
       'How many times were you hospitalized for your mental illness',
       'Lack of concentrati

Unnamed: 0,Respondent ID,Collector ID,Start Date,End Date,IP Address,Email Address,First Name,Last Name,Custom Data 1,I identify as having a mental illness,...,Obsessive thinking,Mood swings,Panic attacks,Compulsive behavior,Tiredness,Age,Gender,Household Income,Region,Device Type
0,,,,,,,,,,Response,...,Obsessive thinking,Mood swings,Panic attacks,Compulsive behavior,Tiredness,Response,Response,Response,Response,Response
1,6630447000.0,168522804.0,01/15/2018 03:45:16 AM,01/15/2018 03:48:24 AM,,,,,06f645d7ea5af372d50a62bd17,No,...,Obsessive thinking,,Panic attacks,,,30-44,Male,"$25,000-$49,999",Mountain,Android Phone / Tablet
2,6630410000.0,168522804.0,01/15/2018 03:17:52 AM,01/15/2018 03:18:57 AM,,,,,abca2776418ff1fe24bb85e21f,Yes,...,,,Panic attacks,,Tiredness,18-29,Male,"$50,000-$74,999",East South Central,MacOS Desktop / Laptop
3,6630402000.0,168522804.0,01/15/2018 03:10:28 AM,01/15/2018 03:12:49 AM,,,,,3800088cf4e55278b38bbe67f3,No,...,,,,,,30-44,Male,"$150,000-$174,999",Pacific,MacOS Desktop / Laptop
4,6630335000.0,168522804.0,01/15/2018 02:11:16 AM,01/15/2018 02:12:33 AM,,,,,84585803a3cec189f89fe43d44,No,...,,,,,,30-44,Male,"$25,000-$49,999",New England,Windows Desktop / Laptop


First, it looks like the first row repeats the column names for some features and doesn't contain real data. We can drop it. 

In [4]:
survey = survey.loc[1:, ]

Second, it looks like there are some identifying variables that have been removed. Recall that this is not enough to preserve privacy.

Let's start by making these column names easier to work with.

In [5]:
features = ['respondent_id', 'collector_id', 'survey_start', 'survey_end', 'ip', 'email', 'first_name', 'last_name', 
 'custom_data', 'has_mental_illness', 'education', 'has_computer', 'hospitalized', 'days_hospitalized', 
 'employed', 'legally_disabled', 'internet_access', 'lives_with_parents', 'resume_gap', 'resume_gap_months',
 'total_income', 'unemployed', 'reads', 'welfare_income', 'gets_food_stamps', 'gets_section8', 'times_hospitalized',
 'lack_of_concentration', 'anxiety', 'depression', 'obsessive_thinking', 'mood_swings',  'panic_attacks', 
 'compulsive_behavior', 'tiredness', 'age', 'gender', 'household_income', 'region', 'device_type']
# Create a dictionary to store the new feature name as well as the originl feature name, 
# so that we can easily look up the full original question if we forget what the feature name represents
feat_dict = {}
for i in range(len(features)):
    feat_dict[features[i]] = survey.columns[i]
survey.columns = features

Third, let's see if we can drop any columns that aren't useful.

In [6]:
# Let's try to figure out what custom_data is 
survey.custom_data.value_counts() 
# Looks like we can drop this
survey.drop(["custom_data"], axis=1, inplace=True)

In [7]:
# Let's confirm that all the "identifying" variables are actually all missing, and if so, we'll drop them.
for var in [ 'ip', 'email', 'first_name', 'last_name']:
    # Confirm all rows are missing
    assert sum(survey[var].notna())==0
    # Drop the variable
    survey.drop([var], axis=1, inplace=True)

Fourth, let's confirm that respondent_id is a unique identifier

In [8]:
# Your code here


Fifth, let's make sure all numeric variables are stored as numbers. (We won't worry about dummies yet.)

In [9]:
survey.dtypes

respondent_id            float64
collector_id             float64
survey_start              object
survey_end                object
has_mental_illness        object
education                 object
has_computer              object
hospitalized              object
days_hospitalized         object
employed                  object
legally_disabled          object
internet_access           object
lives_with_parents        object
resume_gap                object
resume_gap_months         object
total_income              object
unemployed                object
reads                     object
welfare_income            object
gets_food_stamps          object
gets_section8             object
times_hospitalized        object
lack_of_concentration     object
anxiety                   object
depression                object
obsessive_thinking        object
mood_swings               object
panic_attacks             object
compulsive_behavior       object
tiredness                 object
age       

In [10]:
# Try to cast each variable that we suspect might be numeric, 
# flagging the variable if that doesn't work
for var in ["days_hospitalized", "resume_gap_months", "total_income", "welfare_income", "times_hospitalized", "age", "household_income"]:
    try:
        survey[var]=pd.to_numeric(survey[var])
        print("\n{} is all numeric.".format(var))
    except:
        print("\n{} has non numeric values:".format(var))
        print(survey[var].unique())


days_hospitalized is all numeric.

resume_gap_months is all numeric.

total_income is all numeric.

welfare_income is all numeric.

times_hospitalized is all numeric.

age has non numeric values:
['30-44' '18-29' '45-60' '> 60']

household_income has non numeric values:
['$25,000-$49,999' '$50,000-$74,999' '$150,000-$174,999' '$0-$9,999'
 '$100,000-$124,999' '$125,000-$149,999' 'Prefer not to answer'
 '$10,000-$24,999' '$75,000-$99,999' '$200,000+' '$175,000-$199,999']


Seventh, let's examine variables that seem to overlap for any inconsistencies. 

In [11]:
# Let's look back at how some of these questions were described. Do you spot any potential issues?
print(feat_dict['hospitalized'])
print(feat_dict['days_hospitalized'])
print(feat_dict['times_hospitalized'])

I have been hospitalized before for my mental illness
How many days were you hospitalized for your mental illness
How many times were you hospitalized for your mental illness


Let's confirm that responses to these questions are consistent with each other. 

In [12]:
survey.loc[survey.hospitalized=="No", "days_hospitalized"].unique()

array([ 0., nan, 20., 78., 44.,  1., 99.,  6.,  3.,  2.])

In [13]:
survey.loc[survey.hospitalized=="No", "times_hospitalized"].unique()

array([ 0, 19,  1,  3, 69])

These responses are inconsistent, so we have to decide what to trust. Let's recode the data for consistency.

In [14]:
# Your code here

How about these variables? Do they mean the same thing?

In [15]:
print(feat_dict['employed'])
print(feat_dict['unemployed'])

I am currently employed at least part-time
I am unemployed


In [16]:
pd.crosstab(survey.employed, survey.unemployed)

unemployed,No,Yes
employed,Unnamed: 1_level_1,Unnamed: 2_level_1
No,29,78
Yes,219,8


Again, we should probably recode the data for consistency. What should we trust?

In [19]:
# Your code here

Finally, let's recode yes-no variables as booleans

In [20]:
for var in ['has_mental_illness', 'has_computer', 'hospitalized','employed', 'legally_disabled', 
            'internet_access', 'lives_with_parents', 'resume_gap', 'unemployed', 'reads',
            'gets_food_stamps', 'gets_section8']:
    survey[var] = survey[var].map({'Yes': True, 'No': False})
    
for var in ['lack_of_concentration', 'anxiety', 'depression', 'obsessive_thinking', 
            'mood_swings',  'panic_attacks', 'compulsive_behavior', 'tiredness']:
    survey[var] = survey[var].notna()

In [21]:
survey.head()

Unnamed: 0,respondent_id,collector_id,survey_start,survey_end,has_mental_illness,education,has_computer,hospitalized,days_hospitalized,employed,...,obsessive_thinking,mood_swings,panic_attacks,compulsive_behavior,tiredness,age,gender,household_income,region,device_type
1,6630447000.0,168522804.0,01/15/2018 03:45:16 AM,01/15/2018 03:48:24 AM,False,High School or GED,False,False,0.0,False,...,True,False,True,False,False,30-44,Male,"$25,000-$49,999",Mountain,Android Phone / Tablet
2,6630410000.0,168522804.0,01/15/2018 03:17:52 AM,01/15/2018 03:18:57 AM,True,Some Phd,True,False,0.0,True,...,False,False,True,False,True,18-29,Male,"$50,000-$74,999",East South Central,MacOS Desktop / Laptop
3,6630402000.0,168522804.0,01/15/2018 03:10:28 AM,01/15/2018 03:12:49 AM,False,Completed Undergraduate,True,False,0.0,True,...,False,False,False,False,False,30-44,Male,"$150,000-$174,999",Pacific,MacOS Desktop / Laptop
4,6630335000.0,168522804.0,01/15/2018 02:11:16 AM,01/15/2018 02:12:33 AM,False,Some Undergraduate,True,False,,False,...,False,False,False,False,False,30-44,Male,"$25,000-$49,999",New England,Windows Desktop / Laptop
5,6630290000.0,168522804.0,01/15/2018 01:24:12 AM,01/15/2018 01:26:34 AM,True,Completed Undergraduate,True,True,35.0,True,...,True,True,True,True,True,30-44,Male,"$25,000-$49,999",East North Central,iOS Phone / Tablet


Now that the data's a little cleaner, we should do some profiling. Take a look at the distribution of values of some variables that you think are interesting. 


In [75]:
# Your code here

How about the overlap of certain variables? Are there any relationships that are worth keeping in mind when we proceed to analyze the data?

In [76]:
# Your code here

## Part 3: Auditing Fairness in the Data

It's always useful to look at disparate outcomes in the data itself *before* worrying about disparate predictions that arise as a result of modeling.

Pick a protected charactaristic that's interesting to you, as well as an outcome variable. Then, implement the following fairness measures. Don't worry, you don't need to use AIF360.

**Disparate impact:**
$$\frac{Pr(Y = 1 | D = \text{unprivileged})}
{Pr(Y = 1 | D = \text{privileged})}$$

In [31]:
# Your code here


0.8968926553672316

**Statistical parity difference:**

$$Pr(Y = 1 | D = \text{unprivileged})- Pr(Y = 1 | D = \text{privileged})$$

In [33]:
# Your code here


-0.07185039370078738

**Your choice!** 

Come up with another interesting number to explore. 

In [None]:
# Your code here