# Mental Health in Technology Fields (OSMI Survey, 2016)
This data set was posted on [Kaggle.com](https://www.kaggle.com/) by Open Sourcing Mental Illness, LTD under a creative commons license which can be found [here](https://creativecommons.org/licenses/by-sa/4.0/). The primary goal of the data, as stated on the [page](https://www.kaggle.com/osmi/mental-health-in-tech-2016#mental-health-in-tech-2016-neo4j-20161114.zip) which this was upload too, is to use the data to raise awareness and to improve condition for those with mental health disorders in the IT workplace.

*Disclosure* I, Corey Bryant, am not affiliated with this organization. All work here within is exploratory. All analyses will be conducted using Python; I will be using a Python package that I developed, *[researchpy](https://researchpy.readthedocs.io/en/latest/)*, for parts of this analysis.

**Loading the Data**

In [1]:
import pandas as pd
import researchpy as rp
import numpy as np

# These are for running the model and conducting model diagnostics
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from statsmodels.compat import lzip

In [2]:
df = pd.read_csv("C:\\Users\\CoreySSD\\Google Drive\\Python for Data Science, LLC\\Data sets\\OSMI Mental Health in Tech Survey\\mental-heath-in-tech-2016.csv")

print(df.shape, "\n"*2, df.columns)

(1433, 63) 

 Index(['Are you self-employed?',
       'How many employees does your company or organization have?',
       'Is your employer primarily a tech company/organization?',
       'Is your primary role within your company related to tech/IT?',
       'Does your employer provide mental health benefits as part of healthcare coverage?',
       'Do you know the options for mental health care available under your employer-provided coverage?',
       'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
       'Does your employer offer resources to learn more about mental health concerns and options for seeking help?',
       'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
       'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:',
       'Do you 

There are 63 columns with a total of 1,433 observations in the data. The column titles are the wording used in the survey question - pretty long. I will shorten the variables of interests for cleanliness. Also, most of the responses are categorical, so some recoding will be required as I go along.

## Questions to be Answered
Reminder, the only insight I have to the questions are what is displayed above. I do not know if any scales are used within the survey, and I do not know if there are skip patterns present. Skip patterns could effect some statistical tests as I may be including individuals in the N that should not be there.

Since this is an exploratory analysis, I will provide some questions I know that I want to explore below, but I will also allow for data driven questions to come about and be answered as I go along. Each question will be a subsection in this document.

* General descriptives of the survey sample
    * Demographic information such as age, gender, country, number cases of mental health disorder (MHD), family history of MHD, and number of individuals working in IT field

* Is there a relationship between having a MHD and family history of MHD?
    * There should be, most mental health disorders have a genetic component.
    
* Are those with a MHD more likely to talk to their supervisor than those without a MHD?
    * My hypothesis is that those with a MHD will have more reservation than those without a MHD

### Data Wrangling

#### Renaming columns

In [3]:
df.rename(columns= {"What is your age?": "Age",
                   "What is your gender?": "Gender",
                   "Do you currently have a mental health disorder?": "Mental health disorder (MHD)",
                   "Do you have a family history of mental illness?": "Family history of MHD",
                   "What country do you work in?": "Country (Work)",
                   "Have you been diagnosed with a mental health condition by a medical professional?": "Diagnosed by medical professional",
                   "Is your employer primarily a tech company/organization?": "Employer primarily Tech",
                   "Would you feel comfortable discussing a mental health disorder with your coworkers?": "Comfortable talking about MHD w/co-workers",
                   "Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?": "Comfortable talking about MHD w/supervisor(s)",
                   "Do you feel that your employer takes mental health as seriously as physical health?": "Employer takes MHD as serious as PH"
                   }, inplace= True
         )

#### Cleaning the Data
This data is messy - which is good because it means it was uploaded in a pretty raw form. It's just going to take some time. I thought about having a seperate file to do the data cleaning, but decided to demonstrate everything in this document.

For gender, I have decided to collapse the response options into the categories of 'Male', 'Female', 'Other'. For individuals that responded that they transitioned from a gender to another, I classified them as the gender post-transition; all other gender identifications have been placed into the 'Other' category.

In [4]:
def clean_gender(series):
    if type(series) is str:
        if series.upper().strip() in ['MALE', 'M', 'MAN', 'MALE (CIS)', 'MAIL', 'SEX IS MALE', 'MALR', 'MALE.', 'CIS MALE',
                                     'MALE (TRANS, FTM)', 'M|', 'CISDUDE', 'CIS MAN', 'DUDE',
                                     "I'm a man why didn't you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take?"]:
            return "Male" 
        
        elif series.upper().strip() in ['FEMALE', 'F', 'WOMAN', 'Transitioned, M2F', 'TRANSGENDER WOMAN',
                                       'FEMALE ASSIGNED AT BIRTH', 'CIS-WOMAN', 'GENDERQUEER WOMAN', 'FEMALE/WOMAN',
                                       'FM', 'CIS FEMALE', 'FEMALE (PROPS FOR MAKING THIS A FREEFORM FIELD, THOUGH)',
                                       'I IDENTIFY AS FEMALE.', 'CISGENDER FEMALE', 'FEM', 'MTF']:
            return "Female"
        
        else:
            return "Other"
    
    else:
        return series
    
df['Gender'] = df['Gender'].apply(clean_gender)

In [5]:
def clean_country(series):
    if type(series) is str:
        if series.upper().strip() == 'UNITED STATES OF AMERICA':
            return 'USA'
        else:
            return 'Rest of the world'

df['Country (Inhabit)'] = df['What country do you live in?'].apply(clean_country)

## Descriptive Statistics of the Survey Population

In [6]:
df['Age'].describe()

print(df['Age'].describe(), "\n"*2,
      "Let's check for ages that could be considered as odd", "\n"*2, df['Age'][df['Age'] >= 80])

count    1433.000000
mean       34.286113
std        11.290931
min         3.000000
25%        28.000000
50%        33.000000
75%        39.000000
max       323.000000
Name: Age, dtype: float64 

 Let's check for ages that could be considered as odd 

 372     99
564    323
Name: Age, dtype: int64


Well, there clearly are some incorrectly entered responses in this field - unless someone is a baby genius and the other has found the tree of life. I'm going to limit the analyses to individuals who are 19-80.

In [7]:
df = df[(df['Age'] >= 19) & (df['Age'] <= 80)]

df['Age'].describe()

count    1428.000000
mean       34.086134
std         8.086273
min        19.000000
25%        28.000000
50%        33.000000
75%        39.000000
max        74.000000
Name: Age, dtype: float64

In [8]:
rp.summary_cat(df[['Gender', 'Mental health disorder (MHD)', 'Family history of MHD', 
                   'Country (Inhabit)', 'Employer primarily Tech', 
                   'Comfortable talking about MHD w/co-workers',
                   'Comfortable talking about MHD w/supervisor(s)',
                  'Employer takes MHD as serious as PH']])

Unnamed: 0,Variable,Outcome,Count,Percent
0,Gender,Male,1053,73.89
1,,Female,341,23.93
2,,Other,31,2.18
3,Mental health disorder (MHD),Yes,574,40.2
4,,No,528,36.97
5,,Maybe,326,22.83
6,Family history of MHD,Yes,668,46.78
7,,No,486,34.03
8,,I don't know,274,19.19
9,Country (Inhabit),USA,838,58.68


### Summary of Descriptive Statistics

The majority of the sample is male, 73.89%, and lives in the United States of America, 58.68%. There are slightly higher more individuals who reported having a MHD compared to those that did not report having an MHD, (40.20% vs. 36.97% respectively) and slightly over a fifth of respondents stated they might have a MHD, 22.83%. Just under half, 46.78%, of the respondents reported having a family history of MHD with about a third, 34.03%, reporting no family history of MHD. I'm going to assume that a value of "1" indicates "Yes" and a value of "0" indicates "No". With that, about three-quarters, 76.97%, of the respondents work at a company that is primarily in the technology field.

**MHD within the Workplace** <br>
Majority of the respondents feel that they don't know if their employer takes mental health as serious as physical health, 42.99%, with a little under a third feels that their employer does take mental health as serious as physical health. 37.22% feel comfortable with talking about MHD with their direct supervisor(s) and 23.99% feel comfortable with talking about it with co-workers.

## Relationship between MHD and Family History of MHD

Before the test for a relationship between self-reported cases of having a MHD and family history of MHD, I want to check the proportion of those that self-reported having a MHD and have been diagnosed by a medical professional.

In [9]:
rp.crosstab(df['Diagnosed by medical professional'], 
            df['Mental health disorder (MHD)'], prop= 'row')

Unnamed: 0_level_0,Mental health disorder (MHD),Mental health disorder (MHD),Mental health disorder (MHD),Mental health disorder (MHD)
Unnamed: 0_level_1,Maybe,No,Yes,All
Diagnosed by medical professional,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,28.43,62.61,8.96,100.0
Yes,17.23,11.34,71.43,100.0
All,22.83,36.97,40.2,100.0


Interestingly, 17.23% of those that have been diagnosed by a medical professional self-reported they might have a MHD and 8.96% of those that stated they have a MHD has not been diagnosed by a medical professional.

I'm going use the self-reported measure of having a MHD and limit the relationship test between those who did not respond with "Maybe" and "I don't know".

In [10]:
crosstab, results = rp.crosstab(df['Family history of MHD'][df['Family history of MHD'] != "I don't know"], 
            df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'], 
            test= 'chi-square', prop= 'cell')

print(crosstab, "\n"*2, results)

                      Mental health disorder (MHD)               
                                                No    Yes     All
Family history of MHD                                            
No                                           32.58  10.11   42.69
Yes                                          15.59  41.72   57.31
All                                          48.17  51.83  100.00 

                 Chi-square test   results
0  Pearson Chi-square ( 1.0) =   219.8647
1                    p-value =     0.0000
2               Cramer's phi =     0.4862


As expected, there is a strong relationship present between currently having a mental health disorder and having a family history of a mental health disorder, $\chi^2$(1) = 219.8647, p< 0.0001, $\phi$= 0.4862; 41.72% of the respondents who have a MHD have a family history of MHD.

## Work taking Mental Health as Serious as Physical Health

Again, I will be limiting these analysis to those that did not respond with "Maybe" to having a MHD.

In [11]:
df['Employer takes MHD as serious as PH'].replace("I don't know", "DK", inplace= True)

crosstab, results = rp.crosstab(df['Employer takes MHD as serious as PH'], 
                                df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'],
                                test= 'chi-square', prop= 'col')

print(crosstab.round(1), "\n"*2, results)

                                    Mental health disorder (MHD)              
                                                              No    Yes    All
Employer takes MHD as serious as PH                                           
DK                                                          45.6   38.2   41.8
No                                                          18.7   32.9   25.9
Yes                                                         35.8   28.9   32.3
All                                                        100.0  100.0  100.0 

                 Chi-square test  results
0  Pearson Chi-square ( 2.0) =   23.4542
1                    p-value =    0.0000
2                 Cramer's V =    0.1624


There is a relationship between having a MHD and feeling that their employer takes mental health as serious as physical health, $\chi^2$(2)= 23.4542, p< 0.001, V= 0.1624. Since this is a 3x2 table, a post-hoc analysis will need to be conducted to further explore this. I'm interested in those that either state 'Yes' or 'No' to feeling like their employer takes mental health as serious as physical health.

In [12]:
crosstab, results = rp.crosstab(df['Employer takes MHD as serious as PH'][df['Employer takes MHD as serious as PH'] != "DK"], 
                                df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'],
                                test= 'chi-square', prop= 'col')

print(crosstab.round(1), "\n"*2, results)

                                    Mental health disorder (MHD)              
                                                              No    Yes    All
Employer takes MHD as serious as PH                                           
No                                                          34.3   53.2   44.5
Yes                                                         65.7   46.8   55.5
All                                                        100.0  100.0  100.0 

                 Chi-square test  results
0  Pearson Chi-square ( 1.0) =   18.6433
1                    p-value =    0.0000
2               Cramer's phi =    0.1899


In [13]:
rp.crosstab(df['Employer takes MHD as serious as PH'][df['Employer takes MHD as serious as PH'] != "DK"], 
                                df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'])

Unnamed: 0_level_0,Mental health disorder (MHD),Mental health disorder (MHD),Mental health disorder (MHD)
Unnamed: 0_level_1,No,Yes,All
Employer takes MHD as serious as PH,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
No,82,148,230
Yes,157,130,287
All,239,278,517


In [14]:
print(f"Odds ratio= {(130 * 82) / (157 * 148): .2f}")

Odds ratio=  0.46


As with before, the relationship exists, $\chi^2$(1)= 18.6433, p< 0.0001, $\phi$= 0.1899; compared to those without a MHD, those with a MHD are less likely to feel that their employer takes mental health as serious as physical health, OR= 0.46.

## Comfort in Discussing Mental Health Disorder at place of Employment

### Talking with Co-workers

In [85]:
df['Comfortable talking to co-workers'] = df['Comfortable talking about MHD w/co-workers']

crosstab, results = rp.crosstab(df['Comfortable talking to co-workers'][df['Comfortable talking to co-workers'] != 'Maybe'],
                                df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'],
                                test= 'chi-square', prop= 'col')

print(crosstab.round(1), "\n"*2, results)

                                  Mental health disorder (MHD)              
                                                            No    Yes    All
Comfortable talking to co-workers                                           
No                                                        52.9   57.5   55.2
Yes                                                       47.1   42.5   44.8
All                                                      100.0  100.0  100.0 

                 Chi-square test  results
0  Pearson Chi-square ( 1.0) =    1.1343
1                    p-value =    0.2869
2               Cramer's phi =    0.0465


No relationship is present between having a MHD and feeling comfortable enough to talk to co-workers about it, $\chi^2$(1)= 1.1343, p= 0.2869.

### Talking with Supervisor

In [87]:
df['Comfortable talking to supervisor'] = df['Comfortable talking about MHD w/supervisor(s)']

crosstab, results = rp.crosstab(df['Comfortable talking to supervisor'][df['Comfortable talking to supervisor'] != 'Maybe'],
                                df['Mental health disorder (MHD)'][df['Mental health disorder (MHD)'] != 'Maybe'],
                                test= 'chi-square', prop= 'col')

print(crosstab.round(1), "\n"*2, results)

                                  Mental health disorder (MHD)              
                                                            No    Yes    All
Comfortable talking to supervisor                                           
No                                                        38.9   46.0   42.6
Yes                                                       61.1   54.0   57.4
All                                                      100.0  100.0  100.0 

                 Chi-square test  results
0  Pearson Chi-square ( 1.0) =    3.0874
1                    p-value =    0.0789
2               Cramer's phi =    0.0715


No statistically significant relationship is present between having a MHD and feeling comfortable enough to talk to a supervisor about it, $\chi^2$(1)= 3.0874, p= 0.0789. Since there appears to be some relationship present, I want to explore this further in a multivariate model.

Due to the requirements of variable names using patsy, I have to make the variable names have no spaces. During this time, I will also be removing "Don't know", "Other", and "Maybe" responses from co-variates in my model.

In [142]:
df['MH_PH'] = df['Employer takes MHD as serious as PH'].replace({"DK": np.nan})
df['MHD'] = df['Mental health disorder (MHD)'].replace({"Maybe": np.nan})
df['Emp_in_tech'] = df['Employer primarily Tech']
df['Emp_size'] = df['How many employees does your company or organization have?']
df['Emp_MH_benefits'] = df['Does your employer provide mental health benefits as part of healthcare coverage?'].replace({"I don't know": np.nan,
                                                                                                                        "Not eligible for coverage / N/A": np.nan})
df['Talk_to_sup'] = df['Comfortable talking about MHD w/supervisor(s)'].replace({"Yes": 1, "No": 0, "Maybe": np.nan})
df['Talk_to_coworkers'] = df['Comfortable talking about MHD w/co-workers'].replace({"Maybe": np.nan})
df['Gender2'] = df["Gender"].replace('Other', np.nan) # Dropping group due to low sample size

model = smf.logit('Talk_to_sup ~ C(MHD) + C(MH_PH) + C(Emp_in_tech) + C(Gender2) + Age + C(Emp_size) + C(Emp_MH_benefits)', data= df).fit(maxiter= 3000)


# Coefficients in logistic regression models are not that intuitive to interprete. I will convert them to odds ratios so it will be easier.
model_odds = pd.DataFrame(np.exp(model.params), columns= ['OR']) # Converting coef. to odds ratios for easier interpretation
model_odds['OR'] = model_odds['OR'].round(4)
model_odds['p-value']= model.pvalues
model_odds['p-value'] = model_odds['p-value'].round(4)
model_odds[['2.5%', '97.5%']] = np.exp(model.conf_int())

print("\n", f"Number of observations= {model.nobs}", "\n",
      f"F({model.df_model: .0f},{model.df_resid: .0f})", "\n",
      f"Log-likelihood p-value = {model.llr_pvalue: .4f}", "\n",
      f"Pseudo R-squared= {model.prsquared: .4f}", "\n"*3, 
     model_odds)

Optimization terminated successfully.
         Current function value: 0.441062
         Iterations 6

 Number of observations= 259 
 F( 11, 247) 
 Log-likelihood p-value =  0.0000 
 Pseudo R-squared=  0.3118 


                                     OR  p-value      2.5%      97.5%
Intercept                       0.4034   0.4582  0.036662   4.439455
C(MHD)[T.Yes]                   0.8492   0.6458  0.422892   1.705223
C(MH_PH)[T.Yes]                18.3860   0.0000  8.855960  38.171411
C(Emp_in_tech)[T.1.0]           1.1018   0.8203  0.477364   2.542991
C(Gender2)[T.Male]              1.1834   0.6742  0.539686   2.595103
C(Emp_size)[T.100-500]          1.1181   0.9072  0.171205   7.301672
C(Emp_size)[T.26-100]           0.6233   0.6059  0.103478   3.755012
C(Emp_size)[T.500-1000]         0.3780   0.3800  0.043083   3.316590
C(Emp_size)[T.6-25]             0.4320   0.3674  0.069623   2.679937
C(Emp_size)[T.More than 1000]   0.3732   0.2834  0.061634   2.259862
C(Emp_MH_benefits)[T.Yes]   

The current model accounts for approximately 31.18% of the variance associated with feeling comfortable enough to talk to the direct supervisor(s) about a mental health disorder. The only significant predictor is feeling that the employer takes mental health as serious as physical health, aOR= 18.39 (8.9, 38.17), p< 0.0001.

I did some exploration with this variable earlier, but given this result, I want to explore it further. However for now, I am taking a break from this data and am going to eat some delicious gallahba.

## What makes Exmployees feel that Work takes Mental Health as Serious as Physical Health

Again, I will be limiting these analysis to those that did not respond with "Maybe" to having a MHD.