### Motivation and problem statement:



Mental health has become a serious concern in the 21st century. As people in technical jobs lead a sedentary lifestyle and have very few social engagements outside their workplace, It is important to study the factors in a workplace that affect an individual’s mental health. I want to know how employers can take measures to promote their employee’s mental health and spread awareness to address this growing concern. Aside from company policies with regard to health, leaves, insurance, etc., the ability to comfortably communicate health problems with colleagues and supervisors can have a serious impact on an employee’s mental health. 

### Datasets:

For this project, I will use following dataset that is publicly available on [Kaggle](https://www.kaggle.com/).
- [OSMI Mental Health In Tech Survey 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
The dataset is owned by [OSMI](https://osmihelp.org) and the contents of the website are also licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). It contains qualitative survey data conducted in the year 2019 about mental health in Tech and answers a range of questions about employers’ mental health policies along with candidates’ age, race, gender, and country of work and residence. 

This dataset contains all the necessary qualitative and quantitative information needed to analyze the problem at hand. Although the OSMI survey results are anonymized and do not contain PII, it still has sensitive demographic data(age, gender). The results of the analysis are purely to understand the factors affecting the mental health of people in tech and is not meant for disrespecting anyone.

All the raw data collected for the analysis is stored in [data/raw](./data/raw) folder.


### Unknowns and dependencies:

The Mental Health in Tech Survey data is skewed in the sense that the majority of survey participants are from the US and very few from low-income countries. The survey has answers from 1259 participants about 26 questions. The small sample size and skewness will create problems in making any strong claims on whether the region/country affects the mental health of people working there. It does not appear that the survey was administered to a representative sample and so, we won't be able to generalize the findings to a broader population. Also, as it is qualitative data, the analysis will heavily depend on the range of options for each categorical question. There might not be sufficient objectivity and also some bias in data collection steps. The variable, treatment (Have you sought treatment for a mental health condition?) may not be representative of the fact that whether individual suffers from a mental illness. 

### Research Questions

Q1. How does the frequency of mental health illness vary by age, gender?<br>
**Hypothesis**: Frequency of mental health illness is different for different demographic indicators.
 
Q2. Does family history of mental health illness impact the frequency of mental health illnesses?<br>
**Hypothesis**: The frequency of mental health illness is independent of family history of mental health.
 
Q3. Does attitude towards mental health impacts an individual’s decision to seek treatment for mental health condition?<br>
**Hypothesis**: The attitude towards mental health does not impact the individual’s decision to seek treatment for mental health condition.
 
Q4. What are the strongest predictors of mental health illness due to workplace environment? <br>
**Hypothesis**: A workplace that provides medical health benefits, awareness for care options to employees, safe environment to discuss the mental health issues with supervisor and peers contribute towards the better mental health of employees as opposed to a workplace that does not prioritize their employees’ health.


### Background/Related Work

A related [paper](https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3966-2019.pdf) that measure the suicide tendencies in employees in tech industry based on the mental health illnesses and certain attitudes towars mental health in workplace suggests that suicides rates in indicviduals are direct linked to mental health condictions are vary considerably in different age groups, gender, region. 
The study also suggests that companies which provide remote work, benefits and awareness around mental health have positive impact on employees’ mental health and in turn decrease in suicidal tendencies. 

The articles [here](https://www.infoq.com/articles/mental-health-tech-workplace/) suggests that the major hindrance to mental wellness in a workplace is that mental illnesses are stigmatized and employees do not feel comfortable speaking up when they have a mental health issue due to fear to losing their promotion or job.

The [OSMI](https://osmihelp.org/research) provides useful survey datasets from years 2014 
conducted every year which asks candidates about the mental health illness attitudes in their workplace(benefits, care options, consequence of informing employer about physical or mental health issures, awareness in employees related to employers policies and programmes around mental health, ability to take leaves of absense, etc.) along with demography data(age, gender, country). OSMI also provides guidelines for promote mental wellness in the workplace to eecitives and HR professionals based on studies conducted on the survey results.

Above research and resources are the basis for the research questions and hypothesis that I plan to answer in this analysis and it would be interesting to see what factors in the workplace contribute to improved mental health. 

### Methodology

#### Data Gathering

In [None]:
# import libraries
import pandas as pd
import numpy as np

# read data files
mhit = pd.read_csv("survey.csv")

# quick look at mental health in tech data
mhit.head(5)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


#### Data cleaning and preprocessing


Treating missing values:

- The data contains a lot of missing values in many columns. The default values of ‘NaN’ for string data type and default value of 0 for int data type will be used. 
- The column ‘Gender’ contains 49 distinct responses and will be changed to reflect just three gender types: male/female/trans(non-binary).
- To deal with missing values in the column ‘Age’, the median age will be used as the survey is mostly filled by people working in tech industry, once can assume that majority age-group of the participants is a safe choice of missing age. Also some values of age are too high or too low to be real numbers. Such values will be replaced by median age.
- For the columns with Yes/No inputs, the missing value will be replaced by ‘No/Don’t know’.

Encoding categorical data:

For each categorical data columns, we will apply label encoder to convert the inputs into classes. Further the age column needs to be normalised as it is an integer before applying any model to see the what are strongest predictors for an individual needing treatment.