# Mental Health in Tech Part 1 - EDA
Exported from Filament on Sun, 13 Mar 2022 17:02:36 GMT

---

This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace [Mental Health in Tech Survey | Kaggle](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

The original dataset is from **Open Sourcing Mental Illness**

**This dataset contains the following data:**

* **Timestamp**

* **Age**

* **Gender**

* **Country**

* **state**: If you live in the United States, which state or territory do you live in?

* **self_employed**: Are you self-employed?

* **family_history**: Do you have a family history of mental illness?

* **treatment**: Have you sought treatment for a mental health condition?

* **work_interfere**: If you have a mental health condition, do you feel that it interferes with your work?

* **no_employees**: How many employees does your company or organization have?

* **remote_work**: Do you work remotely (outside of an office) at least 50% of the time?

* **tech_company**: Is your employer primarily a tech company/organization?

* **benefits**: Does your employer provide mental health benefits?

* **care_options**: Do you know the options for mental health care your employer provides?

* **wellness_program**: Has your employer ever discussed mental health as part of an employee wellness program?

* **seek_help**: Does your employer provide resources to learn more about mental health issues and how to seek help?

* **anonymity**: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

* **leave**: How easy is it for you to take medical leave for a mental health condition?

* **mental*health*consequence**: Do you think that discussing a mental health issue with your employer would have negative consequences?

* **phys*health*consequence**: Do you think that discussing a physical health issue with your employer would have negative consequences?

* **coworkers**: Would you be willing to discuss a mental health issue with your coworkers?

* **supervisor**: Would you be willing to discuss a mental health issue with your direct supervisor(s)?

* **mental*health*interview**: Would you bring up a mental health issue with a potential employer in an interview?

* **phys*health*interview**: Would you bring up a physical health issue with a potential employer in an interview?

* **mental*vs*physical**: Do you feel that your employer takes mental health as seriously as physical health?

* **obs_consequence**: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

* **comments**: Any additional notes or comments

# Contents:

**Part 1** - EDA\
**Part 2** - NLP Word Cloud\
**Part 3** - Modelling

## Data quality checks

In [None]:
# IMPORT PACKAGES

# Data manipulation
#--------------------------
import pandas as pd
import numpy as np

# Data visualisation
#--------------------------
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# LOAD IN DATASET

df = pd.read_csv("survey.csv")

In [None]:
print("This dataset has", df.shape[1], "columns and", df.shape[0], "rows")
print("This dataset has", df.duplicated().sum(), "duplicate rows")

In [None]:
for col in df.columns:
    print("Number of unique values in", col, ":", len(df[col].unique()))

### Check for nulls and datatypes

In [None]:
df.info()

# only Age column numerical

In [None]:
# Unexpected discrepancy between number of non-nulls and number unique for comments 
# Non-null, non-unique values in comments column displayed

df[['Age','comments']][(df['comments'].isna()==False)&(df['comments'].duplicated()==True)]

In [None]:
sns.heatmap(df.isna(), cbar=False)
plt.title("Heatmap showing missing values")

plt.show()

In [None]:
cols = []
percentages = []
table = {'column':cols, 'percentage_missing':percentages}
for col in df.columns:
    percentage_missing = df[col].isna().sum()/df.shape[0] *100
    if percentage_missing >0:
        cols.append(col)
        percentages.append(percentage_missing)

pd.DataFrame(table).sort_values('percentage_missing', ascending=False,
                               ignore_index=True)

### Check numerical columns

In [None]:
df.describe() 

# median age 31
# min and max ages impossible

In [None]:
# Seven Lowest Ages

df[['Age']].sort_values('Age').head(7)

In [None]:
# Seven Highest Ages

df[['Age']].sort_values('Age').tail(7)

In [None]:
def data_check(index_number):
    ''' Function to check reliability of data from respondent with invalid age '''
    return df.iloc[index_number]

In [None]:
data_check(1127)

# appears to be a test entry (comments: 'password:testered')

In [None]:
data_check(989)

# 21 columns, self_employed:obs_consequence, match with test entry 
# (likely the first options on dropdown lists in survey)

In [None]:
data_check(734)

# other data field entries appear valid

In [None]:
def invalid_entries(df):
    ''' Function to identify invalid responses '''
    
    return np.where(((df['Age']<18)|(df['Age']>75))&
           (df['self_employed']=='Yes')&
           (df['family_history']=='Yes')&
           (df['treatment']=='Yes')&
           (df['work_interfere']=='Often')&
           (df['no_employees']=='1-5')&
           (df['remote_work']=='Yes')&
           (df['tech_company']=='Yes')&
           (df['benefits']=='Yes')&
           (df['care_options']=='Yes')&
           (df['wellness_program']=='Yes')&
           (df['seek_help']=='Yes')&
           (df['anonymity']=='Yes')&
           (df['leave']=='Very easy')&
           (df['mental_health_consequence']=='Yes')&
           (df['phys_health_consequence']=='Yes')&
           (df['coworkers']=='Yes')&
           (df['supervisor']=='Yes')&
           (df['mental_health_interview']=='Yes')&
           (df['phys_health_interview']=='Yes')&
           (df['mental_vs_physical']=='Yes')&
           (df['obs_consequence']=='Yes'))

In [None]:
invalid_entries(df)

### Check for duplicates

In [None]:
# Check for duplicates in dataset with Timestamp column excluded

exclude_col = ['Timestamp']
include_cols = [x for x in df.columns if x not in exclude_col]
np.where(df[include_cols].duplicated() == True)

In [None]:
def row_style(row):
    ''' Function to colour duplicate rows the same '''
    
    if row.Country == 'Denmark':
        return pd.Series('background-color: mistyrose', row.index)
    elif row.Country == 'United Kingdom':
        return pd.Series('background-color: lemonchiffon', row.index)
    elif row.Country == 'New Zealand':
        return pd.Series('background-color: honeydew', row.index)
    else:
        return pd.Series('background-color: lavender', row.index)

In [None]:
# Four pairs of duplicates 
# (each pair with less than 5 minutes difference between them)
# assume tech or user error

df.iloc[[819,821,859,860,1133,1134,1215,1218]].style.apply(row_style, axis=1)

## Data Pre-processing for EDA

**Drop redundant rows**

In [None]:
# Drop the 6 redundant rows from dataframe: 
# test entry 1127, unreliable entry 989, four duplicated rows from EDA

df.drop(df.index[[1127,989,821,860,1134,1218]],inplace=True)

if df.shape[0]!=(1259-6):  # 1259 = original number of rows
    raise Exception(f'unexpected number of rows: {df.shape[0]}')

df.reset_index(inplace=True)

**Replace invalid ages with median**

In [None]:
# Valid age range chosen to be 18 to 75 inclusive

np.where((df['Age']<18)|(df['Age']>75))  # give row indices with invalid ages 

# Assume invalid age either given by accident
# or respondant unwilling to provide but required field 
# (and answers otherwise accurate)

In [None]:
# Replace invalid ages with median age, 31

df.loc[[143,364,390,715,734,1087],'Age']=31

# Check dataframe

df[['Age']].iloc[[143,364,390,715,734,1087]]

### Categorise genders

In [None]:
def top_8_by_count_desc_groupby_df(col_name):
    ''' Function creates dataframe with columns: count, cumulative count, cumulative percentage '''
    assert(type(col_name)==str), "Input an appropriate column name, in a STRING format, e.g, 'string'."
    
    # shows the counts 
    groupby_df = df[[col_name,'Timestamp']].groupby(col_name).count().sort_values('Timestamp', ascending = False)

    # adds cumulative counts
    groupby_df.rename(columns={'Timestamp':'count'},inplace=True)
    groupby_df[['cumulative']] = groupby_df['count'].cumsum(axis=0)

    # percent column shows the proportion of data accounted for
    groupby_df[['percent_data_accounted_for']] = (groupby_df['cumulative']/df.shape[0])*100
    return groupby_df.head(8)
    

In [None]:
top_8_by_count_desc_groupby_df('Gender')

In [None]:
# Group genders

Male = ['Male', 'male', 'M', 'm']
Female = ['Female', 'female', 'F', 'f']

Other = [x for x in df.Gender.unique() if x not in Male and x not in Female]

# Replace all Gender values with Male, Female or Other

df['Gender'] = df['Gender'].replace(Male,'Male')
df['Gender'] = df['Gender'].replace(Female,'Female')
df['Gender'] = df['Gender'].replace(Other,'Other')

df['Gender'].unique()

### Categorise countries

In [None]:
top_8_by_count_desc_groupby_df('Country')

# 3/4 data made up by US and UK entries

In [None]:
countries = ['United States', 'United Kingdom'] # country categories to keep

Other = [x for x in df.Country.unique() if x not in countries] 

df['Country'] = df['Country'].replace(Other,'Other') # combine remaining

df['Country'].unique()

## Visualising data distributions

In [None]:
# Age distribution by gender category

sns.displot(data = df, x = 'Age', hue = 'Gender')

# Female gender shows more positive/right skew (younger) than other genders

In [None]:
# Answer distributions for the following columns

cols = ['Gender', 'self_employed', 'family_history', 'treatment', 'remote_work', 'tech_company', 'benefits', 
        'care_options', 'wellness_program', 'seek_help', 'anonymity', 
        'coworkers', 'supervisor', 'mental_vs_physical', 'obs_consequence']

counter = 0

fig, ax = plt.subplots(5, 3, figsize=(12, 19))

for i in range(5):
    for j in range(3):
        sns.countplot(data = df, x = cols[counter], ax=ax[i, j]).set(ylabel=None)
        counter +=1

In [None]:
# Answer distributions for Likert/Semantic differential scale style questions

fig, ax = plt.subplots(1,2,figsize=(10,4))

sns.countplot(data= df, x= 'work_interfere', ax=ax[0],
              order= ['Never', 'Rarely', 'Sometimes', 'Often'])

plt.subplot(121)
plt.xticks(rotation = 90)

sns.countplot(data= df, x= 'leave', ax=ax[1],
              order= ['Don\'t know', 'Very easy', 'Somewhat easy', 
                      'Somewhat difficult', 'Very difficult'])

plt.subplot(122)
plt.xticks(rotation = 90)

In [None]:
# Number of employees 

sns.countplot(data= df, x= 'no_employees',
              order= ['1-5','6-25','26-100','100-500',
                      '500-1000','More than 1000'])

plt.xticks(rotation = 45)

In [None]:
# Answer distributions comparing mental health and physical health

fig, ax = plt.subplots(2,2,figsize=(10,8))

cols = ['mental_health_consequence', 'phys_health_consequence',
          'mental_health_interview', 'phys_health_interview']

counter = 0

for i in range(2):
    for j in range(2):
        sns.countplot(data=df, x=cols[counter], ax=ax[i, j], 
                      order=['No','Maybe','Yes'])
        counter +=1


### Additional tables

In [None]:
# non-null values in comments field replaced with 1
df['comments'][df['comments'].notnull()] = 1
# null values in comments field replaced with 0
df['comments'] = df['comments'].fillna(0)

In [None]:
#only consider those respondents who identify as having a mental health condition
df = df.dropna(subset=['work_interfere'])
df['treatment'].replace(('Yes', 'No'), (1, 0), inplace=True)

In [None]:
def sought_treatment(df, col):
    ''' returns percentage that sought treatment '''
    
    table = df[[col, 'treatment']].groupby(col).agg(['sum', 'count'])
    table['percent'] = table.iloc[:,0]*100/table.iloc[:,1]
    return table

In [None]:
sought_treatment(df, 'Gender')

In [None]:
sought_treatment(df,'family_history')

In [None]:
sought_treatment(df,'comments')

## Conclusions

* About half of respondents had sought treatment for mental health

* Most common age group of respondents 25-35

* Majority of respondents male

* Higher proportion of females sought treatment when compared to other genders

* More respondents would bring up a physical health condition over a mental health condition in an interview

* More respondents would anticipate negative consequences to discussing mental health compared to physical health

* Opinions were split on whether employers took mental health or physical health more seriously

* Respondents most commonly answered that their mental health would interfere with their work sometimes