# Supervised Learning Capstone: Mental Health in Tech

### Introduction

This data is extracted from __[Kaggle](https://www.kaggle.com/osmi/mental-health-in-tech-survey)__ It is from a survey that was conducted in 2014. It measures the attitudes towards mental health in the workplace. The survey includes questions about attitudes, health, and benefits availability.

### Research Interest

<b>Objective:</b> To use the variables of this dataset to predict if a person in tech has sought treatment for a mental health condition.

<b>Original Features:</b>
-  <b>Timestamp</b>: Time survey was submitted
-  <b>Age</b>: Age of Responder
-  <b>Gender</b>: Gender of Responder
-  <b>Country</b>: Country of Responder
-  <b>state</b>: State of Responder
-  <b>self_employed</b>: Are they self employed?
-  <b>family_history</b>: Is there a family history of mental health conditions?
-  <b>treatment</b>: Have you sought treatment for a mental health condition?
-  <b>work_interfere</b>: If you have a mental health condition, do you feel that it interferes with your work?
-  <b>no_employees</b>: How many employees does your company or organization have?
-  <b>remote_work</b>: Do you work remotely (outside of an office) at least 50% of the time?
-  <b>tech_company</b>: Is your employer primarily a tech company/organization?
-  <b>benefits</b>: Does your employer provide mental health benefits?
-  <b>care_options</b>: Do you know the options for mental health care your employer provides? 
-  <b>wellness_program</b>: Has your employer ever discussed mental health as part of an employee wellness program?
-  <b>seek_help</b>: Does your employer provide resources to learn more about mental health issues and how to seek help?
-  <b>anonymity</b>: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
-  <b>leave</b>: How easy is it for you to take medical leave for a mental health condition?
-  <b>mental_health_consequence</b>: Do you think that discussing a mental health issue with your employer would have negative consequences?
-  <b>phys_health_consequence</b>: Do you think that discussing a physical health issue with your employer would have negative consequences?
-  <b>coworkers</b>: Would you be willing to discuss a mental health issue with your coworkers?
-  <b>supervisor</b>: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
-  <b>mental_health_interview</b>: Would you bring up a mental health issue with a potential employer in an interview?
-  <b>phys_health_interview</b>: Would you bring up a physical health issue with a potential employer in an interview?
-  <b>mental_vs_physical</b>: Do you feel that your employer takes mental health as seriously as physical health?
-  <b>obs_consequence</b>: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
-  <b>comments</b>: Any additional notes or comments

### Modules and Data Loading

In [1]:
#Import modules
import numpy as np
import pandas as pd
import sklearn
import scipy
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import math
from sklearn.model_selection import cross_val_score
%matplotlib inline
plt.style.use('dark_background')
#sklearn
import sklearn
from sklearn import ensemble
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, log_loss, recall_score 
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

#for clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score

#other learners
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from kmodes.kmodes import KModes

#imblearn
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

#webscraping
import requests
from bs4 import BeautifulSoup
import re
import urllib
from IPython.core.display import HTML

#time series
import statsmodels.api as sm
from pylab import rcParams
import itertools
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA


#warning ignorer
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv(("https://raw.githubusercontent.com/GenTaylor/SupervisedLearning/master/mentalhealthintechsurvey.csv"))

In [3]:
df.columns

Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

### Data Exploration and Cleaning

In [4]:
df.dtypes

Timestamp                    object
Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
comments                     object
dtype: object

There are 27 features and of those features 26 are objects and 1 is an integer.

In [5]:
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [6]:
df.shape

(1259, 27)

In [7]:
#standardize all columns to lowercase for ease of use in querying
df.columns = map(str.lower, df.columns)

#verify
df.columns

Index(['timestamp', 'age', 'gender', 'country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

In [8]:
df.describe(include='all')  

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
count,1259,1259.0,1259,1259,744,1241,1259,1259,995,1259,...,1259,1259,1259,1259,1259,1259,1259,1259,1259,164
unique,1246,,49,48,45,2,2,2,4,6,...,5,3,3,3,3,3,3,3,2,160
top,2014-08-27 17:33:52,,Male,United States,CA,No,No,Yes,Sometimes,6-25,...,Don't know,No,No,Some of them,Yes,No,Maybe,Don't know,No,* Small family business - YMMV.
freq,2,,615,751,138,1095,767,637,465,290,...,563,490,925,774,516,1008,557,576,1075,5
mean,,79428150.0,,,,,,,,,...,,,,,,,,,,
std,,2818299000.0,,,,,,,,,...,,,,,,,,,,
min,,-1726.0,,,,,,,,,...,,,,,,,,,,
25%,,27.0,,,,,,,,,...,,,,,,,,,,
50%,,31.0,,,,,,,,,...,,,,,,,,,,
75%,,36.0,,,,,,,,,...,,,,,,,,,,


The mean of age seemed odd for something such as a technical professional so I decided to explore its values.

In [9]:
#check values of only int, age

print("Age Values")
print("Average Age: ", df['age'].mean())
print("Minimum Age: ", df['age'].min())
print("Maximum Age: ", df['age'].max())
print("Null values: ", pd.isnull(df['age']).sum())

Age Values
Average Age:  79428148.31135821
Minimum Age:  -1726
Maximum Age:  99999999999
Null values:  0


Age values are obviously off, and the outliers need to be fixed before any machine learning or analysis can be done.

In [10]:
#fix outlier issues for ages
def fixedage(age):
    if age>=1 and age<=99:
        return age
    else:
        return np.nan
df['age'] = df['age'].apply(fixedage)


#check age values again

print("Age Values")
print("Average Age: ", df['age'].mean())
print("Minimum Age: ", df['age'].min())
print("Maximum Age: ", df['age'].max())
print("Null values: ", pd.isnull(df['age']).sum())

Age Values
Average Age:  32.01913875598086
Minimum Age:  5.0
Maximum Age:  72.0
Null values:  5


#### Handling Null Values

In [11]:
#Check Missing Data
df.isnull().sum().sort_values(ascending=False)

comments                     1095
state                         515
work_interfere                264
self_employed                  18
age                             5
benefits                        0
gender                          0
country                         0
family_history                  0
treatment                       0
no_employees                    0
remote_work                     0
tech_company                    0
care_options                    0
obs_consequence                 0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
timestamp                       0
dtype: int64

In [12]:
#fill null values with avg age

df['age'].fillna(df['age'].mean(), inplace = True)


df['age']=df['age'].astype(int)

#check null again
print("Null values: ", pd.isnull(df['age']).sum())

Null values:  0


In [13]:
#drop timestamp, state, and comments

df = df.drop(['comments'], axis=1)
df = df.drop(['timestamp'], axis=1)
df = df.drop(['state'], axis=1)


In [14]:
#Check Missing Data
df.isnull().sum().sort_values(ascending=False)

work_interfere               264
self_employed                 18
obs_consequence                0
mental_vs_physical             0
gender                         0
country                        0
family_history                 0
treatment                      0
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
age                            0
dtype: int64

In [15]:
#Checking the unique values for 'work_interfere'
print("Distinct values for work_interfere:\n", set(df['work_interfere']))

Distinct values for work_interfere:
 {nan, 'Sometimes', 'Rarely', 'Often', 'Never'}


In [16]:

df['work_interfere'].fillna("Unsure",inplace = True) 
  
df['self_employed'].fillna("No",inplace = True) 
  

In [17]:
#Check Missing Data
df.isnull().sum().sort_values(ascending=False)

obs_consequence              0
mental_vs_physical           0
gender                       0
country                      0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
age                          0
dtype: int64

##### Check Distinct Values

In [18]:
#Gender
print("Distinct values for gender:\n", set(df['gender']))

Distinct values for gender:
 {'woman', 'ostensibly male, unsure what that really means', 'Male', 'm', 'A little about you', 'Female', 'All', 'Femake', 'Mail', 'Male ', 'Nah', 'Man', 'M', 'fluid', 'Male-ish', 'Androgyne', 'cis-female/femme', 'Female ', 'something kinda male?', 'Agender', 'Genderqueer', 'queer/she/they', 'Make', 'Cis Female', 'Male (CIS)', 'Enby', 'Guy (-ish) ^_^', 'msle', 'maile', 'F', 'Cis Man', 'female', 'Mal', 'Neuter', 'male leaning androgynous', 'queer', 'cis male', 'f', 'p', 'non-binary', 'Malr', 'Trans woman', 'Woman', 'Trans-female', 'Female (cis)', 'Cis Male', 'femail', 'male', 'Female (trans)'}


In [19]:
#Age
print("Distinct values for age:\n", set(df['age']))

Distinct values for age:
 {5, 8, 11, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 57, 58, 60, 61, 62, 65, 72}


In [20]:
#Country
print("Distinct values for country:\n", set(df['country']))

Distinct values for country:
 {'Switzerland', 'Spain', 'Zimbabwe', 'China', 'Portugal', 'France', 'Ireland', 'Slovenia', 'India', 'Thailand', 'New Zealand', 'Australia', 'Italy', 'Bahamas, The', 'Nigeria', 'Croatia', 'Sweden', 'Uruguay', 'Bosnia and Herzegovina', 'Philippines', 'Poland', 'Germany', 'Costa Rica', 'United States', 'Denmark', 'Russia', 'Romania', 'South Africa', 'Canada', 'Latvia', 'Japan', 'Hungary', 'Austria', 'Singapore', 'Netherlands', 'Israel', 'Belgium', 'Finland', 'Czech Republic', 'Norway', 'Bulgaria', 'Moldova', 'Georgia', 'Brazil', 'Mexico', 'Greece', 'Colombia', 'United Kingdom'}


Gender as a distinct value seems too varied in spellings and classifications so I assume it would cause a problem. I decided to cleanse this data through strings and grouping of the values into categories.

In [21]:
#create gender groups

male= ["man","msle", "mail", "malr","male", "m", "male-ish", "maile", "mal", "male (cis)", 
       "make", "male ","cis man", "Cis Male", "cis male"]

female=["female ","cis-female/femme", "female (cis)", "femail", "cis female", "f", "female", 
        "woman",  "femake"]

other_or_trans =["something kinda male?", "queer/she/they", "p","a little about you","non-binary","nah", "all", 
        "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", 
        "guy (-ish) ^_^",  "neuter",  "queer", "ostensibly male, unsure what that really means", "trans-female","trans woman","female (trans)"]

for (row, col) in df.iterrows():

    if str.lower(col.gender) in male:
        df['gender'].replace(to_replace=col.gender, value='male', inplace=True)

    if str.lower(col.gender) in female:
        df['gender'].replace(to_replace=col.gender, value='female', inplace=True)
    
    if str.lower(col.gender) in other_or_trans:
        df['gender'].replace(to_replace=col.gender, value='other_or_trans', inplace=True)
    
    
 
#Gender
print("Distinct responses for gender:\n", set(df['gender']))

Distinct responses for gender:
 {'other_or_trans', 'female', 'male'}


In [22]:
print("Distinct values for leave:\n", set(df['leave']))

Distinct values for leave:
 {'Very easy', 'Somewhat easy', 'Somewhat difficult', "Don't know", 'Very difficult'}


<b>Ages:</b> I want to create a feature that groups the ages. I didn't want to not include certain groups such as those <18 because it is possible for someone in that age group to be working in tech and taking that survey. Their responses to age could have been typos but I did not want to take the risk of completely deleting them and changing things.

In [23]:
#Group the ages into groups
df['age'] = pd.cut(df['age'],
                         [0,18, 25, 35, 45, 55, 65, 99],
                         labels=['<18','18-24','25-34','35-44','45-54', '55-64', '65+'])

In [24]:
df.columns

Index(['age', 'gender', 'country', 'self_employed', 'family_history',
       'treatment', 'work_interfere', 'no_employees', 'remote_work',
       'tech_company', 'benefits', 'care_options', 'wellness_program',
       'seek_help', 'anonymity', 'leave', 'mental_health_consequence',
       'phys_health_consequence', 'coworkers', 'supervisor',
       'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence'],
      dtype='object')

#### The practical uses of your model for an audience of interest:



This model is practical for at least two groups, employers and health insurance companies. It can help companies who are trying to improve the mental health of their workers. Employers can use these results to help boost the mental health of employees which can improve performance. They can also use it as a way to recruit new employees. Knowing that a company cares about the mental health of its employees is a want of many who are looking for employment.

It can help health insurance companies learn how to better market their insurance to companies. Health insurance companies can use this information to better understand how to sell their policies to companies. Once they have done their research on the company they're pitching to and they better understand the needs of the organization, they can determine if they should be selling policies with a heavy focus on mental health and well-being or not.


#### -Genesis Taylor