# Mini-project 1 : Machine Learning for Mental Health
2024 Week 4 - Data Understanding and pre-processing |
**Author:** David  Koskas
**Date:** September 3, 2024  


## Setup

### Kaggle Authentification Process

First, **make sure you have downloaded your Kaggle token** file from the Kaggle website (go to Settings, API, and then click on Create New Token).


In [None]:
# Upload the Kaggle token file to Google Colab
from google.colab import files

uploaded = files.upload()


Saving kaggle.json to kaggle.json


In [None]:
# Create the Kaggle folder & Move the Kaggle token file into it

import os
import shutil

kaggle_dir = '/root/.kaggle'

# Create the Kaggle directory if it doesn't exist
if not os.path.exists(kaggle_dir):
  os.makedirs(kaggle_dir)

source = '/content/kaggle.json'
destination = '/root/.kaggle/kaggle.json'

# Move the Kaggle token file to the Kaggle directory
if os.path.exists(source):
  shutil.move(source, destination)

In [None]:
# Change permissions to the Kaggle token file
!chmod 600 ~/.kaggle/kaggle.json

### Dataset download from Kaggle & File unzipping


In [None]:
# URL of the page: https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey

# Download the dataset from Kaggle site
!kaggle datasets download -d osmi/mental-health-in-tech-survey

Dataset URL: https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey
License(s): CC-BY-SA-4.0
Downloading mental-health-in-tech-survey.zip to /content
  0% 0.00/48.8k [00:00<?, ?B/s]
100% 48.8k/48.8k [00:00<00:00, 1.36MB/s]


In [None]:
# Unzip the file in the same Colab directory
!unzip /content/mental-health-in-tech-survey.zip -d /content/

Archive:  /content/mental-health-in-tech-survey.zip
  inflating: /content/survey.csv     


## Exploratory Data Analysis

In [None]:
import pandas as pd

# Load the  data into a DataFrame
df = pd.read_csv('/content/survey.csv')

df.head()


Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [None]:
df.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


In [None]:
# Check the age values
df['Age'].value_counts()

Unnamed: 0_level_0,count
Age,Unnamed: 1_level_1
29,85
32,82
26,75
27,71
33,70
28,68
31,67
34,65
30,63
25,61


In [None]:
# Identify entries with unrealistic age values
wrong_ages = ((df['Age'] < 15) | (df['Age'] > 80))

print(f"There are {wrong_ages.sum()} people with a wrong age variable.")

There are 8 people with a wrong age variable.


In [None]:
# Create a new DataFrame without the invalid age entries
df_corrected_age = df[~wrong_ages]

df_corrected_age

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2015-09-12 11:17:21,26,male,United Kingdom,,No,No,Yes,,26-100,...,Somewhat easy,No,No,Some of them,Some of them,No,No,Don't know,No,
1255,2015-09-26 01:07:35,32,Male,United States,IL,No,Yes,Yes,Often,26-100,...,Somewhat difficult,No,No,Some of them,Yes,No,No,Yes,No,
1256,2015-11-07 12:36:58,34,male,United States,CA,No,Yes,Yes,Sometimes,More than 1000,...,Somewhat difficult,Yes,Yes,No,No,No,No,No,No,
1257,2015-11-30 21:25:06,46,f,United States,NC,No,No,No,,100-500,...,Don't know,Yes,No,No,No,No,No,No,No,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [None]:
# Check the number of unique values per column
df_corrected_age.nunique()

Unnamed: 0,0
Timestamp,1239
Age,45
Gender,46
Country,46
state,45
self_employed,2
family_history,2
treatment,2
work_interfere,4
no_employees,6


In [None]:
# Gender column check
df_corrected_age['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,612
male,204
Female,121
M,116
female,62
F,38
m,34
f,15
Make,4
Woman,3


In [None]:
# Correct inconstitencies of the Gender column

df_corrected_age.loc[:,'Gender'] = df_corrected_age['Gender'].str.lower()

# Replace different variations of male and female with consistent labels
df_corrected_age.loc[:,'Gender'] = df_corrected_age.replace(dict.fromkeys(['m', 'man'], 'male'))
df_corrected_age.loc[:,'Gender'] = df_corrected_age.replace(dict.fromkeys(['f', 'woman'], 'female'))

df_corrected_age['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
male,968
female,240
make,4
cis male,3
male,3
female (trans),2
female,2
something kinda male?,1
guy (-ish) ^_^,1
cis man,1


In [None]:
df = df_corrected_age

df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,male,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [None]:
# Drop some columns
df.drop(columns=['Timestamp', 'state'], inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1251 entries, 0 to 1258
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1251 non-null   int64 
 1   Gender                     1251 non-null   object
 2   Country                    1251 non-null   object
 3   self_employed              1233 non-null   object
 4   family_history             1251 non-null   object
 5   treatment                  1251 non-null   object
 6   work_interfere             989 non-null    object
 7   no_employees               1251 non-null   object
 8   remote_work                1251 non-null   object
 9   tech_company               1251 non-null   object
 10  benefits                   1251 non-null   object
 11  care_options               1251 non-null   object
 12  wellness_program           1251 non-null   object
 13  seek_help                  1251 non-null   object
 14  anonymity    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Timestamp', 'state'], inplace = True)


## Questions

###  Distribution of mental health conditions among different age groups in the tech industry

In [None]:
# Define age bins and corresponding labels for categorization
bins = [15, 25, 35, 45, 55, 65, 75, 80]
labels = ['15-25', '26-35', '36-45', '46-55', '56-65', '66-75', '76-80']

# Segment the 'Age' column into predefined age ranges using the bins and labels provided
df.loc[:, 'Age_group'] = pd.cut(df['Age'], bins = bins, labels = labels, right = True)

# Filter to include only individuals who have received treatment
treated_people = df[df['treatment'] == 'Yes']

# Calculate the proportion of treated individuals in each age group compared to the whole population
normalized_groups = treated_people.groupby(('Age_group'), observed = False)['treatment'].count().apply(lambda x: ((x * 100)/treated_people['treatment'].count()).round(2))

normalized_groups

Unnamed: 0_level_0,treatment
Age_group,Unnamed: 1_level_1
15-25,16.61
26-35,53.96
36-45,24.21
46-55,3.8
56-65,1.27
66-75,0.16
76-80,0.0


### Frequency of mental health issues vary by gender


In [None]:
# Filter, group by 'Gender' and 'treatment', calculate and normalize proportions, then unstack for display
gender_distribution = df[df['Gender'].isin(['male', 'female'])].groupby('Gender')['treatment'].value_counts(normalize = True).round(2).unstack()

gender_distribution

treatment,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.3,0.7
male,0.55,0.45


### Countries with the highest and lowest reported rates of mental health issues in Tech


In [None]:
# Filter to include only countries with sufficient data
country_counts = df.groupby('Country').size()
country_mask = df['Country'].map(country_counts) > 20

# Analyze treatment distribution for countries with sufficient data
country_distribution = df[country_mask].groupby('Country')['treatment'].value_counts(normalize = True).round(2).unstack().fillna(0).sort_values(by = 'Yes', ascending = False)

country_distribution

treatment,No,Yes
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,0.38,0.62
United States,0.45,0.55
Canada,0.49,0.51
United Kingdom,0.5,0.5
Ireland,0.52,0.48
Germany,0.53,0.47
Netherlands,0.67,0.33
