This notebook is for Task 1: Mental Health in Technology-related jobs. 
Throughout this notebook I will document the required steps to complete this data analysis task and the unsupervised machine learning model utilized. 


Objectives: 
1.) Data Cleaning: Manage missing values, redundant or irrelevant columns, and non-standardized textual inputs. 
2.) Utilize an unsupervised machine learning model to cluster the data.
3.) Provide visualizes to gain insights into each cluster.

In [93]:
# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile

Step 1: Download the dataset from Kaggle

In [94]:
# Download the dataset from Kaggle

api = KaggleApi()
api.authenticate()

kaggle_user = 'osmi'
kaggle_project = 'mental-health-in-tech-2016'
api.dataset_download_files(kaggle_user + '/' + kaggle_project)

#unzip the dataset
zip = zipfile.ZipFile(kaggle_project + '.zip').extractall()

# load the data
tech_df = pd.read_csv('mental-heath-in-tech-2016_20161114.csv')

# Create dataframe 
tech_df = pd.DataFrame(tech_df)

Dataset URL: https://www.kaggle.com/datasets/osmi/mental-health-in-tech-2016


In [95]:
# Check the dataset is as expected: 63 columns and 1433 rows
tech_df.shape

(1433, 63)

In [96]:
# Display the first 5 rows of the dataset
tech_df.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes


Step 2: Rename Columns

In [97]:
tech_df.columns

Index(['Are you self-employed?',
       'How many employees does your company or organization have?',
       'Is your employer primarily a tech company/organization?',
       'Is your primary role within your company related to tech/IT?',
       'Does your employer provide mental health benefits as part of healthcare coverage?',
       'Do you know the options for mental health care available under your employer-provided coverage?',
       'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
       'Does your employer offer resources to learn more about mental health concerns and options for seeking help?',
       'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
       'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:',
       'Do you think that dis

In [98]:
# Rename the columns to save space and for improved understanding of column meaning.
new_column_names = ['self_employed', 'number_of_employees', 'tech_company', 'role_tech_related', 'mental_health_benefits', 'mental_health_coverage_awareness', 'mental_health_offical_communication', 
                    'mental_health_resources', 'mental_health_anonymity', 'asking_for_leave_for_mental_health', 'discussing_mental_health_with_employer', 'discussing_physical_health_with_employer',
                    'discussing_mental_health_with_coworker', 'discussing_mental_health_with_supervisor', 'physical_same_mental', 'neg_consequences_open_mental_health', 'private_state_medical_coverage_w_mental_health', 'online_local_resources_mental_health', 'reveal_mental_diagnoses_clients', 'revealed_diagnoses_clients_negative_consequences',
                    'reveal_mental_diagnoses_coworkers_employees', 'revealed_diagnoses_coworkers_negative_consequences', 'productivity_with_mental_health_issue', 'percentage_productivity_with_mental_health_issue', 'previous_employer', 'previous_employer_mental_health_benefits', 'previous_employer_mental_health_coverage_awareness', 'previous_employer_mental_health_official_communication', 'previous_employer_mental_health_resources', 'previous_employer_mental_health_anonymity', 'discussing_mental_health_with_previous_employer', 'discussing_physical_health_with_previous_employer', 'reveal_mental_diagnoses_previous_coworkers', 'reveal_mental_diagnoses_previous_supervisor', 'previous_employer_physical_same_mental', 'previous_employer_neg_consequences_open_mental_health', 'physical_health_interview', 'physical_health_interview_explanation', 'mental_health_interview', 'mental_health_interview_explanation', 'mental_health_career_impact', 'coworkers_perception_of_mental_health', 'share_mental_illness_loved_ones', 'current_or_previous_employment_unsupportive_response_mental_health', 'current_employment_less_likely_to_reveal_mental_health_issue', 'family_history_mental_health', 'past_history_mental_health', 'current_MH_disorder', 'current_diagnosed_MH_disorder', 'believed_MH_disorder', 'at_any_point_diagnosed_MH_disorder', 'name_history_diagnosed_MH_disorder', 'treatment_MH_disorder', 'interference_with_work_with_effective_treatment', 'interference_with_work_NOT_effective_treatment', 
                    'age', 'gender', 'country_residence', 'US_state_residence', 'country_work', 'US_state_work', 'work_position', 'remote_work']

tech_df.columns = new_column_names
tech_df.shape

(1433, 63)

Step 3: Understand missing value patterns to analyze Missing at Random (MAR) or Missing Completely at Random (MCAR) or Missing Not at Random (MNAR). Replace missing values with appropriate values. 

- Insert 0 for values where the respondent was not required to answer the question due to the question not being applicable to them from their answer to a previous column. Assumed to be MNAR. (Missing value is dependent on some other variable.)
- Insert "missing" for missing values where the respondent was required to answer the question but did not. Assumed to be MNAR. (Missing value is dependent on some other variable.)

In [99]:
# Understand the patterns of missing values
missing_values = tech_df.isnull().sum()
missing_values

self_employed                0
number_of_employees        287
tech_company               287
role_tech_related         1170
mental_health_benefits     287
                          ... 
US_state_residence         593
country_work                 0
US_state_work              582
work_position                0
remote_work                  0
Length: 63, dtype: int64

In [100]:
# Delete rows and columns related to self-employed workers. Concern for introducing bias and not relevant to the analysis of mental health for a tech related company with multiple employees.  

tech_df = tech_df[tech_df['self_employed'] == 0]

# These columns were only answered by self-employed workers and not relevant to the remaining dataset.

columns_to_delete = ['online_local_resources_mental_health', 'productivity_with_mental_health_issue', 'revealed_diagnoses_coworkers_negative_consequences', 'private_state_medical_coverage_w_mental_health', 'reveal_mental_diagnoses_coworkers_employees', 'reveal_mental_diagnoses_clients', 'percentage_productivity_with_mental_health_issue', 'revealed_diagnoses_clients_negative_consequences']

tech_df = tech_df.drop(columns=columns_to_delete)

In [101]:
# Check value of previous employer = 0 == 131 
tech_df['previous_employer'].value_counts()

previous_employer
1    1015
0     131
Name: count, dtype: int64

In [102]:
# Columns with missing values due to not being applicable to the respondent, do not have previous employer. Replace with 0. MNAR.

columns_fill_na = ['previous_employer_neg_consequences_open_mental_health', 
    'discussing_physical_health_with_previous_employer', 'previous_employer_mental_health_resources',
    'previous_employer_mental_health_coverage_awareness', 'previous_employer_physical_same_mental',
    'previous_employer_mental_health_official_communication', 'previous_employer_mental_health_anonymity',
    'discussing_mental_health_with_previous_employer', 'previous_employer_mental_health_benefits',
    'reveal_mental_diagnoses_previous_coworkers', 'reveal_mental_diagnoses_previous_supervisor']

tech_df[columns_fill_na] = tech_df[columns_fill_na].fillna(0)

In [103]:
# Gender column has 3 missing values. Replace with "missing" assumed to be MNAR, and discomfort with answering the question.

tech_df['gender'] = tech_df['gender'].fillna('missing')

In [104]:
# Understand the patterns of missing values
missing_values = tech_df.isnull().sum()
missing_values

self_employed                                                           0
number_of_employees                                                     0
tech_company                                                            0
role_tech_related                                                     883
mental_health_benefits                                                  0
mental_health_coverage_awareness                                      133
mental_health_offical_communication                                     0
mental_health_resources                                                 0
mental_health_anonymity                                                 0
asking_for_leave_for_mental_health                                      0
discussing_mental_health_with_employer                                  0
discussing_physical_health_with_employer                                0
discussing_mental_health_with_coworker                                  0
discussing_mental_health_with_supervis