Mini-project 1 : Machine Learning for Mental Health


What you will learn
Data loading and exploration with Pandas.
Preprocessing techniques including cleaning data.


Instructions :
Using the Mental Health dataset and what you have learn this week, answer the following questions :

What is the distribution of mental health conditions among different age groups in the tech industry?
How does the frequency of mental health issues vary by gender?
Identify the countries with the highest and lowest reported rates of mental health issues in the tech industry.


Ressources
Use the Mental Health in Tech Survey dataset available on Kaggle.



Hint
Introduction to the Dataset:

Download the dataset from Kaggle.
Load the dataset using Pandas.
Perform initial exploration to understand the dataset structure : whats the distribution of the data? What types of data do i have?
Data Cleaning:

Identify and handle missing values.
Detect and correct any inconsistencies in the data.
Drop irrelevant columns if necessary.


In [2]:
# What is the distribution of mental health conditions among different age groups in the tech industry?
import pandas as pd
import kaggle

mhdf = pd.read_csv('survey.csv')
print(mhdf.info())
print(mhdf.head())

#define age groups
age_groups = {
    '18-24': range(18, 25),
    '25-34': range(25, 35),
    '35-44': range(35, 45),
    '45-54': range(45, 55),
    '55-64': range(55, 65),
    '65+': range(65, 100)
}
#add age group column
mhdf['Age_Group'] = mhdf['Age'].apply(lambda x: next((k for k, v in age_groups.items() if x in v), None))

#mental health conditions distribution by age group
mh_age_group = mhdf.groupby('Age_Group')['treatment'].value_counts(normalize=True).unstack()
print(mh_age_group)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [7]:
mhdf['Age_Group'].value_counts()

Age_Group
25-34    707
35-44    320
18-24    156
45-54     51
55-64     15
65+        2
Name: count, dtype: int64

The distribution of mental health conditions among different age groups in the tech industry is relatively even spread for majority of the age groups. Varrying in percentages close and around 50% for yes and no for 5 out of the 6 age groups. The one age group that has the more present history of treatment for mental health conditions is the 55-64 age group. This however is just a broad understanding of where the conditions may be more prevelant. If we want to dive deeper into the data we should start with the outlier age group of 55-64 and understand why more members of this age group have gotten treatment. Afterwards we can conduct a more general analysis on the other age groups as the distribution is pretty even across the board. 

*See how many people per age group/ do a count for agegroup. 

In [19]:
# How does the frequency of mental health issues vary by gender?

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#change the gender coloumn so that any mention of M = Male and any mention of F = Female
def clean_gender(gender):
    gender = gender.strip().lower()

    if gender in ['male', 'm', 'male ', 'cis male', 'cis man', 'guy (-ish) ^_^', 'man', 'mal', 'msle', 'mail', 'malr', 'make', 'maile']:
        return 'Male'
    elif gender in ['female', 'f', 'woman', 'femake', 'female ', 'cis female', 'cis-female/femme', 'female (cis)', 'femail']:
        return 'Female'
    elif gender in ['non-binary', 'enby', 'androgyne', 'genderqueer', 'agender', 'fluid', 'male leaning androgynous', 'something kinda male?', 'queer/she/they', 'queer']:
        return 'Non-Binary'
    elif gender in ['trans-female', 'trans woman', 'female (trans)']:
        return 'Male'  # Adjusted to treat trans females as male
    elif gender in ['nah', 'all', 'a little about you', 'ostensibly male, unsure what that really means', 'neuter', 'p']:
        return 'Other'
    else:
        return 'Other'

mhdf['Gender'] = mhdf['Gender'].apply(clean_gender)  # Apply the function to the 'Gender' column

# Mapping the cleaned gender to encoded values
mhdf['Encoded_Gender'] = mhdf['Gender'].map({'Male': 0, 'Female': 1, 'Non-Binary': 2, 'Trans Female': 3, 'Other': 4})

# Group the data by gender and mental health issues, and count the occurrences
gender_mental_health = mhdf.groupby(['Gender', 'treatment']).size().unstack()
print(gender_mental_health)

treatment    No  Yes
Gender              
Female       77  170
Male        541  453
Non-Binary    3    7
Other         1    7


In [3]:
mhdf

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,Age_Group
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,No,No,Some of them,Yes,No,Maybe,Yes,No,,35-44
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Maybe,No,No,No,No,No,Don't know,No,,35-44
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,No,No,Yes,Yes,Yes,Yes,No,No,,25-34
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,,25-34
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,No,No,Some of them,Yes,Yes,Yes,Don't know,No,,25-34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2015-09-12 11:17:21,26,male,United Kingdom,,No,No,Yes,,26-100,...,No,No,Some of them,Some of them,No,No,Don't know,No,,25-34
1255,2015-09-26 01:07:35,32,Male,United States,IL,No,Yes,Yes,Often,26-100,...,No,No,Some of them,Yes,No,No,Yes,No,,25-34
1256,2015-11-07 12:36:58,34,male,United States,CA,No,Yes,Yes,Sometimes,More than 1000,...,Yes,Yes,No,No,No,No,No,No,,25-34
1257,2015-11-30 21:25:06,46,f,United States,NC,No,No,No,,100-500,...,Yes,No,No,No,No,No,No,No,,45-54


In [None]:
import pandas as pd

mhdf = pd.read_csv('survey.csv')

mhdf['treatment'] = mhdf['treatment'].str.strip().str.lower()
mhdf['mental_health_issues'] = mhdf['treatment'] == 'yes'

country_avg = df.groupby('Country', as_index=False)['mental_health_issues'].mean()
country_avg.rename(columns={'mental_health_issues': 'avg_mental_health_issues'}, inplace=True)
country_avg = country_avg.sort_values('avg_mental_health_issues', ascending=False)

print("Top 5 Countries with Highest Mental Health Issues:")
print(country_avg[['Country', 'avg_mental_health_issues']].head(5))

print("\nBottom 5 Countries with Lowest Mental Health Issues:")
print(country_avg[['Country', 'avg_mental_health_issues']].tail(5))

Top 5 Countries with Highest Mental Health Issues:
     Country  avg_mental_health_issues
24     Japan                       1.0
11   Croatia                       1.0
38  Slovenia                       1.0
27   Moldova                       1.0
13   Denmark                       1.0

Bottom 5 Countries with Lowest Mental Health Issues:
                   Country  avg_mental_health_issues
32             Philippines                       0.0
16                 Georgia                       0.0
30                 Nigeria                       0.0
4   Bosnia and Herzegovina                       0.0
31                  Norway                       0.0


By normalizing the treatment data after averaging the number of True vs False for each country, we are able to get an average of T v F for countries that recieved treatment. After we get that, we normalize the data so it fits between the range of 0 and 1. countries with 0 have the lowest rate and countries with 1 have the highest rate. 