# **Mental_Health_Survery-2014-2016**    



# **Project Summary -**

This project focuses on performing an exploratory data analysis (EDA) on a workplace mental health survey dataset to understand patterns, trends, and relationships influencing mental health treatment among professionals.

The dataset contains 1,259 responses with a mix of demographic, workplace, and mental-health-related attributes, such as age, gender, work interference, family history, benefits, and treatment status.

1. Employees whose mental health interferes with work are significantly more likely to seek treatment

2. Individuals with a family history of mental illness show higher treatment rates

3. Availability of mental health benefits positively influences treatment-seeking behavior

4. The dataset is demographically and geographically skewed, which may affect generalization

Tools & Technologies

1. Python

2. Pandas

3. Seaborn & Matplotlib

4. Jupyter Notebook




# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

# ***Let's Begin !***

## ***1. Know Your Data***

### Dataset Information

In [None]:
# Dataset Info
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("/content/survey.csv")


In [None]:
df.shape
df.head()
df.info()


In [None]:
df.isnull().sum()


In [None]:
df.isnull().sum()[df.isnull().sum() > 0]


In [None]:
# (df.isnull().sum() / len(df)) * 100[df.isnull().sum() > 0]
((df.isnull().sum() / len(df)) * 100)[df.isnull().sum() > 0]


In [None]:
df['Age'].describe()


In [None]:
df['self_employed'].value_counts()


In [None]:
df[(df['Age'] < 18) | (df['Age'] > 100)]


In [None]:
df.loc[(df['Age'] < 18) | (df['Age'] > 100), 'Age'] = None


In [None]:
df['Age'].describe()


In [None]:
df['self_employed'].value_counts()

In [None]:
# Missing count is very small (18 / 1259) → ~1.4% So its better to replace wiht mode
# df['self_employed'].mode()[0] = 'No'
df['self_employed'].fillna('No', inplace=True)
df['self_employed'].value_counts() #1113+146=1259 so nulls were replaced

In [None]:
df['work_interfere'].isnull().sum()

In [None]:
df['work_interfere'].value_counts()

In [None]:
df['work_interfere'].mode()[0]

In [None]:
# So there are 264/1259 % of values that are missing from the dataset so its better to replace it with the 'Unknown Value' why not mode
# Missing = ~21% - This is not small If we go with mode 264 people feel EXACTLY the same as the most common group which wont be true
df['work_interfere'].fillna('Unknown', inplace=True)
df['work_interfere'].value_counts()
# df['work_interfere'].isnull().sum()

In [None]:
# State
df['state'].value_counts()

In [None]:
df['state'].isnull().sum() # Total Missing Value Percentage = 515 / 1259 ≈ 41% and state is moreover Location-based, not opinion-based If you fill missing values with mode: You are saying: 515 people live in California That’s clearly wrong So better use Not Specified


In [None]:
df['state'] = df['state'].fillna('Not Specified')
df['state'].isnull().sum()

In [None]:
df['comments'].isnull().sum() # 1095 / 1259 ≈ 86.97% values are missing i.e., Almost 9 out of 10 people did not write a comment


In [None]:
# df.drop(columns=['comments'], inplace=True) # Instead of dropping its better to replace and there is a high chance for dropping this because no need to use the comment in any of the analysis just to continue traditional EDA Left it open
df['comments'].fillna('No Comment', inplace=True)
df['comments'].isnull().sum()


In [None]:
df.info()

In [None]:
df.isnull().sum()[df.isnull().sum() > 0]

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].median()) # Mode can lie due to outlier and median in this case would be the best option
df['Age'].isnull().sum()

In [None]:
df.info()

In [None]:
df.select_dtypes(include=['int64', 'float64']).columns


In [None]:
for col in df.columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(dropna=False))


In [None]:
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        print(f"\n{col}")
        print(df[df[col] < 0])


In [None]:
df['Gender_cleaned'] = 'Other'

df.loc[df['Gender'].str.contains('male|man', case=False, na=False), 'Gender_cleaned'] = 'Male'
df.loc[df['Gender'].str.contains('female|woman', case=False, na=False), 'Gender_cleaned'] = 'Female'
df.loc[df['Gender'].str.contains('trans|non-binary|non binary|queer|enby', case=False, na=False), 'Gender_cleaned'] = 'Non-binary'


In [None]:
df['Gender'].value_counts()

In [None]:
df['Gender_cleaned'].value_counts()

In [None]:
# Now Drop the Gender Column
df.drop(columns=['Gender'], inplace=True)

In [None]:
df.info()

###Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables


###UNIVARIATE ANALYSIS

In [None]:
df['Age'].describe()


In [None]:
df['Gender_cleaned'].value_counts(normalize=True) * 100


In [None]:
plt.figure(figsize=(8, 5))

sns.histplot(
    df['Age'],
    bins=30,
    kde=True,
    color='#F4A6B1',   # Pale pink
    edgecolor='white'
)

plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Gender_cleaned', data=df,color='#F4A6B1')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


In [None]:
df['Country'].value_counts(normalize=True).head(10) * 100


In [None]:
print(list(df.columns))

In [None]:
df = df.loc[:, df.columns != '']


In [None]:

# 'Country' in df.columns  # should be True
top_countries = df['Country'].value_counts().head(10).index

plt.figure(figsize=(10,5))
sns.countplot(y='Country', data=df, order=top_countries,color='#F4A6B1')
plt.title('Top 10 Countries by Number of People')
plt.xlabel('Count')
plt.ylabel('Country')
plt.show()


In [None]:
# How many people Requested treatment
df['treatment'].value_counts(normalize=True) * 100


In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='treatment', data=df, color='#F4A6B1')
plt.title('Mental Health Treatment Distribution')
plt.xlabel('Treatment')
plt.ylabel('Count')
plt.show()


In [None]:
# For each column: describe() / value_counts() The do the Visualization
# In univariate analysis, I first examine statistical summaries using describe() and value_counts(), then validate patterns visually using Seaborn plots to understand distributions

###BIVARIATE ANALYSIS

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='treatment', y='Age', data=df, color='#F4A6B1')
plt.title('Age vs Mental Health Treatment')
plt.xlabel('Treatment')
plt.ylabel('Age')
plt.show()
# People who sought treatment tend to be slightly older, suggesting age may influence mental health awareness.

In [None]:
# plt.figure(figsize=(6,4))
# sns.countplot(x='Gender_cleaned', hue='treatment', data=df)
# plt.title('Gender vs Treatment')
# plt.xlabel('Gender')
# plt.ylabel('Count')
# plt.show()
plt.figure(figsize=(6,4))
sns.countplot(x='Gender_cleaned',hue='treatment',data=df,palette=['#F4A6B7', '#EAEAEA'])
plt.title('Gender vs Treatment')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='work_interfere', hue='treatment', data=df,palette=['#F4A6B7', '#EAEAEA'])
plt.title('Work Interference vs Treatment')
plt.xlabel('Work Interference')
plt.ylabel('Count')
plt.show()


In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='family_history', hue='treatment', data=df, palette=['#F4A6B7', '#EAEAEA'])
plt.title('Family History vs Treatment')
plt.xlabel('Family History')
plt.ylabel('Count')
plt.show()


In [None]:
plt.figure(figsize=(7,4))
sns.countplot(x='benefits', hue='treatment', data=df, palette=['#F4A6B7', '#EAEAEA'])
plt.title('Company Benefits vs Treatment')
plt.xlabel('Benefits')
plt.ylabel('Count')
plt.show()


###MULTIVARIATE ANALYSIS

In [None]:
df = data.copy()
df.drop(columns=['Country', 'state', 'comments'], inplace=True)
binary_map = {
    'Yes': 1, 'No': 0,
    'True': 1, 'False': 0,
    'Male': 1, 'Female': 0
}
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].map(binary_map)
plot_cols = ['Age','family_history','treatment','benefits','seek_help']
df_plot = df[plot_cols]
df_plot = df_plot.dropna()
plt.figure(figsize=(10, 8))
sns.heatmap(
    df_plot.corr(),
    annot=True,
    cmap='coolwarm',
    fmt='.2f'
)
plt.title('Correlation Heatmap')
plt.show()

## **5. Solution to Business Objective**

# **Conclusion**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***