# **Project Name**    - Mental Health EDA Project



##### **Project Type**    - Cleaning data on excel and EDA
##### **Contribution**    - Individual
##### **By -** Manne Kovidha


# **Project Summary -**

This project focuses on analyzing mental health conditions among individuals working in the technology industry, using survey data collected from tech workers across various countries. The primary goal is to understand how mental health issues are distributed across demographic groups and how workplace environments affect an individual’s likelihood to seek help or receive support.

The dataset includes variables like age, gender, country, work type (remote or not), company size, availability of mental health benefits, openness to mental health discussions, and whether individuals have sought treatment. With these factors, an Exploratory Data Analysis (EDA) was performed using Univariate, Bivariate, and Multivariate (UBM) techniques.

Univariate analysis showed that most survey participants are between 25–35 years of age and predominantly male. Most respondents come from the United States, indicating that results may be more representative of Western workplace cultures. Many people have sought treatment for mental health issues, suggesting mental health problems are not rare in the tech industry.

Bivariate analysis revealed how different variables interact. For instance, older individuals were more likely to seek treatment, and people who experienced interference in their work due to mental health were more likely to seek help. Gender also played a role — females and gender minorities showed slightly higher treatment-seeking behavior than males. People working in larger organizations reported better access to care options and benefits, while smaller companies often lacked these resources.

Multivariate analysis deepened this understanding by looking at three or more variables at a time. For example, comparing age, gender, and treatment status helped reveal that younger males were less likely to seek treatment. A comparison of country, remote work status, and availability of mental health benefits showed big differences between countries, indicating that workplace mental health support is not consistent across regions.

From a business perspective, this analysis highlights several important takeaways. Many employees are unsure whether their companies offer mental health benefits, which shows a gap in communication from employers. Supervisor and coworker support strongly influence whether someone feels safe talking about mental health. Companies that do not actively support employee well-being may face lower productivity, higher turnover, and a more stressed workforce.

In conclusion, this EDA project helps us understand not only where mental health issues are most frequent but also which workplace factors encourage or discourage individuals from seeking help. Companies that take this data seriously can design better mental health policies, improve employee satisfaction, and promote a healthier work culture.



# **GitHub Link -**

GitHub Link- Click here: https://github.com/Hiiiiii10/Mental-Health-EDA-Project

# **Problem Statement**


Mental health issues are rising in the tech industry, but many companies are unaware of how factors like geography, work culture, and support systems influence whether employees seek treatment or suffer silently. Without proper data, it’s difficult for employers to take informed action.

#### **Define Your Business Objective?**

To identify patterns in mental health-related attitudes and behaviors among tech employees based on geographic and workplace variables, in order to help companies design better mental health policies and improve overall employee well-being.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = "https://raw.githubusercontent.com/Hiiiiii10/Mental-Health-EDA-Project/main/Mental%20Health%20Survey%20cleaned.csv"

This step loads the dataset that I uploaded on Github along with this EDA File. This file has already been checked and cleaned in excel where the comments column has been removed, the genders have been looked for spelling errors and filtered according and the timestamp has been split into two seperate columns date and time. In addition duplicate values have also been adjusted.

### Dataset Loading

In [None]:
# Load Dataset
try:
    df = pd.read_csv("https://raw.githubusercontent.com/Hiiiiii10/Mental-Health-EDA-Project/main/Mental%20Health%20Survey%20cleaned.csv")
    print("Data loaded successfully.")
except Exception as e:
    print("Error loading file:", e)

The dataset is loaded.

### Dataset First View

In [None]:
# Dataset First Look
df.head()

 This shows the first few records of the mental servey dataset. This dataset initially contains records of peoples' age, gender, country followed records related to mental health like whether a treatment has been taken or not.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Dataset Dimensions There are 1258 rows and 28 columns in the dataset(in the cleaned dataset).

### Dataset Information

In [None]:
# Dataset Info
df.info()

The dataset contains many columns such timestamp, age and gender among other.
This part shows the total cells that are not empty and their data types(except age all data types are object).

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_values = df.isnull().sum()

# Columns with missing data
print("🧾 Missing Values Summary:")
print(missing_values[missing_values > 0])

# Plot missing values using seaborn heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, cmap='Reds', yticklabels=False)
plt.title("Heatmap of Missing Values in Dataset")
plt.show()

### What did you know about your dataset?

This dataset is from a 2014 survey conducted in the tech industry to understand attitudes toward mental health and the frequency of mental health disorders at the workplace. It includes demographic information, work conditions, and questions about mental health support, treatment, and stigma, helping explore how these factors influence mental health outcomes in tech environments.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
column_info = pd.DataFrame({
    'Column Name': df.columns,
    'Data Type': df.dtypes.values
})

column_info

In [None]:
# Dataset Describe
df.describe()

### Variable Description

### Mental Health Survey Dataset

| Variable                      | Description                                                                 |
|------------------------------|-----------------------------------------------------------------------------|
| `Timestamp`                  | Date and time the survey response was submitted.                           |
| `Age`                        | Age of the respondent (in years).                                          |
| `Gender`                     | Self-identified gender of the respondent.                                  |
| `Country`                    | Country of residence.                                                      |
| `state`                      | State or region (if applicable).                                           |
| `self_employed`              | Whether the respondent is self-employed.                                   |
| `family_history`             | Whether the respondent has a family history of mental illness.             |
| `treatment`                  | Whether the respondent has sought treatment for mental health issues.      |
| `work_interfere`             | How mental health issues interfere with work.                              |
| `no_employees`               | Size of the company the respondent works for.                              |
| `remote_work`                | Whether the respondent works remotely.                                     |
| `tech_company`               | Whether the employer is a tech company.                                    |
| `benefits`                   | Whether the employer provides mental health benefits.                      |
| `care_options`               | Availability of mental health care options at the workplace.               |
| `wellness_program`           | Whether a wellness program exists at the workplace.                        |
| `seek_help`                  | Whether the employer encourages seeking mental health help.                |
| `anonymity`                  | Whether anonymity is protected when seeking mental health help.            |
| `leave`                      | Ease of taking leave for mental health reasons.                            |
| `mental_health_consequence` | Perceived consequence of disclosing mental health issues at work.          |
| `phys_health_consequence`   | Perceived consequence of disclosing physical health issues at work.        |
| `coworkers`                  | Comfort discussing mental health with coworkers.                           |
| `supervisor`                 | Comfort discussing mental health with supervisors.                         |
| `mental_health_interview`   | Willingness to discuss mental health in a job interview.                   |
| `phys_health_interview`     | Willingness to discuss physical health in a job interview.                 |
| `mental_vs_physical`        | Belief about how mental and physical health are valued comparatively.      |
| `obs_consequence`           | Observed negative consequences of discussing mental health at the workplace.|
| `comments`                  | Open-ended additional comments from the respondent (optional).             |

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"\nUnique values in '{col}':")
    print(df[col].unique())

**Unique Values for Each Variable**

After cleaning the data, the number of unique values of gender is four: Male, Female, Other and Unknown.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Clean column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

df['age'] = pd.to_numeric(df['age'].astype(str).str.strip(), errors='coerce')


# Remove ages outside 15 to 100
df = df[(df['age'] >= 15) & (df['age'] <= 100)]

# Row count after
print("After removing outliers:", df.shape)

# Summary stats
print(df['age'].describe())

# Check raw age values
print("Unique raw age values:")
print(df['age'].unique())

This data wrngling code removes the rows where there ar outliers as there are values such as -1 to 999 which are realistic and therefore should be excluded from the analysis. We can see the unique values where 9999 and negative values are not included wherein they are present in the unique values code earlier therefore this has been corrected.
The cleaning of the gender data has already been performed on the excel and therefore it is correctly standardized into four main genders such Male, Female, Other and Unknown.

### What all manipulations have you done and insights you found?

**Cleaned the age column**

➤ Converted all age values to numeric format using pd.to_numeric() after stripping spaces and special characters.

➤ Removed rows where age was unrealistic (outside the 15–100 range), including negative and corrupted entries.

**Cleaned the gender column**

➤ Mapped inconsistent gender entries (e.g., "M", "maile", "cis male", "F", "femail", etc.) into standard categories: Male, Female, Other, and Unknown.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Distribution of Age **(Univeriate Analysis: chart 1-8)**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(df['age'], bins=30, kde=True)
plt.title('Age Distribution of Respondents')
plt.show()

##### 1. Why did you pick the specific chart?

Histogram best visualizes a single continuous variable.

##### 2. What is/are the insight(s) found from the chart?

Most respondents are aged 25–35, key workforce demographic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Suggests which age group to target for mental health initiatives.

#### Chart - 2: Gender Distribution

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='standardized_gender')
plt.title('Gender Distribution')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Countplot shows frequency of categorical values.

##### 2. What is/are the insight(s) found from the chart?

Most respondents identify as male.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Programs must address gender disparities in awareness.

#### Chart - 3: Country-wise Participation

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
top_countries = df['country'].value_counts().nlargest(10)
top_countries.plot(kind='bar')
plt.title('Top 10 Respondent Countries')
plt.ylabel('Number of Respondents')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart compares spread of clients acroos all the countries.

##### 2. What is/are the insight(s) found from the chart?

US dominates response pool followed by United Kingdon and Canada.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights may be skewed toward US work culture and therefore might not be accurate and relevant to other countries.

In [None]:
# Chart - 4 visualization code
sns.countplot(x='treatment', data=df)
plt.title('Have You Sought Mental Health Treatment?')
plt.show()

##### 1. Why did you pick the specific chart?

Countplot shows binary response so its useful in this case because we have two responses yes or no.

InsightMore than half have sought treatment.

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4: Treatment History

In [None]:
# Chart - 5 visualization code
sns.countplot(x='treatment', data=df)
plt.title('Have You Sought Mental Health Treatment?')
plt.show()

##### 1. Why did you pick the specific chart?

Countplot shows binary response so this is useful as the reponse has only two options yes or no.

##### 2. What is/are the insight(s) found from the chart?

 More than half have sought treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indicates high need for support programs  because there are still half who have not considered or taken a treatment. No negative impact.

#### Chart - 5: Remote Work

In [None]:
# Chart - 5 visualization code
sns.countplot(x='remote_work', data=df)
plt.title('Do You Work Remotely?')
plt.show()

##### 1. Why did you pick the specific chart?

Understand remote working status.

##### 2. What is/are the insight(s) found from the chart?

 Many work remotely.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Remote mental health support is crucial. No negative impact.

#### Chart - 6: Self-employed Status


In [None]:
# Chart - 6 visualization code
sns.countplot(x='self_employed', data=df)
plt.title('Self-employed Status')
plt.show()

##### 1. Why did you pick the specific chart?

Show employment type.

##### 2. What is/are the insight(s) found from the chart?

Few are self-employed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Emphasizes importance of employer-led mental health policies.

#### Chart - 7: Mental Health Consequence at Work

In [None]:
# Chart - 7 visualization code
sns.countplot(x='mental_health_consequence', data=df)
plt.title('Perceived Mental Health Consequence at Work')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Understand perceived stigma.

##### 2. What is/are the insight(s) found from the chart?

Many fear consequences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Reducing stigma can improve productivity. Ignoring it could hurt retention.

#### Chart - 8: Benefits Availability

In [None]:
# Chart - 8 visualization code
sns.countplot(x='benefits', data=df)
plt.title('Are Mental Health Benefits Provided by Employer?')
plt.show()

##### 1. Why did you pick the specific chart?

Visualize benefit awareness.

##### 2. What is/are the insight(s) found from the chart?

Many unsure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Communication gap; unutilized resources. Potential for negative impact.

#### Chart - 9: Age vs Treatment **(Bivariate Analysis: chart 9-16)**

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(x='treatment', y='age', data=df)
plt.title('Age Distribution by Treatment Seeking')
plt.show()

##### 1. Why did you pick the specific chart?

 Boxplot compares distribution across categories.

##### 2. What is/are the insight(s) found from the chart?

Older respondents more likely to seek treatment from ages between 27-38.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Design campaigns for younger workers. Ignoring them risks long-term issues.

#### Chart - 10: Gender vs Treatment

In [None]:
# Chart - 10 visualization code
sns.countplot(data=df, x='standardized_gender', hue='treatment')
plt.title('Treatment Seeking by Gender')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

 Compare treatment by gender.

##### 2. What is/are the insight(s) found from the chart?

Females more open to treatment than men.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gender-sensitive outreach can improve participation.

#### Chart - 11: Country vs Treatment (Top 5)

In [None]:
# Chart - 11 visualization code
top5 = df['country'].value_counts().nlargest(5).index
sns.countplot(data=df[df['country'].isin(top5)], x='country', hue='treatment')
plt.title('Treatment Seeking by Country (Top 5)')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

See treatment trends by geography.

##### 2. What is/are the insight(s) found from the chart?

Different cultures may affect openness. In United States there are significantly more people who are open to treatment compared to othr countries in top 5.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps localize mental health campaigns.

#### Chart - 12: Work Interference vs Treatment


In [None]:
# Chart - 12 visualization code
sns.countplot(x='work_interfere', hue='treatment', data=df)
plt.title('Work Interference vs Treatment')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Explore cause-effect of how work influences treatment.

##### 2. What is/are the insight(s) found from the chart?

Higher interference means more treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Shows ROI of addressing interference early.

#### Chart - 13: Remote Work vs Benefits

In [None]:
# Chart - 13 visualization code
sns.countplot(x='remote_work', hue='benefits', data=df)
plt.title('Remote Work vs Mental Health Benefits')
plt.show()

##### 1. Why did you pick the specific chart?

Compare remote work with benefits.

##### 2. What is/are the insight(s) found from the chart?

Many remote workers unsure about benefits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Highlights need for virtual HR engagement.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

Understand correlation between numeric variables. This chart was included as it has been mentioned in the sample EDA already however there is not much use as age is the only numeric value present.



##### 2. What is/are the insight(s) found from the chart?

Shows relationships such as age and responses.

#### Chart - 15: No. of Employees vs Care Options

In [None]:
# Chart - 15 visualization code
plt.figure(figsize=(12, 5))
sns.countplot(x='no_employees', hue='care_options', data=df)
plt.title('Company Size vs Availability of Care Options')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Show correlation with org size.

##### 2. What is/are the insight(s) found from the chart?

Bigger firms provide more options.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Small companies may need government/NGO support.

#### Chart - 16: Supervisor Support vs Seeking *Help*

In [None]:
# Chart - 16 visualization code
sns.countplot(x='supervisor', hue='seek_help', data=df)
plt.title('Supervisor Support vs Seeking Mental Health Help')
plt.show()

##### 1. Why did you pick the specific chart?

Analyze influence of leadership.

##### 2. What is/are the insight(s) found from the chart?

Having supportive supervisor leads to higher help-seeking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Train managers to be mental-health friendly.

#### Chart - 17: Age, Gender, and Treatment **(Multivariate Analysis: charts 17-20)**

In [None]:
# Chart - 17 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='standardized_gender', y='age', hue='treatment', data=df)
plt.title('Age and Gender vs Treatment Seeking')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Identify cross-demographic patterns.

##### 2. What is/are the insight(s) found from the chart?

Gender-age gaps in seeking help.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Refine initiatives by cohort.

#### Chart - 18 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, vars=['age'], hue='treatment')
plt.suptitle('Pair Plot of Age with Treatment Status', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Explore multivariate relationships.

##### 2. What is/are the insight(s) found from the chart?

Visual check of how age and treatment interact across subgroups.

#### Chart - 19: Mental vs Physical Consequences and Support

In [None]:
# Chart - 19 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='mental_vs_physical', hue='mental_health_consequence', data=df)
plt.title('Mental vs Physical Health Priority vs Consequences')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Compare attitudes and real-world effects.

##### 2. What is/are the insight(s) found from the chart?

Mental health seen as less prioritized.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Balanced HR messaging can help destigmatize mental illness.

#### Chart - 20: Work Interference vs Leave Policy

In [None]:
# Chart - 20 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='work_interfere', hue='leave')
plt.title('Work Interference vs Leave Policy')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Evaluate HR policy effectiveness.

##### 2. What is/are the insight(s) found from the chart?

Poor leave policies linked to higher interference.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Reworking leave can reduce burnout.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

**Better Awareness Programs:** Many employees are unsure if benefits are available. Employers must improve communication about mental health resources.

**Training for Managers:** Supervisor support directly impacts employees’ comfort with discussing mental health. Train managers to create a safe space.

**Country-specific Policies:** Since cultural differences exist, mental health programs should be tailored to regional needs and expectations.

**Target Younger Workers:** Younger employees are less likely to seek help. Initiatives should focus on awareness and normalization within this group.

**Support for Remote Workers:** As many employees work remotely, virtual mental health services should be strengthened.

# **Conclusion**

This analysis shows that mental health issues are common in the tech industry, especially among young adults. Factors like supportive supervisors, clear mental health benefits, and country culture strongly affect whether people seek help. Many still fear negative consequences at work. To create a healthier workplace, companies should improve communication, reduce stigma, and offer targeted mental health support.