# **Project Name**    - Mental Health Survey EDA

---





##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Ashwin kanth Marapally

# **Project Summary -**

This project involves a comprehensive Exploratory Data Analysis (EDA) of the Mental Health in Tech Survey dataset. The survey, conducted among professionals in the technology sector, aims to understand employees' mental health conditions, treatment behavior, workplace support, and the cultural perceptions of mental wellness in tech environments.

**The analysis includes:**

* Cleaning and preprocessing (handling missing values, correcting outliers, and data type conversion)

* Univariate, Bivariate, and Multivariate visualizations (UBM) to identify patterns

* Analysis of mental health treatment behavior across different genders, age groups, employment types, and work environments

* Key focus on how work interference, remote work, and company support influence treatment-seeking behavior

By identifying high-risk groups and support gaps, this project helps build actionable recommendations for organizations to create a more inclusive and responsive mental health environment.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

In today’s high-stress and fast-paced tech industry, mental health challenges are increasingly prevalent, yet they remain under-discussed and under-supported in many workplaces. Many employees struggle silently, unaware of or unable to access proper mental health care and support systems.

This project aims to answer the following core questions:

* What are the demographic and workplace factors that influence mental health conditions in tech?

* Are employees seeking treatment when needed — and if not, why?

* How do remote work, company size, gender, and awareness of benefits affect mental health support?

* What patterns indicate gaps in mental health accessibility or awareness?

The ultimate goal is to equip organizations with data-driven insights to help them:

* **Identify at-risk groups**

* **Address awareness and access gaps**

* **Reduce stigma**

* **Create a positive business impact through better mental health culture**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd;
import numpy as np;
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

mental_health_data = pd.read_csv("/content/survey.csv")

### Dataset First View

In [None]:
# Dataset First Look

mental_health_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Rows: ", mental_health_data.shape[0])
print("Columns: ", mental_health_data.shape[1])

### Dataset Information

In [None]:
# Dataset Info

mental_health_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

mental_health_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

mental_health_missing_data = pd.DataFrame({
    "Missing Values" : mental_health_data.isnull().sum(),
    "Missing_values_percentage %" : mental_health_data.isnull().mean()* 100
}).sort_values("Missing_values_percentage %", ascending = False)

mental_health_missing_data

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10, 6))
sns.heatmap(mental_health_data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

**Dataset Overview:**

* The dataset contains 1,259 rows and multiple columns 27

* Demographics (e.g., Age, Gender, Country, state)

* Work-related details (remote_work, no_employees, self_employed)

* Mental health experiences (treatment, work_interfere, seek_help, family_history, etc.)

**Data Cleaning Observations:**

* The Age column contained extreme outliers (e.g., values like -1726 and 9999999999), which were removed using the IQR method.

* The Gender column had over 45+ unique variations due to free-text input (e.g., "male", "M", "cis male", etc.), which were standardized into 'Male', 'Female', and 'Other'.

**Several columns had missing values, including:**

* self_employed (~1.4% missing)

* work_interfere (~21% missing)

* state (~41% missing)

These were handled via mode imputation or appropriate default values after analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

mental_health_data.columns.tolist()

In [None]:
# Dataset Describe

mental_health_data.describe(include='all')

### Variables Description

**Age:**	Respondent's age (numeric). Outliers were present and removed using IQR.

**Gender:**	Self-reported gender. Normalized into Male, Female, and Other.

**Country:**	Country where the respondent resides.

**state:**	If you live in the United States, which state or territory do you live in?.

**self_employed:**	Whether the respondent is self-employed (Yes / No).

**family_history:**	Indicates if the respondent has a family history of mental illness.

**treatment:**	Whether the respondent has sought mental health treatment.

**work_interfere:**	How often mental health interferes with work (Never, Rarely, Sometimes, Often).

**no_employees:**	Size of the company the respondent works for (e.g., 6-25, More than 1000).

**remote_work:**	Whether the respondent works remotely at least part of the time.

**tech_company:**	Whether the respondent works for a tech company (Yes / No).

**benefits:**	Whether the employer provides mental health benefits

**care_options:**	Awareness of mental health care options provided by the employer.

**wellness_program:**	Availability of wellness programs at the respondent’s workplace.

**seek_help:**	Whether the respondent knows how to seek help for mental health issues.

**anonymity:**	Does the employer maintain anonymity for those seeking mental health treatment?

**leave:**	Comfort level in requesting a mental health-related leave.

**mental_health_consequence:**	Perceived consequence of discussing mental health at work.

**phys_health_consequence:**	Perceived consequence of discussing physical health at work.

**coworkers, supervisor, mental_health_interview, phys_health_interview:**	Attitudes toward discussing mental health with colleagues or during job interviews.

**obs_consequence:**	Belief whether discussing mental health may affect job prospects.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

mental_health_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Create copy of the dataset

mental_health_data_copy = mental_health_data.copy()

In [None]:
# Handling Missing Values

mental_health_data_copy.drop(columns=['comments'], inplace=True)

mental_health_data_copy.fillna({'state': 'Not specified'}, inplace=True)
mental_health_data_copy.fillna({"work_interfere": mental_health_data_copy["work_interfere"].mode()[0]}, inplace=True)
mental_health_data_copy.fillna({'self_employed': 'No'}, inplace=True)

In [None]:
# cross checking null values

mental_health_data_copy.isnull().sum()

In [None]:
# Converting data types

mental_health_data_copy['Timestamp'] = pd.to_datetime(mental_health_data_copy['Timestamp'])

categorical_columns = ["Gender","Country","self_employed","treatment"]

mental_health_data_copy[categorical_columns] = mental_health_data_copy[categorical_columns].astype('category')

In [None]:
# Cleaning the Gender Column (As there are 46 unique genders in dataset)

def clean_gender(gender):
    gender = str(gender).strip().lower()

    if gender in ['male', 'm', 'male ', 'cis male', 'cis man', 'man', 'malr']:
        return 'Male'
    elif gender in ['female', 'f', 'femail', 'cis female', 'cis woman', 'woman']:
        return 'Female'
    else:
        return 'Other'

mental_health_data_copy['Gender'] = mental_health_data_copy['Gender'].apply(clean_gender)



In [None]:
# Check outliers in Age using IQR Method

Q1 = mental_health_data_copy["Age"].quantile(0.25)
Q3 = mental_health_data_copy["Age"].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = mental_health_data_copy[(mental_health_data_copy["Age"] < lower_bound) | (mental_health_data_copy["Age"] > upper_bound)]

print(f"Number of Age outliers: {len(outliers)}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

In [None]:
# Removing outliers

df_cleaned = mental_health_data_copy[(mental_health_data_copy['Age'] >= lower_bound) & (mental_health_data_copy['Age'] <= upper_bound)]


In [None]:
# save cleaned dataset

df_cleaned.to_csv("cleaned_mental_health_data.csv", index=False)

### What all manipulations have you done and insights you found?

**1. Outlier Handling**

* Detected extreme outliers in the Age column (e.g., -1726, 100000000000).

* Used IQR method to compute lower and upper bounds.

* Removed 40 rows outside the age range 13.5 – 49.5 for clean analysis.

**2. Missing Values Handling**

* Dropped the comments column (86% missing).

**Filled missing:**

* self_employed → with "No"

* work_interfere → with mode ("Sometimes")

* state → with "Not specified"

**3. Data Type Fixes**

* Converted Timestamp to datetime

* Converted relevant columns (e.g., Gender, Country, remote_work, etc.) to category type for efficient analysis


**4. Gender Normalization**

* Cleaned messy Gender values (e.g., "male", "MALE ", "femail", "cis man", etc.)

* Consolidated into 3 categories: Male, Female, and Other

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate Visualizations**

#### Chart - 1

In [None]:
# Chart - 1 Age Distribution

plt.figure(figsize=(12, 6))
sns.histplot(df_cleaned['Age'], bins=20, kde=True)
plt.title("Age Distribution After Removing Outliers (IQR-based)")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()



##### 1. Why did you pick the specific chart?

I selected a histogram because it is the most effective way to explore how a continuous variable like Age is distributed

##### 2. What is/are the insight(s) found from the chart?

**Insight :** The chart shows that the age of respondents clusters mainly between 25 to 35 years, indicating that the majority of tech employees who responded are relatively young. There is a gradual decline after age 35, and almost no representation above age 49 due to outlier removal. This suggests that younger professionals are either more impacted by or more aware of mental health topics in the tech industry.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:**  Understanding the age concentration of employees engaging with mental health surveys allows companies to tailor wellness programs and mental health support. If the majority of affected or engaged users are in their 20s and 30s, organizations can design more age-relevant mental health resources, workshops, and communication strategies

#### Chart - 2

In [None]:
# Chart - 2 Gender Count

plt.figure(figsize=(8, 6))
sns.countplot(data=df_cleaned, x='Gender')
plt.title("Gender Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To observe gender diversity in the dataset and check representation across male, female, and other categories.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** he dataset is heavily male-dominated, with relatively fewer female and other-gender respondents. This reflects either the tech industry imbalance or a greater willingness among men to participate in this specific survey.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Gender inclusion policies should ensure that mental health support is inclusive and customized for all gender identities.

#### Chart - 3

In [None]:
# Chart - 3 Treatment Sought

plt.figure(figsize=(8, 6))
sns.countplot(data=df_cleaned, x='treatment')
plt.title("Treatment Sought Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

To evaluate how many people have sought treatment for mental health, which indicates openness and access to care.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** A significant number of respondents have sought mental health treatment. This may indicate rising awareness and acceptance, or it may reflect underlying stress and burnout in the tech industry.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Organizations should continue reducing stigma, offering accessible therapy, and building a culture of mental health awareness.

#### Chart - 4

In [None]:
# Chart - 4 Self-employed Status

self_emp_counts = df_cleaned["self_employed"].value_counts()

labels = self_emp_counts.index
sizes = self_emp_counts.values

plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title("Self-Employed Status")


##### 1. Why did you pick the specific chart?

To check how many respondents are self-employed, as this affects access to structured mental health benefits.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Most respondents are not self-employed, indicating they may depend more on company-provided benefits and culture for mental wellness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Organizations must recognize this dependency and ensure their employee wellness infrastructure is reliable and easy to access.

#### Chart - 5

In [None]:
# Chart - 5 Family History of Mental Illness

sns.countplot(data=df_cleaned, x="family_history")
plt.title("Family History of Mental Illness")
plt.show()

##### 1. Why did you pick the specific chart?

To know how many respondents have a family history of mental illness, a known risk

##### 2. What is/are the insight(s) found from the chart?

**Insight:** A large number of individuals reported having a family history, which could contribute to a higher personal risk or awareness of mental health challenges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Support systems could include preventive screenings and educational sessions for employees with higher personal risk factors.

# **Bivariate Visualizations**

#### Chart - 6

In [None]:
# Chart - 6 Age vs Treatment

plt.figure(figsize=(10, 6))
sns.boxplot(data=df_cleaned, x='treatment', y='Age')
plt.title("Age vs Treatment")
plt.show()

##### 1. Why did you pick the specific chart?

To explore whether age has an influence on treatment-seeking behavior.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** The median age of those who sought treatment is slightly lower than those who did not. This suggests younger professionals might be more open to seeking help.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Mental health outreach and awareness campaigns might be more effective when targeted at younger employees, especially in onboarding and training.

#### Chart - 7

In [None]:
# Chart - 7 Gender vs Treatment

plt.figure(figsize=(10, 6))
sns.countplot(data=df_cleaned, x="Gender", hue="treatment")
plt.title("Gender vs Treatment")
plt.show()

##### 1. Why did you pick the specific chart?

To understand how treatment-seeking behavior differs across genders.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** While all genders report seeking treatment, the distribution varies. Men appear slightly more likely to seek treatment, which may reflect differing comfort levels or access.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Companies should consider gender-specific support strategies and make resources equally approachable for all.

#### Chart - 8

In [None]:
# Chart - 8 Remote Work vs Treatment

plt.figure(figsize=(8,6))
sns.countplot(data=df_cleaned, x="remote_work", hue="treatment",  palette='viridis')
plt.title("Remote Work vs Treatment")
plt.show()


##### 1. Why did you pick the specific chart?

To examine if working remotely influences treatment-seeking behavior.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** The bar chart shows that among those who do not work remotely, both “Yes” and “No” responses for seeking treatment are nearly balanced, with slightly more saying “No.”
However, for those who work remotely, there is a slight increase in treatment-seeking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** This insight is critical for companies with hybrid or remote models. Employers should:

* Offer remote-access mental health services (e.g., teletherapy, mental wellness apps).

* Encourage open communication and support channels in remote settings.

* Ensure that remote employees are not left out of wellness programs designed for in-office teams.

# **Multivariate Analysis**

#### Chart - 9

In [None]:
# Chart - 9 Work Interference vs Treatment by Gender

sns.catplot(x="work_interfere", hue="treatment", col="Gender",
            data=df_cleaned, kind="count", height=5, aspect=1)
plt.show()

##### 1. Why did you pick the specific chart?

 It allows us to break down how work interference due to mental health varies across gender, and how it relates to whether someone sought treatment.

 This multi-panel approach is excellent for spotting trends within each gender group separately, which would be hidden in a single plot.

##### 2. What is/are the insight(s) found from the chart?

**Insight:**

**1. For Male respondents:**

* The largest group falls under "Sometimes" work interference.

* A majority of them did not seek treatment, even though interference exists.

* Treatment-seeking is relatively low for “Rarely” and “Never” categories.


**2. For Female respondents:**

* More women seek treatment when they report work interference as "Often" or "Sometimes".

* Women are more proactive in seeking help when they notice mental health impacting work.

**3. For Other genders:**

* The sample size is small, but those reporting “Sometimes” interference are more likely to seek treatment.

* Despite lower numbers, the behavior is more similar to female respondents than male.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Female employees are more likely to seek help when mental health interferes with work, suggesting higher awareness or reduced stigma in this group. Male employees may need more proactive encouragement or a safer environment to seek help.

#### Chart - 10

In [None]:
# Chart - 10  Remote Work vs Mental Health Support

sns.countplot(x='remote_work', hue='seek_help', data=df_cleaned)
plt.title("Remote Work vs Mental Health Help Access")
plt.show()

##### 1. Why did you pick the specific chart?

We used a countplot with a hue to compare the proportion of people working remotely vs on-site across different responses to mental health support availability

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Among non-remote workers, the largest group says "No", followed by a significant number who answered "Don't know". Surprisingly, only a minority said "Yes".

Among remote workers, even fewer say "Yes", while the majority responded "No", and a notable group is also unsure ("Don't know").

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:**

* Awareness is low — many employees don’t know whether they have access to mental health support.

* Remote workers may be especially disconnected from wellness resources and HR communication.

#### Chart - 11

In [None]:
# Chart - 11 Company Size vs Mental Health Benefits

sns.countplot(x='no_employees', hue='benefits', data=df_cleaned)
plt.title("Company Size vs Mental Health Benefits")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To explore if company size affects the likelihood of offering mental health benefits.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Larger organizations are more likely to provide mental health benefits than small companies. Small and mid-size companies show inconsistency.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** There is a need to help SMEs adopt scalable and affordable mental health support systems or use third-party wellness services.

# **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the goal of improving employee mental health awareness, support, and engagement, we suggest the following data-backed strategies:

**1. Enhance Awareness and Accessibility of Mental Health Support**

Many employees either don’t know or believe they don’t have access to mental health resources — especially remote workers. The company should:

* Regularly communicate available support via email, intranet, and meetings.

* Include mental health resource overviews during onboarding.

* Use digital tools (e.g., chatbots, apps) to provide 24/7 access to help.

**2. Target Support Based on Risk Factors**

Groups more likely to seek treatment include:

* Employees aged 25–35

* Those with a family history of mental illness

* Those who report work interference from mental health

Focus interventions (e.g., preventive workshops, stress-relief programs) on these groups.

**3. Address Gender Gaps in Treatment-Seeking Behavior**

The analysis revealed that males are less likely to seek treatment even when mental health affects work. This may be due to stigma or cultural resistance. The company should:

* Promote mental health in male-dominated teams through leaders sharing personal experiences.

* Introduce anonymous self-assessments or surveys.

**4. Support Remote Workers Differently**

Remote employees are:

* Less aware of support options

* Less likely to seek treatment despite possibly facing isolation

Offer teletherapy, virtual wellness sessions, and regular check-ins to bridge this gap.

# **Conclusion**

This EDA uncovered key behavioral patterns and risk indicators affecting mental health support and treatment in the tech workplace. Key findings include:

* Majority of respondents are in the 25–35 age group — a critical audience for intervention.

* A significant number of employees reported a family history of mental illness, increasing their potential vulnerability.

* There is a clear link between work interference and treatment-seeking, but gender plays a major role in response behavior.

* Remote employees are less likely to know or access mental health support, despite being in potentially more isolated environments.

* Males are less likely to seek treatment even when needed, indicating the need for stigma-reducing strategies.

In conclusion, the organization must implement proactive, inclusive, and well-communicated mental health strategies tailored to age, gender, and work mode to truly support employee well-being and productivity.