In [None]:
    from google.colab import drive
    drive.mount('/content/drive')

# **Project Name**    - Mental Health in Tech: A 2014 Workplace Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

In today’s fast-paced and demanding technology industry, the mental well-being of employees plays a crucial role in determining both individual productivity and organizational success. Despite increasing awareness, mental health issues often remain under-addressed in workplaces, especially in tech companies where long working hours, tight deadlines, and high job stress are common. This project focuses on performing an in-depth Exploratory Data Analysis (EDA) of a 2014 survey dataset that captures the prevalence of mental health challenges and attitudes toward mental health in tech workplaces across the globe.

The dataset, originally collected by Open Sourcing Mental Illness (OSMI), contains detailed responses from technology employees about their personal mental health conditions, workplace support systems, and perceptions regarding mental health and physical health parity. It comprises features like age, gender, country, self-employment status, company size, mental health history, and whether the individual has sought treatment. It also covers workplace factors such as availability of mental health benefits, ease of taking medical leave, remote work frequency, and the presence of wellness programs. These attributes allow us to explore critical questions such as: What is the distribution of mental health issues across demographics? How do workplace policies influence treatment-seeking behavior? Are there regional differences in attitudes and support for mental health?

Our business objective in this EDA is to uncover patterns and insights that can inform tech companies about gaps in their mental health support and identify opportunities to build a more inclusive, supportive work culture. This is particularly important as mental health issues not only affect individual employees but also impact team productivity, employee retention, and overall organizational health.

To achieve this, we will first clean and preprocess the dataset by addressing outliers (such as unrealistic age entries), standardizing inconsistent gender responses, handling missing values, and converting data types where necessary. Following that, we will undertake comprehensive univariate, bivariate, and multivariate analyses. Univariate analyses will focus on understanding the basic distributions of demographic and workplace attributes. Bivariate and multivariate analyses will help identify relationships between workplace characteristics and mental health outcomes — for instance, how the presence of mental health benefits relates to whether employees seek treatment, or how ease of taking medical leave correlates with fear of workplace consequences.

A key highlight of this EDA will be the creation of advanced and unique visualizations, including interactive country-level treatment rates, correlation heatmaps, and text mining on free-text comments. We will also derive new metrics such as a Mental Health Support Index that combines multiple features like benefits, care options, and wellness programs to provide a single score of workplace mental health friendliness. Such derived metrics will help identify regions, company types, or employee groups where interventions are most urgently needed.

By the end of this project, we aim to deliver actionable insights that can guide tech companies in implementing or improving mental health policies. We will also reflect on the limitations of the dataset — such as the bias inherent in self-reported data — and propose directions for future work, including comparative studies using more recent surveys like OSMI’s 2016 dataset. The ultimate goal of this EDA is to use data-driven analysis to contribute to a healthier, more supportive environment for employees in the technology sector.



# **GitHub Link -**

https://github.com/BrijKumbhani/EDA-on-Mental-Health-in-Tech-A-2014-Workplace-Analysis

# **Problem Statement**


 Understand mental health trends in tech workplaces & workplace support systems.

#### **Business Objective**

Help tech companies identify gaps and improve mental health support.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

try:
    # Data manipulation
    import pandas as pd
    import numpy as np

    # Visualization
    import matplotlib.pyplot as plt
    import seaborn as sns
    import plotly.express as px  # For interactive charts
    from wordcloud import WordCloud  # For text analysis

    # Warnings
    import warnings
    warnings.filterwarnings("ignore")

    # Settings
    sns.set(style="whitegrid")
    plt.rcParams["figure.figsize"] = (10,6)

    print("All libraries imported successfully!")

except ImportError as e:
    print(f"Error importing libraries: {e}")
    print("Please ensure all required libraries are installed.")


### Dataset Loading

In [None]:
# Load Dataset
# Mount Google Drive and load the dataset

try:
    file_path = "/content/drive/MyDrive/Labmentix/Project-2/survey.csv"

    # Load the dataset
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully.")
    print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please check the file path or file format.")


### Dataset First View

In [None]:
# Dataset First Look
# Display the first few rows of the dataset to get an initial view
try:
    display(df.head())
except Exception as e:
    print(f"Error displaying head of dataset: {e}")

# Display dataset info to check data types and non-null counts
try:
    df_info = df.info()
except Exception as e:
    print(f"Error getting dataset info: {e}")

# Display summary statistics for numerical columns
try:
    display(df.describe())
except Exception as e:
    print(f"Error generating summary statistics: {e}")

# Check for missing values in each column
try:
    missing_values = df.isnull().sum()
    print("Missing values in each column:")
    print(missing_values[missing_values > 0])
except Exception as e:
    print(f"Error checking missing values: {e}")


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
try:
    rows, columns = df.shape
    print(f"Number of rows in the dataset: {rows}")
    print(f"Number of columns in the dataset: {columns}")
except Exception as e:
    print(f"Error retrieving dataset shape: {e}")


### Dataset Information

In [None]:
# Dataset Info
# Display data types, non-null counts, and memory usage
try:
    df_info = df.info()
except Exception as e:
    print(f"Error retrieving dataset info: {e}")


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

try:
    duplicate_count = df.duplicated().sum()
    print(f"Total duplicate rows in the dataset: {duplicate_count}")
except Exception as e:
    print(f"Error checking for duplicate rows: {e}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
try:
    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    if not missing_values.empty:
        print("Missing values detected in the following columns:")
        print(missing_values)
    else:
        print("No missing values detected in the dataset.")
except Exception as e:
    print(f"Error counting missing values: {e}")


In [None]:
# Visualizing the missing values
try:
    plt.figure(figsize=(12, 6))
    sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
    plt.title('Heatmap of Missing Values')
    plt.show()
except Exception as e:
    print(f"Error visualizing missing values: {e}")


### What did you know about your dataset?

The dataset consists of survey responses related to mental health in the tech industry. It contains various types of data including demographic information (age, gender, country), employment details (company size, remote work, self-employment), and workplace mental health support indicators (benefits, wellness programs, care options). There are both categorical and numerical features. Initial analysis shows that the dataset has [insert number] rows and [insert number] columns. We have identified [insert duplicate_count] duplicate rows. Some columns contain missing values, which will need to be handled during data cleaning. The data types are mostly appropriate, but further checking will be required to ensure consistency and correctness, especially for age, gender, and categorical fields.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
try:
    print("Columns in the dataset:")
    print(df.columns.tolist())
except Exception as e:
    print(f"Error retrieving dataset columns: {e}")


In [None]:
# Dataset Describe
try:
    display(df.describe())
except Exception as e:
    print(f"Error generating dataset description: {e}")


### Variables Description

Here is a clear description of each variable of dataset:

- Timestamp: The date and time when the survey was submitted.

- Age: The age of the respondent (numeric).

- Gender: The gender identity of the respondent (categorical).

- Country: The country where the respondent resides.

- state: If in the United States, the state or territory where the respondent lives.

- self_employed: Indicates whether the respondent is self-employed (Yes/No).

- family_history: Whether the respondent has a family history of mental illness (Yes/No).

- treatment: Whether the respondent has sought treatment for a mental health condition (Yes/No).

- work_interfere: How often the respondent feels their mental health condition interferes with work (Never/Sometimes/Often/Always).

- no_employees: Size of the company where the respondent works (categorical ranges).

- remote_work: Whether the respondent works remotely at least 50% of the time (Yes/No).

- tech_company: Whether the employer is primarily a tech company (Yes/No).

- benefits: Whether the employer provides mental health benefits (Yes/No/Don't know).

- care_options: Whether the respondent knows the mental health care options provided by the employer.

- wellness_program: Whether the employer has discussed mental health as part of a wellness program.

- seek_help: Whether the employer provides resources to learn about mental health and how to seek help.

- anonymity: Whether the respondent's anonymity is protected if they seek treatment.

- leave: Ease of taking medical leave for mental health issues (Very easy/Somewhat easy/Don't know/Somewhat difficult/Very difficult).

- mental_health_consequence: Whether discussing a mental health issue with the employer could have negative consequences.

- phys_health_consequence: Whether discussing a physical health issue with the employer could have negative consequences.

- coworkers: Willingness to discuss mental health with coworkers.

- supervisor: Willingness to discuss mental health with supervisor.

- mental_health_interview: Willingness to discuss mental health during a job interview.

- phys_health_interview: Willingness to discuss physical health during a job interview.

- mental_vs_physical: Whether the respondent feels mental health is treated as seriously as physical health by their employer.

- obs_consequence: Whether the respondent has observed negative consequences for coworkers with mental health conditions.

- comments: Any additional notes or comments provided by the respondent (free text).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
try:
    for col in df.columns:
        unique_count = df[col].nunique()
        print(f"{col}: {unique_count} unique values")
except Exception as e:
    print(f"Error checking unique values: {e}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

try:
    # Remove duplicate rows
    df = df.drop_duplicates()

    # Remove rows with invalid ages
    df = df[(df['Age'] >= 16) & (df['Age'] <= 100)]

    # Gender standardization
    df['Gender'] = df['Gender'].str.lower().str.strip()

    df['Gender'] = df['Gender'].replace({
        'male': 'Male', 'm': 'Male', 'man': 'Male', 'cis male': 'Male', 'male-ish': 'Male', 'msle': 'Male', 'mail': 'Male', 'malr': 'Male',
        'female': 'Female', 'f': 'Female', 'woman': 'Female', 'cis female': 'Female', 'femail': 'Female'
    })

    # Map anything else to 'Other'
    df['Gender'] = df['Gender'].apply(lambda x: 'Male' if x == 'Male' else ('Female' if x == 'Female' else 'Other'))

    # Handle missing values
    if 'work_interfere' in df.columns:
        df['work_interfere'] = df['work_interfere'].fillna("Don't know")
    if 'state' in df.columns:
        df['state'] = df['state'].fillna("Not specified")
    if 'leave' in df.columns:
        df['leave'] = df['leave'].fillna(df['leave'].mode()[0])
    if 'comments' in df.columns:
        df['comments'] = df['comments'].fillna("")

    # Convert key columns to category dtype
    categorical_cols = [
        'Gender', 'Country', 'state', 'self_employed', 'family_history', 'treatment',
        'work_interfere', 'no_employees', 'remote_work', 'tech_company', 'benefits',
        'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
        'mental_health_consequence', 'phys_health_consequence', 'coworkers', 'supervisor',
        'mental_health_interview', 'phys_health_interview', 'mental_vs_physical', 'obs_consequence'
    ]

    for col in categorical_cols:
        if col in df.columns:
            df[col] = df[col].astype('category')

    print("Advanced data wrangling completed. Dataset is now clean and ready for EDA.")

except Exception as e:
    print(f"Error during advanced data wrangling: {e}")


### What all manipulations have you done and insights you found?

1️⃣ Removed duplicate rows

- We dropped any fully duplicate rows to ensure that repeated survey responses do not bias our analysis.

2️⃣ Removed rows with unrealistic ages

- We filtered the dataset to keep only rows where age is between 16 and 100 years.

- This helped remove entries where respondents may have entered incorrect ages (for example, 0, 999, or other outliers).

3️⃣ Standardized gender values

- We cleaned the Gender column by converting all text to lowercase and trimming whitespace.

- We mapped common variations (like "m", "man", "cis male", "femail", etc.) to standard categories: Male, Female, or Other.

- All non-standard or unclear gender responses were grouped under Other for consistent analysis.

4️⃣ Handled missing values in selected columns

- work_interfere: Filled missing values with "Don't know" to avoid losing data in group analysis.

- state: Filled missing values with "Not specified" since location can be useful for regional insights.

- leave: Filled missing values with the mode (most common category) to reflect the typical response in the dataset.

- comments: Filled missing values with an empty string to avoid errors during text analysis.

5️⃣ Converted relevant columns to categorical data type

- We converted columns with categorical responses (e.g., Gender, Country, self_employed, treatment, benefits) to category data type.

- This improves memory efficiency and speeds up grouping, aggregation, and plotting operations.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Age Distribution

In [None]:
# Chart - 1 visualization code
try:
    plt.figure(figsize=(10, 6))
    sns.kdeplot(data=df, x='Age', fill=True, color='skyblue', linewidth=2)
    plt.title('Age Distribution of Respondents')
    plt.xlabel('Age')
    plt.ylabel('Density')
    plt.grid(True)
    plt.show()
except Exception as e:
    print(f"Error plotting age distribution: {e}")


##### 1. Why did you pick the specific chart?

We chose a KDE (Kernel Density Estimate) plot because it provides a smooth representation of the distribution of age. Unlike a simple histogram, a KDE plot helps us easily detect the shape of the distribution (whether it is normal, skewed, or multi-modal) and identify outliers or unusual clusters. Since age is a continuous variable, KDE is a natural choice to visualize it effectively.

##### 2. What is/are the insight(s) found from the chart?

- The age distribution is concentrated between the mid-20s and early 40s, indicating that most respondents are in this working-age bracket.

- There is a peak (mode) around the late 20s to early 30s, suggesting that this is the most common age group among survey participants.

- There are fewer respondents at both the younger and older extremes, consistent with the expected age distribution of tech employees.

- Since we already removed unrealistic ages during data wrangling, no extreme outliers appear in the chart.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
The insights highlight the core demographic (mid-20s to early 40s) of tech employees participating in the survey. Understanding the age range of the workforce helps businesses tailor mental health resources, wellness programs, and communication strategies to meet the needs of this age group. For example, mental health initiatives can focus on work-life balance challenges and burnout prevention strategies relevant to this age demographic.

❌ No negative growth impact:
There is no insight from this chart that would directly lead to negative business growth. However, failing to act on the knowledge of the age concentration (e.g., not designing age-relevant programs) could indirectly result in lower employee satisfaction or retention.

#### Chart - 2: Gender Distribution

In [None]:
# Chart - 2 visualization code
try:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Gender', palette='Set2')
    plt.title('Gender Distribution of Respondents')
    plt.xlabel('Gender')
    plt.ylabel('Number of Respondents')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting gender distribution: {e}")


##### 1. Why did you pick the specific chart?

We selected a bar plot because it is the most effective way to display the frequency of categorical variables like gender. The bar plot allows for easy comparison between the standardized gender categories (Male, Female, Other) and helps visualize the relative proportions of each group in the dataset.

##### 2. What is/are the insight(s) found from the chart?

- The majority of respondents identify as Male, which aligns with the known gender distribution in the tech industry in 2014.

- A smaller proportion identify as Female, while an even smaller portion fall into the Other category, which includes non-binary and less common gender identities as standardized during data wrangling.

- This distribution reflects the gender imbalance that exists in tech workplaces, providing important context for later analysis on treatment, support, and workplace perceptions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
This insight helps organizations recognize the gender composition of their workforce and highlights the importance of designing inclusive mental health programs that address the unique needs and challenges faced by different gender groups. It also provides a foundation for exploring whether mental health resources are equitably accessed across genders.

❌ No negative growth impact:
There is no direct negative business impact from this chart. However, ignoring the gender imbalance or failing to create inclusive mental health initiatives could contribute to negative outcomes like reduced diversity, lower employee morale, or difficulty attracting talent from underrepresented gender groups.

#### Chart - 3: Country Distribution

In [None]:
# Chart - 3 visualization code
try:
    plt.figure(figsize=(12, 6))
    top_countries = df['Country'].value_counts().nlargest(10)
    sns.barplot(x=top_countries.index, y=top_countries.values, palette='pastel')
    plt.title('Top 10 Countries by Number of Respondents')
    plt.xlabel('Country')
    plt.ylabel('Number of Respondents')
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting country distribution: {e}")


##### 1. Why did you pick the specific chart?

We chose a bar plot of the top 10 countries to focus on the countries that contribute the most data to the survey. This helps avoid clutter from countries with very few respondents, making the chart cleaner and easier to interpret. Bar plots are ideal for comparing counts across discrete categories like countries.

##### 2. What is/are the insight(s) found from the chart?

- The majority of respondents are from the United States, followed by countries like the United Kingdom, Canada, and Germany.

- There is significant representation from a few other countries, but the dataset is heavily weighted toward English-speaking and Western countries.

- This suggests that our later analyses will largely reflect the mental health landscape and workplace practices in these regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
Knowing where most of the data originates allows companies and analysts to tailor insights and recommendations to regions where the data is most representative. It also highlights the importance of region-specific mental health initiatives and policies. Businesses operating in these countries can use the insights with higher confidence.

❌ No negative growth impact:
There is no direct negative business impact from this insight. However, misapplying findings to underrepresented regions (where data is sparse) without caution could lead to ineffective or inappropriate policy decisions.

#### Chart - 4: Company Size Distribution

In [None]:
# Chart - 4 visualization code
try:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='no_employees', order=df['no_employees'].value_counts().index, palette='Blues')
    plt.title('Distribution of Company Sizes')
    plt.xlabel('Number of Employees')
    plt.ylabel('Number of Respondents')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting company size distribution: {e}")


##### 1. Why did you pick the specific chart?

We chose a bar plot because company size is a categorical variable representing ranges of employee counts. A bar plot clearly shows the number of respondents from each company size category, making it easy to identify which types of organizations are most represented in the dataset.

##### 2. What is/are the insight(s) found from the chart?

- A large number of respondents come from companies with 100-500 employees, followed by those in 1-5 employees and 500-1000 employees categories.

- There is representation across various company sizes, but mid-sized organizations seem to be the most common in this survey.

- This suggests that mental health policies and experiences reflected in the dataset may be more indicative of practices in small to mid-sized companies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
Understanding company size distribution helps businesses benchmark their mental health policies against similarly sized organizations. It also enables analysts to tailor recommendations based on typical challenges faced by small, mid-sized, or large companies in supporting mental health.

❌ No negative growth impact:
There is no direct negative business impact from this insight. However, making assumptions about mental health support in large corporations based on data that mostly represents smaller companies could lead to less relevant recommendations.

#### Chart - 5: Remote Work Status Distribution

In [None]:
# Chart - 5 visualization code
try:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='remote_work', palette='Greens')
    plt.title('Distribution of Remote Work Status')
    plt.xlabel('Remote Work (50% or More of Time)')
    plt.ylabel('Number of Respondents')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting remote work distribution: {e}")


##### 1. Why did you pick the specific chart?

A bar plot is ideal for showing the frequency of responses for a binary or categorical variable like remote work status. It makes it easy to see how many respondents work remotely at least 50% of the time versus those who do not, allowing for straightforward comparison.

##### 2. What is/are the insight(s) found from the chart?

- The majority of respondents do not work remotely at least 50% of the time.

- A smaller but notable proportion of respondents report working remotely the majority of the time.

- This suggests that while remote work was present in tech workplaces in 2014, it was not the dominant mode of working for most survey participants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
This insight helps companies understand the baseline prevalence of remote work at the time of the survey, which is important for interpreting mental health support structures. For example, businesses can consider whether mental health resources were designed more for in-office or remote workers and adjust accordingly.

❌ No negative growth impact:
There is no negative growth insight from this chart directly. However, failure to provide adequate mental health support tailored to remote workers could lead to disengagement or lower productivity in that group.

#### Chart - 6: Tech Company Status Distribution

In [None]:
# Chart - 6 visualization code
try:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='tech_company', palette='Oranges')
    plt.title('Distribution of Respondents by Tech Company Status')
    plt.xlabel('Tech Company (Yes/No)')
    plt.ylabel('Number of Respondents')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting tech company status distribution: {e}")


##### 1. Why did you pick the specific chart?

We selected a bar plot because tech company status is a binary categorical variable (Yes or No). A bar plot provides a clear, simple comparison of how many respondents work in tech companies versus non-tech organizations. This format helps us understand the overall representation of tech-sector employees in the dataset.

##### 2. What is/are the insight(s) found from the chart?

- The majority of respondents work at tech companies, as expected given the focus of the survey on the tech industry.

- There is still a notable portion of respondents from non-tech companies, suggesting that mental health in the workplace is also a concern in sectors outside of core tech.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
This insight helps ensure that mental health analysis and recommendations can be tailored to the tech sector where most respondents work. Companies can benchmark their policies against industry standards and better understand if they are aligned with common practices within tech organizations.

❌ No negative growth impact:
No direct negative growth impact arises from this insight. However, overlooking the needs of employees in non-tech firms who participated in the survey could result in less inclusive mental health strategies.

#### Chart - 7: Treatment Frequency Distribution

In [None]:
# Chart - 7 visualization code
try:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='treatment', palette='Purples')
    plt.title('Distribution of Mental Health Treatment Seeking')
    plt.xlabel('Sought Treatment (Yes/No)')
    plt.ylabel('Number of Respondents')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting treatment frequency distribution: {e}")


##### 1. Why did you pick the specific chart?

A bar plot is appropriate because treatment is a binary categorical variable. The bar plot effectively shows how many respondents reported seeking mental health treatment versus those who did not, offering a clear visual comparison of the two groups.

##### 2. What is/are the insight(s) found from the chart?

- A significant number of respondents reported having sought mental health treatment.

- There is still a considerable portion who have not sought treatment, indicating that barriers or reluctance to seek help may exist despite working in the tech sector.

- This sets the stage for deeper analysis of what workplace or personal factors influence treatment-seeking behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
These insights can help companies assess whether their employees are accessing mental health resources and identify potential gaps. Organizations can use this information to promote their support systems more effectively and reduce stigma around seeking treatment.

❌ No negative growth impact:
There is no direct negative growth insight from this chart. However, companies that ignore this kind of data risk missing opportunities to support their workforce’s mental well-being, which could indirectly affect morale and productivity.

#### Chart - 8: Age vs Treatment

In [None]:
# Chart - 8 visualization code
# Age vs Treatment: Box plot comparing age distributions for those who sought treatment vs those who did not

try:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='treatment', y='Age', palette='coolwarm')
    plt.title('Age Distribution by Treatment Status')
    plt.xlabel('Sought Treatment (Yes/No)')
    plt.ylabel('Age')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting age vs treatment box plot: {e}")


##### 1. Why did you pick the specific chart?

We chose a box plot because it provides a clear summary of the distribution of age (a numeric variable) for each treatment category (Yes/No). A box plot shows the median, quartiles, and outliers, which helps us compare age distributions and identify whether certain age groups are more or less likely to seek treatment.

##### 2. What is/are the insight(s) found from the chart?

- The median age of those who sought treatment is similar to those who did not.

- The spread of ages is slightly wider among those who sought treatment, indicating that mental health concerns and the decision to seek help span across age groups.

- No strong age-related trend is immediately obvious, suggesting that age alone may not be a primary factor in treatment-seeking behavior within this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
These insights highlight that mental health support strategies should be inclusive of employees across all age groups. Companies can ensure that resources and programs are designed to meet the needs of both younger and older employees, rather than assuming that mental health concerns are concentrated in a specific age group.

❌ No negative growth impact:
There is no direct negative business impact from this chart. However, failing to recognize that mental health needs span all age groups could result in programs that overlook certain segments of the workforce.

#### Chart - 9: Gender vs Treatment

In [None]:
# Chart - 9 visualization code
# Gender vs Treatment: Grouped bar plot showing treatment frequency by gender

try:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=df, x='Gender', hue='treatment', palette='Set1')
    plt.title('Treatment Seeking by Gender')
    plt.xlabel('Gender')
    plt.ylabel('Number of Respondents')
    plt.legend(title='Sought Treatment')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting gender vs treatment grouped bar plot: {e}")


##### 1. Why did you pick the specific chart?

We selected a grouped bar plot because it effectively displays the relationship between two categorical variables: Gender and treatment. This format allows us to compare how frequently respondents from each gender category reported seeking mental health treatment, making it easy to spot differences across groups.

##### 2. What is/are the insight(s) found from the chart?

- Across all gender categories, a substantial portion of respondents reported seeking mental health treatment.

- The proportion of treatment seekers appears relatively consistent across Male and Female groups.

- The Other category shows fewer respondents overall, which limits the strength of conclusions, but treatment seeking is still present in that group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
These insights emphasize the importance of designing mental health support initiatives that are inclusive across all gender identities. They also help companies understand that treatment-seeking behavior exists across genders, guiding equitable resource allocation.

❌ No negative growth impact:
No direct negative growth insight arises from this chart. However, failing to provide gender-inclusive mental health resources or overlooking minority gender groups due to smaller numbers could undermine diversity and inclusion efforts.

#### Chart - 10: Company Size vs Mental Health Benefits

In [None]:
# Chart - 10 visualization code
# Company Size vs Benefits: Grouped bar plot showing mental health benefit availability by company size

try:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='no_employees', hue='benefits', order=df['no_employees'].value_counts().index, palette='Set2')
    plt.title('Mental Health Benefits by Company Size')
    plt.xlabel('Number of Employees')
    plt.ylabel('Number of Respondents')
    plt.legend(title='Provides Mental Health Benefits')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting company size vs benefits grouped bar plot: {e}")


##### 1. Why did you pick the specific chart?

We selected a grouped bar plot because it provides a clear visual comparison of mental health benefit availability across different company sizes. This format makes it easy to see how frequently companies of various sizes offer mental health benefits, helping identify any patterns based on company scale.

##### 2. What is/are the insight(s) found from the chart?

- Larger companies (for example, those with 100-500 or 500-1000 employees) appear more likely to offer mental health benefits compared to smaller firms.

- Smaller companies (such as those with 1-5 employees) more often do not provide mental health benefits.

- This suggests a positive correlation between company size and the likelihood of offering mental health benefits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
This insight can guide small and mid-sized companies to recognize gaps in their mental health support offerings compared to larger organizations. It can also help HR teams in larger firms benchmark their practices and further strengthen benefit programs.

❌ No negative growth impact:
There is no direct negative growth insight from this chart. However, smaller companies that continue to neglect mental health benefits could risk lower employee satisfaction, retention challenges, and reduced competitiveness.

#### Chart - 11: Remote Work vs Mental Health Consequence Fear

In [None]:
# Chart - 11 visualization code
# Remote Work vs Mental Health Consequence Fear: Grouped bar plot showing fear of negative consequences by remote work status

try:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='remote_work', hue='mental_health_consequence', palette='Set3')
    plt.title('Fear of Negative Consequences by Remote Work Status')
    plt.xlabel('Remote Work (50% or More of Time)')
    plt.ylabel('Number of Respondents')
    plt.legend(title='Fear of Negative Consequence')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting remote work vs mental health consequence fear: {e}")


##### 1. Why did you pick the specific chart?

A grouped bar plot is ideal for comparing two categorical variables — remote work status and perceived fear of negative consequences when discussing mental health. This chart type makes it easy to compare levels of fear between remote and non-remote workers.

##### 2. What is/are the insight(s) found from the chart?

- Both remote and non-remote workers show a range of responses regarding fear of negative consequences, with no extreme differences between the groups.

- There is a slight trend where non-remote workers report higher fear of negative consequences compared to remote workers, but the pattern is not dramatic.

- This suggests that workplace culture and policies may influence fear levels more than remote work status itself.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
These insights help businesses identify that fear of negative consequences exists regardless of remote work status. This means mental health policies and protections need to address both remote and in-office employees equally. It can also guide training and communication strategies to reduce stigma company-wide.

❌ No negative growth impact:
There is no direct negative growth impact from this insight. However, ignoring fear of consequences among in-office employees while focusing only on remote staff could lead to morale and trust issues.

#### Chart - 12: Age + Gender + Treatment

In [None]:
# Chart - 12 visualization code
# Age + Gender + Treatment: Box plot showing age distribution by gender and treatment status

try:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='Gender', y='Age', hue='treatment', palette='Set1')
    plt.title('Age Distribution by Gender and Treatment Status')
    plt.xlabel('Gender')
    plt.ylabel('Age')
    plt.legend(title='Sought Treatment')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting age + gender + treatment box plot: {e}")


##### 1. Why did you pick the specific chart?

We selected a box plot with hue grouping because it allows us to compare the age distribution across genders and treatment statuses simultaneously. This format effectively shows medians, spread, and outliers, helping us explore whether age and gender together relate to the likelihood of seeking treatment.

##### 2. What is/are the insight(s) found from the chart?

- For both Male and Female respondents, the age distribution of those who sought treatment is broadly similar to those who did not.

- There is no strong age difference in treatment-seeking behavior within gender groups, suggesting age and gender together do not have a major combined effect on treatment in this dataset.

- The Other gender group shows fewer respondents, making interpretation limited for that category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
This insight highlights that mental health strategies should be inclusive across age and gender combinations, rather than targeting specific age-gender segments. It helps businesses ensure fairness in designing support programs and avoid bias in assumptions about who needs help.

❌ No negative growth impact:
There is no direct negative growth insight. However, overlooking small groups like the Other category due to their size could contribute to exclusion in mental health initiatives.

#### Chart - 13: Company Size + Benefits + Treatment

In [None]:
# Chart - 13 visualization code
# Company Size + Benefits + Treatment: Grouped bar plot showing treatment seeking across company size and benefit availability

try:
    plt.figure(figsize=(12, 6))
    sns.countplot(data=df, x='no_employees', hue='treatment', palette='Set2', order=df['no_employees'].value_counts().index)
    plt.title('Treatment Seeking by Company Size and Mental Health Benefit Availability')
    plt.xlabel('Company Size (Number of Employees)')
    plt.ylabel('Number of Respondents')
    plt.legend(title='Sought Treatment')
    plt.grid(axis='y')
    plt.show()
except Exception as e:
    print(f"Error plotting company size + benefits + treatment grouped bar plot: {e}")

##### 1. Why did you pick the specific chart?

We chose a grouped bar plot because it lets us simultaneously compare treatment-seeking behavior across different company sizes and explore how mental health benefit availability (which typically correlates with company size) might influence this behavior. Grouped bars make patterns across multiple categorical variables easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

- Respondents from larger companies generally report higher instances of seeking treatment.

- Smaller companies show more respondents who did not seek treatment, suggesting that limited mental health benefits (common in smaller firms) could be a contributing factor.

- This indicates that company size and the likely availability of mental health benefits may influence employees’ likelihood of seeking care.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive business impact:
These insights can help small and mid-sized companies recognize where they might need to strengthen their mental health support offerings to encourage employees to seek care. Larger firms can use this to validate their benefit programs and identify areas for improvement.

❌ No negative growth impact:
There is no direct negative growth insight. However, smaller companies that do not address mental health support gaps risk higher employee stress levels and potential retention challenges.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation heatmap: Show correlation between numeric and encoded ordinal features

try:
    # Select relevant numeric / ordinal-like features
    heatmap_df = df.copy()

    # Example encoding for ordinal features (adjust based on actual data mapping)
    if 'leave' in heatmap_df.columns:
        leave_map = {
            'Very easy': 5,
            'Somewhat easy': 4,
            "Don't know": 3,
            'Somewhat difficult': 2,
            'Very difficult': 1
        }
        heatmap_df['leave_encoded'] = heatmap_df['leave'].map(leave_map)

    if 'work_interfere' in heatmap_df.columns:
        interfere_map = {
            'Never': 1,
            'Rarely': 2,
            'Sometimes': 3,
            'Often': 4,
            'Always': 5,
            "Don't know": 0
        }
        heatmap_df['work_interfere_encoded'] = heatmap_df['work_interfere'].map(interfere_map)

    # Select columns for correlation
    corr_cols = ['Age']
    if 'leave_encoded' in heatmap_df.columns:
        corr_cols.append('leave_encoded')
    if 'work_interfere_encoded' in heatmap_df.columns:
        corr_cols.append('work_interfere_encoded')

    corr_matrix = heatmap_df[corr_cols].corr()

    # Plot heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    plt.title('Correlation Heatmap of Numeric and Ordinal Features')
    plt.show()
except Exception as e:
    print(f"Error generating correlation heatmap: {e}")

##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for visualizing the strength and direction of relationships between numeric and encoded ordinal variables. It helps us quickly identify any linear associations that may exist, guiding further analysis or informing feature selection for modeling.

##### 2. What is/are the insight(s) found from the chart?

- Age shows weak or no linear correlation with leave or work_interfere encoded values, suggesting age does not significantly drive these workplace experiences.

- There may be mild correlation between leave ease and work_interfere, indicating employees who find it easier to take leave may report lower interference of mental health with work.

- Overall, the heatmap suggests limited linear relationships among these features.

#### Chart - 15 - Pair Plot

In [None]:
# Pair plot: Visualize pairwise relationships between numeric and encoded ordinal features

try:
    # Reuse or create encoded features
    pair_df = df.copy()

    if 'leave' in pair_df.columns:
        leave_map = {
            'Very easy': 5,
            'Somewhat easy': 4,
            "Don't know": 3,
            'Somewhat difficult': 2,
            'Very difficult': 1
        }
        pair_df['leave_encoded'] = pair_df['leave'].map(leave_map)

    if 'work_interfere' in pair_df.columns:
        interfere_map = {
            'Never': 1,
            'Rarely': 2,
            'Sometimes': 3,
            'Often': 4,
            'Always': 5,
            "Don't know": 0
        }
        pair_df['work_interfere_encoded'] = pair_df['work_interfere'].map(interfere_map)

    plot_cols = ['Age']
    if 'leave_encoded' in pair_df.columns:
        plot_cols.append('leave_encoded')
    if 'work_interfere_encoded' in pair_df.columns:
        plot_cols.append('work_interfere_encoded')

    sns.pairplot(pair_df[plot_cols], corner=True, diag_kind='kde', plot_kws={'alpha': 0.6})
    plt.suptitle('Pair Plot of Numeric and Ordinal Features', y=1.02)
    plt.show()
except Exception as e:
    print(f"Error generating pair plot: {e}")

##### 1. Why did you pick the specific chart?

A pair plot allows us to visualize pairwise relationships between multiple numeric and ordinal features at once. It provides scatter plots for each pair of variables and distributions for each individual variable. This helps us detect patterns, clusters, or outliers that may not be visible in individual plots or correlation matrices.

##### 2. What is/are the insight(s) found from the chart?

- The scatter plots show no strong linear or non-linear relationships between age, leave ease, and work interference in this dataset.

- The distributions highlight that most respondents fall within similar age ranges and report similar levels of leave ease and work interference.

- There are no clear clusters, suggesting no subgrouping of respondents based on these features alone.

## **5. Solution to Business Objective**

The business objective of this EDA project was to explore the mental health landscape within the tech industry workplace, with a focus on understanding treatment-seeking behavior, workplace support systems, and employee perceptions regarding mental health. Our goal was to provide insights that would help organizations identify gaps and opportunities to strengthen their mental health policies and culture.

Through detailed analysis of the survey data, we have provided a solution that meets this objective in several ways:

1️⃣ Demographic Profiling:
We identified that most respondents are in the age range of mid-20s to early 40s and predominantly identify as male. This helps organizations understand the key demographics of their workforce and design mental health initiatives that are relevant to the largest employee segments, while ensuring inclusion of underrepresented groups.

2️⃣ Workplace Characteristics:
The analysis revealed that larger companies are more likely to offer mental health benefits, while smaller companies often lack formal support. This highlights an opportunity for smaller firms to strengthen their mental health resources to match industry standards and employee expectations.

3️⃣ Treatment Patterns:
A substantial proportion of employees have sought mental health treatment, indicating a clear need for accessible and supportive mental health programs. However, barriers still exist, as seen in the portion of respondents who have not sought help. Companies can use this insight to address stigma, improve communication about available resources, and make mental health care easier to access.

4️⃣ Fear of Consequences:
Our multivariate analyses showed that fear of negative consequences from discussing mental health exists across different work settings, regardless of remote work status or company size. This points to the need for companies to create psychologically safe environments where employees feel comfortable seeking help without fear of judgment or professional risk.

5️⃣ No Strong Age or Gender Barriers:
The data did not indicate strong differences in treatment-seeking behavior based on age or gender, suggesting that mental health support efforts should be designed to reach all employees rather than targeting or excluding specific age or gender groups.

6️⃣ Data-Driven Recommendations:
By analyzing the relationship between variables such as company size, benefits, treatment, and perceptions of support, the solution equips organizations with actionable insights. These include focusing on benefit availability, creating inclusive programs, and addressing cultural barriers to mental health discussions.

In summary, our solution to the business objective is a comprehensive set of data-driven insights that can help tech companies of all sizes create more inclusive, supportive, and effective mental health policies. These insights will help improve employee well-being, reduce stigma, and contribute to a healthier workplace culture that benefits both employees and the business.

# **Conclusion**

This exploratory data analysis provided valuable insights into the state of mental health and workplace support within the tech industry, based on the 2014 survey dataset. Our analysis covered demographic patterns, workplace characteristics, treatment-seeking behavior, and the relationship between company support systems and employee mental health outcomes.

Key findings include:

- The workforce is primarily composed of employees in their mid-20s to early 40s, with a majority identifying as male.

- Larger companies are more likely to offer mental health benefits, while smaller firms often lack formal support programs.

- A significant portion of respondents have sought mental health treatment, indicating both awareness and need for accessible care.

- Fear of negative workplace consequences persists across company sizes and remote work status, underscoring the importance of fostering safe and supportive work environments.

- Age and gender alone do not appear to be major factors influencing treatment-seeking behavior, highlighting the need for mental health programs that are broadly inclusive.

Overall, this EDA demonstrates that while progress has been made in some areas, significant gaps remain in mental health support within the tech sector, particularly in smaller organizations. The insights generated can guide companies in designing and implementing targeted strategies that promote mental well-being, reduce stigma, and enhance employee satisfaction and productivity.

#**Future Scope** #

This analysis provides a solid foundation, but there are several ways it could be extended to generate even deeper, more actionable insights:

1️⃣ Incorporate More Recent Data
The dataset is from 2014. Analyzing more recent surveys (e.g. OSMI’s 2016, 2017 or newer datasets) could reveal how workplace mental health support has evolved, especially after major shifts like remote work during the COVID-19 pandemic.

2️⃣ Perform Predictive Modeling
Building machine learning models (e.g., logistic regression, decision trees) could help predict which employees are most likely to seek treatment based on demographic and workplace features, allowing targeted interventions.

3️⃣ Sentiment and Text Analysis of Comments
A deeper natural language processing (NLP) analysis of the comments column could provide qualitative insights into employee concerns and suggestions regarding mental health.

4️⃣ Regional and Cultural Comparisons
Comparing mental health support patterns across countries and regions could guide localized strategies that respect cultural differences.

5️⃣ Longitudinal Studies
Tracking mental health support trends over time in the same companies (if data were available) would help assess the impact of initiatives and policy changes.

6️⃣ Interactive Dashboards
Converting this static analysis into interactive dashboards (using tools like Power BI, Tableau, or Plotly Dash) would allow HR teams and decision makers to explore data dynamically.