# **Are company providing sufficient mental health support?**


![Mental Health](src/pic/pic.png)


Mental health is essential because it affects every aspect of our lives, from how we think and feel to how we interact with others and handle challenges. Good mental health enables us to cope with stress, build healthy relationships, and make meaningful contributions to our communities. Prioritizing mental health promotes overall well-being, boosts productivity, and enhances quality of life. Ignoring it can lead to serious emotional, physical, and social consequences. Taking care of our mental health is not just self-care; it’s a foundation for living a balanced and fulfilling life.


---


In [1]:
import sqlite3
from math import sqrt
import duckdb
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [2]:
from src.func.chart_functions import bar_chart, histogram_chart, histogram_chart_color

# **Connect to DB**


In [3]:
conn = sqlite3.connect("src/db/mental_health.sqlite")
df = pd.read_sql(
    """
                SELECT 
                    a.UserID
                    , a.SurveyID AS 'Year'
                    , a.QuestionID 
                    , q.questiontext 
                    , a.AnswerText 

                FROM Answer AS a 
                    LEFT JOIN Question AS q 
                    ON a.questionid = q.questionid;""",
    conn,
)

# **EDA**


## **Explore Dataset**


In [4]:
df.shape

(236898, 5)

In [5]:
df.columns

Index(['UserID', 'Year', 'QuestionID', 'questiontext', 'AnswerText'], dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236898 entries, 0 to 236897
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   UserID        236898 non-null  int64 
 1   Year          236898 non-null  int64 
 2   QuestionID    236898 non-null  int64 
 3   questiontext  236898 non-null  object
 4   AnswerText    236898 non-null  object
dtypes: int64(3), object(2)
memory usage: 9.0+ MB


In [7]:
df.describe().round(2)

Unnamed: 0,UserID,Year,QuestionID
count,236898.0,236898.0,236898.0
mean,2514.52,2016.57,48.66
std,1099.46,1.42,36.13
min,1.0,2014.0,1.0
25%,1691.0,2016.0,15.0
50%,2652.0,2016.0,48.0
75%,3439.0,2017.0,80.0
max,4218.0,2019.0,118.0


In [8]:
df.isna().sum()

UserID          0
Year            0
QuestionID      0
questiontext    0
AnswerText      0
dtype: int64

In [9]:
df.duplicated().sum()

0

## **Understanding the Respondents**


### **How many users was in each year**


In [10]:
users_df = duckdb.sql(
    """
    SELECT 
        DISTINCT Year, 
        count(DISTINCT UserID) AS 'Qty Users'
    FROM df
    GROUP BY Year
"""
).to_df()

In [11]:
bar_chart(df=users_df, xaxis="Year", yaxis="Qty Users", title="Unique Users by Year")

Survey participation peaked in **2016** with 1 433 users, followed by strong engagement in **2014** with 1 260 users.\
However, participation steadily declined in later years, with only 352 users in **2019**.\
This decrease could reflect reduced survey visibility, participant fatigue, or shifts in interest.\
The high participation in earlier years might skew overall results, emphasizing the importance of considering temporal biases in the analysis.\
Future surveys should explore strategies to reengage participants and ensure consistent representation.


### **Users Sociodemographics**


In [12]:
# Create table with age, gender and living location
demografic_df = duckdb.sql(
    """
WITH
    age AS (
        SELECT UserID, Year, CAST(AnswerText AS INT64) AS Age
        FROM df
        WHERE QuestionID = 1
        ),
    gender AS (
        SELECT UserID, Year, AnswerText AS Gender
        FROM df
        WHERE QuestionID = 2
        ),
    location AS (
        SELECT UserID, Year, AnswerText AS Location
        FROM df
        WHERE QuestionID = 3
        ),
    final AS (
        SELECT age.UserID, age.Year, age.Age, gender.Gender, location.Location
        FROM age
            LEFT JOIN gender
            ON age.UserID = gender.UserID

            LEFT JOIN location
            ON age.UserID = location.UserID 
    )

SELECT * FROM final;
"""
).to_df()

#### **Age Data Cleaning**


In [13]:
histogram_chart(
    df=demografic_df,
    xaxis="Age",
    title="Age distribution",
)

The age histogram shows that the largest portion of respondents falls within the working-age group, specifically 25 to 40 years.\
Outliers are observed at both extremes: respondents younger than 18 years and older than 100 years.\
Since the number of outliers is minimal, they will be excluded from the analysis. The age range will be restricted to 18 to 67 years, based on the following rationale:

- 18 years: The minimum age at which individuals can fully enter the workforce.
- 67 years: The average retirement age for most employees.

This adjustment ensures that the analysis focuses on a realistic and relevant working-age demographic.


In [14]:
demografic_df = duckdb.sql(
    """
    SELECT *

    FROM demografic_df 

    WHERE
        Age BETWEEN 18 AND 67 
    """
).to_df()

#### **Gender Data Cleaning**


In [15]:
gender_df = duckdb.sql(
    """
    SELECT
        Gender,
        COUNT(Gender) AS 'Qty' 

    FROM demografic_df

    GROUP BY Gender 
    ORDER BY Qty DESC
"""
).to_df()

In [16]:
bar_chart(df=gender_df, xaxis="Gender", yaxis="Qty", title="Qty by Gender")

In [17]:
print("Qty of genders others than male and female:", gender_df["Qty"][4:].sum())

Qty of genders others than male and female: 146


The gender labels Male and Female are inconsistently capitalized. To standardize the data, all gender labels will be converted to lowercase.\
Additionally, other gender labels will be eliminated for the following reasons:

- The quantity is insignificant for further calculations.
- Some labels raise concerns about insincere responses.

This cleaning step ensures consistency and accuracy in the dataset, focusing on meaningful and reliable insights.


In [18]:
demografic_df = duckdb.sql(
    """
    SELECT
        UserID,
        Year,
        Age,
        LOWER(Gender) AS Gender,
        Location

    FROM demografic_df
    
    WHERE LOWER(Gender) IN ('male', 'female') 
"""
).to_df()

#### **Age Distribution by Gender**


In [19]:
histogram_chart_color(
    df=demografic_df, xaxis="Age", title="Age Distribution by Gender", color="Gender"
)

The histogram displays respondents aged 18 to 67, representing a typically working-age population.\
The primary group is concentrated from the late 20s to the 40s, indicating that the survey attracted actively working respondents, likely those who have experienced psychological challenges.\
However, the younger (18-24) and older (55-67) age groups appear underrepresented. This could suggest a potential bias in the data, as these groups might not fully reflect the perspectives of early-career individuals or those nearing retirement.


### **Location**


In [20]:
location_df = duckdb.sql(
    """
    SELECT 
        Location,
        COUNT(LOcation) AS 'Qty'

    FROM demografic_df
    GROUP BY Location
    ORDER BY Qty DESC
"""
).to_df()

In [21]:
bar_chart(df=location_df, xaxis="Location", yaxis="Qty", title="Qty of Locations")

We observe that the Location column contains duplicate entries for the same country: United States of America and United States.\
To standardize the data, all instances of United States will be updated to United States of America. Other countries do not have duplicates.\
Since the majority of respondents are from the United States of America, all countries will be grouped by continents for further analysis.\
This approach will simplify the dataset and ensure that regional trends are accurately represented in subsequent calculations.


In [22]:
demografic_df = duckdb.sql(
    """
    SELECT 
        UserID,
        Year,
        Age,
        Gender,
        CASE 
            WHEN Location = 'United States' THEN 'United States of America' 
            ELSE Location 
            END AS 'Location' 

    FROM demografic_df
"""
).to_df()

In [23]:
location_map = {
    "United States of America": "North America",
    "United Kingdom": "Europe",
    "Canada": "North America",
    "Germany": "Europe",
    "Netherlands": "Europe",
    "Australia": "Australia/Oceania",
    "France": "Europe",
    "Ireland": "Europe",
    "India": "Asia",
    "Brazil": "South America",
    "Sweden": "Europe",
    "Spain": "Europe",
    "Switzerland": "Europe",
    "New Zealand": "Australia/Oceania",
    "Poland": "Europe",
    "Portugal": "Europe",
    "Italy": "Europe",
    "Belgium": "Europe",
    "South Africa": "Africa",
    "Russia": "Asia",
    "Bulgaria": "Europe",
    "Norway": "Europe",
    "Mexico": "North America",
    "Finland": "Europe",
    "Israel": "Asia",
    "Denmark": "Europe",
    "Japan": "Asia",
    "Romania": "Europe",
    "Austria": "Europe",
    "Greece": "Europe",
    "Pakistan": "Asia",
    "Colombia": "South America",
    "Czech Republic": "Europe",
    "Estonia": "Europe",
    "Turkey": "Asia",
    "Singapore": "Asia",
    "Hungary": "Europe",
    "Argentina": "South America",
    "Croatia": "Europe",
    "Serbia": "Europe",
    "Ukraine": "Europe",
    "Bangladesh": "Asia",
    "Chile": "South America",
    "Bosnia and Herzegovina": "Europe",
    "Iceland": "Europe",
    "Lithuania": "Europe",
    "Costa Rica": "North America",
    "Afghanistan": "Asia",
    "Algeria": "Africa",
    "Nigeria": "Africa",
    "Indonesia": "Asia",
    "China": "Asia",
    "Georgia": "Asia",
    "Hong Kong": "Asia",
    "Uruguay": "South America",
    "Slovakia": "Europe",
    "Brunei": "Asia",
    "Iran": "Asia",
    "Vietnam": "Asia",
    "Ecuador": "South America",
    "Venezuela": "South America",
    "Guatemala": "North America",
    "Philippines": "Asia",
    "Belarus": "Europe",
    "Moldova": "Europe",
    "Thailand": "Asia",
    "Latvia": "Europe",
    "Slovenia": "Europe",
    "Ghana": "Africa",
    "Macedonia": "Europe",
    "Saudi Arabia": "Asia",
    "Jordan": "Asia",
    "Ethiopia": "Africa",
    "Kenya": "Africa",
    "Mauritius": "Africa",
    "Taiwan": "Asia",
}

In [24]:
demografic_df["Continent"] = demografic_df["Location"].map(location_map)
continent_df = demografic_df["Continent"].value_counts().reset_index()

In [25]:
bar_chart(
    df=continent_df, xaxis="Continent", yaxis="count", title="Distribution by Continent"
)

Based on the available data, we see that it would be most effective to use only North America and Europe data for further analysis. Other continents have insufficient data for meaningful conclusions.\
This also reveals a potential bias, as the IT sector globally has a significant workforce in Asian countries. However, the data for Asian countries is underrepresented in the dataset, limiting its ability to reflect the true distribution.


In [26]:
demografic_df = duckdb.sql(
    """
    SELECT * 

    FROM demografic_df 

    WHERE 
        Continent IN ('North America', 'Europe')
    
    ORDER BY Gender ASC
"""
).to_df()

#### **Explore Gender Data**


In [27]:
gender_gr = duckdb.sql(
    """
    SELECT 
        Gender,
        COUNT(Gender) AS 'Qty'

    FROM demografic_df
    GROUP BY Gender
    ORDER BY Gender DESC
"""
).to_df()

In [28]:
bar_chart(df=gender_gr, xaxis="Gender", yaxis="Qty", title="Distribution by Gender")

We can see that more men work in the IT sector compared to women.\
According to statistical data, this result is favorable, as women typically make up 20-30% of the IT workforce.\
Therefore, we can dismiss any significant bias in this context.\
Later we present the quantitative and percentage distribution between genders


In [29]:
gender_gr = duckdb.sql(
    """
    WITH 
        male_df AS (
            SELECT
                Year, 
                COUNT(Gender) AS 'Male'

            FROM demografic_df 

            WHERE Gender = 'male'

            GROUP BY Year
        ), 

        female_df AS (
            SELECT 
                Year,
                COUNT(Gender) AS 'Female'
            FROM demografic_df 
            WHERE Gender = 'female' 
            GROUP BY Year
        ),

        app_df AS (
            SELECT 
                m.Year, 
                m.Male,
                f.Female

            FROM male_df AS m 
                LEFT JOIN female_df AS f 
                ON m.Year = f.Year
        ), 

        final_df AS (
            SELECT
                Year, 
                Male,
                Female, 
                ROUND(Male / (Male + Female) * 100, 2) AS 'Male_proc',
                ROUND(Female / (Male + Female) * 100, 2) AS 'Female_proc'
            
            FROM app_df 
        )
     
SELECT * FROM final_df 
ORDER BY Year ASC
"""
).to_df()

In [30]:
fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=gender_gr["Year"],
        y=gender_gr["Male"],
        name="Male",
        marker_color="indianred",
        text=gender_gr["Male"],
    )
)

fig.add_trace(
    go.Bar(
        x=gender_gr["Year"],
        y=gender_gr["Female"],
        name="Female",
        marker_color="lightsalmon",
        text=gender_gr["Female"],
    )
)

fig.update_layout(
    title=dict(text="Male And Female Distribution by Year"),
    xaxis=dict(title=dict(text="Year")),
)

In [31]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=gender_gr["Year"],
        y=gender_gr["Male_proc"],
        name="Male",
        marker_color="indianred",
        text=gender_gr["Male_proc"],
        textposition="top center",
        mode="text+lines+markers",
    )
)

fig.add_trace(
    go.Scatter(
        x=gender_gr["Year"],
        y=gender_gr["Female_proc"],
        name="Female",
        marker_color="lightsalmon",
        text=gender_gr["Female_proc"],
        textposition="top center",
        mode="text+lines+markers",
    )
)

fig.update_layout(
    title="Percentage Distribution of Male and Female",
    xaxis=dict(title=dict(text="Year")),
)

The chart reflects the gender distribution of survey respondents over the years, aligning with the typical representation of females in the IT industry, which ranges from 20% to 30%.\
In 2014, males constituted 79.66% of respondents, while females accounted for 20.34%, consistent with industry norms. By 2018, the proportion of female respondents reached 32.78%, slightly above the typical range, indicating an inclusive sampling approach. Males remain the majority across all years, with proportions between 67.22% and 79.66%, reflecting the industry's demographic composition. These figures suggest the gender distribution in the survey is representative of the IT sector.


The target group consists of:

- Men and women
- Living in Europe and North America
- Aged between 18 and 67 years


## **Prevalence of mental health conditions**


### **How many respondents currently have or have had mental health issues?**


In the Have_disorder column, we will standardize the values for further calculations. Only three values will be retained:

- Yes
- No
- Don't Know

This ensures consistency and simplifies the data for accurate analysis.


In [32]:
have_disorder_df = duckdb.sql(
    """
        SELECT
            d.UserID,
            d.Year,
            d.Age,
            d.Gender,
            d.Location,
            d.Continent,
            CASE 
                WHEN hd.Have_disorder = 'Possibly' THEN 'Don''t Know'
                WHEN hd.Have_disorder = 'Maybe' THEN 'Don''t Know' 
                ELSE hd.Have_disorder
                END AS Have_disorder

        FROM demografic_df AS d 
            LEFT JOIN (
                SELECT 
                    UserID, 
                    AnswerText AS 'Have_disorder' 
                FROM df 
                WHERE QuestionID = 33
                ) AS hd 
            ON d.UserID = hd.UserID 

        WHERE
            d.Year != 2014
    """
).to_df()

In [33]:
have_disorder_gr = duckdb.sql(
    """
    WITH 
        main_df AS (
            SELECT 
                Have_disorder, 
                COUNT(Have_disorder) AS 'count',
                (SELECT COUNT(Have_disorder) FROM have_disorder_df) AS 'Total'

            FROM have_disorder_df

            GROUP BY Have_disorder
        ),

        final_df AS (
            SELECT
                Have_disorder,
                count,
                Total,
                ROUND(count / Total * 100, 2) AS 'count%'

            FROM main_df
        )

SELECT * FROM final_df
ORDER BY count DESC
"""
).to_df()

In [34]:
bar_chart(
    df=have_disorder_gr,
    xaxis="Have_disorder",
    yaxis="count%",
    title="Percentage Distribution of Do you currently have a mental health disorder?",
)

The table summarizes the responses to whether participants have a mental health disorder. Among respondents, 42.04% reported having a disorder, highlighting a significant proportion facing mental health challenges. Another 32.92% explicitly stated they do not have a mental health disorder, while 25.04% were uncertain or did not know. The high percentage of "Don't Know" responses indicates a notable level of uncertainty or lack of awareness about mental health, suggesting the need for better education and resources to help individuals understand and recognize mental health conditions. This distribution provides a foundation for exploring workplace and demographic factors influencing mental health awareness and prevalence.


In [35]:
male_disorder_df = duckdb.sql(
    """
    WITH 
        main_df AS (
            SELECT 
                Have_disorder,
                count(Gender) AS count,
                (SELECT COUNT(Gender) FROM have_disorder_df WHERE Gender = 'male') AS 'Total'

            FROM have_disorder_df

            WHERE
                Gender = 'male'

            GROUP BY Have_disorder
        ),

        final_df AS (
            SELECT
                Have_disorder,
                count,
                Total, 
                ROUND(count / Total * 100, 2) AS 'count%'

            FROM main_df
        )

SELECT * FROM final_df
ORDER BY count DESC
    """
).to_df()

In [36]:
female_disorder_df = duckdb.sql(
    """
    WITH 
        main_df AS (
            SELECT 
                Have_disorder,
                count(Gender) AS 'count',
                (SELECT COUNT(Gender) FROM have_disorder_df WHERE Gender = 'female') AS 'Total'

            FROM have_disorder_df

            WHERE
                Gender = 'female'

            GROUP BY Have_disorder
        ),

        final_df AS (
            SELECT
                Have_disorder,
                count,
                Total,
                ROUND(count / Total * 100, 2) AS 'count%' 

            FROM main_df
        )

SELECT * FROM final_df  
ORDER BY count DESC
    """
).to_df()

In [37]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=male_disorder_df["Have_disorder"],
        y=male_disorder_df["count%"],
        name="Male",
        marker_color="indianred",
        text=male_disorder_df["count%"],
    )
)

fig.add_trace(
    go.Bar(
        x=female_disorder_df["Have_disorder"],
        y=female_disorder_df["count%"],
        name="Female",
        marker_color="lightsalmon",
        text=female_disorder_df["count%"],
    )
)

fig.update_layout(
    title="Percentage Distribution of Male and Female",
    xaxis=dict(title=dict(text="Have Disorder")),
)

The chart provides a gender-wise breakdown of responses regarding mental health disorders:

Females:

- 56.34% reported having a mental health disorder, the highest proportion across both genders, suggesting a greater prevalence or willingness to acknowledge mental health challenges.
- 24.15% stated they do not have a disorder, while 19.51% were uncertain, indicating relatively lower uncertainty about their mental health compared to males.

Males:

- 36.42% of males reported having a mental health disorder, significantly lower than females.
- 36.37% stated they do not have a disorder, a nearly equal proportion to those who reported having one.
- 27.21% of males were uncertain, reflecting a higher level of uncertainty compared to females.

Insights:
The data suggests gender differences in how mental health is experienced or perceived. Females are more likely to report having a disorder, while males show higher levels of uncertainty. This could reflect differences in awareness, societal expectations, or reporting behavior. Addressing these disparities could help tailor mental health resources and education for different genders effectively.


In [38]:
histogram_chart_color(
    df=have_disorder_df, xaxis="Age", color="Have_disorder", title="Disorder by Age"
)

The largest group experiencing disorders is the most productive age group, between 25 and 40 years old.\
Given this situation, companies should focus more on supporting their employees to ensure well-being and productivity.


In [39]:
# Disorder between continents
continent_disorder_df = duckdb.sql(
    """
    WITH 
        na_main_df AS (
            SELECT
                Have_disorder,
                COUNT(Continent) AS 'North_America',
                (SELECT COUNT(Continent) FROM have_disorder_df WHERE Continent = 'North America') AS 'Total' 
                
            FROM have_disorder_df

            WHERE 
                Continent = 'North America'

            GROUP BY Have_disorder
        ), 

        na_final_df AS (
            SELECT
                Have_disorder,
                ROUND(North_America / Total * 100, 2) AS 'North_America_proc'

            FROM na_main_df 
        ),

        eu_main_df AS (
            SELECT
                Have_disorder,
                COUNT(Continent) AS 'Europe', 
                (SELECT COUNT(Continent) FROM have_disorder_df WHERE Continent = 'Europe') AS 'Total'

            FROM have_disorder_df 

            WHERE Continent = 'Europe' 
            GROUP BY Have_disorder
        ),

        eu_final_df AS (
            SELECT 
                Have_disorder,
                ROUND(Europe / Total * 100, 2) AS 'Europe_proc' 

            FROM eu_main_df 
        ),

        final_df AS (
            SELECT
                na.Have_disorder,
                na.North_America_proc,
                eu.Europe_proc,
                CASE 
                    WHEN na.Have_disorder = 'Yes' THEN 1 
                    WHEN na.Have_disorder = 'No' THEN 2
                    Else 3 
                    END AS 'order_col'

            FROM na_final_df AS na 
                LEFT JOIN eu_final_df AS eu 
                ON na.Have_disorder = eu.Have_disorder
        )
    
    SELECT * FROM final_df 
    ORDER BY order_col ASC
"""
).to_df()

In [40]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=continent_disorder_df["Have_disorder"],
        y=continent_disorder_df["North_America_proc"],
        name="North America",
        marker_color="indianred",
        text=continent_disorder_df["North_America_proc"],
    )
)

fig.add_trace(
    go.Bar(
        x=continent_disorder_df["Have_disorder"],
        y=continent_disorder_df["Europe_proc"],
        name="Europe",
        marker_color="lightsalmon",
        text=continent_disorder_df["Europe_proc"],
    )
)

fig.update_layout(
    title="Percentage Distribution by Continent",
    xaxis=dict(title=dict(text="Have Disorder")),
)

The chart compares the prevalence of mental health disorder responses between North America and Europe:

Key Observations:
"Yes" Responses:

- In North America, 47.34% of respondents reported having a mental health disorder, significantly higher than 27.67% in Europe.
- This suggests greater reported prevalence or openness about mental health in North America.

"No" Responses:

- In Europe, 43.79% stated they do not have a mental health disorder, compared to 28.91% in North America.
- This reflects a potentially lower reported prevalence or different cultural attitudes toward mental health in Europe.

"Don't Know" Responses:

- 23.75% of North American respondents and 28.53% of Europeans were uncertain about their mental health status.
- The slightly higher uncertainty in Europe may indicate less awareness or willingness to engage with mental health topics.

Insights:\
Cultural and regional differences likely influence how respondents perceive and report mental health. North America shows a higher prevalence of "Yes" responses, possibly reflecting greater mental health awareness or reduced stigma. In contrast, Europe has more "No" responses, but with a notable proportion of uncertainty, suggesting a need for further education and outreach in both regions.


## **Workplace perceptions**


### **Are workplaces providing sufficient mental health support?**


In [41]:
workplaces_df = duckdb.sql(
    """
        SELECT 
            h.UserID,
            h.Year,
            h.Age,
            h.Gender,
            h.Location,
            h.Continent,
            h.Have_disorder,
            CASE 
                WHEN c.Health_benefits = 'Not eligible for coverage / NA' THEN 'No'
                ELSE c.Health_benefits
                END AS Health_benefits,

            mhs.Mental_health_services,
            fmh.Formal_mental_health,
            hr.Health_resources

        FROM have_disorder_df AS h 
            -- Does your employer provide mental health benefits as part of healthcare coverage?
            LEFT JOIN (
                SELECT 
                    UserID, 
                    AnswerText AS Health_benefits
                FROM df 
                WHERE QuestionID = 10
            ) AS c 
            ON h.UserID = c.UserID 

            -- Do you know the options for mental health care available under your employer-provided health coverage?
            LEFT JOIN (
                SELECT 
                    UserID,
                    AnswerText AS Mental_health_services
                FROM df 
                WHERE QuestionID = 14
            ) AS mhs 
            ON h.UserID = mhs.UserID 

            -- Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
            LEFT JOIN (
                SELECT 
                    UserID, 
                    AnswerText AS Formal_mental_health
                FROM df 
                WHERE QuestionID = 15            
            ) AS fmh
            ON h.UserID = fmh.UserID

            -- Does your employer offer resources to learn more about mental health disorders and options for seeking help?
            LEFT JOIN (
                SELECT 
                    UserID,
                    AnswerText AS Health_resources 
                FROM df
                WHERE QuestionID = 16
            ) AS hr 
            ON h.UserID = hr.UserID 

        WHERE
            c.Health_benefits != '-1'
            AND mhs.Mental_health_services != '-1'
            AND fmh.Formal_mental_health != '-1'
            AND hr.Health_resources != '-1'
    """
).to_df()

#### **Health benefits**


In [42]:
health_summary = duckdb.sql(
    """
WITH 
    group_data AS (
        SELECT
            Health_benefits,
            COUNT(Health_benefits) AS count
        
        FROM workplaces_df

        GROUP BY Health_benefits
        ),
    sum_data AS (
        SELECT 
            COUNT(Health_benefits) AS Total
        FROM workplaces_df     
        ),
    app_tbl AS (
            SELECT
                gd.Health_benefits,
                gd.count,
                (SELECT * FROM sum_data) AS Total

            FROM group_data AS gd
        ),
    prc_calc AS (
            SELECT 
                Health_benefits,
                ROUND(count / Total * 100, 2) AS 'count%',
                CASE 
                    WHEN Health_benefits = 'Yes' THEN 1 
                    WHEN Health_benefits = 'No' THEN 2 
                    ELSE 3 
                    END AS 'order_col' 

            FROM app_tbl
        )

SELECT * FROM prc_calc 
ORDER BY order_col ASC
"""
).to_df()

In [43]:
bar_chart(
    df=health_summary,
    xaxis="Health_benefits",
    yaxis="count%",
    title="Total Overwiev of Health Benefits",
)

The chart summarizes respondents' awareness of whether their workplace provides health benefits for mental health:

Key Observations:

1. "Yes" Responses:
   - 59.46% of respondents indicated that their workplace provides mental health benefits, suggesting that a majority have access to such resources.
2. "No" Responses:
   - Only 11.40% stated that their workplace does not offer mental health benefits, which indicates that workplaces increasingly recognize the importance of supporting mental health.
3. "I Don't Know" Responses:
   - 29.15% of respondents were unsure whether their workplace provides mental health benefits.
   - This highlights a significant gap in communication or awareness about available resources, even among employees who might have access to them.

Insights:\
While most respondents reported having access to mental health benefits, nearly a third were unaware of their availability, pointing to a need for better communication and education in workplaces about mental health resources. Increasing awareness could encourage greater utilization of these benefits and improve workplace mental health support.


In [44]:
health_benefits_df = duckdb.sql(
    """
WITH 
    male_df AS (
        SELECT
            Health_benefits,
            COUNT(Gender) AS 'Male',
            (SELECT COUNT(Gender) FROM workplaces_df WHERE Gender = 'male') AS 'Total'
        FROM workplaces_df 
        WHERE Gender = 'male'
        GROUP BY Health_benefits
        ),

    male_final_df AS (
        SELECT 
            Health_benefits, 
            ROUND(Male / Total * 100, 2) AS 'Male_proc'
        FROM male_df
        ),
    female_df AS (
        SELECT
            Health_benefits,
            COUNT(Gender) AS 'Female',
            (SELECT COUNT(Gender) FROM workplaces_df WHERE Gender = 'female') AS 'Total'
        FROM workplaces_df 
        WHERE Gender = 'female'
        GROUP BY Health_benefits
    ),
    female_final_df AS (
        SELECT 
            Health_benefits, 
            ROUND(Female / Total * 100, 2) AS 'Female_proc'
        FROM female_df
    ),
    final_df AS (
        SELECT 
            m.Health_benefits,
            m.Male_proc AS 'Male%',
            f.Female_proc AS 'Female%', 
            CASE 
                WHEN m.Health_benefits = 'Yes' THEN 1 
                WHEN m.Health_benefits = 'No' THEN 2
                ELSE 3 
                END AS 'order_col' 

        FROM male_final_df AS m
            LEFT JOIN female_final_df AS f 
            ON m.Health_benefits = f.Health_benefits
    )

SELECT * FROM final_df 
ORDER BY order_col ASC
"""
).to_df()

In [45]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=health_benefits_df["Health_benefits"],
        y=health_benefits_df["Male%"],
        name="Male",
        text=health_benefits_df["Male%"],
        marker_color="indianred",
    )
)

fig.add_trace(
    go.Bar(
        x=health_benefits_df["Health_benefits"],
        y=health_benefits_df["Female%"],
        name="Female",
        text=health_benefits_df["Female%"],
        marker_color="lightsalmon",
    )
)

fig.update_layout(title="Health Benefits Distribution by Gender")

The chart shows a gender-wise breakdown of workplace health benefits for mental health:
Key Observations:

1. "Yes" Responses:
   - 67.01% of females reported that their workplace provides mental health benefits, compared to 56.34% of males.
   - This suggests females are more likely to recognize or acknowledge the availability of mental health benefits at their workplace.
2. "No" Responses:
   - A slightly higher percentage of males (12.68%) than females (8.29%) indicated that their workplace does not offer mental health benefits.
   - This could reflect differences in workplace environments or perceptions.
3. "I Don't Know" Responses:
   - 30.98% of males were unsure about the availability of mental health benefits, compared to 24.70% of females.
   - Males appear less aware or less informed about mental health resources in their workplace.

Insights:\
Females are more likely to report the availability of mental health benefits, potentially indicating greater awareness or communication about these resources. The higher level of uncertainty among males suggests a need for better information dissemination to ensure employees are aware of and can access available mental health support. This emphasizes the importance of targeted communication strategies to address gaps in awareness, particularly for male employees.


In [46]:
benefits_continent_df = duckdb.sql(
    """
    WITH 
        /* North America part */
        na_main_df AS (
            SELECT 
                Health_benefits,
                COUNT(Continent) AS 'North_America',
                (SELECT COUNT(Continent) FROM workplaces_df WHERE Continent = 'North America') AS 'Total'

            FROM workplaces_df

            WHERE Continent = 'North America'

            GROUP BY Health_benefits
        ), 

        na_final_df AS (
            SELECT
                Health_benefits, 
                ROUND(North_America / Total * 100, 2) AS 'North_America_proc' 

            FROM na_main_df 
        ),
        /* Europe part */

        eu_main_df AS (
            SELECT 
                Health_benefits,
                COUNT(Continent) AS 'Europe',
                (SELECT COUNT(Continent) FROM workplaces_df WHERE Continent = 'Europe') AS 'Total'

            FROM workplaces_df

            WHERE Continent = 'Europe'

            GROUP BY Health_benefits 
        ), 
        eu_final_df AS (
            SELECT 
                Health_benefits,
                ROUND(Europe / Total * 100, 2) AS 'Europe_proc' 

            FROM eu_main_df
        ),
        /* Final df */
        final_df AS (
            SELECT 
                na.Health_benefits,
                na.North_America_proc,
                eu.Europe_proc,
                CASE 
                    WHEN na.Health_benefits = 'Yes' THEN 1 
                    WHEN na.Health_benefits = 'No' THEN 2 
                    ELSE 3 
                    END AS 'order_col' 

            FROM na_final_df AS na 
                LEFT JOIN eu_final_df AS eu 
                ON na.Health_benefits = eu.Health_benefits
        )


    SELECT * FROM final_df 
    ORDER BY order_col ASC
"""
).to_df()

In [47]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=benefits_continent_df["Health_benefits"],
        y=benefits_continent_df["North_America_proc"],
        name="North America",
        text=benefits_continent_df["North_America_proc"],
        marker_color="indianred",
    )
)

fig.add_trace(
    go.Bar(
        x=benefits_continent_df["Health_benefits"],
        y=benefits_continent_df["Europe_proc"],
        name="Europe",
        text=benefits_continent_df["Europe_proc"],
        marker_color="lightsalmon",
    )
)

fig.update_layout(title="Benefit Distribution by Continent")

The chart provides a regional comparison of awareness of workplace health benefits for mental health between North America and Europe:\
Key Observations:

1. "Yes" Responses:
   - 66.06% of respondents in North America reported that their workplace provides mental health benefits, compared to 32.91% in Europe.
   - This significant difference suggests that North American workplaces are more likely to offer or communicate about mental health benefits.
2. "No" Responses:
   - Only 6.80% of North American respondents indicated their workplace does not provide mental health benefits, compared to 29.87% in Europe.
   - This highlights a disparity in workplace mental health support, with European workplaces lagging behind.
3. "I Don't Know" Responses:
   - 27.14% of North Americans were unsure about the availability of mental health benefits, compared to 37.22% of Europeans.
   - A higher level of uncertainty in Europe suggests less communication or awareness about mental health resources in workplaces.

Insights:\
North American workplaces appear more proactive in offering and communicating about mental health benefits compared to European workplaces. The higher uncertainty and lower affirmative responses in Europe underscore the need for improved mental health support systems and communication in the region. This disparity highlights an opportunity for European organizations to prioritize mental health initiatives to align with global best practices.


### **Do you know the options for mental health care available under your employer-provided health coverage?**


In [48]:
overview_mental_helth_service = duckdb.sql(
    """
    WITH 
        main_df AS (
            SELECT
                Mental_health_services,
                COUNT(Mental_health_services) AS 'count'
            FROM workplaces_df

            GROUP BY Mental_health_services
        ), 

        final_df AS (
            SELECT *
            , CASE 
                WHEN Mental_health_services = 'Yes' THEN 1 
                WHEN Mental_health_services = 'No' THEN 2 
                ELSE 3 
                END AS 'order_col' 
            
            FROM main_df
        )
    SELECT * FROM final_df 
    ORDER BY order_col ASC
"""
).to_df()

In [49]:
bar_chart(
    df=overview_mental_helth_service,
    xaxis="Mental_health_services",
    yaxis="count",
    title="Overview of options for mental health care available under your employer-provided health coverage",
)

The chart summarizes respondents' awareness of the availability of mental health services in their workplaces:

Key Observations:

1. "Yes" Responses:
   - 809 respondents indicated that their workplace provides mental health services, showing a significant portion of organizations are offering support.
2. "No" Responses:
   - 852 respondents reported that their workplace does not provide mental health services, slightly exceeding those who confirmed availability.
   - This highlights that many workplaces still lack dedicated mental health services.
3. "I Am Not Sure" Responses:
   - 322 respondents were unsure about the availability of mental health services in their workplace.
   - This uncertainty points to gaps in communication or visibility of mental health resources.

Insights:\
The nearly equal split between "Yes" and "No" responses shows that while mental health services are becoming more common, many workplaces still do not provide them. Furthermore, the significant number of "I am not sure" responses suggests a need for improved communication and awareness to ensure employees know about the resources available to them.


#### **Mental health services by Gender**


In [50]:
gender_services_df = duckdb.sql(
    """
    WITH
        /* Male part */ 
        male_main_df AS (
            SELECT
                Mental_health_services,
                COUNT(Gender) AS 'Male',
                (SELECT COUNT(Gender) FROM workplaces_df WHERE Gender = 'male') AS 'Total'
            
            FROM workplaces_df 

            WHERE 
                Gender = 'male' 
            
            GROUP BY Mental_health_services
        ),
        male_final_df AS (
            SELECT 
                Mental_health_services,
                ROUND(Male / Total * 100, 2) AS 'Male_proc'

            FROM male_main_df
        ), 

        /* Female part */
        female_main_df AS (
            SELECT
                Mental_health_services,
                COUNT(Gender) AS 'Female',
                (SELECT COUNT(Gender) FROM workplaces_df WHERE Gender = 'female') AS 'Total'
            
            FROM workplaces_df 

            WHERE 
                Gender = 'female' 
            
            GROUP BY Mental_health_services 
        ), 

        female_final_df AS (
            SELECT 
                Mental_health_services, 
                ROUND(Female / Total * 100, 2) AS 'Female_proc' 

            FROM female_main_df
        ),
        /* Merge male and female tables */
        final_df AS (
            SELECT 
                m.Mental_health_services,
                m.Male_proc,
                f.Female_proc,
                CASE 
                    WHEN m.Mental_health_services = 'Yes' THEN 1 
                    WHEN m.Mental_health_services = 'No' THEN 2 
                    ELSE 3 
                    END AS 'order_col' 

            FROM male_final_df AS m 
                LEFT JOIN female_final_df AS f 
                ON m.Mental_health_services = f.Mental_health_services
        )

    SELECT * FROM final_df
    ORDER BY order_col ASC
"""
).to_df()

In [51]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=gender_services_df["Mental_health_services"],
        y=gender_services_df["Male_proc"],
        name="Male",
        text=gender_services_df["Male_proc"],
        marker_color="indianred",
    )
)

fig.add_trace(
    go.Bar(
        x=gender_services_df["Mental_health_services"],
        y=gender_services_df["Female_proc"],
        name="Female",
        text=gender_services_df["Female_proc"],
        marker_color="lightsalmon",
    )
)

fig.update_layout(title="Mental Health Services Distribution by Gender")

The chart provides a gender-based breakdown of awareness of workplace mental health services:

Key Observations:

1. "Yes" Responses:
   - 50.09% of females reported their workplace provides mental health services, compared to 36.97% of males.
   - This suggests females are more likely to acknowledge or have access to workplace mental health services.
2. "No" Responses:
   - 46.08% of males indicated their workplace does not offer mental health services, compared to 35.41% of females.
   - This could reflect differences in workplace environments or perceptions of service availability.
3. "I Am Not Sure" Responses:
   - 16.95% of males and 14.51% of females were unsure about the availability of mental health services.
   - This indicates a slightly higher level of uncertainty among males.

Insights:\
Females are more likely than males to report the availability of mental health services at their workplace, possibly reflecting better communication or higher utilization among females. The higher percentage of "No" and "I am not sure" responses among males highlights the need for greater awareness and outreach to ensure all employees are informed about available mental health resources.


In [52]:
continent_service_df = duckdb.sql(
    """
    WITH 
        /* North America part */
        na_service_df AS (
            SELECT 
                Mental_health_services,
                COUNT(Continent) AS 'North_America',
                (SELECT COUNT(Continent) FROM workplaces_df WHERE Continent = 'North America') AS 'Total'

            FROM workplaces_df
            WHERE Continent = 'North America'
            GROUP BY Mental_health_services
        ),
        na_final_df AS (
            SELECT
                Mental_health_services, 
                ROUND(North_America / Total * 100, 2) AS 'North_America_proc'

            FROM na_service_df
        ), 

        /* Europe part */
        eu_service_df AS (
            SELECT 
                Mental_health_services,
                COUNT(Continent) AS 'Europe',
                (SELECT COUNT(Continent) FROM workplaces_df WHERE Continent = 'Europe') AS 'Total'

            FROM workplaces_df
            WHERE Continent = 'Europe'
            GROUP BY Mental_health_services 
        ),
        eu_final_df AS (
            SELECT
                Mental_health_services, 
                ROUND(Europe / Total * 100, 2) AS 'Europe_proc'

            FROM eu_service_df 
        ),
        /* Final df */
        final_df AS (
            SELECT 
                na.Mental_health_services,
                na.North_America_proc,
                eu.Europe_proc,
                CASE 
                    WHEN na.Mental_health_services = 'Yes' THEN 1
                    WHEN na.Mental_health_services = 'No' THEN 2
                    ELSE 3 
                    END AS 'order_col' 

            FROM na_final_df AS na 
                LEFT JOIN eu_final_df AS eu 
                ON na.Mental_health_services = eu.Mental_health_services
        )
       
    SELECT * FROM final_df
    ORDER BY order_col ASC
"""
).to_df()

In [53]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=continent_service_df["Mental_health_services"],
        y=continent_service_df["North_America_proc"],
        name="North America",
        text=continent_service_df["North_America_proc"],
        marker_color="indianred",
    )
)

fig.add_trace(
    go.Bar(
        x=continent_service_df["Mental_health_services"],
        y=continent_service_df["Europe_proc"],
        name="Europe",
        text=continent_service_df["Europe_proc"],
        marker_color="lightsalmon",
    )
)

fig.update_layout(title="Mental Health Services Distribution by Continent")

The chart compares awareness of workplace mental health services between North America and Europe:

Key Observations:

1. "Yes" Responses:
   - 45.28% of North American respondents reported their workplace provides mental health services, compared to only 22.78% in Europe.
   - This highlights a significant disparity, with North American workplaces being more likely to offer such services.
2. "No" Responses:
   - 39.74% of North Americans stated their workplace does not offer mental health services, while this figure is much higher in Europe at 55.95%.
   - This indicates that European workplaces are less likely to provide mental health services.
3. "I Am Not Sure" Responses:
   - 14.99% of North American respondents and 21.27% of Europeans were unsure about service availability.
   - The higher uncertainty in Europe may reflect less communication or visibility around these services.

Insights:\
North America shows stronger workplace support for mental health services compared to Europe, both in terms of availability and awareness. European workplaces have a higher proportion of "No" and "I am not sure" responses, suggesting a need for better adoption and communication of mental health resources. Bridging this gap could significantly enhance employee well-being in European workplaces.


### **Prevalence Rate calculation**


To calculate the prevalence and confidence intervals for three mental health conditions, you need:

1. Counts of Responses for Each Condition:
   - The number of respondents who answered "Yes" (e.g., diagnosed with a specific condition).
   - The total number of respondents for each condition.
2. Formula: $Prevalence(\%) = {Yes\ Count / Total\ Count * 100}$


In [54]:
prevalence_df = df.loc[df["QuestionID"].isin([115])].rename(
    columns={"AnswerText": "Diagnose"}
)

In [55]:
total_count = prevalence_df["UserID"].count()

In [56]:
diagnose = [
    "Mood Disorder (Depression, Bipolar Disorder, etc)",
    "Anxiety Disorder (Generalized, Social, Phobia, etc)",
    "Attention Deficit Hyperactivity Disorder",
]

In [57]:
mood_disorder = len(prevalence_df.loc[prevalence_df["Diagnose"] == diagnose[0]])
anxiety_disorder = len(prevalence_df.loc[prevalence_df["Diagnose"] == diagnose[1]])
attention_deficit = len(prevalence_df.loc[prevalence_df["Diagnose"] == diagnose[2]])

In [None]:
mood_disorder_prelevance = mood_disorder / total_count * 100
anxiety_disorder_prelevance = anxiety_disorder / total_count * 100
attention_deficit_prelevance = attention_deficit / total_count * 100

### **Confident Interval**


In statistics, a confidence interval is a range of values, derived from sample data, that is likely to contain the true value of an unknown population parameter. This interval provides an estimate of the parameter's possible values and is associated with a specific confidence level, typically expressed as a percentage.\
Formula: $CI=\^{p}{\pm}Z{\times}\sqrt{\frac {\hat{p}(1-\hat{p})}n}$ \
where $p$ is the prelevance rate, Z is the Z-score corresponding to the desired confidence level, and $n$ is the sample size. For a 90% confident interval, $Z$ is approximately 1.645.\
For calculations, we will use a 90% confidence interval, as the data contains bias. This approach allows for a more cautious interpretation of the results, acknowledging the potential skewness in the dataset.


In [None]:
# Prevalence
mood_perv = mood_disorder / total_count
anxiety_prev = anxiety_disorder / total_count
attention_prev = attention_deficit / total_count

In [None]:
# Prevelance percentage
mood_perv_proc = round(mood_disorder / total_count * 100, 2)
anxiety_prev_proc = round(anxiety_disorder / total_count * 100, 2)
attention_prev_proc = round(attention_deficit / total_count * 100, 2)

In [None]:
# Confidente rate
ci_mood = sqrt(mood_perv * (1 - mood_perv) / total_count) * 1.645 * 100
ci_anxiety = sqrt(anxiety_prev * (1 - anxiety_prev) / total_count) * 1.645 * 100
ci_attention = sqrt(attention_prev * (1 - attention_prev) / total_count) * 1.645 * 100

In [90]:
data = {
    "Condition": ["Mood", "Anxiety", "Attention"],
    "Prevalence_Rate": [
        mood_perv_proc,
        anxiety_prev_proc,
        attention_prev_proc,
    ],  # In percentage
    "error": [ci_mood, ci_anxiety, ci_attention],
}

In [91]:
ci_df = pd.DataFrame(data=data)

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=ci_df["Condition"],
        y=ci_df["Prevalence_Rate"],
        error_y=dict(type="data", array=ci_df["error"]),
        marker_color="#e6cea0",
    )
)
fig.update_layout(
    barmode="group",
    title_text="Prevalence Rates of Mental Health Conditions with Confidence Intervals",
    xaxis_title="Mental Health Condition",
    yaxis_title="Prevalence Rate (%)",
)

Key Observations:

1. Mood Disorders:

- Prevalence Rate: 19.88%, the highest among the three conditions.
- Error: 1.44%, indicating a relatively precise estimate with moderate variability.

2. Anxiety Disorders:

- Prevalence Rate: 16.65%, slightly lower than mood disorders but still substantial.
- Error: 1.35%, showing good precision in the estimate.

3. Attention Disorders:

- Prevalence Rate: 5.84%, significantly lower compared to the other two conditions.
- Error: 0.85%, suggesting reasonable reliability despite being the least prevalent.

Insights:

- Mood and Anxiety Disorders are the most commonly reported conditions, indicating a need for targeted workplace or healthcare interventions in these areas.
- The smaller prevalence and error for attention disorders suggest they may be less prevalent in the population or underreported.
- The relatively low errors across all conditions highlight good confidence in the reported prevalence rates.


### **Potential Biases in the Dataset**

_Selection Bias_\
The dataset was collected via an internet survey, meaning participants self-selected to respond. This may overrepresent individuals who are more aware of or affected by mental health issues, skewing prevalence rates higher than in the general population.\
Groups with limited internet access or lower engagement in online surveys, such as older adults or individuals in rural areas, may be underrepresented.

_Demographic Bias_\
Gender Representation: The dataset reflects the gender distribution typical of the IT industry, where females constitute 20-30% of the workforce. While this aligns with industry norms, it may not reflect the broader population.

Regional Representation: A significant proportion of respondents are from North America and Europe, potentially neglecting perspectives from other regions, such as Asia or Africa, where cultural attitudes toward mental health differ.

Age Distribution: The dataset likely overrepresents working-age adults, with fewer responses from younger individuals (e.g., students) or older adults (e.g., retirees), limiting the generalizability of findings across all age groups.

### **Conclution**

1. Gender Differences:

- Females report higher prevalence rates for mental health conditions and greater awareness of workplace mental health benefits compared to males. This suggests potential differences in mental health experiences or willingness to report.

2. Regional Disparities:

- North America shows higher awareness and availability of mental health benefits and services compared to Europe. This underscores a need for better adoption and communication of mental health support in European workplaces.

3. Workplace Support:

- While over half of respondents reported access to workplace mental health benefits, a significant portion (nearly 30%) were unsure, indicating gaps in communication and resource visibility.
