# <center> Enhancing the quality of data <center/>
<center> DLBDSDQDW01 - Data Wrangling and Data Quality <center/>
<center> IU International University of Applied Sciences <center/>

# Greetings
In this project, we tackle the challenges associated with analyzing real-world and messy datasets. Recognizing the inherent difficulties in seamlessly transitioning from hypothesis formulation to data analysis, this work emphasizes the importance of data cleaning, reshaping, and tidying as foundational steps in the analytical process. By applying Data Wrangling techniques and Quality methods, we aim to uncover patterns and insights, despite the noisy and incomplete nature of the data. This analysis not only highlights the relationships and trends within the dataset but also demonstrates the ability to effectively manage and interpret unstructured data, ultimately supporting informed decision-making. This projet will be based on this quantitative analysis of [data](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-2016/) resulting from anonymous surveys from people working in IT-related companies around the world.

### List of contents :
1. __Introduction__
2. __Exploratory Data Analysis (EDA)__
3. _Data Pre-processing_
4. _Clustering_
5. _Clusters Profiling_

Importing the required libraries

In [None]:
# Importing the required libraries
import matplotlib.pyplot as plt
from pathlib import Path
import seaborn as sns
import pandas as pd
import numpy as np
import kagglehub
import textwrap
import warnings
import shutil

In [None]:
# Ignoring irrelevant warnings
warnings.filterwarnings('ignore')

## 1. Introduction
First, let's start by loading the data.

In [None]:
# Formulating the current working directory
path = Path.cwd().parent

### The following 2 lines of code are for one-time execution only,
### rewriting the data each time you run the cell may be disturbing.

# Download latest version
# path_to_dataset = kagglehub.dataset_download("osmi/mental-health-in-tech-2016")

# Move the file or directory
# shutil.move(f"{path_to_dataset}/mental-heath-in-tech-2016_20161114.csv", f"{path}/data")

# Loading the data
df = pd.read_csv(f'{path}/data/mental-heath-in-tech-2016_20161114.csv')

# Printing the first rows
df.head()

In [None]:
# Printing the shape of our data
print(f"The data is formed through {df.shape[1]} columns/features and {df.shape[0]} rows/records.")

As we seek further investigations, we might use the __".info()"__ method, but we want to need to explore unique values within each column as this information will help us later on.

In [None]:
# Creating a user-defined function : Discover_df
def discover_df(dataframe : pd.DataFrame) -> pd.DataFrame:
    """

    Construct a dataframe with custom features based on an existing dataframe.

    parameter:
    dataframe: any pandas dataframe

    returns:
    datafram: a pandas dataframe with detailed report on the dataset including missing values, unique values, and data types.
    """
    # Initiating an empty list data_info
    data_info = []

    # Looping over the dataset
    for index, column in enumerate(dataframe.columns):

        # Collecting the necessary information
        info = {
            # The name of the column
            'name': column,

            # The number of empty values in a column
            'empty_values': df[column].isna().sum(),

            # The number of unique values
            'unique_values_count': [df[column].unique().__len__() - 1 if df[column].isna().sum() != 0 else df[column].unique().__len__()][0],

            # The list of unique values
            'unique_values_list': [element for element in df[column].unique() if element is not np.nan and element != "nan" ],

            # The data type of column
            'data_type': df[column].dtypes
        }

        # Appending the values in the pre-defined dictionary
        data_info.append(info)

    # Create a DataFrame from the gathered information
    discovered_df = pd.DataFrame(data_info)

    # Return the output
    return discovered_df

In [None]:
# Printing the first 10 rows
discovered = discover_df(df)

discovered

__Insights :__
- Out of a total of 63 columns, 7 columns contain numerical data while 15 columns contain more than 500 missing entries.
- Unusual entries within columns such as _age_ and _gender_ were identified.
- Large numbers of unique values necessitate grouping into smaller categories.
- Long columns names in the form of questions can benefit from a transformation into short columns names.

We'll start by renaming the columns into short names instead of large questions.

In [None]:
# Here are the new names of the columns
new_columns_names = [
    # Are you self-employed?
    'is_self_employed',
    # How many employees does your company or organization have?
    'organization_size',
    # Is your employer primarily a tech company/organization?
    'is_tech_company',
    # Is your primary role within your company related to tech/IT?
    'is_tech_role',
    # Does your employer provide mental health benefits as part of healthcare coverage?
    'is_mh_benefits_provided',
    # Do you know the options for mental health care available under your employer-provided coverage?
    'is_aware_mh_care_available',
    # Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
    'is_mh_discussed_by_employer',
    # Does your employer offer resources to learn more about mental health concerns and options for seeking help?
    'is_mh_resources_provided_by_employer',
    # Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
    'is_anonymity_protected',
    # If a mental health issue prompted you to request a medical leave from work, asking for that leave would be
    'how_is_asking_for_medical_leave_due_to_mhi',
    # Do you think that discussing a mental health disorder with your employer would have negative consequences?
    'is_discussing_mhd_with_employer_have_negative_consequences',
    # Do you think that discussing a physical health disorder with your employer would have negative consequences?
    'is_discussing_phd_with_employer_have_negative_consequences',
    # Would you feel comfortable discussing a mental health disorder with your coworkers?
    'is_willing_to_discuss_mhi_with_colleagues',
    # Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?
    'is_willing_to_discuss_mhi_with_direct_supervisor',
    # Do you feel that your employer takes mental health as seriously as physical health?
    'is_employer_takes_mh_seriously',
    # Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?
    'is_aware_of_previous_negative_consequence_of_colleagues_with_mhi',
    # Do you have medical coverage (private insurance or state-provided) which includes treatment of mental health issues?
    'have_medical_coverage_includes_mental_health_issue',
    # Do you know local or online resources to seek help for a mental health disorder?
    'know_how_to_seek_help',
    # If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?
    'is_willing_to_reveal_previous_mental_health_issue_to_business_contacts',
    # If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?
    'is_impacted_negatively_1',
    # If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?
    'is_able_to_reveal_previous_mental_health_issue_to_coworkers',
    # If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?
    'is_impacted_negatively_2',
    # Do you believe your productivity is ever affected by a mental health issue?
    'is_productivity_impacted',
    # If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?
    'percentage_impacted',
    # Do you have previous employers?
    'is_previously_employed',
    # Have your previous employers provided mental health benefits?
    'is_previous_employer_provides_mh_benefits',
    # Were you aware of the options for mental health care provided by your previous employers?
    'is_aware_mh_options_by_previous_employer',
    # Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?
    'is_mh_discussed_by_previous_employer',
    # Did your previous employers provide resources to learn more about mental health issues and how to seek help?
    'is_mh_resources_provided_by_previous_employer',
    # Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?
    'is_anonymity_protected_by_previous_employer',
    # Do you think that discussing a mental health disorder with previous employers would have negative consequences?
    'is_discussing_mhd_with_previous_employer_have_negative_consequences',
    # Do you think that discussing a physical health disorder with previous employers would have negative consequences?
    'is_discussing_phd_with_previous_employer_have_negative_consequences',
    # Would you have been willing to discuss a mental health issue with your previous co-workers?
    'is_willing_to_discuss_mhi_with_previous_colleagues',
    # Would you have been willing to discuss a mental health issue with your direct supervisor(s)?
    'is_willing_to_discuss_mhi_with_previous_direct_supervisor',
    # Did you feel that your previous employers took mental health as seriously as physical health?
    'is_previous_employer_takes_mh_seriously',
    # Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?
    'is_aware_of_previous_negative_consequence_of_colleagues_with_mhi_in_previous_workplace',
    # Would you be willing to bring up a physical health issue with a potential employer in an interview?
    'is_willing_to_bring_phi_in_interview',
    # Why or why not?
    'why_or_why_not_bring_phi_in_interview',
    # Would you bring up a mental health issue with a potential employer in an interview?
    'is_willing_to_bring_mhi_in_interview',
    # Why or why not?
    'why_or_why_not_bring_mhi_in_interview',
    # Do you feel that being identified as a person with a mental health issue would hurt your career?
    'is_being_identified_with_mhi_would_hurt_your_career',
    # Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?
    'is_being_identified_with_mhi_would_lower_your_status_among_colleagues',
    # How willing would you be to share with friends and family that you have a mental illness?
    'is_wiling_to_share_about_mhi',
    # Have you observed or experienced an unsupported or badly handled response to a mental health issue in your current or previous workplace?
    'previously_observed_experienced_response_to_mhi',
    # Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?
    'is_less_encouraged_to_reveal_mhi',
    # Do you have a family history of mental illness?
    'family_history_of_mhi',
    # Have you had a mental health disorder in the past?
    'previous_history_of_mhi',
    # Do you currently have a mental health disorder?
    'is_having_mhd',
    # If yes, what condition(s) have you been diagnosed with?
    'known_conditions',
    # If maybe, what condition(s) do you believe you have?
    'suspected_conditions',
    # Have you been diagnosed with a mental health condition by a medical professional?
    'diagnosed_by_professional',
    # If so, what condition(s) were you diagnosed with?
    'diagnosed_conditions_by_professional',
    # Have you ever sought treatment for a mental health issue from a mental health professional?
    'is_sought_treatment_for_mhi',
    # If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?
    'is_mhi_interferes_with_your_work_when_treated_effectively',
    # If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?
    'is_mhi_does_not_interfere_with_your_work_when_treated_effectively',
    # What is your age?
    'age',
    # What is your gender?
    'gender',
    # What country do you live in?'
    'country_of_residency',
    # What US state or territory do you live in?
    'us_state_residency',
    # What country do you work in?
    'country_of_work',
    # What US state or territory do you work in?
    'us_state_work',
    # Which of the following best describes your work position?
    'role_description',
    # Do you work remotely?
    'is_remote'
]

# Setting the columns names in df
df.columns = new_columns_names

# Preview the data
df.head()

### Missing values



In [None]:
# Apply the new names
discovered = discover_df(df)

# Determine the columns with missing values
missing_vlaues_columns = discovered[discovered["empty_values"] != 0].sort_values(by="empty_values", ascending=False)

# Print the dataframe
missing_vlaues_columns

In [None]:
# Print the number of columns with missing values
print(f"We have {missing_vlaues_columns.shape[0]} columns exhibitng missing values.")

### Flag Columns
There are many reasons for a column to containg missing values, amonng them that this column is preceeded with a __Flag Column__. In this dataset, we have 6 flag columns in which missing values are the result of a __"Condition not met"__.


Let's deal with column from 1 to 16. These collection is preceeded with a flag column, investigating their state as employees = 0 or self-employed.

In [None]:
# Determine how many self-employed respondent in the dataset.
employment_df = df["is_self_employed"].value_counts().reset_index()
employment_df.columns = ['is_self_emplyed', 'number_of_employees']

# Print the results
print(f"In this dataset, we have {employment_df.iloc[1,1]} self-employed and {employment_df.iloc[0,1]} is employed respondents.")

In the the 15 following questions, self-employed respondents aren't expected to answer organization-related question and leaving the fields empty would confuse our analysis.

In [None]:
# Storing the Organization-related questions in a list
non_answered_by_self_employed = [
    "organization_size",
    "is_tech_company",
    "is_tech_role",
    "is_mh_benefits_provided",
    "is_aware_mh_care_available",
    "is_mh_discussed_by_employer",
    "is_mh_resources_provided_by_employer",
    "is_anonymity_protected",
    "how_is_asking_for_medical_leave_due_to_mhi",
    "is_discussing_mhd_with_employer_have_negative_consequences",
    "is_discussing_phd_with_employer_have_negative_consequences",
    "is_willing_to_discuss_mhi_with_colleagues",
    "is_willing_to_discuss_mhi_with_direct_supervisor",
    "is_employer_takes_mh_seriously",
    "is_aware_of_previous_negative_consequence_of_colleagues_with_mhi"
]

However, we must check our assumption first.

In [None]:
# Show the responses of self-employed respondents to these questions
df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed].isna().sum()

Our assumption is correct. Now, we'll fill the missing values to avoid confusion in analysis

In [None]:
# Declaring the variable : value_to_replace
value_to_replace = "Self-employed (Non Applicable)"

# Re-assignging the old entries with the adjusted entries
df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed] = df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed].fillna(value_to_replace)

# Verifying the number of missing value afterwards
df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed].isna().sum()

The same process is repeated for the rest of the flag columns.

In [None]:
# Storing the Organization-related questions in a list
non_answered_by_non_previously_employed = [
    "is_previous_employer_provides_mh_benefits",
    "is_aware_mh_options_by_previous_employer",
    "is_mh_discussed_by_previous_employer",
    "is_mh_resources_provided_by_previous_employer",
    "is_anonymity_protected_by_previous_employer",
    "is_discussing_mhd_with_previous_employer_have_negative_consequences",
    "is_discussing_phd_with_previous_employer_have_negative_consequences",
    "is_willing_to_discuss_mhi_with_previous_colleagues",
    "is_willing_to_discuss_mhi_with_previous_direct_supervisor",
    "is_previous_employer_takes_mh_seriously",
    "is_aware_of_previous_negative_consequence_of_colleagues_with_mhi_in_previous_workplace",
]

In [None]:
# Declaring the variable : value_to_replace
value_to_replace = "Not previously employed (Non Applicable)"

#
df.loc[df["is_previously_employed"] == 0, non_answered_by_non_previously_employed] = df.loc[df["is_previously_employed"] == 0, non_answered_by_non_previously_employed].fillna(value_to_replace)

#
df.loc[df["is_previously_employed"] == 0, non_answered_by_non_previously_employed].isna().sum()

In [None]:
# New list
new_list = [
    "have_medical_coverage_includes_mental_health_issue",
    "know_how_to_seek_help",
    "is_willing_to_reveal_previous_mental_health_issue_to_business_contacts",
    "is_impacted_negatively_1",
    "is_able_to_reveal_previous_mental_health_issue_to_coworkers",
    "is_impacted_negatively_2",
    "is_productivity_impacted",
    "percentage_impacted",
]

df.loc[df["is_self_employed"] == 0, new_list].isna().sum()

## 2. Exploratory Data Analysis
In this section, we're iterating over columns based on relevancy. For a clearer view, visualizations are used where appropriate.

In [None]:
# discovered["empty_values"] = discovered["empty_values"].apply(lambda x: x - 287 if x >= 287 else x)

In [None]:
discovered

In [None]:
# Setting the dataframe of information
frame = pd.DataFrame(df.groupby("Is your employer primarily a tech company/organization?")["How many employees does your company or organization have?"].value_counts()).reset_index()

# Renaming its columns 
frame.columns = ["is_tech_company", "organization_size", "number_of_employees"]

# Adjusting variables
frame["is_tech_company"].replace(0.0, "Non-Technology-related Organization", inplace=True)
frame["is_tech_company"].replace(1.0, "Technology-related Organization", inplace=True)

# Set the figure size
plt.figure(figsize=(15,7))

# Generating the plot
sns.barplot(x="organization_size", y="number_of_employees", hue="is_tech_company", data=frame,
            palette={"Technology-related Organization": "#4CCD99", "Non-Technology-related Organization": "#FFC700"}, saturation=0.7)

# Setting labels
plt.xlabel('')
plt.ylabel('')

# Setting y-axis to omit 1 and limit at 250
plt.ylim(1, 250)

# Show legend
plt.legend(title='Company Type', loc='upper right')

# Save the figure
# plt.savefig(f"{path}/assets/Fig01 - Number of Employees per company size and type.png")

# Showing the plot
plt.show()

In the following section, a comparison is provided of current and previous employers regarding the following areas :
1. Availability of Mental Health Benefits in Healthcare Coverage.
2. Knowledge of Mental Health Care Options Under Employer-Provided Coverage.
3. Formal Discussions on Mental Health in the Workplace.
4. Availability of Resources for Mental Health Concerns.
5. Anonymity Protection for Mental Health or Substance Abuse Treatment.
6. Potential Negative Consequences of Discussing Mental Health Disorders with Employer.
7. Potential Negative Consequences of Discussing Physical Health Issues with Employer.
8. Comfort Level Discussing Mental Health with Coworkers.
9. Comfort Level Discussing Mental Health with Direct Supervisors.
10. Perception of Employer's Attitude Towards Mental Health vs. Physical Health.
11. Observation of Negative Consequences for Coworkers Open About Mental Health Issues.

In [None]:
# Setting lists of columns
previous_employer = [
    "Have your previous employers provided mental health benefits?",
    "Were you aware of the options for mental health care provided by your previous employers?",
    "Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?",
    "Did your previous employers provide resources to learn more about mental health issues and how to seek help?",
    "Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?",
    "Do you think that discussing a mental health disorder with previous employers would have negative consequences?",
    "Do you think that discussing a physical health issue with previous employers would have negative consequences?",
    "Would you have been willing to discuss a mental health issue with your previous co-workers?",
    "Would you have been willing to discuss a mental health issue with your direct supervisor(s)?",
    "Did you feel that your previous employers took mental health as seriously as physical health?",
    "Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?"
]

current_employer = [
    "Does your employer provide mental health benefits as part of healthcare coverage?",
    "Do you know the options for mental health care available under your employer-provided coverage?",
    "Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",
    "Does your employer offer resources to learn more about mental health concerns and options for seeking help?",
    "Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?",
    "Do you think that discussing a mental health disorder with your employer would have negative consequences?",
    "Do you think that discussing a physical health issue with your employer would have negative consequences?",
    "Would you feel comfortable discussing a mental health disorder with your coworkers?",
    "Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?",
    "Do you feel that your employer takes mental health as seriously as physical health?",
    "Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?"

]

# Setting titles' list
titles = [
    "Availability of Mental Health Benefits in Healthcare Coverage",
    "Knowledge of Mental Health Care Options Under Employer-Provided Coverage",
    "Formal Discussions on Mental Health in the Workplace",
    "Availability of Resources for Mental Health Concerns",
    "Anonymity Protection for Mental Health or Substance Abuse Treatment",
    "Potential Negative Consequences of Discussing Mental Health Disorders with Employer",
    "Potential Negative Consequences of Discussing Physical Health Issues with Employer",
    "Comfort Level Discussing Mental Health with Coworkers",
    "Comfort Level Discussing Mental Health with Direct Supervisors",
    "Perception of Employer's Attitude Towards Mental Health vs. Physical Health",
    "Observation of Negative Consequences for Coworkers Open About Mental Health Issues"
]

In [None]:
# Define the size of the figure and the number of rows and columns
fig, axes = plt.subplots(nrows=len(previous_employer), ncols=2, 
                         figsize=(12, 3.2 * len(previous_employer)), sharey=True)
plt.subplots_adjust(hspace=0.75)

# Iterate through each question in previous_employer and current_employer lists
for i in range(len(previous_employer)):
    # Plot the count plot for the previous employer question
    sns.countplot(x=previous_employer[i], data=df, ax=axes[i, 0], color='#FFC700')
    axes[i, 0].set_ylabel("")
    axes[i, 0].set_xlabel("")
    axes[i, 0].set_ylim(1, 1000)
    axes[i, 0].set_xticklabels([textwrap.fill(label.get_text(), 15) for label in axes[i, 0].get_xticklabels()])
    axes[i, 0].legend(["Previous employment"], loc="upper right")

    # Plot the count plot for the current employer question
    sns.countplot(x=current_employer[i], data=df, ax=axes[i, 1], color='#4CCD99')
    axes[i, 1].set_ylabel("")
    axes[i, 1].set_xlabel("")
    axes[i, 1].set_ylim(1, 1000)
    axes[i, 1].set_xticklabels([textwrap.fill(label.get_text(), 15) for label in axes[i, 1].get_xticklabels()])
    axes[i, 1].legend(["Current employment"], loc="upper right")


    # Set a single title above each row of graphs
    axes[i, 0].set_title(titles[i], fontsize=14, pad=16)
    axes[i, 0].title.set_position([1.1, 5])

# Saving the figure 
# plt.savefig(f"{path}/assets/Fig02 - Comparison of Mental Health Issues treatment in Current and previous employment.png", dpi=300)

# Show the plot
plt.show()

Next, we explore the intention of employees to reveal physical and health issues in a job interview.

In [None]:
# Performing aggregations on the columns, then saving them into dictionaries
phi = df[
    "Would you be willing to bring up a physical health issue with a potential employer in an interview?"].value_counts().to_dict()
mhi = df[
    "Would you bring up a mental health issue with a potential employer in an interview?"].value_counts().to_dict()

# Setting the components for the chart
categories = ['Yes', 'Maybe', 'No']
phi_values = [phi['Yes'], phi['Maybe'], phi['No']]
mhi_values = [mhi['Yes'], mhi['Maybe'], mhi['No']]

# Set the width of the bars
bar_width = 0.35

# Create figure and axis
fig, ax = plt.subplots(figsize=(15, 4))

# Plot the first set of bars
bar1 = ax.bar(np.arange(len(categories)), phi_values, bar_width, color='#03AED2', label='Physical health issues')

# Calculate the position for the second set of bars
bar2_position = np.arange(len(categories)) + bar_width

# Plot the second set of bars
bar2 = ax.bar(bar2_position, mhi_values, bar_width, color='#F3CA52', label='Mental health issues')

# Set labels and ticks
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_ylim(1, 1000)
ax.set_xticks(np.arange(len(categories)))
ax.set_xticklabels(categories)

# Add legend
bars = [bar1, bar2]
labels = [bar.get_label() for bar in bars]
ax.legend(bars, labels)

# Saving the figure 
# plt.savefig(f"{path}/assets/Fig03 - Intention tp reveal Health Issues in interviews.png", bbox_inches='tight', dpi=300)

# Show plot
plt.show()

In the upcoming section, we discover how employees regard the impact of being identified with mental illnesses on their careers and status among their pairs. Then, we assess their openness to discuss it with friends and relatives.

In [None]:
# Saving the columns names in a list
columns = [
    "Do you feel that being identified as a person with a mental health issue would hurt your career?",
    "Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?",
    "How willing would you be to share with friends and family that you have a mental illness?"
]

# Saving the prospected charts' titles in a list
titles = [
    "Negative impact on the career",
    "Low status probability",
    "Openness about mental health issue"
]

# Create a figure and three subplots
fig3, axes = plt.subplots(1, 3, figsize=(27, 10))

# Starting the for loop
for index, column in enumerate(columns):
    # Setting the dictionary with the relevant details
    the_dict = df[column].value_counts().to_dict()

    # Solving the categories and their values
    categories = list(the_dict.keys())
    values = [the_dict[category] for category in categories]  # Corrected

    # Plotting the chart without labels and percentages
    axes[index].pie(values, labels=None, autopct='%1.1f%%', textprops={'fontsize': 24}, startangle=90,
                    colors=["#87C4FF", "#5CD2E6", "#F6F193", "#ECEE81", "#C5EBAA", "#A5DD9B",])
    axes[index].set_title(titles[index], fontweight='bold', fontsize=28)

    # Adding legend under each chart with extra space at the bottom
    axes[index].legend(categories, loc='lower center', bbox_to_anchor=(0.5, -0.2), fontsize=18)

# Adjust layout
plt.tight_layout()

# Saving the plot
# plt.savefig(f'{path}/assets/Fig04 - Potential consequences of revealing Mental Health Issues.png', bbox_inches='tight', dpi=300)

# Show the plot
plt.show()

Fastforward, we would like to investigate whether the respondents have already experienced a badly handled mental health case, since it'd highly affect their behavior toward reporting a cose of their own.

In [None]:
# Setting a subset with the two columns called new_df
new_df = df[[
    "Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?",
    "Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?"
]].copy()

# Setting the figure size
plt.figure(figsize=(15, 4))

# Plotting the bar plot using seaborn
sns.countplot(
    x="Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?",
    hue="Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?",
    order=["Yes, I experienced", "Yes, I observed", "Maybe/Not sure", "No"],
    palette=["#03AED2",  "#A5DD9B", "#F3CA52"],
    data=new_df)

# Setting labels and legend
plt.xlabel("")
plt.ylabel("")
plt.ylim(1, 140)
plt.legend(title="Does it impact?")

# Saving the figure
# plt.savefig(f'{path}/assets/Fig05 - Previous experiences with badly handed mental health issues.png', bbox_inches='tight', dpi=300)

# Showing the chart
plt.show()

Next, we would like to assess the relationship of having mental health issues in the family of the employees with their own mental health issues.

In [None]:
# Setting a subset with the two columns called new_df
new_df = df[[
    "Do you have a family history of mental illness?",
    "Have you had a mental health disorder in the past?"
]].copy()

# 
plt.figure(figsize=(15, 4))

# Plotting the bar plot using seaborn
sns.countplot(
    x="Do you have a family history of mental illness?",
    hue="Have you had a mental health disorder in the past?",
    palette=["#03AED2",  "#A5DD9B", "#F3CA52"],
    order=["Yes", "I don't know", "No"],
    data=new_df)

# Setting labels and legend
plt.xlabel("")
plt.ylabel("")
plt.ylim(1, 500)
plt.legend(title="Previous mental health issues")

# Saving the figure
# plt.savefig(f'{path}/assets/Fig06 - Previous and family mental health issues.png', bbox_inches='tight', dpi=300)

# Showing the chart
plt.show()

While __575__ employees know they have mental health disorder and __531__ other suspects have them, we want to explore the most frequent mental health disorders.

In [None]:
# Define a new helping function
def conditions_counter(column_name):
    """
    :param column_name: The name of column desired
    :return: a Dataframe containing the conditions and the number of their occurrences in the column
    """

    # Creating the list of unique values excluding empty values
    a_list = pd.DataFrame(df[column_name].value_counts()).reset_index().iloc[:, 0].to_list()

    # Creating an empty dictionary
    a_dict = dict()

    # Initiating a for-loop
    for element in a_list:
        # Inspecting the existence of a character 
        if '|' in element:
            elements = element.split("|")
            # If the condition is met
            for unit in elements:
                # Iterating each unit  
                if unit in a_dict:
                    a_dict[unit] += 1  # Augment the number if it does exist
                else:
                    a_dict[unit] = 1  # Set the number to 1 if a new occurrence is found
        else:
            if element in a_dict:
                a_dict[element] += 1
            else:
                a_dict[element] = 1

    # Turning the data stored in the dictionary into a dataframe
    dframe = pd.DataFrame(list(a_dict.items()), columns=['Condition', 'Count'])

    # The final output of the function
    return dframe

In [None]:
# Setting the dataframes
known_conditions = conditions_counter("If yes, what condition(s) have you been diagnosed with?")
suspected_conditions = conditions_counter("If maybe, what condition(s) do you believe you have?")
diagnosed_conditions = conditions_counter("If so, what condition(s) were you diagnosed with?")

# Merging the data frames into conditions_df
conditions_df = pd.merge(
    pd.merge(
        known_conditions,
        suspected_conditions,
        on='Condition', how='outer',
        suffixes=('_known', '_suspected')
    ), diagnosed_conditions, on='Condition', how='outer')

# Replacing empty values
conditions_df[['Count_known', 'Count_suspected', 'Count']] = conditions_df[
    ['Count_known', 'Count_suspected', 'Count']].replace(np.nan, 0)

# Adjusting the data type
conditions_df[['Count_known', 'Count_suspected', 'Count']] = conditions_df[
    ['Count_known', 'Count_suspected', 'Count']].astype('int64')

# Adding a new column called: Total
conditions_df["Total"] = conditions_df["Count_known"] + conditions_df["Count_suspected"] + conditions_df["Count"]
conditions_df.sort_values("Total", ascending=False, inplace=True)

# Renaming te columns for better clarity 
conditions_df.columns = ["Conditions", "Known", "Suspected", "Diagnosed_by_professional", "Total"]

# Print conditions
conditions_df

With over __800__ respondents having already sought treatment for mental health disorders, we want to investigate its interference in productivity.

In [None]:
# Setting the dataframes
is_affected = pd.DataFrame(df["If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?"].value_counts()).reset_index()
is_not_affected = pd.DataFrame(df["If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?"].value_counts()).reset_index()

# Setting the column names
is_affected.columns = is_not_affected.columns = ["Categories", "Count"]

# Gathering the categories
categories_available = list(is_affected.Categories.unique())

# Create a subplot with two columns and one row and adjusting their width space
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 4))
plt.subplots_adjust(wspace=0.25)

# Plot the 1st plot
sns.barplot(data=is_affected, y='Categories', x='Count', palette='summer', ax=axes[0], order=categories_available)
axes[0].set_title("Interference with Work when Treated Effectively")
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].set_xlim(1, 600)

# Plot the 2nd plot
sns.barplot(data=is_not_affected, y='Categories', x='Count', palette='summer', ax=axes[1], order=categories_available)
axes[1].set_title("Interference with Work when NOT Treated Effectively")
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].set_xlim(1, 600)

# Inverting the x-axis and deleting the labels of the 1st plot 
axes[0].invert_xaxis()
axes[0].set_yticklabels([])

# Save the plot as a picture by specifying the file name and format
# plt.savefig(f'{path}/assets/Fig07 - Comparison of the interference of Mental health issues after being treated.png', bbox_inches='tight', dpi=300)

# Show the plot
plt.show()

Earlier on, we saw some unusual entries for the age column.

In [None]:
# Create a vertical box plot for the numerical column with outliers
plt.figure(figsize=(5, 5))
ax = sns.boxplot(y='What is your age?', data=df, palette='summer')

# Setting the title and the y-axis label
plt.ylabel('')

# Set y-axis limits and omit the first 0
ax.set_ylim(1, 350)

# Save the plot as a picture by specifying the file name and format
# plt.savefig(f'{path}/assets/Fig08 - Age distribution using boxplot.png', bbox_inches='tight', dpi=300)

# Displaying the figure
plt.show()

The same thing applies to the gender column, where having more than _3_ unique values is unusual we have __70__.

In [None]:
# Counting unique values
df["What is your gender?"].value_counts()

In [None]:
# Getting the dataframes
work_country = df["What country do you work in?"].value_counts().reset_index()
residence_country = df["What country do you live in?"].value_counts().reset_index()
work_country.columns = residence_country.columns = ["Country", "Count"]

# 
countries = pd.merge(work_country, residence_country, on='Country', how='outer', suffixes=('_work', '_residency'))

# Drop empty values
countries.dropna(inplace=True)

# Set types to integer
countries[["Count_work", "Count_residency"]] = countries[["Count_work", "Count_residency"]].astype("int64")
countries["Total"] = countries["Count_work"] + countries["Count_residency"]

# Preview the data
countries

We didn't get to see the exact description of the employees' roles

In [None]:
# applying the helping functions on the role column
roles = conditions_counter("Which of the following best describes your work position?")
roles

In [None]:
# Getting the counts of the unique classes
df["Do you work remotely?"].value_counts()

## Summary

In this notebook, we discussed the following aspects :
- A basic exploration of the dataset including summary statistics, visualizations, and unique values.
- The dataset requires adequate data handling procedures.

## Author
<a href="https://www.linkedin.com/in/ab0858s/">Abdelali BARIR</a> is a former veteran in the Moroccan's Royal Armed Forces, and a self-taught python programmer. Currently enrolled in B.Sc. Data Science in __IU International University of Applied Sciences__.

## Change Log    

| Date         | Version   | Changed By       | Change Description        |
|--------------|-----------|------------------|---------------------------|
| 2024-07-10   | 1.0       | Abdelali Barir   | Modified markdown         |
| ------------ | --------- | ---------------- | ------------------------- |
