# <center> Enhancing the quality of data <center/>
<center> DLBDSDQDW01 - Data Wrangling and Data Quality <center/>
<center> IU International University of Applied Sciences <center/>

# Greetings
In this project, we tackle the challenges associated with analyzing real-world and messy datasets. Recognizing the inherent difficulties in seamlessly transitioning from hypothesis formulation to data analysis, this work emphasizes the importance of data cleaning, reshaping, and tidying as foundational steps in the analytical process. By applying Data Wrangling techniques and Quality methods, we aim to uncover patterns and insights, despite the noisy and incomplete nature of the data. This analysis not only highlights the relationships and trends within the dataset but also demonstrates the ability to effectively manage and interpret unstructured data, ultimately supporting informed decision-making. This projet will be based on this quantitative analysis of [data](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-2016/) resulting from anonymous surveys from people working in IT-related companies around the world.

### List of contents :
1. _Introduction_
2. _Exploratory the Data_
3. _Data Cleaning_
4. _Summary_

Importing the required libraries

In [1]:
# Importing the required libraries
import matplotlib.pyplot as plt
from pathlib import Path
import seaborn as sns
import pandas as pd
import numpy as np
import kagglehub
import textwrap
import warnings
import shutil

# Ignoring irrelevant warnings
warnings.filterwarnings('ignore')

## 1. Introduction
First, let's start by loading the data.

In [2]:
# Formulating the current working directory
path = Path.cwd().parent

### The following 2 lines of code are for one-time execution only,
### rewriting the data each time you run the cell may be disturbing.

# Download latest version
path_to_dataset = kagglehub.dataset_download("osmi/mental-health-in-tech-2016")

# Move the file or directory
shutil.move(f"{path_to_dataset}/mental-heath-in-tech-2016_20161114.csv", f"{path}/data")

# Loading the data
df = pd.read_csv(f'{path}/data/mental-heath-in-tech-2016_20161114.csv')

# Printing the first rows
df.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes


In [3]:
# Printing the shape of our data
print(f"The data is formed through {df.shape[1]} columns/features and {df.shape[0]} rows/records.")

The data is formed through 63 columns/features and 1433 rows/records.


## 2. Exploratory Data Analysis

As we seek further investigations, we might use the __".info()"__ method, but we want to need to explore unique values within each column as this information will help us later on.
We'll define a new function to provide us with the infromations we need to know on the dataset.

In [4]:
# Creating a user-defined function : Discover_df
def discover_df(dataframe : pd.DataFrame, only_na=False, only_binary=False) -> pd.DataFrame:
    """

    Construct a dataframe with custom features based on an existing dataframe.

    Parameter:
    dataframe: any pandas dataframe
    only_na: (optional) Set to false by default, if True, returns only the columns containing missing values.
    only_binary (optional) Set to false by default, if True, return only the columns containing binary values.

    Returns:
    datafram: a pandas dataframe with detailed report on the dataset including missing values, unique values, and data types.
    """
    # Initiating an empty list data_info
    data_info = []

    # Looping over the dataset
    for index, column in enumerate(dataframe.columns):

        # Collecting the necessary information
        info = {
            # The name of the column
            'name': column,

            # The number of empty values in a column
            'empty_values': df[column].isna().sum(),

            # The number of unique values
            'unique_values_count': [df[column].unique().__len__() - 1 if df[column].isna().sum() != 0 else df[column].unique().__len__()][0],

            # The list of unique values
            'unique_values_list': [element for element in df[column].unique() if element is not np.nan and element != "nan" ],

            # The data type of column
            'data_type': df[column].dtypes
        }

        # Appending the values in the pre-defined dictionary
        data_info.append(info)

    # Create a DataFrame from the gathered information
    discovered_df = pd.DataFrame(data_info).sort_values(by="empty_values", ascending=False)

    # Return the output
    if only_na:
        discovered_df = discovered_df[discovered_df['empty_values'] != 0]
        return discovered_df
    elif only_binary:
        discovered_df = discovered_df[discovered_df['unique_values_count'] == 2]
        return discovered_df
    else:
        return discovered_df

In [5]:
# Apply the function to our data
discovered = discover_df(df)

# Print the dataframe
discovered

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
19,If you have revealed a mental health issue to ...,1289,3,"[I'm not sure, Yes, No]",object
23,"If yes, what percentage of your work time (tim...",1229,4,"[1-25%, 76-100%, 26-50%, 51-75%]",object
3,Is your primary role within your company relat...,1170,2,"[nan, 1.0, 0.0]",float64
16,Do you have medical coverage (private insuranc...,1146,2,"[nan, 1.0, 0.0]",float64
18,If you have been diagnosed or treated for a me...,1146,5,"[Sometimes, if it comes up, No, because it doe...",object
...,...,...,...,...,...
55,What is your age?,0,53,"[39, 29, 38, 43, 42, 30, 37, 44, 28, 34, 35, 5...",int64
57,What country do you live in?,0,53,"[United Kingdom, United States of America, Can...",object
59,What country do you work in?,0,53,"[United Kingdom, United States of America, Can...",object
61,Which of the following best describes your wor...,0,264,"[Back-end Developer, Back-end Developer|Front-...",object


__Insights :__
- Out of a total of 63 columns, 7 columns contain numerical data while 15 columns contain more than 500 missing entries.
- Unusual entries within columns such as _age_ and _gender_ were identified.
- Large numbers of unique values necessitate grouping into smaller categories.
- Long columns names in the form of questions can benefit from a transformation into short columns names.

## 3. Data Preprocessing


We'll start by renaming the columns into short names instead of large questions.

In [6]:
# Here are the new names of the columns
new_columns_names = [
    # Are you self-employed?
    'is_self_employed',
    # How many employees does your company or organization have?
    'organization_size',
    # Is your employer primarily a tech company/organization?
    'is_tech_company',
    # Is your primary role within your company related to tech/IT?
    'is_tech_role',
    # Does your employer provide mental health benefits as part of healthcare coverage?
    'is_mh_benefits_provided',
    # Do you know the options for mental health care available under your employer-provided coverage?
    'is_aware_mh_care_available',
    # Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
    'is_mh_discussed_by_employer',
    # Does your employer offer resources to learn more about mental health concerns and options for seeking help?
    'is_mh_resources_provided_by_employer',
    # Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
    'is_anonymity_protected',
    # If a mental health issue prompted you to request a medical leave from work, asking for that leave would be
    'how_is_asking_for_medical_leave_due_to_mhi',
    # Do you think that discussing a mental health disorder with your employer would have negative consequences?
    'is_discussing_mhd_with_employer_have_negative_consequences',
    # Do you think that discussing a physical health disorder with your employer would have negative consequences?
    'is_discussing_phd_with_employer_have_negative_consequences',
    # Would you feel comfortable discussing a mental health disorder with your coworkers?
    'is_willing_to_discuss_mhi_with_colleagues',
    # Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?
    'is_willing_to_discuss_mhi_with_direct_supervisor',
    # Do you feel that your employer takes mental health as seriously as physical health?
    'is_employer_takes_mh_seriously',
    # Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?
    'is_aware_of_previous_negative_consequence_of_colleagues_with_mhi',
    # Do you have medical coverage (private insurance or state-provided) which includes treatment of mental health issues?
    'have_medical_coverage_includes_mental_health_issue',
    # Do you know local or online resources to seek help for a mental health disorder?
    'know_how_to_seek_help',
    # If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?
    'is_willing_to_reveal_previous_mental_health_issue_to_business_contacts',
    # If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?
    'is_impacted_negatively_1',
    # If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?
    'is_able_to_reveal_previous_mental_health_issue_to_coworkers',
    # If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?
    'is_impacted_negatively_2',
    # Do you believe your productivity is ever affected by a mental health issue?
    'is_productivity_impacted',
    # If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?
    'percentage_impacted',
    # Do you have previous employers?
    'is_previously_employed',
    # Have your previous employers provided mental health benefits?
    'is_previous_employer_provides_mh_benefits',
    # Were you aware of the options for mental health care provided by your previous employers?
    'is_aware_mh_options_by_previous_employer',
    # Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?
    'is_mh_discussed_by_previous_employer',
    # Did your previous employers provide resources to learn more about mental health issues and how to seek help?
    'is_mh_resources_provided_by_previous_employer',
    # Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?
    'is_anonymity_protected_by_previous_employer',
    # Do you think that discussing a mental health disorder with previous employers would have negative consequences?
    'is_discussing_mhd_with_previous_employer_have_negative_consequences',
    # Do you think that discussing a physical health disorder with previous employers would have negative consequences?
    'is_discussing_phd_with_previous_employer_have_negative_consequences',
    # Would you have been willing to discuss a mental health issue with your previous co-workers?
    'is_willing_to_discuss_mhi_with_previous_colleagues',
    # Would you have been willing to discuss a mental health issue with your direct supervisor(s)?
    'is_willing_to_discuss_mhi_with_previous_direct_supervisor',
    # Did you feel that your previous employers took mental health as seriously as physical health?
    'is_previous_employer_takes_mh_seriously',
    # Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?
    'is_aware_of_previous_negative_consequence_of_colleagues_with_mhi_in_previous_workplace',
    # Would you be willing to bring up a physical health issue with a potential employer in an interview?
    'is_willing_to_bring_phi_in_interview',
    # Why or why not?
    'why_or_why_not_bring_phi_in_interview',
    # Would you bring up a mental health issue with a potential employer in an interview?
    'is_willing_to_bring_mhi_in_interview',
    # Why or why not?
    'why_or_why_not_bring_mhi_in_interview',
    # Do you feel that being identified as a person with a mental health issue would hurt your career?
    'is_being_identified_with_mhi_would_hurt_your_career',
    # Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?
    'is_being_identified_with_mhi_would_lower_your_status_among_colleagues',
    # How willing would you be to share with friends and family that you have a mental illness?
    'is_wiling_to_share_about_mhi',
    # Have you observed or experienced an unsupported or badly handled response to a mental health issue in your current or previous workplace?
    'previously_observed_experienced_response_to_mhi',
    # Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?
    'is_less_encouraged_to_reveal_mhi',
    # Do you have a family history of mental illness?
    'family_history_of_mhi',
    # Have you had a mental health disorder in the past?
    'previous_history_of_mhi',
    # Do you currently have a mental health disorder?
    'is_having_mhd',
    # If yes, what condition(s) have you been diagnosed with?
    'known_conditions',
    # If maybe, what condition(s) do you believe you have?
    'suspected_conditions',
    # Have you been diagnosed with a mental health condition by a medical professional?
    'diagnosed_by_professional',
    # If so, what condition(s) were you diagnosed with?
    'diagnosed_conditions_by_professional',
    # Have you ever sought treatment for a mental health issue from a mental health professional?
    'is_sought_treatment_for_mhi',
    # If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?
    'is_mhi_interferes_with_your_work_when_treated_effectively',
    # If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?
    'is_mhi_does_not_interfere_with_your_work_when_treated_effectively',
    # What is your age?
    'age',
    # What is your gender?
    'gender',
    # What country do you live in?'
    'country_of_residency',
    # What US state or territory do you live in?
    'us_state_residency',
    # What country do you work in?
    'country_of_work',
    # What US state or territory do you work in?
    'us_state_work',
    # Which of the following best describes your work position?
    'role_description',
    # Do you work remotely?
    'is_remote'
]

# Setting the columns names in df
df.columns = new_columns_names

# Preview the data
df.head()

Unnamed: 0,is_self_employed,organization_size,is_tech_company,is_tech_role,is_mh_benefits_provided,is_aware_mh_care_available,is_mh_discussed_by_employer,is_mh_resources_provided_by_employer,is_anonymity_protected,how_is_asking_for_medical_leave_due_to_mhi,...,is_mhi_interferes_with_your_work_when_treated_effectively,is_mhi_does_not_interfere_with_your_work_when_treated_effectively,age,gender,country_of_residency,us_state_residency,country_of_work,us_state_work,role_description,is_remote
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes


### 3.1 Handling Raw input
Including open-ended questions in surveys offers a great opportunity to know more about the ideas of the repondents. However, there is a trade-off to make. The more elaborate the question is, the harder it's to classify the answer of the participant. In some cases, the existence of some words can determine the orientation of the answer but this isn't one of those cases.

In [7]:
# Load the arranged data
data = pd.read_csv(f'{path}/data/data_v0.5.csv')

# Replace the old entries with the new categories
df['why_or_why_not_bring_phi_in_interview'] = data["Why or why not?"]
df['why_or_why_not_bring_mhi_in_interview'] = data["Why or why not?.1"]

# Display the new categories
df['why_or_why_not_bring_phi_in_interview'].value_counts()

why_or_why_not_bring_phi_in_interview
Bias and Discrimination                517
Low hiring chances                     237
Performance-related Concerns           235
Transparency and Integrity             210
Privacy and Personal Concerns          101
Uncertainty about Employer Reaction     50
Other                                   25
Ineffective Disclosure                  25
Legal Implications                      24
Negative impact on the outcome           9
Name: count, dtype: int64

### 3.2 Handling Missing values
The existence of missing values in a dataset can be caused by several factors.
Let's first identify the columns where we have missing values.

In [8]:
# Apply the new names
discovered = discover_df(df, only_na=True)

# Print the dataframe
discovered.head()

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
19,is_impacted_negatively_1,1289,3,"[I'm not sure, Yes, No]",object
23,percentage_impacted,1229,4,"[1-25%, 76-100%, 26-50%, 51-75%]",object
3,is_tech_role,1170,2,"[nan, 1.0, 0.0]",float64
16,have_medical_coverage_includes_mental_health_i...,1146,2,"[nan, 1.0, 0.0]",float64
18,is_willing_to_reveal_previous_mental_health_is...,1146,5,"[Sometimes, if it comes up, No, because it doe...",object


In [9]:
# Print the number of columns with missing values
print(f"We have {discovered.shape[0]} columns exhibitng missing values.")

We have 42 columns exhibitng missing values.


#### 3.2.1 Flag Columns
There are many reasons for a column to containg missing values, amonng them that this column is preceeded with a __Flag Column__. In this dataset, we have 6 flag columns in which missing values are the result of a __"Condition not met"__.


Let's deal with column from 1 to 16. These collection is preceeded with a flag column, investigating their state as employees = 0 or self-employed.

In [10]:
# Determine how many self-employed respondent in the dataset.
employment_df = df["is_self_employed"].value_counts().reset_index()
employment_df.columns = ['is_self_emplyed', 'number_of_employees']

# Print the results
print(f"In this dataset, we have {employment_df.iloc[1,1]} self-employed "
      f"and {employment_df.iloc[0,1]} is employed respondents.")

In this dataset, we have 287 self-employed and 1146 is employed respondents.


In the the 15 following questions, self-employed respondents aren't expected to answer organization-related question and leaving the fields empty would confuse our analysis.

In [11]:
# Storing the Organization-related questions in a list
non_answered_by_self_employed = [
    "organization_size",
    "is_tech_company",
    "is_tech_role",
    "is_mh_benefits_provided",
    "is_aware_mh_care_available",
    "is_mh_discussed_by_employer",
    "is_mh_resources_provided_by_employer",
    "is_anonymity_protected",
    "how_is_asking_for_medical_leave_due_to_mhi",
    "is_discussing_mhd_with_employer_have_negative_consequences",
    "is_discussing_phd_with_employer_have_negative_consequences",
    "is_willing_to_discuss_mhi_with_colleagues",
    "is_willing_to_discuss_mhi_with_direct_supervisor",
    "is_employer_takes_mh_seriously",
    "is_aware_of_previous_negative_consequence_of_colleagues_with_mhi"
]

However, we must check our assumption first.

In [12]:
# Show the responses of self-employed respondents to these questions
df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed].isna().sum()

organization_size                                                   287
is_tech_company                                                     287
is_tech_role                                                        287
is_mh_benefits_provided                                             287
is_aware_mh_care_available                                          287
is_mh_discussed_by_employer                                         287
is_mh_resources_provided_by_employer                                287
is_anonymity_protected                                              287
how_is_asking_for_medical_leave_due_to_mhi                          287
is_discussing_mhd_with_employer_have_negative_consequences          287
is_discussing_phd_with_employer_have_negative_consequences          287
is_willing_to_discuss_mhi_with_colleagues                           287
is_willing_to_discuss_mhi_with_direct_supervisor                    287
is_employer_takes_mh_seriously                                  

Our assumption is correct. Now, we'll fill the missing values to avoid confusion in analysis

In [13]:
# Declaring the variable : value_to_replace
value_to_replace = "Self-employed (Non Applicable)"

# Re-assignging the old entries with the adjusted entries
df.loc[
    df["is_self_employed"] == 1,
    non_answered_by_self_employed] = df.loc[
    df["is_self_employed"] == 1,
    non_answered_by_self_employed].fillna(value_to_replace)

# Verifying the number of missing value afterwards
df.loc[df["is_self_employed"] == 1, non_answered_by_self_employed].isna().sum()

organization_size                                                   0
is_tech_company                                                     0
is_tech_role                                                        0
is_mh_benefits_provided                                             0
is_aware_mh_care_available                                          0
is_mh_discussed_by_employer                                         0
is_mh_resources_provided_by_employer                                0
is_anonymity_protected                                              0
how_is_asking_for_medical_leave_due_to_mhi                          0
is_discussing_mhd_with_employer_have_negative_consequences          0
is_discussing_phd_with_employer_have_negative_consequences          0
is_willing_to_discuss_mhi_with_colleagues                           0
is_willing_to_discuss_mhi_with_direct_supervisor                    0
is_employer_takes_mh_seriously                                      0
is_aware_of_previous

The same process is repeated for the rest of the flag columns.

In [14]:
# Storing the Organization-related questions in a list
non_answered_by_non_previously_employed = [
    "is_previous_employer_provides_mh_benefits",
    "is_aware_mh_options_by_previous_employer",
    "is_mh_discussed_by_previous_employer",
    "is_mh_resources_provided_by_previous_employer",
    "is_anonymity_protected_by_previous_employer",
    "is_discussing_mhd_with_previous_employer_have_negative_consequences",
    "is_discussing_phd_with_previous_employer_have_negative_consequences",
    "is_willing_to_discuss_mhi_with_previous_colleagues",
    "is_willing_to_discuss_mhi_with_previous_direct_supervisor",
    "is_previous_employer_takes_mh_seriously",
    "is_aware_of_previous_negative_consequence_of_colleagues_with_mhi_in_previous_workplace",
]

# Declaring the variable : value_to_replace
value_to_replace = "Not previously employed (Non Applicable)"

# Fill the columns with the specific value
df.loc[df["is_previously_employed"] == 0, non_answered_by_non_previously_employed] = df.loc[df["is_previously_employed"] == 0, non_answered_by_non_previously_employed].fillna(value_to_replace)

In [15]:
# Arranging another collection of column with the same number of missing values
unanswered_by_employees = [
    "have_medical_coverage_includes_mental_health_issue",
    "know_how_to_seek_help",
    "is_willing_to_reveal_previous_mental_health_issue_to_business_contacts",
    "is_impacted_negatively_1",
    "is_able_to_reveal_previous_mental_health_issue_to_coworkers",
    "is_impacted_negatively_2",
    "is_productivity_impacted",
    "percentage_impacted",
]

# Declaring the variable : value_to_replace
value_to_replace = "Employees (Unanswered)"

# Fill the columns with the specific value
df.loc[df["is_self_employed"] == 0, unanswered_by_employees] = df.loc[df["is_self_employed"] == 0, unanswered_by_employees].fillna(value_to_replace)

Following the same approach, we'll fill the remaining columns.

In [16]:
# Fixing the missing values in the known_conditions column
df.loc[df["is_having_mhd"] == "Maybe", "known_conditions"] = df.loc[df["is_having_mhd"] == "Maybe", "known_conditions"].fillna("I'm not sure")

df.loc[df["is_having_mhd"] == "No", "known_conditions"] = df.loc[df["is_having_mhd"] == "No", "known_conditions"].fillna("I don't have a condition")


# Fixing the missing values in the suspected_conditions column
df.loc[df["is_having_mhd"] == "Yes", "suspected_conditions"] = df.loc[df["is_having_mhd"] == "Yes", "suspected_conditions"].fillna("I'm aware of my condition")

df.loc[df["is_having_mhd"] == "No", "suspected_conditions"] = df.loc[df["is_having_mhd"] == "No", "suspected_conditions"].fillna("I don't have a condition")


# Fixing the missing values in the diagnosed_conditions_by_professional column
df.loc[df["diagnosed_by_professional"] == "No", "diagnosed_conditions_by_professional"] = df.loc[df["diagnosed_by_professional"] == "No", "diagnosed_conditions_by_professional"].fillna("Not diagnosed by a professional (Non Applicable)")


# Fixing the missing values in the us_state_residency and us_state_work column
df.loc[df["country_of_residency"] != "United States of America", "us_state_residency"] = df.loc[
    df["country_of_residency"] != "United States of America", "us_state_residency"].fillna(
    "I don't live in the USA (Non Applicable)")

df.loc[df["country_of_work"] != "United States of America", "us_state_work"] = df.loc[
    df["country_of_work"] != "United States of America", "us_state_work"].fillna(
    "I don't work in the USA (Non Applicable)")

In the following, the __is_tech_role__ requires filling based on __role_description__ column. Therefore, we'll proceed into cleaning the column.

Despite having no missing values in the __role_description__ column, this column uses a delimiter __( | )__ to separate the multiple choices. In a context where we want to tidy up the data without knowing the use case of the dataset, the obvious approach would be to leave the data as it is. However, we know that the data will be used for clustering purposes. Thus, maintaining the least number possible of unique values in each column makes the job much more easier.

We'll proceed by defining a function __Classifier__, which will help us classify the __role_description__ column into unique categories.

In [17]:
# Define a user-defined function : Classifier
def classifier(data : pd.DataFrame, column : str, categories_dict : dict) -> str:
    """
    Replace a multiple choices entry with a dedicated category

    Parameters:
    data : pandas dataframe
    column : the column pf interest from the dataframe
    categories_dict : the dictionary of categories

    Returns:
    None
    """
    # Initiate a for loop
    for element in data[column].unique():

        # Splitting the entries
        for part in element.split("|"):

            # Iterating over the dictionary of categories
            for k, v in categories_dict.items():

                # Check the first word only
                if part in v :
                    data[column].replace(element, k, inplace=True)
                    break
    return "Done!"

Next, we'll apply the function.

In [18]:
# Getting required roles
roles_categories = {
    "IT": ["Back-end Developer", "Front-end Developer", "DevOps/SysAdmin"],
    "Management": ["Supervisor/Team Lead","Executive Leadership"],
    "Advocacy": ["Dev Evangelist/Advocate"],
    "One-person shop":["One-person shop"],
    "Support": ["Support"],
    "Design": ["Designer"],
    "Sales": ["Sales"],
    "Other": ["Other"],
    "HR": ['HR']
}

# Apply the function classifier
classifier(df, 'role_description', roles_categories)

# Check results
df["role_description"].value_counts()

role_description
IT                 743
Management         250
Other              166
One-person shop    104
Support             63
Advocacy            50
Design              45
HR                   7
Sales                5
Name: count, dtype: int64

Then, we'll apply anonymous function replace the values with 1 if the respondent is in IT, 0 otherwise.


In [19]:
# Apply anonymous function
df["is_tech_role"] = df["role_description"].apply(lambda x: 1 if x == "IT" else 0)

# Show the values count
df["is_tech_role"].value_counts()

is_tech_role
1    743
0    690
Name: count, dtype: int64

At the end of this section, we're no longer dealing with missing values due to flag columns. Let's verify this claim

In [20]:
# Print the detailed dataframe
discover_df(df, only_na=True)

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
44,is_less_encouraged_to_reveal_mhi,776,3,"[Yes, No, Maybe]",object
19,is_impacted_negatively_1,143,4,"[Employees (Unanswered), I'm not sure, Yes, No]",object
5,is_aware_mh_care_available,133,4,"[Yes, Self-employed (Non Applicable), I am not...",object
43,previously_observed_experienced_response_to_mhi,89,4,"[No, Maybe/Not sure, Yes, I experienced, Yes, ...",object
23,percentage_impacted,83,5,"[Employees (Unanswered), 1-25%, 76-100%, 26-50...",object
48,known_conditions,7,130,"[I don't have a condition, Anxiety Disorder (G...",object
49,suspected_conditions,5,101,"[I don't have a condition, I'm aware of my con...",object
51,diagnosed_conditions_by_professional,5,117,"[Anxiety Disorder (Generalized, Social, Phobia...",object
56,gender,3,70,"[Male, male, Male , Female, M, female, m, I id...",object


By this we wrap up this section.

#### 3.2.2 Filling missing values with common values

When the number of missing values is relatively (depend on the size of the dataset, 10 is big in a dataset of 18 observations but small in a dataset made of 300 observations) small, the common appraoch among data scientist is to fill the missing values with the most pertinent value since this won't affect the data distribution in this column.

This approach is the one we'll be using in the following cases where the number of missing values is relatively small. It seems simple but it become complicated as we dive deep in the process.

In [21]:
# Show the distribution of empty values
df.loc[df["is_aware_mh_care_available"].isna(), "is_mh_discussed_by_employer"].value_counts()

is_mh_discussed_by_employer
No              102
Yes              23
I don't know      8
Name: count, dtype: int64

In [22]:
# Fill the values
df["is_aware_mh_care_available"].fillna("No", inplace=True)

# Show the new distribution
df["is_aware_mh_care_available"].value_counts()

is_aware_mh_care_available
No                                487
I am not sure                     352
Yes                               307
Self-employed (Non Applicable)    287
Name: count, dtype: int64

In [23]:
# Determine the number of missing values based on the previous response
df.loc[df["is_impacted_negatively_1"].isna(), "is_willing_to_reveal_previous_mental_health_issue_to_business_contacts"].value_counts()

is_willing_to_reveal_previous_mental_health_issue_to_business_contacts
Not applicable to me                         90
No, because it would impact me negatively    31
No, because it doesn't matter                21
Sometimes, if it comes up                     1
Name: count, dtype: int64

In [24]:
# Store the choices available
choices = ["Not applicable to me", "No, because it would impact me negatively", "No, because it doesn't matter"]

# Initiate a loop to iterate over the missing values
for element in choices :
    # Replace the missing values with the adequate answer
    df.loc[
        df["is_willing_to_reveal_previous_mental_health_issue_to_business_contacts"] == element,
        "is_impacted_negatively_1"
    ] = df.loc[
        df["is_willing_to_reveal_previous_mental_health_issue_to_business_contacts"] == element,
        "is_impacted_negatively_1"
    ].fillna("No")

# Replace the missing values with the adequate answer
df.loc[
    df["is_willing_to_reveal_previous_mental_health_issue_to_business_contacts"] == "Sometimes, if it comes up",
    "is_impacted_negatively_1"
] = df.loc[
    df["is_willing_to_reveal_previous_mental_health_issue_to_business_contacts"] == "Sometimes, if it comes up",
    "is_impacted_negatively_1"
].fillna("I'm not sure")

In [25]:
# Determine the number of missing values based on the previous response
df.loc[
    df["percentage_impacted"].isna(),
    "is_productivity_impacted"
].value_counts()

is_productivity_impacted
Unsure                  38
Not applicable to me    31
No                      14
Name: count, dtype: int64

The missng entries implies a minor impact of mental health issues over their productivity. We'll be using this information to choose the adequate value to fill them using it.

In [26]:
# Decalring the adequate value
value_to_replace = "1-25%"

# Filling the missing values
df["percentage_impacted"].fillna(value_to_replace, inplace=True)

In the following columns, a minor portion of the data is missing and shall be replaced with the most relevant value.

In [27]:
# Filling missing values with most relevant value
df["previously_observed_experienced_response_to_mhi"].fillna(
    df["previously_observed_experienced_response_to_mhi"].value_counts().reset_index().iloc[0,0], inplace=True
)

df["known_conditions"].fillna(
    df["known_conditions"].value_counts().reset_index().iloc[0,0], inplace=True
)

df["suspected_conditions"].fillna(
    df["suspected_conditions"].value_counts().reset_index().iloc[0,0], inplace=True
)

df["diagnosed_conditions_by_professional"].fillna(
    df["diagnosed_conditions_by_professional"].value_counts().reset_index().iloc[0,0], inplace=True
)

In [28]:
# Show the distribution of missing values with regard to the following column
df.loc[
    df["is_less_encouraged_to_reveal_mhi"].isna(),
    "previously_observed_experienced_response_to_mhi"
].value_counts()

previously_observed_experienced_response_to_mhi
No                    656
Maybe/Not sure         74
Yes, I experienced     28
Yes, I observed        18
Name: count, dtype: int64

In [29]:
# Fill the values based on the response with most pertinent values
df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Yes, I experienced",
    "is_less_encouraged_to_reveal_mhi"
] = df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Yes, I experienced",
    "is_less_encouraged_to_reveal_mhi"
].fillna("Yes")

df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Yes, I observed",
    "is_less_encouraged_to_reveal_mhi"
] = df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Yes, I observed",
    "is_less_encouraged_to_reveal_mhi"
].fillna("Yes")

df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Maybe/Not sure",
    "is_less_encouraged_to_reveal_mhi"
] = df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "Maybe/Not sure",
    "is_less_encouraged_to_reveal_mhi"
].fillna("No")

df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "No",
    "is_less_encouraged_to_reveal_mhi"
] = df.loc[
    df["previously_observed_experienced_response_to_mhi"] == "No",
    "is_less_encouraged_to_reveal_mhi"
].fillna("No")

In [30]:
# Show the columns with missing values
discover_df(df, only_na=True)

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
56,gender,3,70,"[Male, male, Male , Female, M, female, m, I id...",object


With one column remaining, this wrap up this section. We'll address this column in the next section.

#### 3.3 Handling redundant data
The _gender_ column showcased 70 unique categories implying 3 big categories which are : __Male, Female, and Other__

In [31]:
# Reassigning the values to Male, Female, or Other
df['gender'].replace(
    ['male', 'm', 'M', 'Male (cis)', 'cisdude', 'Dude', 'Male.', 'Cis male',
     'Cis Male', 'cis male', 'cis man', 'mail', 'Male (trans, FtM)',
     'Male/genderqueer', 'Malr', 'Man', 'Sex is male', 'Male ', 'MALE', 'man',
     'male ', 'M|', 'I\'m a man why didn\'t you make this a drop down question. '
                    'You should of asked sex? And I would of answered yes please. '
                    'Seriously how much text can this take? '
     ], 'Male', inplace=True) # Value to replace the entries with

df['gender'].replace(
    ['Female', 'female', 'I identify as female.', 'female ', 'Female assigned at birth ',
     'Cis female ', 'F', 'Woman', 'f', 'Transitioned, M2F', 'woman', 'female/woman',
     'fem', ' Female', 'Cis-woman', 'Female ', 'Female or Multi-Gender Femme',
     'Cisgender Female', 'fm', 'Female (props for making this a freeform field, though)',
     ], 'Female', inplace=True) # Value to replace the entries with

df['gender'].replace(
    ['Genderfluid (born female)', 'female-bodied; no feelings about gender', 'non-binary',
     'AFAB', 'Agender', 'genderqueer', 'Genderflux demi-girl', 'mtf', 'Genderqueer',
     'Unicorn', 'Androgynous', 'Bigender', 'Enby', 'Transgender woman', 'Human', 'human',
     'Other/Transfeminine', 'Queer', 'Fluid', 'nb masculine', np.nan, 'genderqueer woman',
     'Genderfluid', 'none of your business', 'male 9:1 female, roughly', 'Nonbinary',
     ], 'Other', inplace=True) # Value to replace the entries with

# Overview of the transformed data
df["gender"].value_counts()

gender
Male      1059
Female     340
Other       34
Name: count, dtype: int64

#### 3.4 Handling outliers
The __age__ column contained many erroneous inputs such as 3, 15, 17, 99, 323. We'll start by handling these outliers and then transforming this column data type into categorical.

In [32]:
# Arranging a list of accepted entries
age_entries = df.loc[(df["age"] >= 18) & (df["age"] <= 65), "age"].unique().tolist()

# Getting the median
age_median = np.median(age_entries)

# Replacing unusual entries with the median
df["age"] = df["age"].apply(lambda x: x if x in age_entries else age_median)

# Transforming the data type into int
df["age"] = df["age"].astype(int)

Define a user-defined function

In [33]:
# define a user-defined function : age_categorizer
def age_categorizer(data: pd.DataFrame, column: str) -> dict:
    """
    Returns the category of the given age.
    :param data: Pandas DataFrame
    :param column: str
    """
    # Define age categories
    age_categories = {
        "Before 20s": list(range(18, 20)),
        "20s": list(range(20, 30)),
        "30s": list(range(30, 40)),
        "40s": list(range(40, 50)),
        "Above 50s": list(range(50, 66))
    }

    # Creating a list
    new_categories = []


    # Loop over the column elements
    for element in data[column].tolist():

        # Arranging dictionary elements
        for k, v in age_categories.items():

            # If an age is in a given range, we return that range
            if element in v:

                new_categories.append(k)

    # Deliver the results
    return new_categories

Let's apply the function

In [34]:
# Apply the fucntion as an anonymous fucntion
df["age"] = age_categorizer(df, "age")

# Demontstarte the new categories
df["age"].value_counts()

age
30s           678
20s           443
40s           243
Above 50s      65
Before 20s      4
Name: count, dtype: int64

#### 3.5 Correcting data types
Handling data types is essentiel. Instead of Yes/No, we'll transform this column into numerical 0/1 column with int64 data type.

In [35]:
# Display the dataframe composed of binary columns
data = discover_df(df, only_binary=True)

# Print the dataframe
data

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
0,is_self_employed,0,2,"[0, 1]",int64
3,is_tech_role,0,2,"[1, 0]",int64
24,is_previously_employed,0,2,"[1, 0]",int64
50,diagnosed_by_professional,0,2,"[Yes, No]",object
52,is_sought_treatment_for_mhi,0,2,"[0, 1]",int64


Treat the data accordingly, by using anonymous function.

In [36]:
# Applying anonymous function
df["diagnosed_by_professional"] = df["diagnosed_by_professional"].apply(lambda x: 1 if x == "Yes" else 0)

# Display the dataframe composed of binary columns
data = discover_df(df, only_binary=True)

# preview the results
data

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
0,is_self_employed,0,2,"[0, 1]",int64
3,is_tech_role,0,2,"[1, 0]",int64
24,is_previously_employed,0,2,"[1, 0]",int64
50,diagnosed_by_professional,0,2,"[1, 0]",int64
52,is_sought_treatment_for_mhi,0,2,"[0, 1]",int64


Let's review the dataset again.

In [37]:
# Display the dataframe again
data = discover_df(df)

# Limit the dataframe into the columns with 3 unique values
data = data[data['unique_values_count'] == 3 ]

# Preview the data
data

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
2,is_tech_company,0,3,"[1.0, Self-employed (Non Applicable), 0.0]",object
15,is_aware_of_previous_negative_consequence_of_c...,0,3,"[No, Self-employed (Non Applicable), Yes]",object
16,have_medical_coverage_includes_mental_health_i...,0,3,"[Employees (Unanswered), 1.0, 0.0]",object
36,is_willing_to_bring_phi_in_interview,0,3,"[Maybe, Yes, No]",object
38,is_willing_to_bring_mhi_in_interview,0,3,"[Maybe, No, Yes]",object
44,is_less_encouraged_to_reveal_mhi,0,3,"[No, Yes, Maybe]",object
45,family_history_of_mhi,0,3,"[No, Yes, I don't know]",object
46,previous_history_of_mhi,0,3,"[Yes, Maybe, No]",object
47,is_having_mhd,0,3,"[No, Yes, Maybe]",object
56,gender,0,3,"[Male, Female, Other]",object


Instead of 0/1 in a non-binary column, we'll transform the input of these columns into Yes/No.

In [38]:
# Applying anonymous function
df["is_tech_company"] = df["is_tech_company"].apply(
    lambda x: "Yes" if x == 1.0 else "No")

df["have_medical_coverage_includes_mental_health_issue"] = df["have_medical_coverage_includes_mental_health_issue"].apply(
    lambda x: "Yes" if x == 1.0 else "No")

Let's preview the results

In [39]:
# Display the dataframe again
data = discover_df(df)

# Limit the dataframe into the columns with 3 unique values
data = data[data['unique_values_count'] == 3 ]

# Preview the data
data

Unnamed: 0,name,empty_values,unique_values_count,unique_values_list,data_type
15,is_aware_of_previous_negative_consequence_of_c...,0,3,"[No, Self-employed (Non Applicable), Yes]",object
36,is_willing_to_bring_phi_in_interview,0,3,"[Maybe, Yes, No]",object
38,is_willing_to_bring_mhi_in_interview,0,3,"[Maybe, No, Yes]",object
44,is_less_encouraged_to_reveal_mhi,0,3,"[No, Yes, Maybe]",object
45,family_history_of_mhi,0,3,"[No, Yes, I don't know]",object
46,previous_history_of_mhi,0,3,"[Yes, Maybe, No]",object
47,is_having_mhd,0,3,"[No, Yes, Maybe]",object
56,gender,0,3,"[Male, Female, Other]",object
62,is_remote,0,3,"[Sometimes, Never, Always]",object


In [40]:
# Save the cleaned data
df.to_csv(f"{path}/data/data_v1.0.csv", index=False)

## Summary

In this notebook, we discussed the following aspects :
- A basic exploration of the dataset including summary statistics, missing, and unique values.
- The dataset requires adequate data handling procedures.
- The ad

## Author
<a href="https://www.linkedin.com/in/ab0858s/">Abdelali BARIR</a> is a former veteran in the Moroccan's Royal Armed Forces, and a self-taught python programmer. Currently enrolled in B.Sc. Data Science in __IU International University of Applied Sciences__.

## Change Log    

| Date         | Version   | Changed By       | Change Description        |
|--------------|-----------|------------------|---------------------------|
| 2024-07-10   | 1.0       | Abdelali Barir   | Modified markdown         |
| ------------ | --------- | ---------------- | ------------------------- |
