# Greetings
The [data](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-2016/) used in this notebook is provided from Open-Sourcing Mental Illness, LTD (OSMI) in 2016. The results of a survey, which were conducted to gauge the impact of mental health illnesses among employees.

## Data Exploration
This section is dedicated to shaping an understanding of the data, identifying any underlying issues, and preparing to address them in subsequent phases. This stage is pivotal as it lays the foundation for various decisions concerning data imputation and data wrangling and more.

During this phase, we solely conduct exploratory data analysis, using descriptive statistics and visualizations. No alterations will be made to the data at this stage.

### List of contents :
1. __Overview__ 
2. __Exploratory Data Analysis__
2.1. Column 0 to 23
2.2. Column 24 to 35
2.3. Column 36 to 39
2.4. Column 40 to 46
2.5. Column 47 to 52
2.6. Column 53 and 54
3. __Summary__

Importing the required libraries

In [None]:
# Path library for generating OS paths efficiently
from pathlib import Path

# Data analysis library
import pandas as pd

# Python library for numerical computation 
import numpy as np

# Powerful data visualization library
import seaborn as sns
sns.set_theme(style="whitegrid")

# The base library for plotting graphs in python
import matplotlib.pyplot as plt

# Geographical plotting library
import folium
import geopy

## 1. Overview

The aim of this phase is to get to know the data at hand.

First, let's start by loading the data. 

In [None]:
# Formulating the directory 
path = Path.cwd().parent

# Loading the data
data = pd.read_csv(f'{path}/data/mental-heath-in-tech-2016_20161114.csv')
data.head()

In [None]:
print(f"The data is formed through {data.shape[1]} columns/features and {data.shape[0]} rows/records.")

In [None]:
data.describe()

__Insights :__
- Out of a total of 63 columns, merely 7 contain numerical data. 
- The changing count between columns implies the presence of missing values within the dataset. 
- Unusual entries within the age column were identified, necessitating an appropriate filling approach.

As we seek further investigations, we might use the __".info()"__ method, but we want to need to explore unique values within each column as this information will help us later on.

In [None]:
# Initiating an empty list data_info 
data_info = []

# Gathering the attributes in one place 
for index, column in enumerate(data.columns):
    info = {
        'name': column,  # The name of the column 
        'empty_values': data[column].isna().sum(),  # The number of empty values in a column
        'unique_values': [
            data[column].unique().__len__() - 1 if data[column].isna().sum() != 0 else data[column].unique().__len__()][
            0],  # The number of unique values 
        'data_type': data[column].dtypes  # The data type of column
    }

    # Appending the values  
    data_info.append(info)

# Create a DataFrame from the gathered information
null_categories_data = pd.DataFrame(data_info).sort_values(by=['empty_values'], ascending=False)
null_categories_data

## 2. Exploratory Data Analysis
In this section, we're iterating over columns based on relevancy. We can inspect 11 at a time as we can inspect two or three at a time. For a clearer view, visualizations are used where appropriate.

### 2.1 Columns 0 to 23

The __Are you self-employed?, How many employees does your company or organization have?, Is your employer primarily a tech company/organization?, Is your primary role within your company related to tech/IT?__ investigate basic information on the respondent.

In [None]:
# Performing aggregations on the first few columns using the value_counts method
aggregations = {
    "Number of employees": data["Are you self-employed?"].value_counts()[0],
    "Number of self-employed": data["Are you self-employed?"].value_counts()[1],
    "Number of employees in IT related companies":
        data['Is your employer primarily a tech company/organization?'].value_counts()[1],
    "Number of employees in non-IT related companies":
        data['Is your employer primarily a tech company/organization?'].value_counts()[0],
    "Number of employees in non-IT related companies with IT related role":
        data["Is your primary role within your company related to tech/IT?"].value_counts()[1],
    "Number of employees in non-IT related companies with non-IT related role":
        data["Is your primary role within your company related to tech/IT?"].value_counts()[0]
}

# Print the dictionary
aggregations

The following pie chart better illustrates the data

In [None]:
# Setting the labels and sizes lists 
labels = [
    'Self-employed',
    'Employees in IT related companies',
    'Employees with IT-related role in a non-IT related company',
    'Employees with non-IT-related role in a non-IT related company']

sizes = [
    aggregations["Number of self-employed"],
    aggregations["Number of employees in IT related companies"],
    aggregations["Number of employees in non-IT related companies with IT related role"],
    aggregations["Number of employees in non-IT related companies with non-IT related role"]
]

# Setting the style of the visuals
plt.style.use('fast')

# Starting fig1 and ax1 subplots objects
fig1, ax = plt.subplots(figsize=(15, 5))
ax.pie(
    sizes,  # List of values 
    autopct='%1.2f%%',  # Showing 3 numbers after the decimal point 
    textprops={'fontsize': 12}  # Setting the size of the ratios 
)

ax.legend(
    labels=labels,  # List of labels
    loc="upper center",  # Position of the legend
    bbox_to_anchor=(0.5, 1.1),
    fontsize=8)  # Size of the legend

# Plotting the figure
plt.show()

The organization's size in our data set varies. The next graph describes this variation :

In [None]:
# Performing aggregation using .value_counts()
organization_size = pd.DataFrame(data["How many employees does your company or organization have?"].value_counts()).reset_index()
organization_size

In [None]:
# Plotting horizontal bars plot 
sns.barplot(x="How many employees does your company or organization have?", 
            y="count",
            data=organization_size)

# Setting proper x/y-axis labels 
plt.xlabel("Number of respondents")
plt.ylabel("Organization Size")

# Showing the plot
plt.show()

__Insights__ :
- The employees in IT-related companies don't provide a clear stand if they're hired in an IT-related role or not. These 883 missing values need to be handled properly.
- The employees are working for a broad category of organizations with medium and bigger size dominance. the 287 missing values in this column reflect the number of self-employed who didn't answer this question.
- __Sharing the exact same number of missing values__, the following columns contain unique values that vary between 3 and 6:
    - Does your employer provide mental health benefits as part of healthcare coverage?
    - Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?
    - Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
    - Does your employer offer resources to learn more about mental health concerns and options for seeking help?
    - Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
    - If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:
    - Do you think that discussing a mental health disorder with your employer would have negative consequences?
    - Do you think that discussing a physical health issue with your employer would have negative consequences?
    - Would you feel comfortable discussing a mental health disorder with your coworkers?
    - Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?
    - Do you feel that your employer takes mental health as seriously as physical health?
- The __Do you know the options for mental health care available under your employer-provided coverage?__ column has exactly four unique values and 420 missing values. 
- __Exceeding 1146 missing values__, the following columns are more likely to be dropped due to the significant number of missing values : 
    - Do you have medical coverage (private insurance or state-provided) which includes treatment of mental health issues?
    - Do you know local or online resources to seek help for a mental health disorder?
    - If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?
    - If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?
    - If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?
    - If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?
    - Do you believe your productivity is ever affected by a mental health issue?
    - If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?"

### 2.2 Columns 24 to 35
In the next column, the respondents are filtered into previously employed or not.  

In [None]:
# Performing aggregation using .value_counts()
data["Do you have previous employers?"].value_counts()

__Insights__ :
- The majority of employees have previous employment history.
- The next 11 columns have either four or five unique values tend to have the exact same value of 169 missing values reflects the missing responses of 169 employees who don't have a previous employer : 
    - Have your previous employers provided mental health benefits?
    - Were you aware of the options for mental health care provided by your previous employers?
    - Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?
    - Did your previous employers provide resources to learn more about mental health issues and how to seek help?
    - Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?
    - Do you think that discussing a mental health disorder with previous employers would have negative consequences?
    - Do you think that discussing a physical health issue with previous employers would have negative consequences?
    - Would you have been willing to discuss a mental health issue with your previous co-workers?
    - Would you have been willing to discuss a mental health issue with your direct supervisor(s)?
    - Did you feel that your previous employers took mental health as seriously as physical health?
    - Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?

### 2.3 Columns 36 to 39
Next, we observe how the opinions are distributed in the matter of revealing both physical and mental health issues.

In [None]:
# Performing aggregations on the columns, then saving them into dictionaries
phi = data[
    "Would you be willing to bring up a physical health issue with a potential employer in an interview?"].value_counts().to_dict()
mhi = data[
    "Would you bring up a mental health issue with a potential employer in an interview?"].value_counts().to_dict()

# Setting the components for the chart
categories = ['Yes', 'Maybe', 'No']
phi_values = [phi['Yes'], phi['Maybe'], phi['No']]
mhi_values = [mhi['Yes'], mhi['Maybe'], mhi['No']]

# Set the width of the bars
bar_width = 0.35

# Create figure and axis
fig, ax = plt.subplots()

# Plot the first set of bars
bar1 = ax.bar(np.arange(len(categories)), phi_values, bar_width, color='blue', label='Physical health issues')

# Calculate the position for the second set of bars
bar2_position = np.arange(len(categories)) + bar_width

# Plot the second set of bars
bar2 = ax.bar(bar2_position, mhi_values, bar_width, color='red', label='Mental health issues')

# Set labels and ticks
ax.set_xlabel('Categories')
ax.set_ylabel('Count')
ax.set_xticks(np.arange(len(categories)))
ax.set_xticklabels(categories)

# Add legend
bars = [bar1, bar2]
labels = [bar.get_label() for bar in bars]
ax.legend(bars, labels)

# Show plot
plt.title('Intention to reveal physical/mental health issues')
plt.show()

__Insights :__
- There is a nuanced spread of responses to revealing physical health issues, with a slight advantage towards uncertainty and disagreement, while still encompassing a proportion of agreement.
- A notable prevalence of negative sentiment among respondents to reveal mental health issues with a discernible emphasis on disagreement and uncertainty compared to agreement.
- The reasons for all choices are different as their count goes above 1000 entry.

### 2.4 Columns 40 to 46

Among these columns, we shall explore the two columns with supicious number of empty values which doesn't make sense.

In [None]:
# Setting a subset called new_df 
new_df = data[[
    'Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?',
    'Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?']].copy()

# Replacing the missing values
new_df[
    'Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?'].fillna(
    'empty_value', inplace=True)
new_df[
    'Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?'].fillna(
    'empty_value', inplace=True)

# Performing some aggregations
new_df.value_counts()

__Insights__ : 
- The following columns hold three to six unique values with no empty values:
    - Do you feel that being identified as a person with a mental health issue would hurt your career?-
    - Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?
    - How willing would you be to share with friends and family that you have a mental illness?
    - Do you have a family history of mental illness?
    - Have you had a mental health disorder in the past?
- The two remaining columns can benefit from adding another unique value __I don't know__, instead of dropping them due to the enormous number of empty values. 

### 2.6 Columns 47 to 54

Among these columns, we shall begin by exploring the existence of mental health issues among respondents.

In [None]:
# Inspect the unique values within this column
data[["Do you currently have a mental health disorder?",
      "Have you been diagnosed with a mental health condition by a medical professional?"]].value_counts()

In [None]:
portion = data.iloc[:, 47:52]

portion.head(1433)

In [None]:
# Extracting the unreasonable empty values 
portion[(portion["Do you currently have a mental health disorder?"] != 'No')
        &
        (portion["If yes, what condition(s) have you been diagnosed with?"].isna())
        &
        (portion["If maybe, what condition(s) do you believe you have?"].isna())
        ]


The following function __conditions_counter__ will help us determine the most selected conditions and how many respondents believe they deal with this condition.

In [None]:
# Defining a function called condition_counter
def conditions_counter(column_name):
    """
    :param column_name: The name of column desired
    :return: a Dataframe containing the conditions and the number of their occurrences in the column
    """

    # Creating the list of unique values excluding empty values
    a_list = pd.DataFrame(data[column_name].value_counts()).reset_index().iloc[:, 0].to_list()

    # Creating an empty dictionary
    a_dict = dict()

    # Initiating a for-loop
    for element in a_list:
        # Inspecting the existence of a character 
        if '|' in element:
            elements = element.split("|")
            # If the condition is met
            for unit in elements:
                # Iterating each unit  
                if unit in a_dict:
                    a_dict[unit] += 1  # Augment the number if it does exist
                else:
                    a_dict[unit] = 1  # Set the number to 1 if a new occurrence is found
        else:
            if element in a_dict:
                a_dict[element] += 1
            else:
                a_dict[element] = 1

    # Turning the data stored in the dictionary into a dataframe
    dframe = pd.DataFrame(list(a_dict.items()), columns=['Condition', 'Count'])

    # The final output of the function
    return dframe

In [None]:
# Setting the dataframes
known_conditions = conditions_counter("If yes, what condition(s) have you been diagnosed with?")
suspected_conditions = conditions_counter("If maybe, what condition(s) do you believe you have?")
diagnosed_conditions = conditions_counter("If so, what condition(s) were you diagnosed with?")

We'll merge the generated datasets to provide a compact overview.

In [None]:
# Merging the data frames into final_df
final_df = pd.merge(
    pd.merge(
        known_conditions,
        suspected_conditions,
        on='Condition', how='outer',
        suffixes=('_confirmed', '_suspected')
    ), diagnosed_conditions, on='Condition', how='outer')

# Replacing empty values
final_df[['Count_confirmed', 'Count_suspected', 'Count']] = final_df[
    ['Count_confirmed', 'Count_suspected', 'Count']].replace(np.nan, 0)

# Adjusting the data type
final_df[['Count_confirmed', 'Count_suspected', 'Count']] = final_df[
    ['Count_confirmed', 'Count_suspected', 'Count']].astype('int64')

# Adding a new column called: Total
final_df["Total"] = final_df["Count_confirmed"] + final_df["Count_suspected"] + final_df["Count"]
final_df.sort_values("Total", ascending=False, inplace=True)

# Print final_df
final_df

Here, we can find that there are some redundant values

In [None]:
sorted(final_df["Condition"].to_list())

In [None]:
# Create a new DataFrame for counting unique values and their counts from both columns
is_affected = pd.DataFrame(data[
                               "If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?"].value_counts()).reset_index()
is_not_affected = pd.DataFrame(data[
                                   "If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?"
                               ].value_counts()).reset_index()

is_affected.columns = is_not_affected.columns = ["Categories", "Count"]

# Merging the data frames into final_df
merged_df = pd.merge(
        is_affected,
        is_not_affected,
        on='Categories', how='outer',
        suffixes=('_affected', '_not_affecting'))

# Melt the DataFrame to create a "long-form" representation
melted = pd.melt(merged_df, id_vars='Categories', var_name='Condition', value_name='Count')

# Plot side-by-side bars using Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(data=melted, x='Categories', y='Count', hue='Condition')
plt.title('Impact of Mental Health Issue on Work')
plt.xlabel('Categories')
plt.ylabel('Count')
plt.show()

__Insights__:
- These columns require some feature engineering techniques to transform it from a text data type into new columns with numerical values of 0 and 1.
- Since there are some extra missing values, adopting an adequate approach to fill the missing values is required. 
- Redundancy should be removed by standardizing the respondents' input.
- The __MAJOR__ conditions are: 
    - Mood Disorder (Depression, Bipolar Disorder, etc)
    - Anxiety Disorder (Generalized, Social, Phobia, etc)
    - Attention Deficit Hyperactivity Disorder
    - Post-traumatic Stress Disorder
    - Personality Disorder (Borderline, Antisocial, Paranoid, etc)
    - Obsessive-Compulsive Disorder
    - Substance Use Disorder
    - Stress Response Syndromes
    - Addictive Disorder
    - Eating Disorder (Anorexia, Bulimia, etc)
- The effect of these conditions according to our respondents doesn't affect the productivity of an employee __Often__ but it does __RARELY__.

### 2.7 Columns 55 to 56
for these columns, running __.describe()__ & __value_counts()__ methods is enough to unfold the information we seek.

In [None]:
data["What is your age?"].describe()

In [None]:
# Create a box plot for the numerical column with outliers
fig2= plt.figure(figsize=(5, 6))
sns.boxplot(y='What is your age?', data=data)
plt.title('Age distribution')
plt.ylabel('Age')
plt.set_ylim(1, None)
plt.savefig('box_plot.png', bbox_inches='tight')  # Specify the file name and format
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a horizontal box plot for the numerical column with outliers
plt.figure(figsize=(4, 5))
ax = sns.boxplot(y='What is your age?', data=data)


# Set y-axis limits and omit the first 0
ax.set_ylim(1, None)

# Save the plot as a picture
plt.savefig('box_plot_1.png', bbox_inches='tight')  # Specify the file name and format
plt.show()

In [None]:
gender = pd.DataFrame(data["What is your gender?"].value_counts()).reset_index()
gender

__Insights__ : 
- An obvious typos must be handled due to the existence of unreasonable values 3, 15, and 323 in the age section.
- The redundancy in the gender section must be addressed as well.


### 2.8 Columns 57 to 60
These columns look fine with no unusual inputs.

In [None]:
# Example data (replace this with your actual data)
geo_employee_residency = pd.DataFrame(data["What US state or territory do you live in?"].value_counts()).reset_index()
geo_employee_work = pd.DataFrame(data["What US state or territory do you work in?"].value_counts()).reset_index()

geo_employee_work.columns = geo_employee_residency.columns = ["States", "Count"]

In [None]:
from geopy.geocoders import Nominatim

def get_state_coordinates(state):
    geo_locator = Nominatim(user_agent="my_geocoder")
    location = geo_locator.geocode(state + ", USA")
    if location:
        return location.latitude, location.longitude
    else:
        return None

In [None]:
coordinates_work = {}

for element in geo_employee_work["States"].to_list() :
    value = get_state_coordinates(element)
    coordinates_work[element] = value

coordinates_work

In [None]:
coordinates_residency ={}

for element in geo_employee_residency["States"].to_list() :
    value = get_state_coordinates(element)
    coordinates_residency[element] = value

coordinates_residency

In [None]:
from folium.plugins import HeatMap

# Create a map centered at the United States
map_us = folium.Map(location=[37.0902, -95.7129], zoom_start=4)

# Create a list of tuples containing the latitude and longitude of each state and its count
heat_data = [
    [coordinates_work[element][0], coordinates_work[element][1]] for element in coordinates_work
]

# Create the heatmap
HeatMap(heat_data).add_to(map_us)

for key, value in heat_data:
    folium.Marker(location=[value[0], value[1]], popup=heat_data[key]).add_to(map_us)
    folium.Circle(
        location=(value[0], value[1]),
        radius=row['Value'] * 1000,  # Adjust the radius as per your requirement
        popup=f"{row['State']} - {row['Value']} respondents",
        fill=True,
        fill_opacity=0.5,
        color='blue',
        fill_color='blue'
    ).add_to(us_map)


# Display the map
map_us

In [None]:
# Create a geolocator object
geolocator = Nominatim(user_agent="my_geocoder")

# Example list of cities in the United States
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']

# Dictionary to store city coordinates
city_coordinates = {}

# Retrieve coordinates for each city
for city in cities:
    location = geolocator.geocode(city)
    if location:
        city_coordinates[city] = (location.latitude, location.longitude)

print(city_coordinates)

In [None]:
import folium
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
import pandas as pd

# Function to get coordinates of a state using geopy
def get_state_coordinates(state_name):
    geolocator = Nominatim(user_agent="my_geocoder")
    location = geolocator.geocode(state_name + ", USA")
    if location:
        return location.latitude, location.longitude
    else:
        return None

# Example data - replace this with your own data
data = {
    'State': ['California', 'New York', 'Texas', 'Florida', 'Illinois'],
    'Value': [100, 200, 150, 300, 250]
}

# Geocode the states and create a DataFrame with coordinates
coordinates = []
for state in data['State']:
    coords = get_state_coordinates(state)
    if coords:
        coordinates.append({'State': state, 'Latitude': coords[0], 'Longitude': coords[1]})
df = pd.DataFrame(coordinates)

# Create a base map centered around the US
us_map = folium.Map(location=[37.0902, -95.7129], zoom_start=4)

# Convert data to list of [latitude, longitude] points
heat_data = [[row['Latitude'], row['Longitude'], row['Value']] for index, row in df.iterrows()]

# Create HeatMap layer
HeatMap(heat_data).add_to(us_map)

# Save the map to an HTML file
us_map.save("heatmap_us_map.html")


# Summary
The data exploration of this dataset showed various aspects and insights. Some of which will have to be handled in the phase of pre-processing are included here :
- Reason 1.
- Reason 2.
- Reason 3.

## Authors

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

<a href="https://www.linkedin.com/in/nayefaboutayoun/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01">Nayef Abou Tayoun</a> is a Data Scientist at IBM and pursuing a Master of Management in Artificial intelligence degree at Queen's University.

## Change Log

| Date       | Version | Changed By    | Change Description      |
|------------|---------|---------------|-------------------------|
| 2021-10-12 | 1.1     | Lakshmi Holla | Modified markdown       |
| 2020-09-20 | 1.0     | Joseph        | Modified Multiple Areas |
| 2020-11-10 | 1.1     | Nayef         | updating the input data |