<center><img src="https://github.com/SoumyaK4/SoumyaK4/blob/main/Logo%20B.png?raw=true" width="120" height="50" /></center>

---
<a name = TOC></a>
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-profiling**](#Section5)<br>
  - **5.1** [**Checking for Missing Data**](#Section51)
  - **5.2** [**Checking for Redundant Data**](#Section52)
  - **5.3** [**Checking for Inconsistent Data**](#Section53)
  - **5.4** [**Checking for Outliers**](#Section54)

**6.** [**Data Pre-processing**](#Section6)<br>
  - **6.1** [**Handling of Missing Data**](#Section61)
  - **6.2** [**Handling of Redundant Data**](#Section62)
  - **6.3** [**Handling of Inconsistent Data**](#Section63)
  - **6.4** [**Handling of Outliers**](#Section64)

**7.** [**Data Post-profiling**](#Section7)<br>
  - **7.1** [**Checking for Missing Data**](#Section71)
  - **7.2** [**Checking for Redundant Data**](#Section72)
  - **7.3** [**Checking for Inconsistent Data**](#Section73)
  - **7.4** [**Checking for Outliers**](#Section74)

**8.** [**Exploratory Data Analysis**](#Section8)<br>
  - **8.1** [**How does age relate to various behaviors and/or their awareness of their employer's attitude toward mental health?**](#Section81)
  - **8.2** [**What is the density distribution of Age feature?**](#Section82)

**9.** [**Summarization**](#Section9)<br>

---
<a name = Section1></a>
# **1. Introduction**
[To ToC](#TOC)

---

- Use **minimum 3 points** and 2 images 

<center><img width=40% src="https://raw.githubusercontent.com/insaid2018/PGPDSAI/main/03%20Term%203%20-%20EDA%20%26%20Data%20Storytelling/03%20Module%203/img/03%20mental-health.jpg"></center>

---
<a name = Section2></a>
# **2. Problem Statement**
[To ToC](#TOC)

---
- Write down the problems

<center><img width=40% src="https://raw.githubusercontent.com/insaid2018/PGPDSAI/main/03%20Term%203%20-%20EDA%20%26%20Data%20Storytelling/03%20Module%203/img/04%20mental-health.png"></center>


**<h4>Scenario</h4>**

- <a href="https://osmihelp.org/">**OSMI**</a>, an organization is working to **help people** to **identify** and **overcome mental health disorders** while working in a tech space.

- They **perform surveys** to **measure attitudes** towards mental health in the tech workplace.

- Checkout <a href="https://www.youtube.com/watch?v=NHulgcO_16U&list=PL1MEC8mwrpaIdzYKRidvNB5eYSwWrqFZ3">**Talks at Google**</a> to get better clarity about Mental Health in the Tech Industry.


---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
[To ToC](#TOC)

---

- 4 useful [EDA Libraries](https://towardsdatascience.com/4-libraries-that-can-perform-eda-in-one-line-of-python-code-b13938a06ae)

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
# !pip install -q datascience                                         # Package that is required by pandas profiling
# !pip install -q pandas-profiling                                    # Library to generate basic statistics about data
# !pip install -q autoviz                                             # Automatically Visualize any dataset
# !pip install -q sweetviz                                            # Library to do EDA
!pip install -q dtale                                               # Easy way to view & analyze Pandas data structures

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
# !pip install -q --upgrade pandas-profiling

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing for numerical data analysis
import pandas as pd                                                 # Importing for panel data analysis
import dtale                                                        # Importing for profiling the dataset
# pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
# pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
# pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
# pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
# from collections import Counter                                     # For counting hashable objects
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
[To ToC](#TOC)

---

- This dataset is obtained from a survey in 2014.

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 100 | 5 | 10 KB| 


| Id | Features | Description |
| :-- | :--| :--| 
|01|**Timestamp**|Time the survey was submitted.|
|02|**Age**|The age of the person.| 
|03|**Gender**|The gender of the person.|
|04|**Country**|The country name where person belongs to.|
|05|**state**|The state name where person belongs to.|



In [None]:
data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/survey.csv')
print('Data Shape:', data.shape)
data.head()

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
data.describe()
# data.describe(include='all')                                # To include all data types

In [None]:
IQR = data.Quantity.describe()['75%'] - data.Quantity.describe()['25%']

low_range = data.Quantity.describe()['25%'] - 1.5*IQR
high_range = data.Quantity.describe()['75%'] + 1.5*IQR

print('Low range  : {} '.format(low_range))
print('High range : {} '.format(high_range))


**Observations:**

- The **average age** of the person is found to be **79428148 years** and it is **absurd**.

- Around **25%** of people have an **age** less than or equal to **27 years**.

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [None]:
data.info()

**Observations:**

- The **minimum** and **maximum ages** are found to be **negative** and **very large numbers**.

- It implies that there is **something wrong** with our **data**.

---
<a name = Section5></a>
# **5. Data Pre-Profiling**
[To ToC](#TOC) <br>

---

- [List of Dataframe functions](https://www.w3schools.com/python/pandas/pandas_ref_dataframe.asp)
- outlier - BOXplot, describe
- duplicates - .duplicated()
- misssing values - .info()
- inconstancy in dtypes - know what the type should be, info can see what it actually is.
- typos - check for value_counts() or unique()
- format - check by unique

<a name = Section51></a>
### **5.1 Checking for Missing Data**

- In this section, we will identify missing data and check the proportion of it and take appropriate measures.

In [None]:
data.shape[0]-data['col_name'].shape[0]                 # check if values are missing

<a name = Section52></a>
### **5.2 Checking for Redundant Data**

- In this section, we will identify redundant data and check the proportion of it and take appropriate measures.

<a name = Section53></a>
### **5.3 Checking for Inconsistent Data**

- In this section, we will **identify inconsistency** in data.


<a name = Section54></a>
### **5.4 Checking for Outliers**

- Check for outliers in our data

---
<a name = Section6></a>
# **6. Data Pre-Processing**
[**To ToC**](#TOC) 

---
- [List of Dataframe functions](https://www.w3schools.com/python/pandas/pandas_ref_dataframe.asp)
- outlier - drop - dependent of the objective - contextual
- duplicates - .drop_duplicates() - contextual
- misssing values - fillna(mean/median/mode) or delete the row/column.
- inconstancy in dtypes - .astype()
- typos - replace
- format - replace

<a name = Section61></a>
### **6.1 Handling of Missing Data**

- In this section, we will take appropriate measures for missing data.

**Observations:**

- We can observe that following features are found to have missing values along with the proportions:

|Feature|Object Type|Missing Proportion|Solution|
|:--:|:--:|:--:|:--|
|state|Object|40%|Replace with mode.|
|self_employed|Object|1.43%|Replace with mode.|
|work_interfere|Object|20.97%|Replace with mode.|
|comments|Object|86.97%|Drop the feature.|

In [None]:
# value = data['col_name'].mode()[0]                                  # to replace with mean median or mode
data['col_name'].fillna(value, axis=1, inplace=True)                # axis=0 for rows, can use custom value as well

In [None]:
# Dropping rows containing missing values
data.dropna(inplace=True)

# Checking for missing values again
data.isna().sum()

<a name = Section62></a>
### **6.2 Handling of Redundant Data**

- In this section, we will take appropriate measures for redundant data.

In [None]:
# print('Contains Duplicate Rows?', data.duplicated().any())
print('Contains how many Duplicate Rows?', data.duplicated().sum())

# We will start by first removing the duplicate rows
data.drop_duplicates(inplace=True)

<a name = Section63></a>
### **6.3 Handling of Inconsistent Data**

- In this section, we will **take appropriate measures** for inconsistent data.

- Previously, we observed that **Timestamp** feature was **incorrectly indentified** as Object, so, we will rectify it.

- [Drop](https://www.w3schools.com/python/pandas/ref_df_drop.asp) not neeeded cols
- To [typecast](https://www.w3schools.com/python/pandas/ref_df_astype.asp) datatype

In [None]:
data['Date']= pd.to_datetime(data['Date'])                       # typecast the col to datetime

**Observation:**

- Now, we **handled inconsistency** of data **manually** for **one feature**, but it would be **impossible** when you have **hundreds of features**.

- In that case, we can **use interactive plots** like plotly to know all the possible values in each feature.

- Next, we will **identify** all the categorical features and render a bar plot to identify the **present values**.

- If we find any inconsistency in the feature, then we will take appropriate measures.

**Note:**

- The **approach followed down** for basic data analysis is **not mandatory**.

- You can **also go feature by feature** and analyze the data to understand the underlying face of data.

- To **make our life easier**, we will be **utilizing a small hack**.


In [None]:
# Initiating a plotly figure
fig = go.Figure()

# Adding first graph of Gender
fig.add_bar(x=data[cat_features[0]], y=data[cat_features[0]].index)

# Adding a button to select different features
button = [dict(method = 'restyle',
               args = [{'x': [data[cat_features[k]], 'undefined'],
                        'y': [data[cat_features[k]].index, 'undefined'],
                        'visible':[True, False]}], 
               label = cat_features[k])   for k in range(0, len(cat_features))]  

# Updating the layout of the graph
fig.update_layout(title_text='Frequency Distribution of Feature Values',
                  title_x=0.4,
                  width=1000,
                  height=450,
                  updatemenus=[dict(active=0,
                                    buttons=button,
                                    x=1.15,
                                    y=1,
                                    xanchor='left',
                                    yanchor='top')])

# Adding extra annotaions alongside the button
fig.add_annotation(x=1.03,
                   y=0.97,
                   xref='paper',
                   yref='paper',
                   showarrow=False,
                   xanchor='left',
                   yanchor = 'top',
                   text='Feature')

# Display the graph
fig.show()

**Observations:**

- By interacting with the above figure we can safely conclude that rest all **other features** are **having correct values**.

<a name = Section64></a>
### **6.4 Handling of Outliers**

- Next, if you remember our **age** feature was showing us some **absurd numbers** like 329, 999999999999, -1729.

- These are **outliers** and we will **perform capping** over these values such as:
  - All value above 75 will be capped to 75 (on **average 65** is the **retirement** but taking extra buffer).

In [None]:
# filtering the outliers
data['Age'][data['Age'] > 75] = 75
data['Age'][data['Age'] < 14] = 14

**Observation:**

- Now that we have successfully cleansed our data we are good to go with exploring our data and finding insights.

---
<a name = Section7></a>
# **7. Data Post-Profiling**
[**To ToC**](#TOC)

---

<a name = Section71></a>
### **7.1 Checking for Missing Data**


<a name = Section72></a>
### **7.2 Checking for Redundant Data**


<a name = Section73></a>
### **7.3 Checking for Inconsistent Data**

<a name = Section74></a>
### **7.4 Checking for Outliers**

---
<a name = Section8></a>
# **8. Exploratory Data Analysis** 
[**To ToC**](#TOC)

#### Asking 15+-5 - relevant, reasonable, non vague, UNI BI MULTI<br>

---



<a name = Section62></a>
**<h4>Question:** What is the density distribution of Age feature?</h4>

**Observation:**

- 
-

<a name = Section62></a>
**<h4>Question:** What is the density distribution of Age feature?</h4>

**Observation:**

- 
-

<a name = Section62></a>
**<h4>Question:** What is the density distribution of Age feature?</h4>

**Observation:**

- 
-

---
<a name = Section9></a>
# **9. Summarization**
[To ToC](#TOC)

---

- **<h4>Conclusion</h4>**

  - The mental health survey has **helped** us to **understand** the **mental condition of employees** working in tech firms across countries.

  - A total of **1259 entries were recorded** during the survey out of which **1007 were recorded** from the **top 3 countries**.

  - The **United States leads the chart** in terms of participation in the survey **followed by** the **United Kingdom** and **Canada**.

  - From a **state point of view**, **California leads the chart** when run down the analysis.

  - **48.1%** of **males**, **70%** of **females**, and **88%** of **trans** were found to have **sought treatment** concerning the overall survey.

  - The following set of **parameters** are found to be **affecting mental health** the most and thus requires treatment:
    - Age
    - Family history,
    - Work Interference,
    - Number of employees working in a company,


-  **<h4>Actionable Insights</h4>**

  - There should be an **awareness program** about mental health and its effects.

  - Relationship **Managers** **should be supportive** with the right guidance towards their employees.

  - Managers should be **unbiased** concerning the work and the employees.

  - There should be **appropriate measures** and **support** for the employees suffering from mental health.

  - It is **good to give** an **appreciation** at work **regularly**.