### Table of Contents
- 1. [Problem Statement](#section1)</br>
    - 1.1 [Introduction](#section11)<br/>
    - 1.2 [Data source and data set](#section12)<br/>
- 2.[Load the packages and data](#section2)</br>
- 3.[Examining Data Available From An Existing Information Source-Data profiling](#section3)</br>
    - 3.1 [Pre_profiling](#section31)<br/>
    - 3.2 [Initial observations](#section32)<br/>
    - 3.3 [Post_profiling](#section33)<br/>
    - 3.4 [Final observations](#section34)<br/>
- 4. [Data normalization](#section4)</br>
    - 4.1 [Standardize column headers to lower case](#section401)<br/>
    - 4.2 [Convert timestamp to date-time](#section402)<br/>
    - 4.3 [Missing data and its imputation](#section403)<br/>
    - 4.4 [Outlier Treatment](#section404)<br/>
    - 4.5 [Handling NaN data in categorical variables](#section405)<br/>            
    - 4.6 [Grouping](#section406)<br/>
- 5. [Identify patterns in the data](#section5)</br>
    - 5.1 [Treatment vs work_interfere](#section501)<br/>
    - 5.2 [Age Category Vs seeking treatment](#section502)<br/>
    - 5.3 [Family history Vs Seeking treatment](#section503)<br/>
    - 5.4 [Employee count of Companies](#section505)<br/>
    - 5.5 [Employee Count Vs treatment](#section506)<br/>
    - 5.6 [Using Donut chart to check the relationship between Gender and Treatment](#section507)<br/>
    - 5.7 [Seaborn swarm plot](#section508)<br/>
- 6. [Analysis through questions](#section6)</br>
     - 6.1 [How does the frequency of mental health illness vary by geographic location?](#section601)<br/>
         - 6.1.1 [Which countries contribute the most?](#section602)<br/>
         - 6.1.2 [Which state contributes the most?](#section603)<br/>
         - 6.1.3 [What is the contribution of top 3 countries among all?](#section604)<br/>
         - 6.1.4 [What is the count and percentage of work interfere in work of the employees for the top 3 countries?](#section605)<br/>
         - 6.1.5 [What is the total number of employees going for treatment from the top 3 countries?](#section606)<br/>
         - 6.1.6 [ How many people did go for treatment on the basis of gender for the top 3 countries?](#section607)<br/>
     - 6.2 [Relationship between mental health and attitude.](#section608)<br/>
- 7. [Conclusion](#section7)<br/>

<a id=section1></a> 
# 1. Problem Statement
![Oct22_18_862457080.png](https://hbr.org/resources/images/article_assets/2018/10/Oct22_18_862457080.png " Mental Health")

__How do we better understand the prevalence of mental health issues in the workplace?__

<a id=section11></a> 
### 1.1. Introduction
This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorder.
We are interested in gauging how mental health is viewed within the tech/IT workplace, and the prevalence of certain mental health disorders within the tech industry. The Open Sourcing Mental Illness team of volunteers will use this data to drive our work in raising awareness and improving conditions for those with mental health disorders in the IT workplace.

This Exploratory Data Analysis is to practice Python skills learned. This notebook will walk through step by step in order to explain thoroughly how to approach the data set. Based on my progress on data, I will try to give answer on observation.

<a id=section12></a> 
### 1.2. Data source and dataset
__a__. How was it collected? 

- __Name__: "Annual Mental Health in Tech Survey"
- __Sponsoring Organization__: Open Sourcing Mental Illness (OSMI)
- __Year__: 2014
- __Description__: "With over 1200 responses, we believe the 2014 __Mental Health in Tech Survey__ was the largest survey done on mental health in the tech industry." Since then, OSMI has conducted two more surveys, 2016 and 2017. and it's opne for 2019 survey.

__b__. Is it a sample? If yes, was it properly sampled?
- Yes, it is a sample. We don't have official information about the data collection method, but it appears *not* to be a random sample, so we can assume that it is not representative. 

__c__. Data Set Description


|Column Name  |Description |
|------|---------------|
|Timestamp|                                                                                                                                                                                                               |
|Age|                                                                                                                                                                                                               |
|Gender|                                                                                                                                                                                                               |
|Country|                                                                                                                                                                                                               |
|state| If you live in the United States, which state or territory do you live in?|
|self_employed| Are you self-employed?|
|family_history| Do you have a family history of mental illness?|
|treatment| Have you sought treatment for a mental health condition?|
|work_interfere| If you have a mental health condition, do you feel that it interferes with your work?|
|no_employees| How many employees does your company or organization have?|
|remote_work| Do you work remotely (outside of an office) at least 50% of the time?|
|tech_company| Is your employer primarily a tech company/organization?|
|benefits| Does your employer provide mental health benefits?|
|care_options| Do you know the options for mental health care your employer provides?|
|wellness_program| Has your employer ever discussed mental health as part of an employee wellness program?|
|seek_help| Does your employer provide resources to learn more about mental health issues and how to seek help?|
|anonymity| Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?|
|leave| How easy is it for you to take medical leave for a mental health condition?|
|mental_health_consequence| Do you think that discussing a mental health issue with your employer would have negative consequences?|
|phys_health_consequence| Do you think that discussing a physical health issue with your employer would have negative consequences?|
|coworkers| Would you be willing to discuss a mental health issue with your coworkers?|
|supervisor| Would you be willing to discuss a mental health issue with your direct supervisor(s)?|
|mental_health_interview| Would you bring up a mental health issue with a potential employer in an interview?|
|phys_health_interview| Would you bring up a physical health issue with a potential employer in an interview?|
|mental_vs_physical| Do you feel that your employer takes mental health as seriously as physical health?|
|obs_consequence| Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?|
|comments| Any additional notes or comments|


# 2. Load the packages and data

In [14]:
import sys                                                                      
import pandas as pd
import pandas_profiling
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns           # Provides a high level interface for drawing attractive and informative statistical graphics


'''
%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of 
plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that
produced it.
'''

%matplotlib inline

"\n%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of \nplotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that\nproduced it.\n"

In [15]:
# we can see the value of multiple statements at once
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
MentalData = pd.read_csv("C://Users//t.shah//Desktop//TanujGit//ExploratoryDataAnalysis-EDA//DataSet/MentalSurvey.csv")
pd.set_option('display.max_columns', 100)                                       # Display all dataframe columns in outputs (it has 27 columns, which is wider than the notebook)
MentalData.head()
MentalData.tail()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,8/27/2014 11:29,37,Female,United States,IL,,No,Yes,Often,25-Jun,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,8/27/2014 11:29,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,8/27/2014 11:29,32,Male,Canada,,,No,No,Rarely,25-Jun,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,8/27/2014 11:29,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,8/27/2014 11:30,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
1254,9/12/2015 11:17,26,male,United Kingdom,,No,No,Yes,,26-100,No,Yes,No,No,No,No,Don't know,Somewhat easy,No,No,Some of them,Some of them,No,No,Don't know,No,
1255,9/26/2015 1:07,32,Male,United States,IL,No,Yes,Yes,Often,26-100,Yes,Yes,Yes,Yes,No,No,Yes,Somewhat difficult,No,No,Some of them,Yes,No,No,Yes,No,
1256,11/7/2015 12:36,34,male,United States,CA,No,Yes,Yes,Sometimes,More than 1000,No,Yes,Yes,Yes,No,No,Don't know,Somewhat difficult,Yes,Yes,No,No,No,No,No,No,
1257,11/30/2015 21:25,46,f,United States,NC,No,No,No,,100-500,Yes,Yes,No,Yes,No,No,Don't know,Don't know,Yes,No,No,No,No,No,No,No,
1258,2/1/2016 23:04,25,Male,United States,IL,No,Yes,Yes,Sometimes,26-100,No,No,Yes,Yes,No,No,Yes,Don't know,Maybe,No,Some of them,No,No,No,Don't know,No,


<a id=section3></a> 
# 3.Examining Data Available From An Existing Information Source-Data profiling

In [16]:
MentalData.dtypes

Timestamp                    object
Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
comments                     object
dtype: object

In [17]:
pd.set_option('float_format', '{:f}'.format)
MentalData.describe(include = 'all')              

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
count,1259,1259.0,1259,1259,744,1241,1259,1259,995,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,1259,164
unique,884,,49,48,45,2,2,2,4,6,2,2,3,3,3,3,3,5,3,3,3,3,3,3,3,2,160
top,8/27/2014 12:31,,Male,United States,CA,No,No,Yes,Sometimes,25-Jun,No,Yes,Yes,No,No,No,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,Don't know,No,* Small family business - YMMV.
freq,8,,615,751,138,1095,767,637,465,290,883,1031,477,501,842,646,819,563,490,925,774,516,1008,557,576,1075,5
mean,,79428148.311358,,,,,,,,,,,,,,,,,,,,,,,,,
std,,2818299442.981952,,,,,,,,,,,,,,,,,,,,,,,,,
min,,-1726.0,,,,,,,,,,,,,,,,,,,,,,,,,
25%,,27.0,,,,,,,,,,,,,,,,,,,,,,,,,
50%,,31.0,,,,,,,,,,,,,,,,,,,,,,,,,
75%,,36.0,,,,,,,,,,,,,,,,,,,,,,,,,


In [20]:
MentalData.shape

(1259, 27)

In [21]:
MentalData.columns

Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

### Observations

- As we can see there is huge discrepancies in Age column of dataset. It is having _minimum value as -1726_ which in reality is not possible 
- On the other hand side the _maximum limit is around 99999999_ which is not possible either, as the the age can't be less than 0

In [23]:
import pandas_profiling                                                      # Get a quick overview for all the variables using pandas_profiling                                         
profile = pandas_profiling.ProfileReport(MentalData)
profile.to_file(outputfile="mypreprofillingmentalsurvey.html")     