# Skillenza - Devengers Hackthon EDA

### Objective of the problem
The objective of the problem is to predict values “treatment” attribute from the given features of the test data. The predictions are to be written to a CSV file along with ID which is the unique identifier for each tuple. Please view the sample submission file to understand how the submission file is to be written. Please upload the submission file to get a score. 

### Description of files
**Training File** : All features including the target would be present in this file. Machine learning model would trained using this file. This file is to be used for training and validation.  
**Test File** : This file contains all features, but the target variable. Prediction is to be made for all tuples in the test file. The predicted values are to be written to a CSV file along with ID and uploaded.  
**Sample Submission** : Sample submission is an example of how the actual submission file should be like

### Features 
 - Timestamp  
 - Age  
 - Gender 
 - Country 
 - state: If you live in the United States, which state or territory do you live in? 
 - self_employed: Are you self-employed? 
 - family_history: Do you have a family history of mental illness? 
 - treatment: Does he or she really needs treatment. 
 - work_interfere: If you have a mental health condition, do you feel that it interferes with your work? 
 - no_employees: How many employees does your company or organization have? 
 - remote_work: Do you work remotely (outside of an office) at least 50% of the time? 
 - tech_company: Is your employer primarily a tech company/organization? 
 - benefits: Does your employer provide mental health benefits? 
 - care_options: Do you know the options for mental health care your employer provides? 
 - wellness_program: Has your employer ever discussed mental health as part of an employee wellness program? 
 - seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help? 
 - anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources? 
 - leave: How easy is it for you to take medical leave for a mental health condition? 
 - mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? 
 - phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? 
 - coworkers: Would you be willing to discuss a mental health issue with your coworkers? 
 - supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)? 
 - mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview? 
 - phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview? 
 - mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health? 
 - obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace? 
 - comments: Any additional notes or comments.

In [7]:
import datetime
import glob
import ipywidgets
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
from sklearn import metrics
import time

%matplotlib inline
%run ./plugins/widgets.py

Widget Loaded


### Global parameters and variables

In [8]:
plt.rcParams['figure.figsize'] = [16, 9]
plt.rcParams['font.size'] = 14
plt.rcParams['axes.grid'] = True
plt.rcParams['figure.facecolor'] = 'white'
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

## Load data

In [16]:
train_df = pd.read_csv('./data/devengers_train.csv')
print("Shape : ", train_df.shape)
train_df.sample(10)

Shape :  (1000, 28)


Unnamed: 0,s.no,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
583,584,2014-08-27 21:39:23,35,Male,United States,IN,No,No,No,Never,6-25,No,Yes,No,Yes,No,No,Yes,Very easy,No,No,Some of them,Yes,Maybe,Maybe,Yes,No,
124,125,2014-08-27 12:33:56,27,Male,Canada,,No,No,Yes,Never,100-500,No,No,Yes,Not sure,No,No,Don't know,Very difficult,Maybe,No,Some of them,Yes,No,Maybe,No,Yes,
459,460,2014-08-27 15:59:47,37,M,United States,OR,No,No,No,,500-1000,No,Yes,Yes,No,No,No,Don't know,Don't know,Maybe,Maybe,Some of them,Some of them,No,No,Don't know,No,
441,442,2014-08-27 15:43:45,36,Male,United States,KY,No,No,No,Rarely,More than 1000,Yes,Yes,Yes,No,Yes,Don't know,Don't know,Somewhat easy,Maybe,No,No,No,No,No,Don't know,No,
869,870,2014-08-28 17:10:08,43,Male,Mexico,,Yes,No,No,,More than 1000,Yes,Yes,Don't know,No,No,Don't know,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,Don't know,No,
852,853,2014-08-28 16:58:33,21,Male,United Kingdom,,No,Yes,No,Never,6-25,No,No,No,No,No,No,Yes,Somewhat easy,Maybe,No,Some of them,Some of them,No,Maybe,Don't know,No,
16,17,2014-08-27 11:34:20,23,Male,United Kingdom,,,No,Yes,Sometimes,26-100,Yes,Yes,Don't know,No,Don't know,Don't know,Don't know,Very easy,Maybe,No,Some of them,No,Maybe,Maybe,No,No,My company does provide healthcare but not to ...
270,271,2014-08-27 13:55:38,30,Male,Ireland,,Yes,No,No,Sometimes,1-5,Yes,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Some of them,Some of them,No,Maybe,No,No,
146,147,2014-08-27 12:39:48,24,M,United Kingdom,,No,No,No,Never,6-25,No,Yes,No,No,No,No,Yes,Don't know,Maybe,Maybe,No,Some of them,No,No,Don't know,No,
376,377,2014-08-27 15:22:45,27,female,United States,WA,No,No,Yes,Rarely,More than 1000,No,Yes,Yes,Yes,No,No,Yes,Don't know,Yes,Maybe,No,No,No,No,No,No,


## Data Pre-processing

In [15]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 28 columns):
s.no                         1000 non-null int64
Timestamp                    1000 non-null object
Age                          1000 non-null int64
Gender                       1000 non-null object
Country                      1000 non-null object
state                        600 non-null object
self_employed                982 non-null object
family_history               1000 non-null object
treatment                    1000 non-null object
work_interfere               774 non-null object
no_employees                 1000 non-null object
remote_work                  1000 non-null object
tech_company                 1000 non-null object
benefits                     1000 non-null object
care_options                 1000 non-null object
wellness_program             1000 non-null object
seek_help                    1000 non-null object
anonymity                    1000 non-null object