# EDA
**Project notes** will be collected in the *project_notes.md* file.

## Preparation

### Libraries & Settings

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown

In [27]:
# set options
pd.set_option('display.float_format', '{:,.2f}'.format)

### Data

In [44]:
# read data
df = pd.read_csv("/".join(["../data/raw/", "hr-analytics-prediction", "HR-Employee-Attrition.csv"]))

# read data dictionary
Markdown(filename="../docs/data_dictionary.md")

| column                    | scale      | type | description |
|--------------------------:|:----------:|:----:|:------------|
| Age                       | metric     | int  | Age of the employee |
| Attrition                 | binary     | cat  | employee attrition (TARGET) |
| BusinessTravel            | category   | cat  | how frequently an employee travels for business purpose |
| DailyRate                 | metric     | int  | Daily Wage of an employee |
| Department                | category   | cat  | Employee department |
| DistanceFromHome          | metric     | int  | distance from home to office in km |
| Education                 | category   | int  | employee qualification (masked) |
| EducationField            | category   | cat  | Stream of education |
| EmployeeCount             | metric     | int  | Employee count |
| EmployeeNumber            | individual | int  | employee number |
| EnvironmentalSatisfaction | category   | int  | environment |
| Gender                    | binary     | cat  | gender of employee |
| HourlyRate                | metric     | int  | employee hourly rate |
| JobInvolvement            | category   | int  | employee job involvement |
| JobLevel                  | category   | int  | level of job |
| JobRole                   | cat        | cat  | employee job role |
| JobSatisfaction           | category   | int  | is employee satisfied? |
| MaritalStatus             | category   | cat  | employee marital status |
| MonthlyIncome             | metric     | int  | employee monthly income |
| MonthlyRate               | metric     | int  | employee monthly rate |
| NumCompaniesWorked        | metric     | int  | number of companies worked for |
| Over18                    | category   | cat  | age over 18 |
| OverTime                  | binary     | cat  | whether employee works overtime |
| PercentSalaryHike         | metric     | int  | salary hike |
| PerformanceRating         | category   | int  | performance rate |
| RelationshipSatisfaction  | category   | int  | relationship satisfaction |
| Standardhours             | metric     | int  | standard work hours per week |
| StockOptionLevel          | category   | int  | company stock option level |
| TotalWorkingYears         | metric     | int  | total working years |
| TrainingTimesLastYear     | metric     | int  | training time |
| WorkLifeBalance           | category   | int  | work life balance |
| YearsAtCompany            | metric     | int  | total years at current company |
| YearsInCurrentRole        | metric     | int  | total years in current role |
| YearsSinceLastPromotion   | metric     | int  | years since last promotion |
| YearsWithCurrentManager   | metric     | int  | years worked under current manager |


In [29]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469
Age,41,49,37,33,27,32,59,30,38,36,...,29,50,39,31,26,36,39,27,49,34
Attrition,Yes,No,Yes,No,No,No,No,No,No,No,...,No,Yes,No,No,No,No,No,No,No,No
BusinessTravel,Travel_Rarely,Travel_Frequently,Travel_Rarely,Travel_Frequently,Travel_Rarely,Travel_Frequently,Travel_Rarely,Travel_Rarely,Travel_Frequently,Travel_Rarely,...,Travel_Rarely,Travel_Rarely,Travel_Rarely,Non-Travel,Travel_Rarely,Travel_Frequently,Travel_Rarely,Travel_Rarely,Travel_Frequently,Travel_Rarely
DailyRate,1102,279,1373,1392,591,1005,1324,1358,216,1299,...,468,410,722,325,1167,884,613,155,1023,628
Department,Sales,Research & Development,Research & Development,Research & Development,Research & Development,Research & Development,Research & Development,Research & Development,Research & Development,Research & Development,...,Research & Development,Sales,Sales,Research & Development,Sales,Research & Development,Research & Development,Research & Development,Sales,Research & Development
DistanceFromHome,1,8,2,3,2,2,3,24,23,27,...,28,28,24,5,5,23,6,4,2,8
Education,2,1,2,4,1,2,3,1,3,3,...,4,3,1,3,3,2,1,3,3,3
EducationField,Life Sciences,Life Sciences,Other,Life Sciences,Medical,Life Sciences,Medical,Life Sciences,Life Sciences,Medical,...,Medical,Marketing,Marketing,Medical,Other,Medical,Medical,Life Sciences,Medical,Medical
EmployeeCount,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
EmployeeNumber,1,2,4,5,7,8,10,11,12,13,...,2054,2055,2056,2057,2060,2061,2062,2064,2065,2068


In [30]:
#duplicates
df.duplicated().sum()

#duplicates without ID
df.drop("EmployeeNumber", axis=1).duplicated().sum()

np.int64(0)

In [31]:
def overview(df=df):
    display(pd.DataFrame({"dtypes": df.dtypes,
                        "total": df.count(),
                        "missing_n": df.isna().sum(),
                        "missing_%": df.isna().mean(),
                        "uniques_n": df.nunique(),
                        "uniques": [df[col].unique() for col in df.columns]
                        }))

overview()

Unnamed: 0,dtypes,total,missing_n,missing_%,uniques_n,uniques
Age,int64,1470,0,0.0,43,"[41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2..."
Attrition,object,1470,0,0.0,2,"[Yes, No]"
BusinessTravel,object,1470,0,0.0,3,"[Travel_Rarely, Travel_Frequently, Non-Travel]"
DailyRate,int64,1470,0,0.0,886,"[1102, 279, 1373, 1392, 591, 1005, 1324, 1358,..."
Department,object,1470,0,0.0,3,"[Sales, Research & Development, Human Resources]"
DistanceFromHome,int64,1470,0,0.0,29,"[1, 8, 2, 3, 24, 23, 27, 16, 15, 26, 19, 21, 5..."
Education,int64,1470,0,0.0,5,"[2, 1, 4, 3, 5]"
EducationField,object,1470,0,0.0,6,"[Life Sciences, Other, Medical, Marketing, Tec..."
EmployeeCount,int64,1470,0,0.0,1,[1]
EmployeeNumber,int64,1470,0,0.0,1470,"[1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,..."


In [None]:
# Transform target
df["Attrition"] = df["Attrition"].replace({"Yes": 1, "No": 0}).astype(int)

  df["Attrition"] = df["Attrition"].replace({"Yes": 1, "No": 0}).astype(int)


In [37]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1470.0,36.92,9.14,18.0,30.0,36.0,43.0,60.0
Attrition,1470.0,0.16,0.37,0.0,0.0,0.0,0.0,1.0
DailyRate,1470.0,802.49,403.51,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,1470.0,9.19,8.11,1.0,2.0,7.0,14.0,29.0
Education,1470.0,2.91,1.02,1.0,2.0,3.0,4.0,5.0
EmployeeCount,1470.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EmployeeNumber,1470.0,1024.87,602.02,1.0,491.25,1020.5,1555.75,2068.0
EnvironmentSatisfaction,1470.0,2.72,1.09,1.0,2.0,3.0,4.0,4.0
HourlyRate,1470.0,65.89,20.33,30.0,48.0,66.0,83.75,100.0
JobInvolvement,1470.0,2.73,0.71,1.0,2.0,3.0,3.0,4.0


In [35]:
df.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   