# Background

Employee attrition refers to the leaving of current employees by voluntary or involuntary reasons. 

The reasons of attrition can be through natural means like retirement, or it can be through resignation, termination of contract, or when a company decides to make a position redundant. However, a involuntary turnover made by the most productive, creative, and engaged employees in the company will cause huge loss of the organization.

As a responsible employee, you should not only care about the satisfaction of the customers, but also pay attention to the attrition of your employees. Hence, to prevent the unneccesary attrition, employees should first understand why their employees choose to leave the company and make relative actions to fix the existing problems, so as to create a comfortable environment for the current employees and attract new blood to the company.

# Introduction

Let's see how the HR of IBM made the analysis of their employees attrition. 

The dataset is called 'IBM HR Analytics Employee Attrition & Performance' and can be found from [kaggle](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)

The dataset is an Excel file with about 1470 employee data at IBM. It contain the detailed demographic such as *age, employee role, daily rate, job satisfaction, years at the company, years in current role, as well as whether the employee have left IBM*. 

We can explore the most common factors that lead to employee attrition and even use the data to predict the attrition of employees

The steps we will go through are:

- **Data pre-processing**
    - We will do some basic data processing steps including reading in the dataset to have a overall review of the data, and understand the structure of the dataset.
    
- **Data visualization**
    - We will make several plots to analyze the factors that may influence the employee turnover, and also allow audience to interact with the charts.
- **Contextual visualization**
    - We conducted some research on similar topics about employee attrition and compare the results.


## 1. Data pre-processing

First, let's import necessary library for our program and read our dataset, then show the first ten rows of it. You can click the **Cell- Run All** to see the results.

In [7]:
import pandas as pd
import bqplot
import numpy as np
import traitlets
import ipywidgets
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [8]:
# read in data
attrition = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
# show the first 10 rows
attrition.head(10)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
5,32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6
6,59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0
7,30,No,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,...,2,80,1,1,2,3,1,0,0,0
8,38,No,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,...,2,80,0,10,2,3,9,7,1,8
9,36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7


It's not hard to understand the meaning of each column, but it's better to understand the structure of the dataset by knowing the data type of each column. 

Which columns are object (written in descriptive words)? Which columns are int64 (numerical)?

In [9]:
# show the information of the dataset, including the column names, the count of each column and the data type of each column
attrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

We have these columns that are descriptive:

Non-numerical data:

- Attrition object,
- BusinessTravel object,
- Department object,
- EducationField object,
- Gender object,
- JobRole object,
- MaritalStatus object,
- Over18 object,
- OverTime object

For the numerical data, one of the good way to understand the information behind the numers is to calculate some common index of them. Luckily, the program can do all the calculation for us!

### Show the statistical information (count number, mean, standard deviation, min, max and the value of the 25th percentile for weight, 50th percentile for weight, and 75th percentile for weight of the numerical data.

In [10]:
attrition.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


*However, we should notice that some of the calculation are meaningless. For example, some of the values are showed as integer numbers to represent the level like job level(1-5). It's pointless to calculate the mean, standard deviation of these numbers.*

## 2. Data Visualization

#### Next, come to the data visualization. We believe that most of the employees would like to know why their employees leave the company and this chart might be helpful for them.

In [11]:
@ipywidgets.interact(features = ['Age', 'Gender','MaritalStatus','Department','JobRole','OverTime'], 
                     style = plt.style.available)
def count_plot(features,style):
    with plt.style.context(style):
        plt.figure(figsize = (20,5))
        sns.countplot(x =attrition[features], hue = "Attrition", data = attrition)

interactive(children=(Dropdown(description='features', options=('Age', 'Gender', 'MaritalStatus', 'Department'…

The above interactive visualization shows the numbers of former employees and current employees in different segments/features. 

### How to interact with the plot:
You can choose the factors you are interested in and have an overview of the attrition differences in each segments, as well as choose the visualization styles from the drop-down menu.

### Conclusion 1:
Based on the graphics, we can make general conclusions in each factor:

- Age: Age between 28 to 31 tends to have higher attrition possibility than other age groups.
- Gender: Males tends to leave in higher rate than female.
- Marital Status: Single people are more likely to leave than married and divorced employees.
- Department: People from research & development department tend to have higher possibility to leave the company.
- Job Role: Job roles on sales executive and laboratory technician are more likely to leave.
- Overtime: People who always work overtime are much more likely to leave the company than those who don't work overtime.

#### We also wondering that whether the conditions of the employees could be influenced by other factors, thus lead to the attrition.

In [12]:
@ipywidgets.interact(x = ['JobLevel','Education','DistanceFromHome','JobInvolvement','JobSatisfaction','RelationshipSatisfaction','PerformanceRating'], 
                     y = ["WorkLifeBalance","PercentSalaryHike",'TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager'],
                     style = plt.style.available)
def bar_plot(x,y,style):
    with plt.style.context(style):
        plt.figure(figsize = (20,5))
        sns.barplot(x = 'Attrition', y = attrition[y], hue = attrition[x],data =attrition)
        plt.show()

interactive(children=(Dropdown(description='x', options=('JobLevel', 'Education', 'DistanceFromHome', 'JobInvo…

The second interactive visualization focues on the patterns on numerical columns. 

We tried to find out whether there are some kind of correlations between people in different situation (JobLevel, Education, DistanceFromHome, JobInvolvement, JobSatisfaction, RelationshipSatisfaction, PerformanceRating, Work life balance, working years, Percentage of salary) and their attrition rates.

### How to interact with the plot:
Users are able to choose any of the two segments and visualization styles from the drop-down menus and check out the differences in their attrition. 

For the columns represented by the integers (1,2,3,4...), the larger the number, the higher the level of the condition. For example, an employee with 5 on education has received highest education that others.

### Conclusion 2:

For most of the two segments (eg.Relationship Satisfaction & work life balance), people who choose to leave or not don't have too many differences on these two features.

For some of the factors (eg.Job Level & Years in with current manager), there are obvious distinction between people who leave and people who not leave the company. People in higher job level but work for the same managers for longer years (>=4) tends to have much higher percentages to leave the company compared with other situation. The similar patterns also happen in higher job level and longer years at work, years at the company, years in current role and years since last promotion.

However, there are also some *contrary* impact. People who have higher satisfaction on job and coworker relationship tend to keep contributing to the companoy and have lower probability to leave the company. 

In general:

- **There are some factors that don't have significant influence on attrition like education.**

- **Some factors have certain but not distinct influences on attrition like distance from home, work-life balance nad job involvement.**

- **Sometimes people choose to leave because they are too tired to stay at the same environment for a long time, even though they have reached a high level position.**

- **Besides, people who love their working environment and colleages tend to stay at the company for a longer time.**

## Contextual visualizations

**1.The [first contextual visualization](https://leaderchat.org/2012/05/28/exit-interviews-show-top-10-reasons-why-employees-quit/) shows the top 10 reasons why employees quit.**

<img src="https://i0.wp.com/leaderchat.org/wp-content/uploads/2012/05/top-10-reasons-why-employees-leave1.jpg"/>

The top 3 reasons are *limited career opportunities, supervisor lacked respect and compensation*. 

The picture indicates that most of the employees chooce to leave because the working environments don't provide enough growth spaces, recognitions and workload balance. This finding match our conclusion that people who enjoy their working environment and colleages tend to stay at the company for a longer time.

We can also infer that people who have already got senior positions choose to leave the company because they think the job are not challenging and attracting any more.

**2.The [dashboard](https://www.knime.com/blog/predicting-employee-attrition-with-machine-learning) use the same dataset as mine but visualized by Tableau.** 

<img src="https://www.knime.com/sites/default/files/15-customer-attrition-machine-learning.png"/>

Similarly, the databoard shows the analysis on the percentage of employees attrition by different segments (gender, business travel, department, salary hike and distance from home) with machine learning. Based on the dashboard, the author made the conclusion that male employees who travel frequently, work at HR department, have a low salary hike, and live far from workplace have a high probability of leaving the company.



# References:
 - https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/notebooks
 - https://leaderchat.org/2012/05/28/exit-interviews-show-top-10-reasons-why-employees-quit/
 - https://www.knime.com/blog/predicting-employee-attrition-with-machine-learning