# Employee Turnover Prediction

Problem statement: 

The goal is to find out the employees who stay and leave the company in the upcoming year.Based on the given features,we performed Exploratory Data Analysis, Descriptive Statistical Analysis, Outlier analysis, Feature Engineering, and Feature Selection and made the change as per requirements.Created Data visualizations using Seaborn and Matplotlib libraries to find the insight patterns.Used Machine Learning algorithms to build predictive models to find the best accuracy.



# Project Task 1: Data cleaning and statistical analysis

Task Explanation :

● Import the dataset, convert the target variable values to 0 and 1, separate features from
target values.

● Split the features into numeric and categorical datasets.

● Run descriptive statistical analysis on numerical features.

# Importing Libraries

In [2]:
#for Manipulations
import pandas as pd
import numpy as np

#for Data visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#for Scientific computations
from scipy import stats
from sklearn.preprocessing import LabelEncoder

# to Ignore warnings
import warnings
warnings.filterwarnings("ignore")


# Reading the Dataset  

In [3]:
import os

In [4]:
emp = pd.read_csv('employee.csv')

In [6]:
emp.head()

Unnamed: 0,EmployeId,Age,Gender,MaritalStatus,Turnover,Travelling,Vertical,Qualifications,EducationField,EmployeSatisfaction,...,RelationshipSatisfaction,Hours,StockOptionLevel,TrainingTimesLastYear,Work&Life,YearsAtCompany,YearsAtCompany.1,YearsSinceLastPromotion,YearsWithCurrentManager,DistanceFromHome
0,63,29,M,Divorced,No,Sometimes,Research & Development,1,Medical,3,...,4,80,1,2,3,3,4,2,2,2
1,723,23,M,Single,No,Mostly,Sales,1,Life Sciences,2,...,4,80,0,3,3,1,2,0,0,6
2,1297,36,M,Single,Yes,Sometimes,Human Resources,4,Life Sciences,2,...,3,80,0,3,4,7,8,0,0,10
3,51,30,M,Divorced,No,Sometimes,Research & Development,4,Medical,3,...,2,80,3,5,3,10,11,3,9,12
4,1498,29,M,Single,Yes,Sometimes,Sales,3,Technical Degree,4,...,1,80,0,2,3,3,4,2,2,24


In [5]:
#Checking the columns
emp.columns

Index(['EmployeId', 'Age', 'Gender', 'MaritalStatus', 'Turnover', 'Travelling',
       'Vertical', 'Qualifications', 'EducationField', 'EmployeSatisfaction',
       'JobEngagement', 'JobLevel', 'JobSatisfaction', 'Role', 'DailyBilling',
       'HourBilling', 'MonthlyBilling', 'MonthlyRate', 'Work Experience',
       'OverTime', 'PercentSalaryHike', 'Last Rating',
       'RelationshipSatisfaction', 'Hours', 'StockOptionLevel',
       'TrainingTimesLastYear', 'Work&Life', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrentManager', 'DistanceFromHome'],
      dtype='object')

In [6]:
#Checking the shape of the dataset
emp.shape

(1470, 32)

In [7]:
#Checking Missing Values
emp.isnull().sum()

EmployeId                   0
Age                         0
Gender                      0
MaritalStatus               0
Turnover                    0
Travelling                  0
Vertical                    0
Qualifications              0
EducationField              0
EmployeSatisfaction         0
JobEngagement               0
JobLevel                    0
JobSatisfaction             0
Role                        0
DailyBilling                0
HourBilling                 0
MonthlyBilling              0
MonthlyRate                 0
Work Experience             0
OverTime                    0
PercentSalaryHike           0
Last Rating                 0
RelationshipSatisfaction    0
Hours                       0
StockOptionLevel            0
TrainingTimesLastYear       0
Work&Life                   0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrentManager     0
DistanceFromHome            0
dtype: int64

In [8]:
#Checking Duplicates
emp.duplicated().sum()

0

# Convert the target variable values to 0 and 1

In [9]:
#To check the datatype of Target Variable 
emp['Turnover'].unique()

array(['No', 'Yes'], dtype=object)

In [10]:
#Convert the Target variable into 0 & 1
emp['Turnover']=emp['Turnover'].str.replace('Yes','1')
emp['Turnover']=emp['Turnover'].str.replace('No','0')


In [11]:
#Converting String to Integer
emp['Turnover']=pd.to_numeric(emp['Turnover'])

In [12]:
emp['Turnover'].unique()

array([0, 1], dtype=int64)

In [13]:
#To check the target variables whether it is converted to 0 & 1
emp.head()

Unnamed: 0,EmployeId,Age,Gender,MaritalStatus,Turnover,Travelling,Vertical,Qualifications,EducationField,EmployeSatisfaction,...,RelationshipSatisfaction,Hours,StockOptionLevel,TrainingTimesLastYear,Work&Life,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrentManager,DistanceFromHome
0,63,29,M,Divorced,0,Sometimes,Research & Development,1,Medical,3,...,4,80,1,2,3,3,2,2,2,2
1,723,23,M,Single,0,Mostly,Sales,1,Life Sciences,2,...,4,80,0,3,3,1,0,0,0,6
2,1297,36,M,Single,1,Sometimes,Human Resources,4,Life Sciences,2,...,3,80,0,3,4,7,7,0,0,10
3,51,30,M,Divorced,0,Sometimes,Research & Development,4,Medical,3,...,2,80,3,5,3,10,6,3,9,12
4,1498,29,M,Single,1,Sometimes,Sales,3,Technical Degree,4,...,1,80,0,2,3,3,2,2,2,24


# Separate features from target values

In [14]:
#Seperate features from Target variables
emp_x = emp[['EmployeId', 'Age', 'Gender', 'MaritalStatus','Travelling',
       'Vertical', 'Qualifications', 'EducationField', 'EmployeSatisfaction',
       'JobEngagement', 'JobLevel', 'JobSatisfaction', 'Role', 'DailyBilling',
       'HourBilling', 'MonthlyBilling', 'MonthlyRate', 'Work Experience',
       'OverTime', 'PercentSalaryHike', 'Last Rating',
       'RelationshipSatisfaction', 'Hours', 'StockOptionLevel',
       'TrainingTimesLastYear', 'Work&Life', 'YearsAtCompany',
        'YearsSinceLastPromotion','YearsWithCurrentManager', 'DistanceFromHome']]
emp_y = emp['Turnover']

In [15]:
emp_x.head()

Unnamed: 0,EmployeId,Age,Gender,MaritalStatus,Travelling,Vertical,Qualifications,EducationField,EmployeSatisfaction,JobEngagement,...,Last Rating,RelationshipSatisfaction,Hours,StockOptionLevel,TrainingTimesLastYear,Work&Life,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrentManager,DistanceFromHome
0,63,29,M,Divorced,Sometimes,Research & Development,1,Medical,3,3,...,4,4,80,1,2,3,3,2,2,2
1,723,23,M,Single,Mostly,Sales,1,Life Sciences,2,3,...,3,4,80,0,3,3,1,0,0,6
2,1297,36,M,Single,Sometimes,Human Resources,4,Life Sciences,2,3,...,3,3,80,0,3,4,7,0,0,10
3,51,30,M,Divorced,Sometimes,Research & Development,4,Medical,3,3,...,3,2,80,3,5,3,10,3,9,12
4,1498,29,M,Single,Sometimes,Sales,3,Technical Degree,4,3,...,3,1,80,0,2,3,3,2,2,24


In [16]:
emp_y.head()

0    0
1    0
2    1
3    0
4    1
Name: Turnover, dtype: int64

# Split the features into numeric and categorical datasets

In [17]:
#Split the features into numeric and categorical datasets
category = emp.select_dtypes("object")
numeric = emp.select_dtypes("number")


In [18]:
category.head()

Unnamed: 0,Gender,MaritalStatus,Travelling,Vertical,EducationField,Role,OverTime
0,M,Divorced,Sometimes,Research & Development,Medical,Laboratory Technician,No
1,M,Single,Mostly,Sales,Life Sciences,Sales Representative,No
2,M,Single,Sometimes,Human Resources,Life Sciences,Manager,No
3,M,Divorced,Sometimes,Research & Development,Medical,Manufacturing Director,No
4,M,Single,Sometimes,Sales,Technical Degree,Sales Representative,No


In [19]:
category.shape

(1470, 7)

In [20]:
numeric.head()


Unnamed: 0,EmployeId,Age,Turnover,Qualifications,EmployeSatisfaction,JobEngagement,JobLevel,JobSatisfaction,DailyBilling,HourBilling,...,RelationshipSatisfaction,Hours,StockOptionLevel,TrainingTimesLastYear,Work&Life,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrentManager,DistanceFromHome
0,63,29,0,1,3,3,1,1,368,37,...,4,80,1,2,3,3,2,2,2,2
1,723,23,0,1,2,3,1,3,599,97,...,4,80,0,3,3,1,0,0,0,6
2,1297,36,1,4,2,3,2,4,833,34,...,3,80,0,3,4,7,7,0,0,10
3,51,30,0,4,3,3,3,3,291,66,...,2,80,3,5,3,10,6,3,9,12
4,1498,29,1,3,4,3,1,4,143,61,...,1,80,0,2,3,3,2,2,2,24


In [21]:
numeric.shape

(1470, 25)

# Run Descriptive statistical analysis on numerical features

In [22]:
emp.describe()

Unnamed: 0,EmployeId,Age,Turnover,Qualifications,EmployeSatisfaction,JobEngagement,JobLevel,JobSatisfaction,DailyBilling,HourBilling,...,RelationshipSatisfaction,Hours,StockOptionLevel,TrainingTimesLastYear,Work&Life,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrentManager,DistanceFromHome
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,1056.771429,37.065306,0.181633,2.87619,2.702041,2.712245,2.087075,2.785034,763.046939,65.07483,...,2.727891,80.0,0.817007,2.769388,2.798639,6.772109,4.066667,2.095238,3.945578,9.278231
std,594.598084,9.522562,0.385673,1.019038,1.095039,0.731141,1.109663,1.09584,426.331994,20.604377,...,1.086822,0.0,0.88007,1.3509,0.714718,5.777745,3.741427,3.261537,3.702486,8.15712
min,8.0,18.0,0.0,1.0,1.0,1.0,1.0,1.0,107.0,30.0,...,1.0,80.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,531.25,30.0,0.0,2.0,2.0,2.0,1.0,2.0,403.75,47.0,...,2.0,80.0,0.0,2.0,2.0,3.0,2.0,0.0,1.0,2.0
50%,1073.0,36.0,0.0,3.0,3.0,3.0,2.0,3.0,704.5,66.0,...,3.0,80.0,1.0,2.0,3.0,5.0,3.0,1.0,2.0,7.0
75%,1553.75,44.0,0.0,4.0,4.0,3.0,3.0,4.0,1151.0,83.0,...,4.0,80.0,1.0,3.0,3.0,9.0,7.0,2.0,7.0,15.0
max,2060.0,59.0,1.0,5.0,4.0,4.0,5.0,4.0,1495.0,100.0,...,4.0,80.0,3.0,6.0,4.0,33.0,17.0,15.0,17.0,29.0


In [23]:
emp['EmployeId'].describe()

count    1470.000000
mean     1056.771429
std       594.598084
min         8.000000
25%       531.250000
50%      1073.000000
75%      1553.750000
max      2060.000000
Name: EmployeId, dtype: float64

In [24]:
emp.mean() #find the mean

EmployeId                    1056.771429
Age                            37.065306
Turnover                        0.181633
Qualifications                  2.876190
EmployeSatisfaction             2.702041
JobEngagement                   2.712245
JobLevel                        2.087075
JobSatisfaction                 2.785034
DailyBilling                  763.046939
HourBilling                    65.074830
MonthlyBilling               6752.281633
MonthlyRate                 14539.982313
Work Experience                 2.659184
PercentSalaryHike              15.122449
Last Rating                     3.162585
RelationshipSatisfaction        2.727891
Hours                          80.000000
StockOptionLevel                0.817007
TrainingTimesLastYear           2.769388
Work&Life                       2.798639
YearsAtCompany                  6.772109
YearsInCurrentRole              4.066667
YearsSinceLastPromotion         2.095238
YearsWithCurrentManager         3.945578
DistanceFromHome

In [25]:
emp.median() #find the median

EmployeId                    1073.0
Age                            36.0
Turnover                        0.0
Qualifications                  3.0
EmployeSatisfaction             3.0
JobEngagement                   3.0
JobLevel                        2.0
JobSatisfaction                 3.0
DailyBilling                  704.5
HourBilling                    66.0
MonthlyBilling               4854.0
MonthlyRate                 14717.5
Work Experience                 1.0
PercentSalaryHike              14.0
Last Rating                     3.0
RelationshipSatisfaction        3.0
Hours                          80.0
StockOptionLevel                1.0
TrainingTimesLastYear           2.0
Work&Life                       3.0
YearsAtCompany                  5.0
YearsInCurrentRole              3.0
YearsSinceLastPromotion         1.0
YearsWithCurrentManager         2.0
DistanceFromHome                7.0
dtype: float64

In [26]:
emp.std() #find the standard deviation

EmployeId                    594.598084
Age                            9.522562
Turnover                       0.385673
Qualifications                 1.019038
EmployeSatisfaction            1.095039
JobEngagement                  0.731141
JobLevel                       1.109663
JobSatisfaction                1.095840
DailyBilling                 426.331994
HourBilling                   20.604377
MonthlyBilling              5141.197951
MonthlyRate                 7172.390592
Work Experience                2.501769
PercentSalaryHike              3.817658
Last Rating                    0.369112
RelationshipSatisfaction       1.086822
Hours                          0.000000
StockOptionLevel               0.880070
TrainingTimesLastYear          1.350900
Work&Life                      0.714718
YearsAtCompany                 5.777745
YearsInCurrentRole             3.741427
YearsSinceLastPromotion        3.261537
YearsWithCurrentManager        3.702486
DistanceFromHome               8.157120


In [27]:
emp.var() #find the variance

EmployeId                   3.535469e+05
Age                         9.067919e+01
Turnover                    1.487434e-01
Qualifications              1.038439e+00
EmployeSatisfaction         1.199111e+00
JobEngagement               5.345675e-01
JobLevel                    1.231351e+00
JobSatisfaction             1.200865e+00
DailyBilling                1.817590e+05
HourBilling                 4.245403e+02
MonthlyBilling              2.643192e+07
MonthlyRate                 5.144319e+07
Work Experience             6.258850e+00
PercentSalaryHike           1.457451e+01
Last Rating                 1.362438e-01
RelationshipSatisfaction    1.181182e+00
Hours                       0.000000e+00
StockOptionLevel            7.745234e-01
TrainingTimesLastYear       1.824930e+00
Work&Life                   5.108218e-01
YearsAtCompany              3.338234e+01
YearsInCurrentRole          1.399828e+01
YearsSinceLastPromotion     1.063762e+01
YearsWithCurrentManager     1.370840e+01
DistanceFromHome

In [28]:
emp.skew() #find the skewness

EmployeId                  -0.051467
Age                         0.309846
Turnover                    1.653221
Qualifications             -0.287690
EmployeSatisfaction        -0.324228
JobEngagement              -0.498928
JobLevel                    0.938151
JobSatisfaction            -0.393550
DailyBilling                0.123899
HourBilling                -0.052079
MonthlyBilling              1.304112
MonthlyRate                -0.016495
Work Experience             1.134661
PercentSalaryHike           0.917203
Last Rating                 1.830742
RelationshipSatisfaction   -0.317500
Hours                       0.000000
StockOptionLevel            0.970663
TrainingTimesLastYear       0.619843
Work&Life                  -0.557695
YearsAtCompany              1.618834
YearsInCurrentRole          1.176682
YearsSinceLastPromotion     2.209494
YearsWithCurrentManager     1.072928
DistanceFromHome            0.882660
dtype: float64

In [29]:
emp.kurt() #find the kurtosis

EmployeId                  -1.150945
Age                        -0.630356
Turnover                    0.734137
Qualifications             -0.573288
EmployeSatisfaction        -1.200799
JobEngagement               0.173943
JobLevel                    0.170524
JobSatisfaction            -1.167805
DailyBilling               -1.250962
HourBilling                -1.205422
MonthlyBilling              0.587944
MonthlyRate                -1.065124
Work Experience             0.203050
PercentSalaryHike          -0.191919
Last Rating                 1.353457
RelationshipSatisfaction   -1.192346
Hours                       0.000000
StockOptionLevel            0.274499
TrainingTimesLastYear       0.335005
Work&Life                   0.452730
YearsAtCompany              3.125568
YearsInCurrentRole          1.309247
YearsSinceLastPromotion     4.787841
YearsWithCurrentManager     0.867401
DistanceFromHome           -0.396525
dtype: float64

# Conclusion

The project task - 1 is done with various process like import the dataset, convert the target variable values to 0 and 1, separate features from target values, split the features into numeric and categorical datasets, run descriptive statistical analysis on numerical features and find out the insights in the dataset.