### Problem Statement

A large company named XYZ, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons -  The former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners A sizeable department has to be maintained, for the purposes of recruiting new talent More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company Hence, the management has contracted an HR analytics firm to understand what factors they should focus on, in order to curb attrition. In other words, they want to know what changes they should make to their workplace, in order to get most of their employees to stay. Also, they want to know which of these variables is most important and needs to be addressed right away. Goal of the case study You are required to model the probability of attrition. The results thus obtained will be used by the management to understand what changes they should make to their workplace, in order to get most of their employees to stay.

###### Ignore Warnings

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

###### Import required libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

###### Read the required Data

In [3]:
basePath = 'D:\\MyCodes\\Python\\LetsUpgrade_Repo\\LU-Day-Wise\\Day-07\\Assignment\\'
emp_df = pd.read_csv( basePath + 'Employee_Attrition_Data.csv' )
emp_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,...,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,Y,11,8,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,0.0,Y,23,8,1,6.0,3,5,1,4
2,32,No,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,...,1.0,Y,15,8,3,5.0,2,5,0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,1,4,Male,...,3.0,Y,11,8,3,13.0,5,8,7,5
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,1,5,Male,...,4.0,Y,12,8,2,9.0,2,6,0,4


###### Feature Engineering:
- Handle Missing/Duplicate Values
- Feature Selection
- Feature Scaling and Normalization

In [4]:
emp_df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EmployeeCount', 'EmployeeID', 'Gender',
       'JobLevel', 'JobRole', 'MaritalStatus', 'MonthlyIncome',
       'NumCompaniesWorked', 'Over18', 'PercentSalaryHike', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

###### Check and handle missing values

In [5]:
emp_df.isna().any()

Age                        False
Attrition                  False
BusinessTravel             False
Department                 False
DistanceFromHome           False
Education                  False
EducationField             False
EmployeeCount              False
EmployeeID                 False
Gender                     False
JobLevel                   False
JobRole                    False
MaritalStatus              False
MonthlyIncome              False
NumCompaniesWorked          True
Over18                     False
PercentSalaryHike          False
StandardHours              False
StockOptionLevel           False
TotalWorkingYears           True
TrainingTimesLastYear      False
YearsAtCompany             False
YearsSinceLastPromotion    False
YearsWithCurrManager       False
dtype: bool

In [6]:
emp_df.isnull().any()

Age                        False
Attrition                  False
BusinessTravel             False
Department                 False
DistanceFromHome           False
Education                  False
EducationField             False
EmployeeCount              False
EmployeeID                 False
Gender                     False
JobLevel                   False
JobRole                    False
MaritalStatus              False
MonthlyIncome              False
NumCompaniesWorked          True
Over18                     False
PercentSalaryHike          False
StandardHours              False
StockOptionLevel           False
TotalWorkingYears           True
TrainingTimesLastYear      False
YearsAtCompany             False
YearsSinceLastPromotion    False
YearsWithCurrManager       False
dtype: bool

In [7]:
emp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 4   DistanceFromHome         4410 non-null   int64  
 5   Education                4410 non-null   int64  
 6   EducationField           4410 non-null   object 
 7   EmployeeCount            4410 non-null   int64  
 8   EmployeeID               4410 non-null   int64  
 9   Gender                   4410 non-null   object 
 10  JobLevel                 4410 non-null   int64  
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 13  MonthlyIncome            4410 non-null   int64  
 14  NumCompaniesWorked      

###### Drop Missing values as they are less in number

In [8]:
emp_df = emp_df.dropna(axis=0)
emp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4382 entries, 0 to 4408
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4382 non-null   int64  
 1   Attrition                4382 non-null   object 
 2   BusinessTravel           4382 non-null   object 
 3   Department               4382 non-null   object 
 4   DistanceFromHome         4382 non-null   int64  
 5   Education                4382 non-null   int64  
 6   EducationField           4382 non-null   object 
 7   EmployeeCount            4382 non-null   int64  
 8   EmployeeID               4382 non-null   int64  
 9   Gender                   4382 non-null   object 
 10  JobLevel                 4382 non-null   int64  
 11  JobRole                  4382 non-null   object 
 12  MaritalStatus            4382 non-null   object 
 13  MonthlyIncome            4382 non-null   int64  
 14  NumCompaniesWorked      

###### No Duplicate Values present

In [9]:
emp_df.duplicated().any()

False

###### Drop Unnecessary Columns like - EmployeeID, EmployeeCount or Over18

In [10]:
print( 'Col List Before : ', len(emp_df.columns) )
emp_df = emp_df.drop( ['EmployeeID','EmployeeCount', 'Over18'], axis=1 )
print( 'Col List After : ', len(emp_df.columns) )

Col List Before :  24
Col List After :  21


###### Uni-variate Analysis

In [11]:
emp_df.describe()

Unnamed: 0,Age,DistanceFromHome,Education,JobLevel,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
count,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0,4382.0
mean,36.933364,9.198996,2.912369,2.063898,65061.702419,2.693291,15.210634,8.0,0.794614,11.290278,2.798266,7.010497,2.191693,4.126198
std,9.137272,8.105396,1.024728,1.106115,47142.310175,2.497832,3.663007,0.0,0.852397,7.785717,1.289402,6.129351,3.224994,3.569674
min,18.0,1.0,1.0,1.0,10090.0,0.0,11.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,2.0,2.0,1.0,29110.0,1.0,12.0,8.0,0.0,6.0,2.0,3.0,0.0,2.0
50%,36.0,7.0,3.0,2.0,49190.0,2.0,14.0,8.0,1.0,10.0,3.0,5.0,1.0,3.0
75%,43.0,14.0,4.0,3.0,83790.0,4.0,18.0,8.0,1.0,15.0,3.0,9.0,3.0,7.0
max,60.0,29.0,5.0,5.0,199990.0,9.0,25.0,8.0,3.0,40.0,6.0,40.0,15.0,17.0


In [12]:
emp_df.mean()

Age                           36.933364
DistanceFromHome               9.198996
Education                      2.912369
JobLevel                       2.063898
MonthlyIncome              65061.702419
NumCompaniesWorked             2.693291
PercentSalaryHike             15.210634
StandardHours                  8.000000
StockOptionLevel               0.794614
TotalWorkingYears             11.290278
TrainingTimesLastYear          2.798266
YearsAtCompany                 7.010497
YearsSinceLastPromotion        2.191693
YearsWithCurrManager           4.126198
dtype: float64

In [13]:
emp_df.var()

Age                        8.348974e+01
DistanceFromHome           6.569744e+01
Education                  1.050068e+00
JobLevel                   1.223490e+00
MonthlyIncome              2.222397e+09
NumCompaniesWorked         6.239165e+00
PercentSalaryHike          1.341762e+01
StandardHours              0.000000e+00
StockOptionLevel           7.265814e-01
TotalWorkingYears          6.061739e+01
TrainingTimesLastYear      1.662558e+00
YearsAtCompany             3.756894e+01
YearsSinceLastPromotion    1.040059e+01
YearsWithCurrManager       1.274257e+01
dtype: float64

In [14]:
emp_df.skew()

Age                        0.413048
DistanceFromHome           0.955517
Education                 -0.288977
JobLevel                   1.021797
MonthlyIncome              1.367457
NumCompaniesWorked         1.029174
PercentSalaryHike          0.819510
StandardHours              0.000000
StockOptionLevel           0.967263
TotalWorkingYears          1.115419
TrainingTimesLastYear      0.551818
YearsAtCompany             1.764619
YearsSinceLastPromotion    1.980992
YearsWithCurrManager       0.834277
dtype: float64

###### Hypothesis Testing

###### Hypothesis #1 - if age is less than 33 has a higher attrition rate

In [30]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.Age<33 ].Age.value_counts().sum()

print( 'Employees Left due to Age : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to Age :  54.326 %


###### Hypothesis #2 - if distance from home is greater than 7 has a higher attrition rate

In [35]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.DistanceFromHome>7 ].DistanceFromHome.value_counts().sum()

print( 'Employees Left due to DistanceFromHome : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to DistanceFromHome :  47.801 %


###### Hypothesis #3 - Company Work-Exp greater than 2 years has a higher attrition rate

In [42]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.YearsAtCompany>2 ].YearsAtCompany.value_counts().sum()

print( 'Employees Left due to YearsAtCompany : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to YearsAtCompany :  56.879 %


###### Hypothesis #4 - percentsalaryhike is less than 15 has a higher attrition rate

In [32]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[df.PercentSalaryHike<15].PercentSalaryHike.value_counts().sum()

print( 'Employees Left due to PercentSalaryHike : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to PercentSalaryHike :  51.915 %


###### Hypothesis #6 - monthly income is less than 65000 has a higher attrition rate

In [38]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.MonthlyIncome<65000 ].MonthlyIncome.value_counts().sum()

print( 'Employees Left due to MonthlyIncome : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to MonthlyIncome :  69.504 %


###### Hypothesis #6 - NumCompaniesWorked worked more than 2 has a higher attrition rate

In [47]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.NumCompaniesWorked>2 ].NumCompaniesWorked.value_counts().sum()

print( 'Employees Left due to NumCompaniesWorked : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to NumCompaniesWorked :  41.986 %


###### Hypothesis #7 - if level is less than 2 has a higher attrition rate

In [40]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.JobLevel<2 ].JobLevel.value_counts().sum()

print( 'Employees Left due to JobLevel : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to JobLevel :  35.461 %


###### Hypothesis #8 - Total Work-Exp greater than 5 years has a higher attrition rate

In [50]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.TotalWorkingYears>5 ].TotalWorkingYears.value_counts().sum()

print( 'Employees Left due to TotalWorkingYears : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to TotalWorkingYears :  61.56 %


###### Hypothesis #9 - yearsSinceLastPromotion of employee is more than a year has a higher attrition rate

In [61]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.YearsSinceLastPromotion>1 ].YearsSinceLastPromotion.value_counts().sum()

print( 'Employees Left due to YearsSinceLastPromotion : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to YearsSinceLastPromotion :  33.191 %


###### Hypothesis #10 - if trainingTimesLastYear of employee is more than 2 has a higher attrition rate

In [62]:
df=emp_df[emp_df['Attrition']=='Yes']

totalAttr = len(df)
print( 'Total Employees Left the Company : ', totalAttr )

attrVal = df[ df.TrainingTimesLastYear>2 ].TrainingTimesLastYear.value_counts().sum()

print( 'Employees Left due to TrainingTimesLastYear : ', round((attrVal/totalAttr)*100,3), '%' )

Total Employees Left the Company :  705
Employees Left due to TrainingTimesLastYear :  52.199 %


******

### Steps to be taken (Hypothesis and Suggested Solution)

- Hypothesis #1 - If age is less than 33 has a higher attrition rate
    - **Solution - Try to find more activities, Workshops, Pay Hikes or Business Travel to keep the younger employees motivated to stay**
- Hypothesis #2 - If DistanceFromHome is greater than 7 has a higher attrition rate
    - **Solution - For employees staying far away Work-From-Home policies could be encouraged to reduce travel time and increase productivity**
- Hypothesis #3 - If Company Work-Experience greater than 2 years has a higher attrition rate
    - **Solution - Salary Hike or Promotions Or Training on new skills of employees whereever possible**
- Hypothesis #4 - If percentsalaryhike is less than 15 has a higher attrition rate
    - **Solution - Salary Hike of employees whereever possible**
- Hypothesis #5 - If MonthlyIncome is less than 65000 has a higher attrition rate
    - **Solution - Salary Hike of employees whereever possible**
- Hypothesis #6 - If NumCompaniesWorked worked more than 2 has a higher attrition rate
    - **Solution - Try and find employyes who remain loyal to companies whenever possible** 
- Hypothesis #7 - If JobLevel is less than 2 has a higher attrition rate
    - **Solution - Promotion of employees whereever possible** 
- Hypothesis #8 - If Total Work-Experience greater than 5 years has a higher attrition rate
    - **Solution - Management Training, Pay Hikes or Business Travel to keep the experienced employees motivated to stay**
- Hypothesis #9 - If yearsSinceLastPromotion of employee is more than a year has a higher attrition rate
    - **Solution - Promotion of employees whereever possible**  
- Hypothesis #10 - If trainingTimesLastYear of employee is more than 2 has a higher attrition rate
    - **Solution - Training on new skills as well as company reuired tech and management skills of employees should be carried out on a regular basis** 

*****