**EMPLOYEE ATTRITION PREDICTION**

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

* Education
> 1 : 'Below College',
> 2 : 'College',
> 3 : 'Bachelor',
> 4 : 'Master',
> 5 : 'Doctor'

* EnvironmentSatisfaction
> 1 : 'Low',
> 2 : 'Medium',
> 3 : 'High',
> 4 : 'Very High'

* JobInvolvement
> 1 : 'Low',
> 2 : 'Medium',
> 3 : 'High',
> 4 : 'Very High'

* JobSatisfaction
> 1 : 'Low',
> 2 : 'Medium',
> 3 : 'High',
> 4 : 'Very High'

* PerformanceRating
> 1 : 'Low',
> 2 : 'Good',
> 3 : 'Excellent',
> 4 : 'Outstanding'

* RelationshipSatisfaction
> 1 : 'Low',
> 2 : 'Medium',
> 3 : 'High',
> 4 : 'Very High'

* WorkLifeBalance
> 1 : 'Bad',
> 2 : 'Good',
> 3 : 'Better',
> 4 : 'Best'

# Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as w

w.filterwarnings("ignore")

In [3]:
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: "%.2f" % x)

# Reading Train Data Set

In [4]:
df = pd.read_csv("train.csv")
df.drop("id", axis=1, inplace=True)
original_df = pd.read_csv("original_data.csv")
original_df.drop("EmployeeNumber", axis=1, inplace=True)

In [283]:
df.head()


Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition
0,36,Travel_Frequently,599,Research & Development,24,3,Medical,1,4,Male,42,3,1,Laboratory Technician,4,Married,2596,5099,1,Y,Yes,13,3,2,80,1,10,2,3,10,0,7,8,0
1,35,Travel_Rarely,921,Sales,8,3,Other,1,1,Male,46,3,1,Sales Representative,1,Married,2899,10778,1,Y,No,17,3,4,80,1,4,3,3,4,2,0,3,0
2,32,Travel_Rarely,718,Sales,26,3,Marketing,1,3,Male,80,3,2,Sales Executive,4,Divorced,4627,16495,0,Y,No,17,3,4,80,2,4,3,3,3,2,1,2,0
3,38,Travel_Rarely,1488,Research & Development,2,3,Medical,1,3,Female,40,3,2,Healthcare Representative,1,Married,5347,13384,3,Y,No,14,3,3,80,0,15,1,1,6,0,0,2,0
4,50,Travel_Rarely,1017,Research & Development,5,4,Medical,1,2,Female,37,3,5,Manager,1,Single,19033,19805,1,Y,Yes,13,3,3,80,0,31,0,3,31,14,4,10,1


In [284]:
original_df.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [None]:
df.describe()


In [None]:
original_df.describe()


In [None]:
df.info()


In [None]:
original_df.info()


In [None]:
import missingno as msn

msn.matrix(df)

In [None]:
msn.matrix(original_df)


* Theres no missing Values in the Both Original as well as the generated Dataset.

# EDA

## 1. Age

In [None]:
sns.histplot(data=original_df, x="Age", kde=True)

### Most of the Employees are between 30 to 40 years of Age and the data distribution is positively skewed.

In [None]:
print(original_df.Age.min())
print(original_df.Age.max())


* So everyone is avobe 18 so we can drop **Over18** features from our data set in the cleaning phase.

## 2. Business Travel

In [None]:
sns.countplot(data=original_df, x="BusinessTravel")

> ### Most of the employees travel rarely

## BusinessTravel Based on Attrition Rate

In [None]:
sns.countplot(data=original_df, hue="BusinessTravel", x="Attrition")

* We got to know that People who **do not travel and rarely travels** have a very low rate of Attrition.

## 3. DailyRate

In [None]:
sns.histplot(data=original_df, x="DailyRate", kde=True)

> ### Daily Rate of employees is normally distributed. 

## Attrition vs DailyRate

In [None]:
sns.histplot(data=original_df, x="DailyRate", hue="Attrition", kde=True)

> ### We can say that attrition is same for all the daily rates. So daily rate does not affect the attrition rate.

## 4. Department

In [None]:
sns.countplot(data=original_df, x="Department")

> ### As compared to other departments, HR department has less number of employees.

## Department vs Attrition`

In [None]:
sns.countplot(data=original_df, hue="Department", x="Attrition")

> ### Sales department has the slightly higher attrition rate as compared to other departments.

## 5.Distance From Home (in Km)

In [None]:
sns.histplot(data=original_df, x="DistanceFromHome", kde=True)

> ### Most of the employees live near the company.

## Distance from home vs Attrition

In [None]:
sns.displot(data=original_df, x="DistanceFromHome", hue="Attrition", kde=True)

> ### We can se that distance from home is not a factor for attrition.

## 6.Education

In [None]:
sns.countplot(data=original_df, x="Education")
plt.xlabel(["Below College", "College", "Bachelor", "Master", "Doctor"])

> ### Most of the employees are from College and Bachelor or Master degree holders.

## Education vs Attrition

In [None]:
sns.countplot(data=original_df, hue="Education", x="Attrition")

> ### We can see employees with Below College education are more likely to leave the company as compared to other employees.

## 7.Education Field (Employee's field of study)

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(
    data=original_df,
    x="EducationField",
)

> ### Most of the employees are from Life Sciences and Medical field

## Education Field vs Attrition

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=original_df, hue="EducationField", x="Attrition")

> ### There is slighly more chance of employees leaving the company if they are from Marketing and Technical background

## 8.EmployeeCount

In [None]:
sns.countplot(data=original_df, x="EmployeeCount")

* ### We will be going to drop this column/feature as it only contains only one value is no use for the prediction

## 9.Environment Satisfaction (1-Low, 2-Medium, 3-High, 4-Very High)

In [None]:
sns.countplot(data=original_df, x="EnvironmentSatisfaction")

> ### So most of the employees are satisfied with the environment.

In [None]:
sns.countplot(data=original_df, hue="EnvironmentSatisfaction", x="Attrition")

> ### Employee with less Environment Satisfaction are more likely to leave the company. So Environment Satisfaction is a factor for Attrition.

## 10. Gender

In [None]:
plt.pie(
    original_df.Gender.value_counts(),
    labels=[
        round(
            (
                original_df.Gender.value_counts()[0]
                / original_df.Gender.value_counts().sum()
            )
            * 100,
            2,
        ),
        round(
            (
                original_df.Gender.value_counts()[1]
                / original_df.Gender.value_counts().sum()
            )
            * 100,
            2,
        ),
    ],
    explode=[0.2, 0],
    shadow=True,
    startangle=90,
)
plt.legend(["Male", "Female"])

> ### 60% of the employees working are male.

## Gender vs Attrition

In [None]:
sns.countplot(data=original_df, hue="Gender", x="Attrition")

> ### So Gender has no effect on Attrition.

## 11. HourlyRate

In [None]:
sns.histplot(data=original_df, x="HourlyRate", kde=True)

> ### We can see that the Hourly Rate is normally distributed.

## Hourly Rate vs Attrition

In [None]:
sns.displot(data=original_df, x="HourlyRate", hue="Attrition", kde=True)

> ### Hourly Rate doesn't impact Attrition.

## 12. Job Involvement(1-Low, 2-Medium, 3-High, 4-Very High)

In [None]:
sns.countplot(data=original_df, x="JobInvolvement")

> ### Most of the employees have High JobInvolvement(3)

## Job Involvement vs Attrition

In [None]:
sns.countplot(data=original_df, hue="JobInvolvement", x="Attrition")

> ### We can see that the employees with JobInvolvement 1 are more likely to leave the company.

## 13. Job Level

In [None]:
sns.countplot(data=original_df, x="JobLevel")

> ### Most of the employees working in the company are junior level employees.

## Job Level vs Attrition

In [None]:
sns.countplot(data=original_df, hue="JobLevel", x="Attrition")

> ### Employees with JobLevel 1 are more likely to leave the company.

## 14. Job Role

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=original_df, x="JobRole")
plt.xticks(rotation=90)

> ### Most of the employees are in Sales Executive and Research Scientist and Lab Technician role.

## Job Role vs Attrition

In [None]:
sns.countplot(data=original_df, hue="JobRole", x="Attrition")

> ### Sales and HR department have the highest attrition rate.

## 15. Job Satisfaction (1:Low, 2:Medium, 3:High, 4:Very High)

In [None]:
sns.countplot(data=original_df, x="JobSatisfaction")

> ### Most of the employees are satisfied with their job.

## JobSatisfaction vs Attrition

In [None]:
sns.countplot(data=original_df, x="JobSatisfaction", hue="Attrition")

> ### Employees with low JobSatisfaction are more likely to leave the company.

## 16. Marital Status

In [None]:
sns.countplot(data=original_df, x="MaritalStatus")

> ### Single people are more in the company as divorced employees are considered as single because they are not married any more.

## Marital Status vs Attrition

In [None]:
sns.countplot(data=original_df, x="MaritalStatus", hue="Attrition")

> ### Single People are more likely to leave the company as it is the early years of their career so they can switch easily and doesn't have any responsibilities.

## 17. Monthly Income

In [None]:
sns.histplot(data=original_df, x="MonthlyIncome", kde=True)

> ### MonthlyIncome data is positively skewed.

## MonthlyIncome vs Attrition

In [None]:
sns.histplot(data=original_df, x="MonthlyIncome", hue="Attrition", kde=True)

> ### Low Monthly Income is factor for Attrition.

## 18. Monthly Rate

In [None]:
sns.histplot(data=original_df, x="MonthlyRate")

## MonthlyRate vs Attrition

In [None]:
sns.histplot(data=original_df, x="MonthlyRate", hue="Attrition")

> ### Attrition is same for all monthly rates.

## 19. Number of Companies Worked

In [None]:
sns.countplot(data=original_df, x="NumCompaniesWorked")

> ### Most of the employees have worked in 1 company.

## NumCompaniesWorked vs Attrition

In [None]:
sns.countplot(data=original_df, x="NumCompaniesWorked", hue="Attrition")

> ### If the employee has worked in more companies, the chances of him leaving the company is more. And there is no doubt that the employee who has worked in 1 company will leave the company.

## 20. Over Time

In [None]:
sns.countplot(data=original_df, x="OverTime")

> ### Few employees are working overtime.

## OverTime vs Attrition

In [None]:
sns.countplot(data=original_df, x="OverTime", hue="Attrition")

> ### Employee doing overtime are more likely to leave the company.

## 21. Stock Option Level

In [None]:
sns.countplot(data=original_df, x="StockOptionLevel")

> ### Most of the employees have stock option level 0 or 1.

## StockOptionLevel vs Attrition

In [None]:
sns.countplot(data=original_df, x="StockOptionLevel", hue="Attrition")

> ### Stock Option Level has no effect on Attrition.

## 22. Total Working Years

In [None]:
sns.histplot(data=original_df, x="TotalWorkingYears", kde=True)

> ### Most of the employees have 5 to 10 years of experience.

## TotalWorkingYears vs Attrition

In [None]:
sns.histplot(data=original_df, x="TotalWorkingYears", hue="Attrition", kde=True)

> ### Employee who have worked for less 10 years are more likely to leave the company.

## 23. Training Times Last Year

In [None]:
sns.countplot(data=original_df, x="TrainingTimesLastYear")

> ### Most of the employees have taken 2 or 3 training times last year.

## TrainingTimesLastYear vs Attrition

In [None]:
sns.countplot(data=original_df, x="TrainingTimesLastYear", hue="Attrition")

> ### Employees given more training are less likely to leave the company.

## 24. Work Life Balance

In [None]:
sns.countplot(data=original_df, x="WorkLifeBalance")

> ### Most of the employees have a work life balance of 3.

## WorkLifeBalance vs Attrition

In [None]:
sns.countplot(data=original_df, x="WorkLifeBalance", hue="Attrition")

> ### Employee Work Life Balance affects the Attrition Rate as less work life balance leads to more attrition rate.

## 25. Years At Company

In [None]:
sns.histplot(data=original_df, x="YearsAtCompany", kde=True)

> ### Most of the employees are working in the company for less than 0 to 5 years.

## YearsAtCompany vs Attrition

In [None]:
sns.histplot(data=original_df, x="YearsAtCompany", hue="Attrition", kde=True)

> ### Less no of YearsAtCompany more chances of Attrition.

## 26. YearsInCurrentRole

In [None]:
sns.countplot(data=original_df, x="YearsInCurrentRole")

> ### Most of the employees are in the same for 0, 2 ,7 years.

## YearsInCurrentRole vs Attrition

In [None]:
sns.countplot(data=original_df, x="YearsInCurrentRole", hue="Attrition")

> ### If the employee stays in the same role for more years, then there is high chance of attrition.

## 27. YearsSinceLastPromotion

In [None]:
sns.countplot(data=original_df, x="YearsSinceLastPromotion")

> ### Most of the employees have 0, 1, 2 or 7 years since last promotion.

## YearsSinceLastPromotion vs Attrition

In [None]:
sns.countplot(data=original_df, x="YearsSinceLastPromotion", hue="Attrition")

> ### If the employee is not promoted in last 2 years, then he/she is more likely to leave the company

## 28. YearsWithCurrManager

In [None]:
sns.countplot(data=original_df, x="YearsWithCurrManager")

> ### Most of the employees are with the same manager for 0, 2, 3, 7, 8 years.

## YearswithCurrManager vs Attrition

In [None]:
sns.countplot(data=original_df, x="YearsWithCurrManager", hue="Attrition")

> ### Years with current manager doesn't have any impact on Attrition.

## 28. Attrition

In [None]:
plt.pie(
    original_df.Attrition.value_counts(),
    labels=[round(147700 / 1677, 2), round(20000 / 1677, 2)],
    explode=[0.2, 0],
    shadow=True,
    startangle=90,
)
plt.legend(["Not Churn", "Churn"])

> ### 88% of the employees are not churned and 12% of the employees are churned.

# DATA MUNGING

### Making a function to clean the data efficiently and making it ready for modeling.

In [5]:
# Change the position of Attrition column to the last in original_df

original_df = original_df[
    [
        "Age",
        "BusinessTravel",
        "DailyRate",
        "Department",
        "DistanceFromHome",
        "Education",
        "EducationField",
        "EmployeeCount",
        "EnvironmentSatisfaction",
        "Gender",
        "HourlyRate",
        "JobInvolvement",
        "JobLevel",
        "JobRole",
        "JobSatisfaction",
        "MaritalStatus",
        "MonthlyIncome",
        "MonthlyRate",
        "NumCompaniesWorked",
        "Over18",
        "OverTime",
        "PercentSalaryHike",
        "PerformanceRating",
        "RelationshipSatisfaction",
        "StandardHours",
        "StockOptionLevel",
        "TotalWorkingYears",
        "TrainingTimesLastYear",
        "WorkLifeBalance",
        "YearsAtCompany",
        "YearsInCurrentRole",
        "YearsSinceLastPromotion",
        "YearsWithCurrManager",
        "Attrition",
    ]
]

In [6]:
# Concatenate the two dataframes column wise and then remove the duplicate rows.
print(df.shape)
print(original_df.shape)
df = pd.concat([df, original_df], axis=0)
print(df.shape)

(1677, 34)
(1470, 34)
(3147, 34)


In [7]:
df.drop_duplicates(inplace=True)
print(df.shape)


(3147, 34)


In [8]:
num_cols = [
    "Age",
    "DailyRate",
    "DistanceFromHome",
    "HourlyRate",
    "MonthlyIncome",
    "MonthlyRate",
    "PercentSalaryHike",
    "TotalWorkingYears",
    "YearsAtCompany",
    "YearsInCurrentRole",
    "YearsSinceLastPromotion",
    "YearsWithCurrManager",
]

In [9]:
def cleandata(df):
    df.drop(["EmployeeCount", "Over18", "StandardHours"], axis=1, inplace=True)
    cat_cols = []
    for i in df.columns:
        if i not in num_cols:
            cat_cols.append(i)
            pass
        pass
    cat_cols.remove("Attrition")
    df.Attrition.replace({"Yes": 1, "No": 0}, inplace=True)
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    return df

In [10]:
cleaneddf = cleandata(df)


In [11]:
cleaneddf.to_csv("cleantrain.csv", index=False)