# Predicting Employee Attrition in Recession

# Overview of Problem

As the COVID-19 keeps unleashing its havoc, the world continues to get pushed into the crisis of the great economic recession, more and more companies start to cut down their underperforming employees. Companies firing hundreds and thousands of Employees is a typical headline today. Cutting down employees or reducing an employee salary is a tough decision to take. It needs to be taken with utmost care as imprecision in the identification of employees whose performance is attriting may lead to sabotaging of both employees' career and the company's reputation in the market.

## Aim of The Competition

To predict Employee Attrition by the given data about his/her past history.

## Acknowledgements

We thank IBM for providing us with the dataset.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn import preprocessing

# Machine Learning Models
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

# Displaying Plots Inline
%matplotlib inline

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')


## Importing Training Data

In [2]:
# Reading the CSV file into a DataFrame
df = pd.read_csv("train.csv", index_col=0)

# Displaying the first few rows of the DataFrame to get an overview of the data
(df.head())


Unnamed: 0_level_0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,30,0,Non-Travel,Research & Development,2,3,Medical,571,3,Female,...,3,0,12,2,11,7,6,7,4,1
2,36,0,Travel_Rarely,Research & Development,12,4,Life Sciences,1614,3,Female,...,3,2,7,2,3,2,1,1,2,1
3,55,1,Travel_Rarely,Sales,2,1,Medical,842,3,Male,...,3,0,12,3,9,7,7,3,5,1
4,39,0,Travel_Rarely,Research & Development,24,1,Life Sciences,2014,1,Male,...,3,0,18,2,7,7,1,7,4,1
5,37,0,Travel_Rarely,Research & Development,3,3,Other,689,3,Male,...,3,1,10,2,10,7,7,8,1,1


##  Converting Categorical Data to Numerical Data

In [3]:

# Mapping BusinessTravel categories to numerical values
df.BusinessTravel[df.BusinessTravel == 'Non-Travel'] = 0
df.BusinessTravel[df.BusinessTravel == 'Travel_Rarely'] = 1
df.BusinessTravel[df.BusinessTravel == 'Travel_Frequently'] = 2

# Mapping Department categories to numerical values
df.Department[df.Department == 'Research & Development'] = 0
df.Department[df.Department == 'Sales'] = 1
df.Department[df.Department == 'Human Resources'] = 2

# Mapping EducationField categories to numerical values
df.EducationField[df.EducationField == 'Medical'] = 0
df.EducationField[df.EducationField == 'Life Sciences'] = 1
df.EducationField[df.EducationField == 'Other'] = 2
df.EducationField[df.EducationField == 'Marketing'] = 3
df.EducationField[df.EducationField == 'Technical Degree'] = 4
df.EducationField[df.EducationField == 'Human Resources'] = 5

# Mapping MaritalStatus categories to numerical values
df.MaritalStatus[df.MaritalStatus == 'Single'] = 0
df.MaritalStatus[df.MaritalStatus == 'Married'] = 1
df.MaritalStatus[df.MaritalStatus == 'Divorced'] = 2

# Mapping Gender categories to numerical values
df.Gender[df.Gender == 'Male'] = 0
df.Gender[df.Gender == 'Female'] = 1

# Mapping OverTime categories to numerical values
df.OverTime[df.OverTime == 'No'] = 0
df.OverTime[df.OverTime == 'Yes'] = 1

# Displaying the DataFrame after converting categorical data to numerical data
(df.head())


Unnamed: 0_level_0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,30,0,0,0,2,3,0,571,3,1,...,3,0,12,2,11,7,6,7,4,1
2,36,0,1,0,12,4,1,1614,3,1,...,3,2,7,2,3,2,1,1,2,1
3,55,1,1,1,2,1,0,842,3,0,...,3,0,12,3,9,7,7,3,5,1
4,39,0,1,0,24,1,1,2014,1,0,...,3,0,18,2,7,7,1,7,4,1
5,37,0,1,0,3,3,2,689,3,0,...,3,1,10,2,10,7,7,8,1,1


In [4]:
# Checking the Number of Rows and Columns Before Dropping 'JobRole' Column
print("Number of rows and columns before dropping 'JobRole' column:", df.shape)

# Dropping the 'JobRole' Column
df = df.drop(['JobRole'], axis=1)

# Checking the Number of Rows and Columns After Dropping 'JobRole' Column
print("Number of rows and columns after dropping 'JobRole' column:", df.shape)


Number of rows and columns before dropping 'JobRole' column: (1628, 28)
Number of rows and columns after dropping 'JobRole' column: (1628, 27)


## Checking and Handling Duplicate or Multiple Entries

In [5]:

# Checking for duplicate entries based on 'EmployeeNumber' and removing them
df.drop_duplicates(subset='EmployeeNumber', inplace=True)

# Displaying the Number of Rows and Columns in the New Data
print("Number of rows and columns in the new data:", df.shape)


Number of rows and columns in the new data: (1000, 27)


In [6]:
df

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,30,0,0,0,2,3,0,571,3,1,...,3,0,12,2,11,7,6,7,4,1
2,36,0,1,0,12,4,1,1614,3,1,...,3,2,7,2,3,2,1,1,2,1
3,55,1,1,1,2,1,0,842,3,0,...,3,0,12,3,9,7,7,3,5,1
4,39,0,1,0,24,1,1,2014,1,0,...,3,0,18,2,7,7,1,7,4,1
5,37,0,1,0,3,3,2,689,3,0,...,3,1,10,2,10,7,7,8,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,36,0,0,1,10,4,0,592,2,0,...,3,0,10,3,10,3,9,7,4,1
997,40,0,1,0,16,3,1,1641,3,1,...,3,0,18,2,4,2,3,3,2,1
998,46,1,1,1,9,2,0,118,3,0,...,3,0,9,3,9,8,4,7,4,1
999,30,0,1,0,2,3,0,833,3,1,...,4,0,12,4,0,0,0,0,5,1


In [7]:
y = df['Attrition']
X = df.drop('Attrition', axis=1)

In [8]:
### 

In [9]:
from sklearn.preprocessing import StandardScaler

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=5000)

# Initializing the StandardScaler
scaler = StandardScaler()

# Fit and transform the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)


# Rough Models Evaluation

## RandomForestClassifier

In [10]:
# RandomForestClassifier with scaled data 
from sklearn.ensemble import RandomForestClassifier
xg = RandomForestClassifier()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred_class))

0.86


In [11]:
# RandomForestClassifier without scaled data 
from sklearn.ensemble import RandomForestClassifier
xg = RandomForestClassifier()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)

print(metrics.accuracy_score(y_test, y_pred_class))

0.87


## DecisionTreeClassifier

In [12]:
# RandomForestClassifier with scaled data 
xg = DecisionTreeClassifier()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)



print(metrics.accuracy_score(y_test, y_pred_class))

0.81


## AdaBoostClassifier

In [13]:
# RandomForestClassifier with scaled data 
xg = AdaBoostClassifier()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)



print(metrics.accuracy_score(y_test, y_pred_class))

0.88


## LogisticRegression

In [14]:
# LogisticRegression with scaled data 
from sklearn.linear_model import LogisticRegression
xg = LogisticRegression()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)

print("LogisticRegression Accuracy:", metrics.accuracy_score(y_test, y_pred_class))

LogisticRegression Accuracy: 0.91


## SVC

In [15]:
# Support Vector Classifier (SVC) with scaled data 
from sklearn.svm import SVC
xg = SVC()
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)

print("SVC Accuracy:", metrics.accuracy_score(y_test, y_pred_class))

SVC Accuracy: 0.86


## ANN

In [16]:
# ANN with scaled data 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Assuming you have already split your data into training and testing sets (X_train, X_test, y_train, y_test)

# Initialize the MLPClassifier
ann_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)

# Train the model
ann_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_ann = ann_model.predict(X_test_scaled)

# Calculate accuracy
accuracy_ann = accuracy_score(y_test, y_pred_ann)

print(accuracy_ann)


0.83


## Ensemble

In [17]:
# Ensemble of All Models
from sklearn.ensemble import VotingClassifier

# Creating a list of models
models = [('RandomForest', RandomForestClassifier()),
          ('DecisionTree', DecisionTreeClassifier()),
          ('AdaBoost', AdaBoostClassifier()),
          ('LogisticRegression', LogisticRegression()),
          ('SVC', SVC()),('ANN', ann_model)]

# Creating a VotingClassifier with hard voting
ensemble_model = VotingClassifier(estimators=models, voting='hard')
ensemble_model.fit(X_train_scaled, y_train)
y_pred_class_ensemble = ensemble_model.predict(X_test_scaled)

print("Ensemble Model Accuracy:", metrics.accuracy_score(y_test, y_pred_class_ensemble))

Ensemble Model Accuracy: 0.88


## Fine-tuning AdaBoostClassifier Hyperparameters

Among all the models AdaBoostClassifier gives us the best result, lets modify hyperparameters to get more accurate result. We did'nt prefer Logistic Regression
as it cannot handle Non-Linear trends and AdaBoost tends to be less prone to overfitting.

In [18]:
# Fine-tuned RandomForestClassifier with scaled data 

xg = AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=.5, algorithm='SAMME', random_state=10)
xg.fit(X_train_scaled, y_train)
y_pred_class = xg.predict(X_test_scaled)



print(metrics.accuracy_score(y_test, y_pred_class))


0.9


### Loading test data and seeing our model prediction 

In [19]:
dft = pd.read_csv("test.csv",index_col=0)

# Mapping BusinessTravel categories to numerical values
dft.BusinessTravel[dft.BusinessTravel == 'Non-Travel'] = 0
dft.BusinessTravel[dft.BusinessTravel == 'Travel_Rarely'] = 1
dft.BusinessTravel[dft.BusinessTravel == 'Travel_Frequently'] = 2

# Mapping Department categories to numerical values
dft.Department[dft.Department == 'Research & Development'] = 0
dft.Department[dft.Department == 'Sales'] = 1
dft.Department[dft.Department == 'Human Resources'] = 2

# Mapping EducationField categories to numerical values
dft.EducationField[dft.EducationField == 'Medical'] = 0
dft.EducationField[dft.EducationField == 'Life Sciences'] = 1
dft.EducationField[dft.EducationField == 'Other'] = 2
dft.EducationField[dft.EducationField == 'Marketing'] = 3
dft.EducationField[dft.EducationField == 'Technical Degree'] = 4
dft.EducationField[dft.EducationField == 'Human Resources'] = 5

# Mapping MaritalStatus categories to numerical values
dft.MaritalStatus[dft.MaritalStatus == 'Single'] = 0
dft.MaritalStatus[dft.MaritalStatus == 'Married'] = 1
dft.MaritalStatus[dft.MaritalStatus == 'Divorced'] = 2

# Mapping Gender categories to numerical values
dft.Gender[dft.Gender == 'Male'] = 0
dft.Gender[dft.Gender == 'Female'] = 1

# Mapping OverTime categories to numerical values
dft.OverTime[dft.OverTime == 'No'] = 0
dft.OverTime[dft.OverTime == 'Yes'] = 1

dft = dft.drop(['JobRole'], axis=1)

In [20]:
# Displaying the DataFrame after converting categorical data to numerical data
(dft.head())

Unnamed: 0_level_0,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,JobInvolvement,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,28,1,0,9,3,0,377,4,0,3,...,4,1,5,3,5,2,0,4,5,1
2,31,1,1,6,4,0,653,1,0,4,...,4,2,13,4,7,7,5,7,3,1
3,37,1,0,6,3,0,474,3,0,4,...,3,2,13,2,7,7,6,7,4,1
4,42,1,0,1,2,1,827,4,1,2,...,3,1,8,4,4,3,0,2,5,1
5,45,0,0,4,2,1,972,3,0,3,...,3,0,9,5,9,7,0,8,2,1


In [21]:
dft['Attrition'] = xg.predict_proba(dft)[::,1]

In [22]:
dft['Attrition']

Id
1      0.181762
2      0.211934
3      0.271582
4      0.272684
5      0.165196
         ...   
466    0.212499
467    0.304008
468    0.257207
469    0.198537
470    0.193885
Name: Attrition, Length: 470, dtype: float64

In [23]:
# Assuming 'Attrition' contains probability values in the range [0, 1]
threshold = 0.5
dft['Attrition'] = (dft['Attrition'] > threshold).astype(int)


In [24]:
dft['Attrition']

Id
1      0
2      0
3      0
4      0
5      0
      ..
466    0
467    0
468    0
469    0
470    0
Name: Attrition, Length: 470, dtype: int64

## Evaluation Metrics


Following evaluation metrics are used to analyse the performance of the model:

1.) Percentage Error

2.) Accuracy

### 1.) Percentage Error

In [25]:
import pandas as pd

# Load actual values from the CSV file
actual_values = pd.read_csv('test_actual_values.csv', index_col='Id')['Attrition']


# 'dft_predicted' is a DataFrame with predicted probabilities
threshold = 0.5
predicted_values = (dft['Attrition'] > threshold).astype(int)

# Calculate Percentage Error
percentage_error = (actual_values != predicted_values).mean() * 100

print("Percentage Error:", percentage_error)



Percentage Error: 12.127659574468085


### 2.) Accuracy

In [26]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(actual_values, predicted_values)

print("Accuracy:", accuracy)

Accuracy: 0.8787234042553191
