# Employee Churn Prediction 

## 1) Problem Statement

Employee attrition poses a critical challenge to organizational stability and growth. This project aims to develop a machine learning model that predicts whether an employee is likely to leave the company based on HR-related numerical features such as satisfaction level, work hours, and time spent at the company. Accurate predictions can help HR teams proactively address employee concerns, improve retention strategies, and reduce turnover-related costs

**Problem**： What’s likely to make the employee leave the company?

The dataset provided by the HR department at Salifort Motors comprises 14,999 rows and 10 columns, capturing various attributes pertaining to employee demographics, job-related factors, and potential indicators of turnover.

Here are the variables included in the dataset along with their descriptions:

satisfaction_level: Employee-reported job satisfaction level [0–1]

last_evaluation: Score of the employee's last performance review [0–1]

number_project: Number of projects the employee contributes to

average_monthly_hours: Average number of hours the employee worked per month

time_spend_company: Duration of the employee's tenure with the company (in years)

Work_accident: Whether or not the employee experienced an accident while at work

left: Whether or not the employee left the company (0 for 'stay' and 1 for 'left')

promotion_last_5years: Whether or not the employee was promoted in the last 5 years (0 for 'No' and 1 for 'Yes')

department: The employee's department

salary: The employee's salary (in U.S. dollars)

This comprehensive dataset enables in-depth analysis and modeling to identify patterns and factors influencing employee turnover within the organization.


## 2) Data Collection

Dataset Source:- [www.kaggle.com/datasets/raminhuseyn/hr-analytics-data-set](https://www.kaggle.com/datasets/raminhuseyn/hr-analytics-data-set)

### 2.1. Load Data and Import Required Libraries

In [1]:
# Basic Import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#Modelling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score,f1_score,confusion_matrix,classification_report,roc_auc_score

In [None]:
df = pd.read_csv("HR_dataset.csv")
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## 3) Data Checks to Perform

### 3.1. Check Missing Values

In [3]:
df.isnull().sum()       #No missing value

satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
Department               0
salary                   0
dtype: int64

### 3.2. Check Duplicates

In [4]:
print(df.duplicated().sum())        

3008


In [5]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [6]:
#Checking shape of dataset
print("Shape of dataset after removing duplicates: ", df.shape)

Shape of dataset after removing duplicates:  (11991, 10)


### 3.3. Check data type

In [7]:
df.info()           # 2 categorical and 8 numerical columns

<class 'pandas.core.frame.DataFrame'>
Index: 11991 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     11991 non-null  float64
 1   last_evaluation        11991 non-null  float64
 2   number_project         11991 non-null  int64  
 3   average_montly_hours   11991 non-null  int64  
 4   time_spend_company     11991 non-null  int64  
 5   Work_accident          11991 non-null  int64  
 6   left                   11991 non-null  int64  
 7   promotion_last_5years  11991 non-null  int64  
 8   Department             11991 non-null  object 
 9   salary                 11991 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.0+ MB


### 3.4. Check unique values of each column

In [8]:
#Number of unique values in each column
df.nunique()

satisfaction_level        92
last_evaluation           65
number_project             6
average_montly_hours     215
time_spend_company         8
Work_accident              2
left                       2
promotion_last_5years      2
Department                10
salary                     3
dtype: int64

In [9]:
for col in df.columns:
    print(f"Unique values of column '{col}':\n")
    print(df[col].unique())
    print('\n','-'*50,'\n')

Unique values of column 'satisfaction_level':

[0.38 0.8  0.11 0.72 0.37 0.41 0.1  0.92 0.89 0.42 0.45 0.84 0.36 0.78
 0.76 0.09 0.46 0.4  0.82 0.87 0.57 0.43 0.13 0.44 0.39 0.85 0.81 0.9
 0.74 0.79 0.17 0.24 0.91 0.71 0.86 0.14 0.75 0.7  0.31 0.73 0.83 0.32
 0.54 0.27 0.77 0.88 0.48 0.19 0.6  0.12 0.61 0.33 0.56 0.47 0.28 0.55
 0.53 0.59 0.66 0.25 0.34 0.58 0.51 0.35 0.64 0.5  0.23 0.15 0.49 0.3
 0.63 0.21 0.62 0.29 0.2  0.16 0.65 0.68 0.67 0.22 0.26 0.99 0.98 1.
 0.52 0.93 0.97 0.69 0.94 0.96 0.18 0.95]

 -------------------------------------------------- 

Unique values of column 'last_evaluation':

[0.53 0.86 0.88 0.87 0.52 0.5  0.77 0.85 1.   0.54 0.81 0.92 0.55 0.56
 0.47 0.99 0.51 0.89 0.83 0.95 0.57 0.49 0.46 0.62 0.94 0.48 0.8  0.74
 0.7  0.78 0.91 0.93 0.98 0.97 0.79 0.59 0.84 0.45 0.96 0.68 0.82 0.9
 0.71 0.6  0.65 0.58 0.72 0.67 0.75 0.73 0.63 0.61 0.76 0.66 0.69 0.37
 0.64 0.39 0.41 0.43 0.44 0.36 0.38 0.4  0.42]

 -------------------------------------------------- 

Uniqu

### 3.5. Check statistics of dataset

In [10]:
df.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,11991.0,11991.0,11991.0,11991.0,11991.0,11991.0,11991.0,11991.0
mean,0.629658,0.716683,3.802852,200.473522,3.364857,0.154282,0.166041,0.016929
std,0.24107,0.168343,1.163238,48.727813,1.33024,0.361234,0.372133,0.129012
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.48,0.57,3.0,157.0,3.0,0.0,0.0,0.0
50%,0.66,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.86,5.0,243.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


## 4. Preparing X and Y (independent and dependent variables respectively)

In [11]:
X = df.drop(columns=['left'])

X.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,0,sales,low
1,0.8,0.86,5,262,6,0,0,sales,medium
2,0.11,0.88,7,272,4,0,0,sales,medium
3,0.72,0.87,5,223,5,0,0,sales,low
4,0.37,0.52,2,159,3,0,0,sales,low


In [12]:
y = df['left']
y

0        1
1        1
2        1
3        1
4        1
        ..
11995    0
11996    0
11997    0
11998    0
11999    0
Name: left, Length: 11991, dtype: int64

## 5. Train-Test Split

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
X_train.shape,X_test.shape

((8393, 9), (3598, 9))

In [14]:
X_train

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,Department,salary
6957,0.96,0.37,3,111,2,0,0,sales,high
9240,0.19,1.00,4,188,4,1,0,marketing,medium
618,0.45,0.57,2,148,3,0,0,marketing,high
9296,0.72,0.79,4,154,3,0,0,IT,medium
6030,0.54,0.82,2,279,3,1,0,marketing,low
...,...,...,...,...,...,...,...,...,...
11973,0.49,0.71,4,178,8,0,0,IT,medium
5200,0.77,0.52,4,216,3,0,0,sales,medium
5399,0.84,0.53,5,190,3,0,0,technical,medium
861,0.43,0.48,2,144,3,0,0,sales,low


In [15]:
y_train

6957     0
9240     0
618      1
9296     0
6030     0
        ..
11973    0
5200     0
5399     0
861      1
7279     0
Name: left, Length: 8393, dtype: int64

## 6. Transforming Data With Column Transformers

In [16]:
# Identifying numerical and categorical features
cat_features = X.select_dtypes(include='object').columns
num_features = X.select_dtypes(exclude='object').columns

print("Categorical Features: ",cat_features)
print("Numerical Features: ",num_features)

Categorical Features:  Index(['Department', 'salary'], dtype='object')
Numerical Features:  Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years'],
      dtype='object')


In [17]:
# Creating Column Transformer with 2 types of transformer
numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ('OneHotEncoder',oh_transformer,cat_features),
        ('StandardScaler',numeric_transformer,num_features)
    ]
)

In [18]:
preprocessor

In [19]:
# Transform both X_train, X_test using preprocessor

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [20]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.367002,-2.065493,-0.696994,-1.850387,-1.027132,-0.422838,-0.13259
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.814273,1.682120,0.162902,-0.268909,0.476438,2.364971,-0.13259
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.740076,-0.875775,-1.556889,-1.090456,-0.275347,-0.422838,-0.13259
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.375436,0.432915,0.162902,-0.967224,-0.275347,-0.422838,-0.13259
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.368239,0.611373,-1.556889,1.600110,-0.275347,2.364971,-0.13259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.574815,-0.042972,0.162902,-0.474296,3.483577,-0.422838,-0.13259
8389,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.582012,-1.173204,0.162902,0.306174,-0.275347,-0.422838,-0.13259
8390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.871219,-1.113718,1.022797,-0.227832,-0.275347,-0.422838,-0.13259
8391,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.822707,-1.411148,-1.556889,-1.172611,-0.275347,-0.422838,-0.13259


In [21]:
X_train.shape, X_test.shape

((8393, 18), (3598, 18))

In [22]:
y_train.value_counts()

left
0    7005
1    1388
Name: count, dtype: int64

In [23]:
# Over-sampling the minority class
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

X_train.shape, y_train.shape

((14010, 18), (14010,))

## 7. Model Training with Evaluation

In [24]:
def evaluate_model(true,predicted):
    accuracy = accuracy_score(true,predicted)
    precision = precision_score(true,predicted)
    recall = recall_score(true,predicted)
    roc_auc = roc_auc_score(true,predicted)
    f1 = f1_score(true,predicted)
    return accuracy,precision,recall,roc_auc,f1

In [25]:
# Model Building
models = {
    "Decision Tree Classifier":DecisionTreeClassifier(),
    "Logistic Regression":LogisticRegression(),
    'K-Nearest Neighbours Classifier':KNeighborsClassifier(),
    "Random Forest Classifier":RandomForestClassifier(),
    'AdaBoost Classifier':AdaBoostClassifier(),
    'Gradient Boosting Classifier':AdaBoostClassifier(),
    'XgBoost Classifier':XGBClassifier()
}


for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    train_results = evaluate_model(y_train,y_train_pred)

    # Test set performance
    test_results = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])

    print('Model performance at training set:\n')
    print('- Accuracy: ',train_results[0])
    print('- F1 Score: ',train_results[4])
    print('- Precision: ',train_results[1])
    print('- Recall: ',train_results[2])
    print('- ROC_AUC Score: ',train_results[3])
    print('\n')


    print("-----------------------------------------------------------------\n")

    print('Model performance at testing set:\n')
    print('- Accuracy: ',test_results[0])
    print('- F1 Score: ',test_results[4])
    print('- Precision: ',test_results[1])
    print('- Recall: ',test_results[2])
    print('- ROC_AUC Score: ',test_results[3])
    print('\n')

    print("X"*100,'\n')

Decision Tree Classifier
Model performance at training set:

- Accuracy:  1.0
- F1 Score:  1.0
- Precision:  1.0
- Recall:  1.0
- ROC_AUC Score:  1.0


-----------------------------------------------------------------

Model performance at testing set:

- Accuracy:  0.9694274596998332
- F1 Score:  0.908485856905158
- Precision:  0.9115191986644408
- Recall:  0.9054726368159204
- ROC_AUC Score:  0.9438882382744043


XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

Logistic Regression
Model performance at training set:

- Accuracy:  0.7909350463954319
- F1 Score:  0.7965548378134334
- Precision:  0.7757034632034632
- Recall:  0.8185581727337616
- ROC_AUC Score:  0.7909350463954319


-----------------------------------------------------------------

Model performance at testing set:

- Accuracy:  0.7759866592551418
- F1 Score:  0.5537098560354374
- Precision:  0.41562759767248547
- Recall:  0.8291873963515755
- ROC_AUC Score:  0.797231

From these models, we can check and analyse the performance of the two stand-out models - RandomForestClassifier and XGBClassifier, in terms of metrics.

In [26]:
models = {
    "Random Forest Classifier":RandomForestClassifier(),
    'XgBoost Classifier':XGBClassifier()
}

for i in range(len(list(models))):
    mymodel = list(models.values())[i]
    print(list(models.keys())[i])
    mymodel.fit(X_train,y_train)

    y_pred = mymodel.predict(X_test)
    print("\nConfusion Matrix:- \n",confusion_matrix(y_test,y_pred))
    print("\nClassification Report:- \n",classification_report(y_test,y_pred))

    print("-----------------------------------------------------------------\n")

Random Forest Classifier

Confusion Matrix:- 
 [[2982   13]
 [  60  543]]

Classification Report:- 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      2995
           1       0.98      0.90      0.94       603

    accuracy                           0.98      3598
   macro avg       0.98      0.95      0.96      3598
weighted avg       0.98      0.98      0.98      3598

-----------------------------------------------------------------

XgBoost Classifier

Confusion Matrix:- 
 [[2970   25]
 [  51  552]]

Classification Report:- 
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      2995
           1       0.96      0.92      0.94       603

    accuracy                           0.98      3598
   macro avg       0.97      0.95      0.96      3598
weighted avg       0.98      0.98      0.98      3598

-----------------------------------------------------------------



## 8. Hyperparameter Tuning & Cross-Validation

In [27]:
rf_params = {
    'min_samples_split':[2,8,12,15,18],
    'max_depth': [2,8,12, 15, 18],
    'max_features': [2,5,8,12,15,18],
    'n_estimators': [100,200,500,1000]
}
xg_params = { 
    'n_estimators': [100,200,500,1000], 
    'max_depth': [2,8,12,15,18], 
    'learning_rate': [0.01,0.1], 
    'colsample_bytree': [0.2,0.5,0.8,1]
}

In [28]:
randomcv_models = [
    ('RF',RandomForestClassifier(),rf_params),
    ('XGB',XGBClassifier(),xg_params)
]

In [29]:
# Hyperparameter Tuning using RandomizedSearchCV

model_param = {}
for name,model,params in randomcv_models:
    cv = StratifiedKFold()
    random = RandomizedSearchCV(estimator=model,param_distributions=params,verbose=2,cv=cv,n_jobs=-1,error_score='raise')
    random.fit(X_train,y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print('-'*20,'Best Params for ',model_name,'-'*20)
    print(model_param[model_name])

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
-------------------- Best Params for  RF --------------------
{'n_estimators': 1000, 'min_samples_split': 2, 'max_features': 15, 'max_depth': 12}
-------------------- Best Params for  XGB --------------------
{'n_estimators': 1000, 'max_depth': 12, 'learning_rate': 0.01, 'colsample_bytree': 0.5}
[CV] END max_depth=18, max_features=5, min_samples_split=18, n_estimators=100; total time=   1.3s
[CV] END max_depth=18, max_features=5, min_samples_split=18, n_estimators=200; total time=   2.4s
[CV] END max_depth=18, max_features=5, min_samples_split=18, n_estimators=200; total time=   2.6s
[CV] END max_depth=18, max_features=5, min_samples_split=18, n_estimators=200; total time=   2.5s
[CV] END max_depth=18, max_features=5, min_samples_split=18, n_estimators=200; total time=   2.6s
[CV] END max_depth=8, max_features=8, min_samples_split=8, n_estimators=200; total time=   

In [31]:
models = {
    'Random Forest Classifier': RandomForestClassifier(n_estimators=1000,min_samples_split=2, 
                                          max_features=15,max_depth=12),
    'XgBoost Classifier': XGBClassifier(n_estimators=1000,max_depth=12,learning_rate=0.01, colsample_bytree=0.5) 
}

model_list = []
train_r2_list = []
test_r2_list = []

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    train_results = evaluate_model(y_train,y_train_pred)

    # Test set performance
    test_results = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print('Model performance at training set:\n')
    print('- Accuracy: ',train_results[0])
    print('- F1 Score: ',train_results[4])
    print('- Precision: ',train_results[1])
    print('- Recall: ',train_results[2])
    print('- ROC_AUC Score: ',train_results[3])
    print('\n')

    train_r2_list.append(train_results[0])

    print("-----------------------------------------------------------------\n")

    print('Model performance at testing set:\n')
    print('- Accuracy: ',test_results[0])
    print('- F1 Score: ',test_results[4])
    print('- Precision: ',test_results[1])
    print('- Recall: ',test_results[2])
    print('- ROC_AUC Score: ',test_results[3])
    print('\n')

    test_r2_list.append(test_results[0])
    print("X"*100,'\n')


Random Forest Classifier
Model performance at training set:

- Accuracy:  0.9946466809421841
- F1 Score:  0.9946217282179993
- Precision:  0.9992795389048992
- Recall:  0.9900071377587437
- ROC_AUC Score:  0.9946466809421841


-----------------------------------------------------------------

Model performance at testing set:

- Accuracy:  0.980544747081712
- F1 Score:  0.9399656946826758
- Precision:  0.9733570159857904
- Recall:  0.9087893864013267
- ROC_AUC Score:  0.9518905195779588


XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

XgBoost Classifier
Model performance at training set:

- Accuracy:  0.9977872947894361
- F1 Score:  0.9977836562522342
- Precision:  0.9994270982526496
- Recall:  0.9961456102783726
- ROC_AUC Score:  0.9977872947894362


-----------------------------------------------------------------

Model performance at testing set:

- Accuracy:  0.980822679266259
- F1 Score:  0.9411764705882353
- Precision:  0.9

In [32]:
for i in range(len(list(models))):
    mymodel = list(models.values())[i]
    print(list(models.keys())[i])
    mymodel.fit(X_train,y_train)

    y_pred = mymodel.predict(X_test)
    print("\nConfusion Matrix:- \n",confusion_matrix(y_test,y_pred))
    print("\nClassification Report:- \n",classification_report(y_test,y_pred))

    print("-----------------------------------------------------------------\n")

Random Forest Classifier

Confusion Matrix:- 
 [[2980   15]
 [  55  548]]

Classification Report:- 
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      2995
           1       0.97      0.91      0.94       603

    accuracy                           0.98      3598
   macro avg       0.98      0.95      0.96      3598
weighted avg       0.98      0.98      0.98      3598

-----------------------------------------------------------------

XgBoost Classifier

Confusion Matrix:- 
 [[2977   18]
 [  51  552]]

Classification Report:- 
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      2995
           1       0.97      0.92      0.94       603

    accuracy                           0.98      3598
   macro avg       0.98      0.95      0.96      3598
weighted avg       0.98      0.98      0.98      3598

-----------------------------------------------------------------



We choose XGBoost Classifier as our best model for prediction!