**HR Analytics Project- Understanding the Attrition in HR**
**Project Description**

Every year a lot of companies hire a number of employees. The companies invest time and money in training those employees, not just this but there are training programs within the companies for their existing employees as well. The aim of these programs is to increase the effectiveness of their employees. But where HR Analytics fit in this? and is it just about improving the performance of employees?

**HR Analytics**
Human resource analytics (HR analytics) is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment. HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes.

**Attrition in HR**
Attrition in human resources refers to the gradual loss of employees overtime. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture, and motivation systems that help the organization retain top employees.
How does Attrition affect companies? and how does HR Analytics help in analyzing attrition? We will discuss the first question here and for the second question, we will write the code and try to understand the process step by step.

**Attrition affecting Companies**

A major problem in high employee attrition is its cost to an organization. Job postings, hiring processes, paperwork, and new hire training are some of the common expenses of losing employees and replacing them. Additionally, regular employee turnover prohibits your organization from increasing its collective knowledge base and experience over time. This is especially concerning if your business is customer-facing, as customers often prefer to interact with familiar people. Errors and issues are more likely if you constantly have new workers.



In [7]:
import pandas as pd

# URL to the dataset
url = '/Users/siddhant/Downloads/WA_Fn-UseC_-HR-Employee-Attrition.csv'

# Read the CSV file directly into a DataFrame using pandas
df = pd.read_csv(url)

# Display the first few rows of the dataset to verify it was loaded properly
df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [9]:
df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

encode ["BusinessTravel","Department","EducationField"] into one hot encoding and Attrition in label encoder

In [10]:
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [11]:
# One-hot encoding for categorical variables
df = pd.get_dummies(df, columns=['BusinessTravel', 'Department', 'EducationField'])

# Label encoding for 'Attrition'
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Attrition'] = label_encoder.fit_transform(df['Attrition'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# Display the encoded DataFrame
df.head()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,...,BusinessTravel_Travel_Rarely,Department_Human Resources,Department_Research & Development,Department_Sales,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree
0,41,1,1102,1,2,1,1,2,0,94,...,1,0,0,1,0,1,0,0,0,0
1,49,0,279,8,1,1,2,3,1,61,...,0,0,1,0,0,1,0,0,0,0
2,37,1,1373,2,2,1,4,4,1,92,...,1,0,1,0,0,0,0,0,1,0
3,33,0,1392,3,4,1,5,4,0,56,...,0,0,1,0,0,1,0,0,0,0
4,27,0,591,2,1,1,7,1,1,40,...,1,0,1,0,0,0,0,1,0,0


In [54]:
df["OverTime"].value_counts()

No     1054
Yes     416
Name: OverTime, dtype: int64

In [55]:
df["Over18"].value_counts()

Y    1470
Name: Over18, dtype: int64

In [56]:
df["JobRole"].value_counts()

Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: JobRole, dtype: int64

In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder



# Label encode the 'OverTime' column
label_encoder = LabelEncoder()
df['OverTime'] = label_encoder.fit_transform(df['OverTime'])

# Drop the 'Over18' column
df.drop('Over18', axis=1, inplace=True)

# One-hot encode the 'JobRole' column
df = pd.get_dummies(df, columns=['JobRole'])

df['MaritalStatus'] = label_encoder.fit_transform(df['MaritalStatus'])


In [32]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Assuming you've already separated features and target variable and performed undersampling
# X_resampled, y_resampled = ...

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to generate synthetic samples
X_resampled_smote, y_resampled_smote = smote.fit_resample(X, y)

# Split the resampled data into training and testing sets
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled_smote, y_resampled_smote, test_size=0.2, random_state=42)

# Initialize the RandomForestClassifier
rf_classifier_smote = RandomForestClassifier(random_state=42)

# Train the classifier
rf_classifier_smote.fit(X_train_smote, y_train_smote)

# Predict using the trained classifier
y_pred_rf_smote = rf_classifier_smote.predict(X_test_smote)

# Evaluate accuracy
accuracy_rf_smote = accuracy_score(y_test_smote, y_pred_rf_smote)
print(f'Accuracy using RandomForestClassifier with SMOTE: {accuracy_rf_smote:.2f}')

# Classification report
print("Classification Report using RandomForestClassifier with SMOTE:")
print(classification_report(y_test_smote, y_pred_rf_smote))


Accuracy using RandomForestClassifier with SMOTE: 0.92
Classification Report using RandomForestClassifier with SMOTE:
              precision    recall  f1-score   support

           0       0.90      0.94      0.92       250
           1       0.94      0.90      0.92       244

    accuracy                           0.92       494
   macro avg       0.92      0.92      0.92       494
weighted avg       0.92      0.92      0.92       494



In [33]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np

# Assuming you've already separated features and target variable and performed undersampling
# X_resampled, y_resampled = ...

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to generate synthetic samples
#X_resampled_smote, y_resampled_smote = smote.fit_resample(X, y)

# Initialize the RandomForestClassifier
#rf_classifier_smote = RandomForestClassifier(random_state=42)

# Perform cross-validation
cv_scores_rf_smote = cross_val_score(rf_classifier_smote, X_resampled_smote, y_resampled_smote, cv=5, scoring='accuracy')

# Print cross-validation scores
print("Cross-validation scores using RandomForestClassifier with SMOTE:")
print(cv_scores_rf_smote)
print(f"Mean CV accuracy using RandomForestClassifier with SMOTE: {np.mean(cv_scores_rf_smote):.2f}")

# Additionally, you can print classification report for each fold if needed
for i, score in enumerate(cv_scores_rf_smote, 1):
    print(f"Fold {i} accuracy: {score:.2f}")
    # If desired, uncomment below to print classification report for each fold
    # X_train_fold, X_test_fold, y_train_fold, y_test_fold = train_test_split(X_resampled_smote, y_resampled_smote, test_size=0.2, random_state=42)
    # rf_classifier_smote.fit(X_train_fold, y_train_fold)
    # y_pred_fold = rf_classifier_smote.predict(X_test_fold)
    # print(f"Classification Report for fold {i}:")
    # print(classification_report(y_test_fold, y_pred_fold))


Cross-validation scores using RandomForestClassifier with SMOTE:
[0.68016194 0.95943205 0.95740365 0.94523327 0.72413793]
Mean CV accuracy using RandomForestClassifier with SMOTE: 0.85
Fold 1 accuracy: 0.68
Fold 2 accuracy: 0.96
Fold 3 accuracy: 0.96
Fold 4 accuracy: 0.95
Fold 5 accuracy: 0.72


**Accuracy:** The overall accuracy on the test data is 92%, which is quite good.

**Precision and Recall:** Both classes (0 and 1) have high precision, recall, and F1-score, indicating balanced performance for both classes.



In [34]:
# Assuming your RandomForestClassifier model is named rf_classifier_smote and X contains employee features

# Extract feature importances
feature_importance = rf_classifier_smote.feature_importances_

# Create a DataFrame to display feature importances
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)


                              Feature  Importance
21                   StockOptionLevel    0.072579
13                      MonthlyIncome    0.054119
9                      JobInvolvement    0.043157
11                    JobSatisfaction    0.043112
10                           JobLevel    0.039538
6             EnvironmentSatisfaction    0.038573
28               YearsWithCurrManager    0.036859
33  Department_Research & Development    0.036110
2                    DistanceFromHome    0.033866
14                        MonthlyRate    0.033785
5                      EmployeeNumber    0.032905
1                           DailyRate    0.032435
0                                 Age    0.031779
8                          HourlyRate    0.031685
38             EducationField_Medical    0.031392
22                  TotalWorkingYears    0.031015
26                 YearsInCurrentRole    0.028501
19           RelationshipSatisfaction    0.026833
25                     YearsAtCompany    0.024941


**Top Features Impacting Attrition Risk:**

1)StockOptionLevel (0.073): Employees with higher stock options might have lower attrition.

2)MonthlyIncome (0.054): Lower monthly income might contribute to higher attrition.

3)JobInvolvement (0.043): Higher job involvement might lead to lower attrition.

4)JobSatisfaction (0.043): Higher job satisfaction might reduce attrition.

5)JobLevel (0.040): Higher job levels could associate with lower attrition.

In [37]:
# Assuming 'Attrition' is encoded as numeric values (1 for 'Yes', 0 for 'No')
correlation = df['StockOptionLevel'].corr(df['Attrition'])
print(f"Correlation between StockOptionLevel and Attrition: {correlation:.2f}")


Correlation between StockOptionLevel and Attrition: -0.14
