### HR Employee Attrition dataset (https://www.kaggle.com/datasets/saurabhbadole/hr-employee-attrition) 
#### This is a historical Employee Data with a number of features about each employee. This dataset provides information on employees within an organization, including their demographics, job-related factors, and attrition status. Dataset has 35 columns and 1470 rows. This is a classification dataset and I will perfom logistic regression machine learning model here.

In [None]:
# Needed imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# pip install scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [3]:
# load the dataset:
df = pd.read_csv("HR-Employee-Attrition.csv")

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


#### Target variable - "Attrition" column (show how likely someone is to quit), it has binary yes/no values. Here are a lot of other variables. I will choose support variable base on correlation to "Attrition" column. First, I will clean up the dataset:

### Cleaning up the dataset:

In [None]:
# checking for dublicates,
# No duplicates.
df.duplicated().sum()

0

In [None]:
# no missing values:
df.isna().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
DistanceFromHome            0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

In [None]:
# checking target variable.
# only yes/no values. I will use LabelEncoder to converts the value of column to 0 or 1.
# just before LabelEncoder I will check other variable that may also need that, so I can encoder all at once.
df['Attrition'].value_counts()

Attrition
No     1233
Yes     237
Name: count, dtype: int64

In [None]:
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [10]:
# some columns are not related to attrition: "Over18" - all employees are adults, "EmployeeNumber" - not usefull, 
# "EmployeeCount" - constant value for this dataset;"EducationField" and "Department" - column is categorical, 
# they can be usefull but we have enough other options for support variable; The "Education" column is a numerical
# feature that represents education level, but it may not strongly influence attrition compared to salary orjob satisfaction.
#  I will drop those columns:

columns_to_drop = [
    'Over18',
    'EmployeeCount',
    'EmployeeNumber',
    'Department',
    'EducationField',
    "Education"
]

df = df.drop(columns_to_drop, axis=1)

In [11]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,80,1,6,3,3,2,2,2,2


In [None]:
# now the dataset has three columns with categorical values: "BusinessTravel", "JobRole" and "Gender" not counting target column.
# let's check "BusinessTravel" column. It has three different values, I will keep it and convert categories into numerical form
# using one-hot encoded. I kept this variable because frequent business travel is often associated with higher attrition due to stress and job demands.

df['BusinessTravel'].value_counts()

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

In [14]:
# use of OneHotEncoder for "BusinessTravel" column , it will create multiple columns with numeric values:
from sklearn.preprocessing import OneHotEncoder
variables = ["BusinessTravel"]
             
# use encoder:
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
one_hot_encoded = encoder.fit_transform(df[variables]).astype(int)
df = pd.concat([df,one_hot_encoded],axis=1).drop(columns=variables)

In [15]:
df.head()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,...,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely
0,41,Yes,1102,1,2,Female,94,3,2,Sales Executive,...,8,0,1,6,4,0,5,0,0,1
1,49,No,279,8,3,Male,61,2,2,Research Scientist,...,10,3,3,10,7,1,7,0,1,0
2,37,Yes,1373,2,4,Male,92,2,1,Laboratory Technician,...,7,3,3,0,0,0,0,0,0,1
3,33,No,1392,3,4,Female,56,3,1,Research Scientist,...,8,3,3,8,7,3,0,0,1,0
4,27,No,591,2,1,Male,40,3,1,Laboratory Technician,...,6,3,3,2,2,2,2,0,0,1


In [16]:
# next we will remove one new boolean variable "BusinessTravel_Travel_Rarely" (since it's going to be represented
# by a straight line of zeroes)
df = df.drop("BusinessTravel_Travel_Rarely", axis=1)

In [None]:
df.head( )

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently
0,41,Yes,1102,1,2,Female,94,3,2,Sales Executive,...,0,8,0,1,6,4,0,5,0,0
1,49,No,279,8,3,Male,61,2,2,Research Scientist,...,1,10,3,3,10,7,1,7,0,1
2,37,Yes,1373,2,4,Male,92,2,1,Laboratory Technician,...,0,7,3,3,0,0,0,0,0,0
3,33,No,1392,3,4,Female,56,3,1,Research Scientist,...,0,8,3,3,8,7,3,0,0,1
4,27,No,591,2,1,Male,40,3,1,Laboratory Technician,...,1,6,3,3,2,2,2,2,0,0


In [None]:
# "JobRole" column. Job role can significantly impact attrition, but it has 9 different values - too many.
# I will drop this column.
df['JobRole'].value_counts()

JobRole
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: count, dtype: int64

In [21]:
# drop "JobRole" column: 
df = df.drop("JobRole", axis=1)

In [None]:
# "Gender" column, I will check unique options, should be just two but who knows...
df['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [23]:
# now we will use LabelEncoder to converts the value of column to 0 or 1 for "Gender" and "Attrition" columns:

from sklearn.preprocessing import LabelEncoder
# list of all boolean variables we want to convert
variables = ['Gender', 'Attrition']

# initalize encoder and convert everything
encoder = LabelEncoder()
df[variables] = df[variables].apply(encoder.fit_transform)

In [24]:
df.head()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently
0,41,1,1102,1,2,0,94,3,2,4,...,0,8,0,1,6,4,0,5,0,0
1,49,0,279,8,3,1,61,2,2,2,...,1,10,3,3,10,7,1,7,0,1
2,37,1,1373,2,4,1,92,2,1,3,...,0,7,3,3,0,0,0,0,0,0
3,33,0,1392,3,4,0,56,3,1,3,...,0,8,3,3,8,7,3,0,0,1
4,27,0,591,2,1,1,40,3,1,2,...,1,6,3,3,2,2,2,2,0,0


#### Now, all columns contain only numeric values. In total, there are 29 columns. Next, we will check the data distribution and correlation to decide which columns to keep for the logistic regression model.

### Visualising the Data.