<h1>1. Predicting Employee Attrition Using Logistic Regression </h1>
<h3><b>Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values if any.</li>
    <li>Encode categorical variables (e.g., one-hot encoding for department, gender, etc.).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b>Task:</b> Implement logistic regression to predict employee attrition and evaluate the model using precision, recall, and F1-score.</h3>



In [50]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

In [51]:
# Loading the dataset
employee_dataset = pd.read_csv('..\\..\\Datasets\\HREmployeeAttrition.csv')
print(employee_dataset.shape, '\n')
employee_dataset.head()

(1470, 35) 



Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [52]:
# Printing basic statistics of the dataset
employee_dataset.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [53]:
# Printing information of the dataset
employee_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

<h2>Data Preprocessing</h2>

<h3>1. Handling Missing Values</h3>

In [54]:
# Checking for the missing values in the dataset
employee_dataset.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

-> Since there are no missing values in the dataset, we can proceed to the next preprocessing step i.e, <b>encoding categorical variables</b>.

<h3>2. Encoding Categorical Variables</h3>

In [55]:
# Identifying the categorical features in the dataset, datatype 'object'
categorical_features = employee_dataset.select_dtypes('object').columns
print("Categorical Features in the data:\n", categorical_features)

# Separating the numerical features
numerical_features = employee_dataset.drop(categorical_features, axis=1)
print("\nNumerical Features in the data:\n", numerical_features.columns)
print(numerical_features.shape)

Categorical Features in the data:
 Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')

Numerical Features in the data:
 Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')
(1470, 26)


In [56]:
# Printing the categories in each categorical column
for category in categorical_features:
    print(category, ': ', employee_dataset[category].nunique(), ': ', employee_dataset[category].unique())
    print(employee_dataset[category].value_counts(), '\n')

Attrition :  2 :  ['Yes' 'No']
Attrition
No     1233
Yes     237
Name: count, dtype: int64 

BusinessTravel :  3 :  ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64 

Department :  3 :  ['Sales' 'Research & Development' 'Human Resources']
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64 

EducationField :  6 :  ['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree'
 'Human Resources']
EducationField
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: count, dtype: int64 

Gender :  2 :  ['Female' 'Male']
Gender
Male      882
Female    588
Name: count, dtype: int64 

JobRole :  9 :  ['Sales Executive' 'Research Scientist' 'Laboratory Technician'
 'Manufacturing Director' 'Healthcare R

-> One hot encoding will not be suitable for this dataset since it has many categorical variables with many categories inside. It will enlarge the dataset very much. So, label encoding would be better.

In [57]:
# Applying Label encoding
encoder = LabelEncoder()
encoded_features_df = employee_dataset.copy()

# Apply LabelEncoder to each categorical feature
for feature in categorical_features:
    encoded_features_df[feature] = encoder.fit_transform(employee_dataset[feature])

employee_dataset = encoded_features_df.copy()
employee_dataset

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,1102,2,1,2,1,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,1,279,1,8,1,1,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,2,1373,1,2,2,4,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,1,1392,1,3,4,1,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,2,591,1,2,1,3,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,0,1,884,1,23,2,3,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,0,2,613,1,6,1,3,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,0,2,155,1,4,3,1,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,0,1,1023,2,2,3,3,1,2065,...,4,80,0,17,3,2,9,6,0,8


In [58]:
for category in employee_dataset[categorical_features]:
    print(category, ': ', employee_dataset[category].nunique(), ': ', employee_dataset[category].unique())
    print(employee_dataset[category].value_counts(), '\n')

Attrition :  2 :  [1 0]
Attrition
0    1233
1     237
Name: count, dtype: int64 

BusinessTravel :  3 :  [2 1 0]
BusinessTravel
2    1043
1     277
0     150
Name: count, dtype: int64 

Department :  3 :  [2 1 0]
Department
1    961
2    446
0     63
Name: count, dtype: int64 

EducationField :  6 :  [1 4 3 2 5 0]
EducationField
1    606
3    464
2    159
5    132
4     82
0     27
Name: count, dtype: int64 

Gender :  2 :  [0 1]
Gender
1    882
0    588
Name: count, dtype: int64 

JobRole :  9 :  [7 6 2 4 0 3 8 5 1]
JobRole
7    326
6    292
2    259
4    145
0    131
3    102
8     83
5     80
1     52
Name: count, dtype: int64 

MaritalStatus :  3 :  [2 1 0]
MaritalStatus
1    673
2    470
0    327
Name: count, dtype: int64 

Over18 :  1 :  [0]
Over18
0    1470
Name: count, dtype: int64 

OverTime :  2 :  [1 0]
OverTime
0    1054
1     416
Name: count, dtype: int64 



In [59]:
# Printing info of the dataset
employee_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Age                       1470 non-null   int64
 1   Attrition                 1470 non-null   int64
 2   BusinessTravel            1470 non-null   int64
 3   DailyRate                 1470 non-null   int64
 4   Department                1470 non-null   int64
 5   DistanceFromHome          1470 non-null   int64
 6   Education                 1470 non-null   int64
 7   EducationField            1470 non-null   int64
 8   EmployeeCount             1470 non-null   int64
 9   EmployeeNumber            1470 non-null   int64
 10  EnvironmentSatisfaction   1470 non-null   int64
 11  Gender                    1470 non-null   int64
 12  HourlyRate                1470 non-null   int64
 13  JobInvolvement            1470 non-null   int64
 14  JobLevel                  1470 non-null 

-> So all the object features are now converted to int feature, encoded data. So, categorical variables have been successfully encoded.

<h3>3. Standardize numerical features</h3>

In [60]:
# Applying standard scalar
scaler = StandardScaler()
numerical_features_scaled = scaler.fit_transform(numerical_features)

# Creating dataframe for numerical features
numerical_features_scaled_df = pd.DataFrame(numerical_features_scaled, columns=numerical_features.columns)
employee_dataset.drop(employee_dataset[numerical_features.columns], axis=1, inplace=True)
employee_dataset = pd.concat([employee_dataset, numerical_features_scaled_df], axis=1)
employee_dataset

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime,Age,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,2,2,1,0,7,2,0,1,0.446350,...,-1.584178,0.0,-0.932014,-0.421642,-2.171982,-2.493820,-0.164613,-0.063296,-0.679146,0.245834
1,0,1,1,1,1,6,1,0,0,1.322365,...,1.191438,0.0,0.241988,-0.164511,0.155707,0.338096,0.488508,0.764998,-0.368715,0.806541
2,1,2,1,4,1,2,2,0,1,0.008343,...,-0.658973,0.0,-0.932014,-0.550208,0.155707,0.338096,-1.144294,-1.167687,-0.679146,-1.155935
3,0,1,1,1,0,6,1,0,1,-0.429664,...,0.266233,0.0,-0.932014,-0.421642,0.155707,0.338096,0.161947,0.764998,0.252146,-1.155935
4,0,2,1,3,1,2,1,0,0,-1.086676,...,1.191438,0.0,0.241988,-0.678774,0.155707,0.338096,-0.817734,-0.615492,-0.058285,-0.595227
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,0,1,1,3,1,2,1,0,0,-0.101159,...,0.266233,0.0,0.241988,0.735447,0.155707,0.338096,-0.327893,-0.615492,-0.679146,-0.314873
1466,0,2,1,3,1,0,1,0,0,0.227347,...,-1.584178,0.0,0.241988,-0.293077,1.707500,0.338096,-0.001333,0.764998,-0.368715,0.806541
1467,0,2,1,1,1,4,1,0,1,-1.086676,...,-0.658973,0.0,0.241988,-0.678774,-2.171982,0.338096,-0.164613,-0.615492,-0.679146,-0.314873
1468,0,1,2,3,1,7,1,0,0,1.322365,...,1.191438,0.0,-0.932014,0.735447,0.155707,-1.077862,0.325228,0.488900,-0.679146,1.086895


In [61]:
# Printing basic statistic of the dataset after scaling
employee_dataset.describe().round(3)

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime,Age,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,0.161,1.607,1.261,2.248,0.6,4.459,1.097,0.0,0.283,-0.0,...,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0
std,0.368,0.665,0.528,1.331,0.49,2.462,0.73,0.0,0.451,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.072,...,-1.584,0.0,-0.932,-1.45,-2.172,-2.494,-1.144,-1.168,-0.679,-1.156
25%,0.0,1.0,1.0,1.0,0.0,2.0,1.0,0.0,0.0,-0.758,...,-0.659,0.0,-0.932,-0.679,-0.62,-1.078,-0.654,-0.615,-0.679,-0.595
50%,0.0,2.0,1.0,2.0,1.0,5.0,1.0,0.0,0.0,-0.101,...,0.266,0.0,0.242,-0.165,0.156,0.338,-0.328,-0.339,-0.369,-0.315
75%,0.0,2.0,2.0,3.0,1.0,7.0,2.0,0.0,1.0,0.665,...,1.191,0.0,0.242,0.478,0.156,0.338,0.325,0.765,0.252,0.807
max,1.0,2.0,2.0,5.0,1.0,8.0,2.0,0.0,1.0,2.527,...,1.191,0.0,2.59,3.692,2.483,1.754,5.387,3.802,3.977,3.61


-> So now all the scaled features have mean of 0 and std of 1. Now the dataset is ready to be used for model training.

<h2>Model Training</h2>

In [62]:
# Separating the features and target variable
X = employee_dataset.drop('Attrition', axis=1)
Y = employee_dataset['Attrition']

# Splitting the dataset into training and testing sets in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [63]:
# Applying the model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [64]:
# Predicting the target variable
Y_pred = lr_model.predict(X_test)

<h2>Model Evaluation</h2>

<h3>1. Precision Score</h3>

In [65]:
# Calculating the precision of the model
precision = precision_score(Y_test, Y_pred)
print('Precision of the model:', precision)

Precision of the model: 0.7368421052631579


<h3>2. Recall Score</h3>

In [66]:
# Calculating recall score of the model
recall = recall_score(Y_test, Y_pred)
print('Recall score of the model:', recall)

Recall score of the model: 0.358974358974359


<h3>3. F1 Score</h3>

In [67]:
# Calculating f1 score of the model
f1 = f1_score(Y_test, Y_pred)
print('F1 score of the model:', f1) 

F1 score of the model: 0.4827586206896552


<h3>The Classification Report</h3>

In [68]:
print('Classification Report:\n', classification_report(Y_test, Y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.98      0.94       255
           1       0.74      0.36      0.48        39

    accuracy                           0.90       294
   macro avg       0.82      0.67      0.71       294
weighted avg       0.89      0.90      0.88       294



-> The logistic regression model demonstrates high precision (0.74) but low recall (0.36) for predicting employee attrition, indicating it correctly identifies most true positives but misses a substantial number of actual attritions. The F1-score of 0.48 reflects a balance between precision and recall, highlighting room for improvement in capturing all attrition cases. Overall, while the model performs well in identifying non-attrition cases (accuracy of 0.90), it struggles with detecting all attrition cases, suggesting potential benefits from further tuning or alternative modeling approaches.

<hr>