# **IBM HR Analytics Employee Attrition & Performance**

## 1.Brainstorm

This dataset contains detailed information about employees, such as their age, job role, income, experience, satisfaction levels, and work environment.
It is taken from Kaggle’s “IBM HR Analytics Employee Attrition & Performance” dataset and is used to analyze and predict employee attrition (whether an employee will leave the company or stay).

## 2.Dataset
**Source**: IBM HR Analytics Employee Attrition & Performance

**Key Columns :** EmployeeNumber, Age, Gender, Department, JobRole, MonthlyIncome, YearsAtCompany, JobSatisfaction, WorkLifeBalance, PerformanceRating, Attrition (Target)..

**Timeline:** Contains employee records collected over multiple years across different departments.

**Location:** Kaggle

##3. Problem statement

Employee turnover is a major challenge for organizations. When employees leave frequently, it affects productivity, team morale, and company performance.
Predicting which employees are likely to leave can help HR teams take preventive actions.
This project uses machine learning models to predict employee attrition based on various HR and performance-related factors.

##4. Domain

Human Resources Analytics/  Employee Retention

##5. Objective

The primary objective of this project is to analyze employee-related data to understand the factors influencing attrition and performance, and to build predictive models that help organizations proactively manage workforce challenges.


## 6.Outcome

 I am going to achieve a comprehensive understanding of the factors that influence employee attrition and performance. I will build predictive models capable of identifying employees who are at risk of leaving, analyze the key drivers that contribute to attrition, and gain actionable insights that can help HR managers design effective retention strategies.

.
**Sample Input:**  Age, Gender, Department, JobRole, MonthlyIncome, DistanceFromHome, YearsAtCompany, OverTime, JobSatisfaction, EducationField, MaritalStatus

**Sample Output:** Predicted Employee Attrition Status


##7.Algorithms
This dataset is a classification problem, as the goal is to predict whether an employee will leave the company. Therefore, I plan to apply the following supervised machine learning classification algorithms to build predictive models and identify the factors influencing employee attrition.

*  Logistic Regression
*   Decision Tree Classifier
*   Random Forest Classifier
*   Support Vector Machine (SVM / SVC)
*   Naïve Bayes
*  K-Nearest Neighbors (KNN)






## EDA Process

Library Importing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Loading the Dataset

In [None]:
Ep_df=pd.read_csv('https://raw.githubusercontent.com/Sasi2528/IBM-HR-Analytics-Employee-Attrition-Performance/refs/heads/main/WA_Fn-UseC_-HR-Employee-Attrition.csv')
Ep_df

Finding the First Five Roww

In [None]:
Ep_df.head()


Finding the Last Five Rows

In [None]:
Ep_df.tail()

Finding the Count of Columns & Rows

In [None]:
Ep_df.shape
print("Count of Rows:", Ep_df.shape[0])
print("Count of Columns:", Ep_df.shape[1])

### Basic Informations
Finding the Data type

In [None]:
Ep_df.info()

### Finding the datatype

In [None]:
print(Ep_df.dtypes)

Finding the column Names


In [None]:
print('Column',Ep_df.columns)

### Finding the categorical columns

In [None]:
Categorical_cols=Ep_df.select_dtypes(include=['object']).columns
Categorical_cols


### Finding the Numerical Columns

In [None]:
Numerical_cols=Ep_df.select_dtypes(include=['int','float']).columns
Numerical_cols

## Interpretation

In this project, the IBM HR Analytics Employee Attrition and Performance dataset is first loaded for detailed analysis. The primary objective of this dataset is to understand the various factors that influence employee attrition — whether an employee stays in the company or decides to leave.

After loading the dataset, its structure is carefully examined to understand the total number of rows, columns, and data types of each feature. The dataset contains 1,470 rows and 35 columns, where each row represents an employee record, and each column represents a specific attribute related to employee demographics, job satisfaction, compensation, and work environment.

# Stage 2

## EDA (Visualization) and Pre-processing

### Finding the Missing Values

In [None]:
Ep_df.isnull()

### Finding the sum of Null Values

In [None]:
Ep_df.isnull().sum()

### Handling Duplicates

In [None]:
Ep_df_Duplicate = Ep_df.duplicated()
print(Ep_df_Duplicate)

There is no duplicate values in Ep dataset.

**Finding Outliers**

Before Removing Outlier Dataset Shape

In [None]:
numerical_cols = Ep_df.select_dtypes(include=[np.number]).columns
print('Before removing outlier Dataset shape',Ep_df.shape)

In [None]:
numeric_cols = Ep_df.select_dtypes(include=['int64', 'float64']).columns
print("Numeric columns:", numeric_cols)

In [None]:
plt.figure(figsize=(14,5))
sns.boxplot(data=Ep_df[numeric_cols])
plt.title(" Before Removing Outlier")
plt.xticks(rotation=60)
plt.show()

In [None]:
for col in Ep_df.select_dtypes(include=['int','float64']):
  Q1=Ep_df[col].quantile(0.25)
  Q3=Ep_df[col].quantile(0.75)
  IQR=Q3-Q1
  print(f"{col}:IQR-{IQR}")

In [None]:
lower_bound = Q1-1.5* IQR
lower_bound

In [None]:
Upper_bound = Q3+1.5* IQR
Upper_bound

## Removing outliers

In [None]:
Ep_df_Removed_Outliers=Ep_df[(Ep_df[col]>=lower_bound)&(Ep_df[col]<=Upper_bound)]
Ep_df_Removed_Outliers

After Removing Outlier Dataset Shape

In [None]:
print('After removing outlier Dataset shape',Ep_df_Removed_Outliers.shape)

Boxplot After Removing Outlier for Numarical Column

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=Ep_df_Removed_Outliers[col])
plt.title("After Removing Outliers")
plt.show()

**Finding Skewness**

In [None]:
Ep_df_Skewness = Ep_df[numeric_cols].skew()
print(Ep_df_Skewness)

 Skewness  Visualizitation

In [None]:
plt.figure.figsize=(14,7)
sns.kdeplot(data=Ep_df_Skewness,fill=True)
plt.title("Skewness")
plt.show()

# Visualization

Univariate

In [None]:
sns.histplot(Ep_df['YearsWithCurrManager'], kde=True, color='skyblue')
plt.title(f"Histogram & KDE of YearsWithCurrManager")
plt.show()
print(Ep_df['YearsWithCurrManager'].describe())

**Interpretation:**

 YearsWithCurrManager is indicating that most employees have relatively fewer years of experience with their current manager. Only a small proportion have stayed with the same manager for many years.

In [None]:
sns.boxplot(Ep_df['JobInvolvement'])
plt.title(f"Histogram & KDE of JobInvolvement")
plt.show()
print(Ep_df['JobInvolvement'].describe())

**Interpretation:**

The JobInvolvement represents how actively employees are engaged in their work and how much they identify with their job role.


In [None]:
sns.barplot(Ep_df['Gender'], color='blue')
plt.title(f"Barplot &  of Gender")
plt.show()
print(Ep_df['Gender'].describe())

**Interpretation:**

The barplot of Gender shows the distribution of employees based on their gender within the organization.The plot indicates that the organization has a higher number of male employees compared to female employees

In [None]:
sns.countplot(x=Ep_df['MonthlyIncome'],color='blue')
plt.title("Boxplot of Monthly Income")
plt.show()
print(Ep_df['MonthlyIncome'].describe())

**Interpretation:**

MonthlyIncome  indicates that most employees earn within a certain moderate income range, while a few employees have significantly higher monthly incomes.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='YearsAtCompany', data=Ep_df,color='#808000')
plt.title("YearsAtCompany",color='#808000')
plt.show()


**Interpretation:**

YearsAtCompany illustrates the number of employees grouped by their total years of service in the organization.
From the visualization, it can be observed that the majority of employees have relatively fewer years of experience within the company, while the number of employees gradually decreases as years of service increase.

Bivariate

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Attrition', y='MonthlyIncome', data=Ep_df,hue='Attrition',palette='Set2',legend=False)
plt.title('Attrition vs. Monthly Income')
plt.show()

**Interpretation:**

Attrition vs. Monthly Income shows the relationship between an employee’s salary and their likelihood of leaving the organization.

In [None]:
plt.figure(figsize=(15,6))
sns.scatterplot(data=Ep_df,x='YearsAtCompany',y='MonthlyIncome',hue='PerformanceRating',palette='viridis')
plt.title("Performance Rating vs Monthly Income over Years at Company")
plt.show()


**Interpretation:**

It can be observed that employees with higher performance ratings generally earn higher monthly incomes, especially as their tenure increases.
This trend suggests that performance and experience play a significant role in salary growth within the organization.

In [None]:
sns.barplot(x='Attrition',y='JobSatisfaction',data=Ep_df,hue='Attrition',palette='coolwarm',legend=False)
plt.title('Attrition Vs JobSatifaction')
plt.show()

**Interpretation:**

Attrition vs. JobSatisfaction shows how employee satisfaction with their job relates to their likelihood of leaving the organization.

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(Ep_df[Ep_df['Attrition'] == 'Yes']['Age'], color='red', label='Left', kde=True)
sns.histplot(Ep_df[Ep_df['Attrition'] == 'No']['Age'], color='green', label='Stayed', kde=True)
plt.title('Attrition vs. Age')
plt.legend()
plt.show()

**Interpretation:**

The age distribution of employees who left (Attrition = Yes) and those who stayed (Attrition = No) in the organization.

In [None]:
plt.figure(figsize=(15,6))
sns.violinplot(data=Ep_df,x='YearsSinceLastPromotion',y='PerformanceRating',hue='Attrition', palette='Set2')
plt.title('Years Since Last Promotion vs Performance Rating by Attrition')
plt.show()

**Interpretation:**

The relationship between the number of years since an employee’s last promotion and their performance rating, colored by attrition status.
Employees who have not been promoted for many years tend to have lower performance ratings, and some of these employees also leave the company.
This suggests that promotion opportunities and recognition of performance could influence employee retention.


### Multivariate

In [None]:
plt.figure(figsize=(16,5))
sns.pairplot(data=Ep_df, vars=numerical_cols, hue='Attrition', palette='Set1', diag_kind='kde')
plt.tight_layout()
plt.show()


**Interpretation:**

The pairplot reveals that attrition is more common among younger employees with lower income and fewer years at the company. Distance from home also appears to have a mild effect, with employees living farther away showing a slightly higher tendency to leave. These findings highlight that demographic and financial factors jointly influence employee attrition patterns.

In [None]:
print(Ep_df.info())

In [None]:
print(Ep_df.describe())

In [None]:
Ep_df.select_dtypes(include='object').nunique()

#Stage 3

## Feature Engineering

**Encoding the categorical Column**

In [None]:
Ep_df['Attrition']=Ep_df['Attrition'].map({'Yes':1, 'No':0}).astype(int)

In [None]:
Ep_df['Attrition'].unique().tolist()

In [None]:
Ep_df['Attrition'].info()

In [None]:
for col in Ep_df.select_dtypes(include=['object']).columns:
  print(f"{col}:{Ep_df[col].unique().tolist()}")

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoding = OneHotEncoder(sparse_output=False)
categorical_encoding=encoding.fit_transform(Ep_df[['BusinessTravel','Department','EducationField','Gender',
                                                   'JobRole','MaritalStatus','Over18','OverTime']])
categorical_encoding

In [None]:
Ep_df_encoding=pd.DataFrame(categorical_encoding,columns=encoding.get_feature_names_out())
encoding_names=Ep_df_encoding.columns
print("Encoding Names:",encoding_names)

In [None]:
Ep_df_dropped=Ep_df.drop(['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus','Over18','OverTime'],axis=1)
FEP_df=pd.concat([Ep_df_dropped,Ep_df_encoding[[ 'BusinessTravel_Non-Travel',
       'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely',
       'Department_Human Resources', 'Department_Research & Development',
       'Department_Sales', 'EducationField_Human Resources',
       'EducationField_Life Sciences', 'EducationField_Marketing',
       'EducationField_Medical', 'EducationField_Other',
       'EducationField_Technical Degree', 'Gender_Female', 'Gender_Male',
       'JobRole_Healthcare Representative', 'JobRole_Human Resources',
       'JobRole_Laboratory Technician', 'JobRole_Manager',
       'JobRole_Manufacturing Director', 'JobRole_Research Director',
       'JobRole_Research Scientist', 'JobRole_Sales Executive',
       'JobRole_Sales Representative', 'MaritalStatus_Divorced',
       'MaritalStatus_Married', 'MaritalStatus_Single', 'Over18_Y',
       'OverTime_No', 'OverTime_Yes']]],axis=1)

In [None]:
FEP_df.info()

In [None]:
FEP_df.head(10)

In [None]:
FEP_df.describe()

**Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
Scaler=StandardScaler()

In [None]:
scaled_numeric=Scaler.fit_transform(FEP_df[numerical_cols])
scaled_numeric_df=pd.DataFrame(scaled_numeric,columns=numeric_cols)
FEP_df_process=pd.concat([scaled_numeric_df,FEP_df.select_dtypes(include=['int','float64']), FEP_df['Attrition']], axis=1)
FEP_df_process


In [None]:
FEP_df_process['Attrition'].info()

**Feature Selection**

In [None]:
corr_matrix=FEP_df_process.corr()
corr_matrix

In [None]:
FEP_df_process.shape

In [None]:
FEP_df_process = FEP_df_process.loc[:, ~FEP_df_process.columns.duplicated()]

In [None]:
X=FEP_df_process.drop(['Attrition', 'EmployeeCount','Over18_Y'],axis=1)
y=FEP_df_process['Attrition'].astype(int)


In [None]:
FEP_df['Attrition'].info()

  **Interpretation:**

  I choose “Attrition” as the target because it’s the main outcome variable
it represents the problem we want to solve or predict (whether an employee stays or leaves).

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_selection import SelectKBest,f_classif
select=SelectKBest(score_func=f_classif,k=10)
select.fit(X,y)


In [None]:
x_new=select.fit_transform(X,y)
Top_features=X.columns[select.get_support()]
print(Top_features)

In [None]:
X = X[Top_features]
y=FEP_df_process['Attrition'].astype(int)

**Model Bulding**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,ConfusionMatrixDisplay,f1_score

In [None]:
classifiers={
    'Logistic Regression':LogisticRegression(),
    'K-Nearest Neighbours': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Classifier': SVC(),
    'Naive Bayes': GaussianNB(),
    'XGBoost': XGBClassifier()

}

In [None]:
import warnings
warnings.filterwarnings("ignore")

#Stage 4

**Compare the Models**

In [None]:
results={}
for name, clf in classifiers.items():
  clf.fit(X_train,y_train)
  y_pred=clf.predict(X_test)
  accuracy=accuracy_score(y_test,y_pred)
  f1=f1_score(y_test,y_pred)
  results[name]={'Accuracy':accuracy, 'F1 Score':f1}

results_df=pd.DataFrame(results).T
results_df

**Interpretation:**

    
* Support Vector Classifier (SVC) achieved the highest accuracy , meaning it correctly classifies most samples overall.
However, it's F1 score is quite low its biased toward the majority class.

* Naïve Bayes, despite a slightly lower accuracy , achieved the highest F1 score.This indicates it is better balanced.
It's likely more effective at detecting rare or minority-class events.

* Logistic Regression and KNN perform moderately well in both metrics, offering a stable trade-off between accuracy and generalization.

* Decision Tree and XGBoost show lower overall accuracy and F1, possibly due to overfitting or lack of hyperparameter tuning.

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data=results_df.reset_index(), x='index', y='Accuracy', palette="coolwarm")
plt.title("Model Accuracy Comparison")
plt.xlabel("Classifier", fontsize=12)
plt.ylabel("Accuracy", fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Using KFold cross validation**

In [None]:
from sklearn.model_selection import KFold, cross_val_score
k=5
kf=KFold(n_splits=k, shuffle=True, random_state=4)
print(f"{'Classifier':<25}{'Mean Accuracy':<20}{'Mean F1 Score':<20}")
for name, clf in classifiers.items():
  accuracy_score=cross_val_score(clf,X,y,cv=kf,scoring='accuracy')
  f1_score=cross_val_score(clf,X,y,cv=kf,scoring='f1_weighted')
  print(f"{name:<25}{accuracy_score.mean():<20.4f}{f1_score.mean():<20.4f}")

**Interpretation**

* All models show reasonably high accuracy  and strong F1 scores , indicating stable generalization across the 5 folds.

* The small variation between accuracy and F1 means your dataset is likely well-balanced or the models are handling imbalance decently.

* Support Vector Classifier (SVC) has the highest mean accuracy  — it's the most consistent in overall prediction correctness.

* Random Forest delivers the best F1 score  — it balances precision and recall well, meaning it performs slightly better at correctly identifying both classes.

* K-Nearest Neighbours and XGBoost also perform strongly and close to the top performers.

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid={
    'C':[0.1,1,10,100],
    'gamma':['scale','auto',0.1,1,10],
    'kernel':['rbf','linear']
}
grid_search=GridSearchCV(estimator=SVC(),param_grid=param_grid,refit=True,verbose=2,cv=5)
grid_search.fit(X_train,y_train)
print("Best parameteres found:")
print(grid_search.best_params_)
best_svm_model=grid_search.best_estimator_

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid_lr = {
    'penalty': ['l1', 'l2', 'elasticnet', None],
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear', 'saga'],
    'max_iter': [100, 500, 1000]
}
log_reg = LogisticRegression()
grid_search_lr = GridSearchCV(estimator=log_reg, param_grid=param_grid_lr,refit=True, verbose=2, cv=5, n_jobs=-1)

grid_search_lr.fit(X_train, y_train)
print("Best parameters for Logistic Regression:")
print(grid_search_lr.best_params_)

best_log_model = grid_search_lr.best_estimator_

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid_rf = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}
rf = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf,
                              refit=True, verbose=2, cv=5, n_jobs=-1)

grid_search_rf.fit(X_train, y_train)

print("Best parameters for Random Forest:")
print(grid_search_rf.best_params_)

best_rf_model = grid_search_rf.best_estimator_

In [None]:
from sklearn.metrics import accuracy_score, classification_report

In [None]:
best_models = {
    'SVC': grid_search.best_estimator_,
    'Random Forest': grid_search_rf.best_estimator_,
    'Logistic Regression':grid_search_lr.best_estimator_
}
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    print(f"\n{name} Results:")
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Attrition', 'Attrition']))

In [None]:
results = {}

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, target_names=['No Attrition', 'Attrition'], output_dict=True)
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']
    }

metrics = list(next(iter(results.values())).keys())
x = np.arange(len(metrics))
width = 0.25
plt.figure(figsize=(10, 6))
for i, (name, scores) in enumerate(results.items()):
    plt.bar(x + i*width, list(scores.values()), width=width, label=name)
plt.xticks(x + width, metrics)
plt.ylim(0, 1)
plt.title('Model Performance Comparison', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

**Interpretation**

 After Hyperparameter tuning finding the best model

| Model                   | Accuracy | Attrition Recall | Attrition F1-score |
| ----------------------- | -------- | ---------------- | ------------------ |
| **SVC**                 | 0.87     | 0.13             | 0.20               |
| **Random Forest**       | 0.85     | 0.15             | 0.21               |
| **Logistic Regression** | 0.87     | 0.10             | 0.17               |

- Evaluated models: Logistic Regression, Random Forest, Decision Tree, KNN, SVC, Naïve Bayes, XGBoost

- Based on Accuracy (≈0.87) and F1-score (≈0.20), the Support Vector Classifier (SVC) still performs best overall.

- Random Forest performs strongly but slightly biased toward the majority “No Attrition” class.

# **IBM HR Analytics Employee Attrition & Performance**

## **Project Overview**

The goal of this project is to analyze employee data to identify the key factors that lead to employee attrition (employees leaving the company) and to build predictive models that can accurately forecast the likelihood of attrition.
Understand the main drivers of employee turnover.

Predict which employees are at risk of leaving, enabling proactive retention strategies.

Improve overall workforce performance and satisfaction through data-driven HR decisions.


## **Stage 1**

### Data choosing & Understanding the Dataset

- Dataset: 'Data_Train.xlsx'
- Columns include EmployeeNumber, Age, Gender, Department, JobRole, MonthlyIncome, YearsAtCompany, JobSatisfaction, WorkLifeBalance, PerformanceRating, Attrition.
- Checked data types, null values, and duplicates.

## **Stage 2**

### EDA (Visualization) and Pre-processing
 - Finding the outliers
 - Removing the outliers
 - Finding the Skewness
 - Visualize the Univariate
 - Visualize the Bivariate
 - Visualize the Multivariate

 ## **Stage 3**

 ### Feature Engineering
 - Encoding the categorical columns(One hot Encoding)
 - Feature Selection (Using Kbest selection)
 - Feature Scaling (Standard Scalar)
 - Model Building

  ## **Stage 4**
  ### Model Evaluation
  - Logistic Regression
  - Decision Tree Classifier
  - Random Forest Classifier
  - Support Vector Machine (SVM / SVC)
  - Naïve Bayes
  - K-Nearest Neighbors (KNN)
  - XGBooster
   ### Model Comparision

   - Using KFold cross validation
   - Hyperparameter Tuning
   Finally find the best model for predict the Attrition.

  ### Future Enhancement

- While the current model provides valuable insights into employee attrition and performance, there is significant scope for further improvement and expansion.
 - The  solution more robust, accurate, and practical for real-world HR applications:

 ### Conclusion
The IBM HR Analytics - Employee Attrition & Performance project successfully explored employee-related data to identify key factors influencing attrition and to build predictive models that help forecast which employees are at risk of leaving the organization.

Through data preprocessing, feature selection, and model comparison, the analysis revealed that factors such as overtime, job level, total working years, marital status, and monthly income play a crucial role in employee attrition.

Among the tested machine learning models — including Logistic Regression, Random Forest, SVM, KNN, Decision Tree, Naïve Bayes, and XGBoost — models like Support Vector Classifier (SVC) and Random Forest delivered the highest accuracy, while Naïve Bayes performed better in detecting minority (attrition) cases, as reflected in its F1 score.

The findings highlight the importance of data-driven HR strategies. By understanding what drives employees to leave, organizations can proactively implement retention initiatives, optimize workload, enhance job satisfaction, and improve workforce stability.








