# ABOUT THE PROJECT

# **Predicting Employee Attrition**  

### **Project Overview**  
Employee attrition is a critical concern for organizations, as high turnover can lead to increased costs and loss of talent. This project aims to predict which employees are likely to leave the company using Machine Learning techniques, helping HR teams take proactive measures to retain valuable employees.  

### **Techniques Used**  
- **Random Forest**: A powerful ensemble learning method that builds multiple decision trees and aggregates their outputs for accurate predictions. It is robust against overfitting and works well with tabular data.  
- **XGBoost (Extreme Gradient Boosting)**: An advanced boosting algorithm known for its high performance and efficiency in classification tasks. It helps capture complex relationships in the data and improves predictive accuracy.  

### **Dataset**  
The **HR Analytics Dataset** contains various employee-related features that can influence attrition, including:  
- **Demographics**: Age, Gender, Education Level  
- **Job-related factors**: Job Role, Department, Tenure  
- **Work environment**: Overtime, Travel Frequency, Work-Life Balance  
- **Compensation**: Salary, Bonuses, Promotions  
- **Performance & Satisfaction**: Job Satisfaction, Training Hours, Performance Rating  

### **Objective**  
The primary goal is to build a predictive model that can:  
✅ Identify key factors leading to employee attrition  
✅ Provide HR teams with insights into employee retention  
✅ Help companies reduce turnover and improve workplace satisfaction  

### **Expected Outcomes**  
- A machine learning model that predicts whether an employee is likely to leave the company.  
- Insights into the most influential factors affecting employee attrition.  
- Visualizations showcasing trends and patterns in employee turnover.  


# IMPORTING LIBRARIES AND LOADING THE DATA

In [163]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

In [None]:
job_change_train_data=pd.read_csv('/content/aug_train.csv')

In [None]:
job_change_train_data.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


# Data Preprocessing

In [118]:
job_change_train_data.shape

(18014, 14)

In [119]:
job_change_train_data.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


In [120]:
job_change_train_data = job_change_train_data.dropna(subset=['enrolled_university'])

In [121]:
job_change_train_data = job_change_train_data.dropna(subset=['education_level','experience','last_new_job'])

In [122]:
job_change_train_data.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


In [123]:
job_change_train_data['gender'].value_counts()

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Male,16635
Female,1206
Other,173


In [124]:
job_change_train_data['gender']=job_change_train_data['gender'].fillna(value='Male')

In [125]:
job_change_train_data['major_discipline'].value_counts()

Unnamed: 0_level_0,count
major_discipline,Unnamed: 1_level_1
STEM,16215
Humanities,653
Other,364
Business Degree,322
Arts,248
No Major,212


In [126]:
job_change_train_data['major_discipline']=job_change_train_data['major_discipline'].fillna(value='STEM')

In [127]:
job_change_train_data['company_size'].value_counts()

Unnamed: 0_level_0,count
company_size,Unnamed: 1_level_1
50-99,8178
100-500,2565
10000+,1964
10/49,1394
1000-4999,1282
<10,1242
500-999,847
5000-9999,542


In [128]:
mode_value = job_change_train_data['company_size'].mode()[0]
print(mode_value)

50-99


In [129]:
job_change_train_data['company_size'] = job_change_train_data.groupby('education_level')['company_size'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
job_change_train_data['company_type'] = job_change_train_data.groupby('education_level')['company_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))


In [130]:
job_change_train_data['company_type'].isnull().sum()

0

In [131]:
job_change_train_data.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


In [132]:
X=job_change_train_data.drop(columns=['enrollee_id','city'])
Y=job_change_train_data['target']

In [133]:
X.tail()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
19153,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,50-99,Pvt Ltd,1,42,1.0
19154,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,50-99,Pvt Ltd,4,52,1.0
19155,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,0.802,Male,Has relevent experience,no_enrollment,High School,STEM,<1,500-999,Pvt Ltd,2,97,0.0
19157,0.855,Male,No relevent experience,no_enrollment,Primary School,STEM,2,50-99,Pvt Ltd,1,127,0.0


In [134]:
# label encoding --> handling the catogorical Data
labelEncode=LabelEncoder()

X['gender']=labelEncode.fit_transform(X['gender'])
X['relevent_experience']=labelEncode.fit_transform(X['relevent_experience'])
X['enrolled_university']=labelEncode.fit_transform(X['enrolled_university'])
X['education_level']=labelEncode.fit_transform(X['education_level'])
X['major_discipline']=labelEncode.fit_transform(X['major_discipline'])
X['company_type']=labelEncode.fit_transform(X['company_type'])

In [135]:
X['gender'].value_counts()

# 0 --> Female
# 1 --> Male
# 2 --> Others

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
1,16635
0,1206
2,173


In [136]:
X.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,0.92,1,0,2,0,5,>20,50-99,5,1,36,1.0
1,0.776,1,1,2,0,5,15,50-99,5,>4,47,0.0
2,0.624,1,1,0,0,5,5,50-99,5,never,83,0.0
4,0.767,1,0,2,2,5,>20,50-99,1,4,8,0.0
5,0.764,1,0,1,0,5,11,50-99,5,1,24,1.0


In [137]:
X['last_new_job'].value_counts()

Unnamed: 0_level_0,count
last_new_job,Unnamed: 1_level_1
1,7789
>4,3210
2,2827
never,2187
4,1010
3,991


In [138]:
X['last_new_job']=X['last_new_job'].replace({'>4':5 ,'never':0})

In [139]:
X['company_size'].value_counts()

Unnamed: 0_level_0,count
company_size,Unnamed: 1_level_1
50-99,8178
100-500,2565
10000+,1964
10/49,1394
1000-4999,1282
<10,1242
500-999,847
5000-9999,542


In [140]:
mapping = {
    '<10': 0,
    '10/49': 1,  # or '10-49' if you standardize
    '50-99': 2,
    '100-500': 3,
    '500-999': 4,
    '1000-4999': 5,
    '5000-9999': 6,
    '10000+': 7
}

X['company_size'] = X['company_size'].replace(mapping)


  X['company_size'] = X['company_size'].replace(mapping)


In [141]:
X['experience']=X['experience'].replace({'>20':21})

In [142]:
X=X.drop(columns='major_discipline')

In [143]:
X.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,experience,company_size,company_type,last_new_job,training_hours,target
0,0.92,1,0,2,0,21,2,5,1,36,1.0
1,0.776,1,1,2,0,15,2,5,5,47,0.0
2,0.624,1,1,0,0,5,2,5,0,83,0.0
4,0.767,1,0,2,2,21,2,1,4,8,0.0
5,0.764,1,0,1,0,11,2,5,1,24,1.0


In [144]:
X['experience']=X['experience'].replace({'<1':0})

In [145]:
X['experience']=X['experience'].astype(int)

In [146]:
# Check data types of all columns in X
print(X.dtypes)


city_development_index    float64
gender                      int64
relevent_experience         int64
enrolled_university         int64
education_level             int64
experience                  int64
company_size                int64
company_type                int64
last_new_job               object
training_hours              int64
target                    float64
dtype: object


In [147]:
X.shape

(18014, 11)

In [148]:
X['last_new_job']=X['last_new_job'].astype(int)

In [153]:
X['target']=X['target'].astype(int)

In [154]:
X.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,experience,company_size,company_type,last_new_job,training_hours,target
0,0.92,1,0,2,0,21,2,5,1,36,1
1,0.776,1,1,2,0,15,2,5,5,47,0
2,0.624,1,1,0,0,5,2,5,0,83,0
4,0.767,1,0,2,2,21,2,1,4,8,0
5,0.764,1,0,1,0,11,2,5,1,24,1


#Traing the Model with Processed Data

In [155]:
X.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,experience,company_size,company_type,last_new_job,training_hours,target
0,0.92,1,0,2,0,21,2,5,1,36,1
1,0.776,1,1,2,0,15,2,5,5,47,0
2,0.624,1,1,0,0,5,2,5,0,83,0
4,0.767,1,0,2,2,21,2,1,4,8,0
5,0.764,1,0,1,0,11,2,5,1,24,1


In [156]:
X['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,13593
1,4421


In [157]:
Leave=X[X.target == 1]
Stay=X[X.target == 0]

In [158]:
print(Leave.shape)
print(Stay.shape)

(4421, 11)
(13593, 11)


In [159]:
new_stay=Stay.sample(n=4400)

In [160]:
new_data_set=pd.concat([Leave,new_stay],axis=0)

In [180]:
new_data_set.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,experience,company_size,company_type,last_new_job,training_hours,target
0,0.92,1,0,2,0,21,2,5,1,36,1
5,0.764,1,0,1,0,11,2,5,1,24,1
7,0.762,1,0,2,0,13,0,5,5,18,1
8,0.92,1,0,2,0,7,2,5,1,46,1
10,0.624,1,1,0,1,2,2,5,0,32,1


In [164]:
new_data_set['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,4421
0,4400


In [165]:
X=new_data_set.drop(columns='target')
Y=new_data_set['target']
# Train-test split


Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,experience,company_size,company_type,last_new_job,training_hours
0,0.92,1,0,2,0,21,2,5,1,36
5,0.764,1,0,1,0,11,2,5,1,24
7,0.762,1,0,2,0,13,0,5,5,18
8,0.92,1,0,2,0,7,2,5,1,46
10,0.624,1,1,0,1,2,2,5,0,32


In [169]:
X.shape

(8821, 10)

In [166]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,stratify=Y,random_state=7)

In [167]:
# model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

classifer=SVC(kernel='linear')

In [171]:
X.shape,X_train.shape,X_test.shape

((8821, 10), (6174, 10), (2647, 10))

In [172]:
classifer.fit(X_train, Y_train)

In [173]:
from sklearn.metrics import accuracy_score

pridictions=classifer.predict(X_train)

score=accuracy_score(Y_train,pridictions)

In [174]:
print('train data accuracy:',score)

train data accuracy: 0.6705539358600583


In [175]:
from sklearn.metrics import accuracy_score

pridictions=classifer.predict(X_test)

score=accuracy_score(Y_test,pridictions)

In [176]:
print(score)

0.6667925953910087


#Making Predictive system

In [177]:
import pandas as pd

def predict_new_data(input_data):
    # Convert input_data (1D array or tuple) to a DataFrame with one row
    input_df = pd.DataFrame([input_data], columns=[
        'city_development_index', 'gender', 'relevent_experience', 'enrolled_university',
        'education_level', 'experience', 'company_size', 'company_type',
        'last_new_job', 'training_hours'
    ])
    # Make the prediction
    prediction = classifer.predict(input_df)

    return prediction[0]  # Return the predicted label

In [182]:
input_data = (0.764	,1	,0	,1	,0	,11,	2,	5,	1,	24)  # Example input
predicted_class = predict_new_data(input_data)

print(f"Predicted class: {predicted_class}")


Predicted class: 1


#SAVING THE MODEL


In [183]:
import pickle


In [184]:
filename='employee-predict-model.sav'
pickle.dump(classifer,open(filename,'wb'))

In [185]:
loaded_model=pickle.load(open('employee-predict-model.sav','rb'))

In [186]:
import pandas as pd

def predict_new_data(input_data):
    # Convert input_data (1D array or tuple) to a DataFrame with one row
    input_df = pd.DataFrame([input_data], columns=[
        'city_development_index', 'gender', 'relevent_experience', 'enrolled_university',
        'education_level', 'experience', 'company_size', 'company_type',
        'last_new_job', 'training_hours'
    ])

    # Ensure the input data has the same feature columns as the training data
    # If you have any preprocessing steps (like scaling or encoding), apply them here

    # Make the prediction
    prediction = loaded_model.predict(input_df)

    return prediction[0]  # Return the predicted label

# Example: Predicting a new data point


In [187]:
input_data = (0.764,	1,	0,	1,	0,	11,	2,	5,	1,	24)  # Example input
predicted_class = predict_new_data(input_data)

print(f"Predicted class: {predicted_class}")

Predicted class: 1
