- Dataset Link - 
https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction

In [1]:
# importing modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv('HR_comma_sep.csv')
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.80,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low
14995,0.37,0.48,2,160,3,0,1,0,support,low
14996,0.37,0.53,2,143,3,0,1,0,support,low
14997,0.11,0.96,6,280,4,0,1,0,support,low


#### Column Description
-The "last_evaluation" column in the employee retention dataset represents the employee's most recent performance evaluation score, which is a decimal number between 0 and 1.

In this dataset, a value of 0 represents the lowest possible evaluation score, and a value of 1 represents the highest possible score. The decimal point in the score indicates the level of precision in the evaluation, with more decimal places indicating a more granular and precise evaluation.

For example, an employee with a "last_evaluation" score of 0.87 has received a relatively high evaluation score, with a level of precision of two decimal places. This score could have been derived from a combination of factors such as productivity, quality of work, job knowledge, and communication skills, among others, which are typically evaluated during an employee's performance review.

##### 1. Data pre-processing.

In [3]:
# Remove any rows with missing data
df.dropna(inplace=True)

In [4]:
# encode categorical variables using one hot encoding
categorical_columns = ["Department","salary"]
df = pd.get_dummies(df, columns= categorical_columns)


#### 2. Feature Engineering
Feature engineering on the dataframe can be done by following steps

- Create a new feature called "productivity" by dividing the "average_montly_hours" column by the "time_spend_company" column. This feature can represent the employee's average monthly productivity, taking into account the amount of time they have spent at the company.



In [5]:
# assuming time spend company denotes year = multiply by 12 to get months
df['productivity'] = df['average_montly_hours'] / (12*df['time_spend_company'])
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department_IT,Department_RandD,...,Department_management,Department_marketing,Department_product_mng,Department_sales,Department_support,Department_technical,salary_high,salary_low,salary_medium,productivity
0,0.38,0.53,2,157,3,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,4.361111
1,0.80,0.86,5,262,6,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,3.638889
2,0.11,0.88,7,272,4,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,5.666667
3,0.72,0.87,5,223,5,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,3.716667
4,0.37,0.52,2,159,3,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,4.416667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,4.194444
14995,0.37,0.48,2,160,3,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,4.444444
14996,0.37,0.53,2,143,3,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,3.972222
14997,0.11,0.96,6,280,4,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,5.833333


In [6]:
# scale numeric columns as a pending part of data pre-processing
numeric_columns = ['number_project', 'average_montly_hours','time_spend_company','productivity']
df[numeric_columns] = (df[numeric_columns] - df[numeric_columns].mean())/ df[numeric_columns].std()
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department_IT,Department_RandD,...,Department_management,Department_marketing,Department_product_mng,Department_sales,Department_support,Department_technical,salary_high,salary_low,salary_medium,productivity
0,0.38,0.53,-1.462814,-0.882010,-0.341224,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,-0.480459
1,0.80,0.86,0.971081,1.220382,1.713379,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,-0.812702
2,0.11,0.88,2.593677,1.420610,0.343644,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0.120133
3,0.72,0.87,0.971081,0.439493,1.028511,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,-0.776922
4,0.37,0.52,-1.462814,-0.841965,-0.341224,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,-0.454902
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,-1.462814,-1.002147,-0.341224,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,-0.557131
14995,0.37,0.48,-1.462814,-0.821942,-0.341224,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,-0.442124
14996,0.37,0.53,-1.462814,-1.162329,-0.341224,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,-0.659359
14997,0.11,0.96,1.782379,1.580792,0.343644,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,0.196804
