#**DATA PREPROCESSING**



---


# **1. Import Libraries**

In [None]:
# drive module for mounting gdrive storage
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
%cd '/content/gdrive/MyDrive/PRCL_10281-employee_performance_analysis'

Mounted at /content/gdrive
/content/gdrive/MyDrive/PRCL_10281-employee_performance_analysis


In [None]:
# show current directory
!pwd

/content/gdrive/MyDrive/PRCL_10281-employee_performance_analysis


In [None]:
# basic libraries for statistics and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import pylab
import os
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")



---


# **2. Load Data**
* File containing employee performance data

In [None]:
# load data from CSV files
df = pd.read_csv('/content/gdrive/MyDrive/PRCL_10281-employee_performance_analysis/2-data/2-raw/INX_Future_Inc_Employee_Performance_CDS_Project2_Data.csv')
display(df.head())
print('='*150)
display(df.tail())

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,E1001000,32,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,10,3,...,4,10,2,2,10,7,0,8,No,3
1,E1001006,47,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,14,4,...,4,20,2,3,7,7,1,7,No,3
2,E1001007,40,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Frequently,5,4,...,3,20,2,3,18,13,1,12,No,4
3,E1001009,41,Male,Human Resources,Divorced,Human Resources,Manager,Travel_Rarely,10,4,...,2,23,2,2,21,6,12,6,No,3
4,E1001010,60,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,16,4,...,4,10,1,3,2,2,2,2,No,3




Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
1195,E100992,27,Female,Medical,Divorced,Sales,Sales Executive,Travel_Frequently,3,1,...,2,6,3,3,6,5,0,4,No,4
1196,E100993,37,Male,Life Sciences,Single,Development,Senior Developer,Travel_Rarely,10,2,...,1,4,2,3,1,0,0,0,No,3
1197,E100994,50,Male,Medical,Married,Development,Senior Developer,Travel_Rarely,28,1,...,3,20,3,3,20,8,3,8,No,3
1198,E100995,34,Female,Medical,Single,Data Science,Data Scientist,Travel_Rarely,9,3,...,2,9,3,4,8,7,7,7,No,3
1199,E100998,24,Female,Life Sciences,Single,Sales,Sales Executive,Travel_Rarely,3,2,...,1,4,3,3,2,2,2,0,Yes,2




---


# **3. Assessing the data**

This step involves visually and programmatically examining the data for data quality and tidiness issues.


In [None]:
# Print shape
print(df.shape)

(1200, 28)


In [None]:
# check data types for columns of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   EmpNumber                     1200 non-null   object
 1   Age                           1200 non-null   int64 
 2   Gender                        1200 non-null   object
 3   EducationBackground           1200 non-null   object
 4   MaritalStatus                 1200 non-null   object
 5   EmpDepartment                 1200 non-null   object
 6   EmpJobRole                    1200 non-null   object
 7   BusinessTravelFrequency       1200 non-null   object
 8   DistanceFromHome              1200 non-null   int64 
 9   EmpEducationLevel             1200 non-null   int64 
 10  EmpEnvironmentSatisfaction    1200 non-null   int64 
 11  EmpHourlyRate                 1200 non-null   int64 
 12  EmpJobInvolvement             1200 non-null   int64 
 13  EmpJobLevel       

In [None]:
# checking the missing values in the data
df.isnull().sum()

Unnamed: 0,0
EmpNumber,0
Age,0
Gender,0
EducationBackground,0
MaritalStatus,0
EmpDepartment,0
EmpJobRole,0
BusinessTravelFrequency,0
DistanceFromHome,0
EmpEducationLevel,0


---


* The employee performance dataset consists of 1200 records, each record containing 28 (columns) features.

* There're only two types of features - integer and object datatypes.

* Regarding the data in each column, we can observe that there are no null values.
---

In [None]:
# Checking the % of count of unique values in each categorical column

cols_cat = df.select_dtypes(['object'])

for i in cols_cat.columns:
    print('Unique values in',i, 'are :')
    print(df[i].value_counts(normalize = True))
    print('*'*40)

Unique values in EmpNumber are :
EmpNumber
E1001000    0.000833
E100346     0.000833
E100342     0.000833
E100341     0.000833
E100340     0.000833
              ...   
E1001718    0.000833
E1001717    0.000833
E1001716    0.000833
E1001713    0.000833
E100998     0.000833
Name: proportion, Length: 1200, dtype: float64
****************************************
Unique values in Gender are :
Gender
Male      0.604167
Female    0.395833
Name: proportion, dtype: float64
****************************************
Unique values in EducationBackground are :
EducationBackground
Life Sciences       0.410000
Medical             0.320000
Marketing           0.114167
Technical Degree    0.083333
Other               0.055000
Human Resources     0.017500
Name: proportion, dtype: float64
****************************************
Unique values in MaritalStatus are :
MaritalStatus
Married     0.456667
Single      0.320000
Divorced    0.223333
Name: proportion, dtype: float64
*******************************

In [None]:
# Checking the count of unique values in each categorical column

cols_uniq = df.select_dtypes(['object'])

for i in cols_uniq.columns:
    print('Unique values in',i, 'are :')
    print(df[i].nunique())
    print('*'*40)

Unique values in EmpNumber are :
1200
****************************************
Unique values in Gender are :
2
****************************************
Unique values in EducationBackground are :
6
****************************************
Unique values in MaritalStatus are :
3
****************************************
Unique values in EmpDepartment are :
6
****************************************
Unique values in EmpJobRole are :
19
****************************************
Unique values in BusinessTravelFrequency are :
3
****************************************
Unique values in OverTime are :
2
****************************************
Unique values in Attrition are :
2
****************************************


In [None]:
# EmpNumber is not necessary for analysis and should be dropped
df = df.drop('EmpNumber', axis=1)

In [None]:
# Check distribution of categorical features
df.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
Gender,1200,2,Male,725
EducationBackground,1200,6,Life Sciences,492
MaritalStatus,1200,3,Married,548
EmpDepartment,1200,6,Sales,373
EmpJobRole,1200,19,Sales Executive,270
BusinessTravelFrequency,1200,3,Travel_Rarely,846
OverTime,1200,2,No,847
Attrition,1200,2,No,1022


In [None]:
# Check distribution of numerical features (max , min, std of values)
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1200.0,36.918333,9.087289,18.0,30.0,36.0,43.0,60.0
DistanceFromHome,1200.0,9.165833,8.176636,1.0,2.0,7.0,14.0,29.0
EmpEducationLevel,1200.0,2.8925,1.04412,1.0,2.0,3.0,4.0,5.0
EmpEnvironmentSatisfaction,1200.0,2.715833,1.090599,1.0,2.0,3.0,4.0,4.0
EmpHourlyRate,1200.0,65.981667,20.211302,30.0,48.0,66.0,83.0,100.0
EmpJobInvolvement,1200.0,2.731667,0.707164,1.0,2.0,3.0,3.0,4.0
EmpJobLevel,1200.0,2.0675,1.107836,1.0,1.0,2.0,3.0,5.0
EmpJobSatisfaction,1200.0,2.7325,1.100888,1.0,2.0,3.0,4.0,4.0
NumCompaniesWorked,1200.0,2.665,2.469384,0.0,1.0,2.0,4.0,9.0
EmpLastSalaryHikePercent,1200.0,15.2225,3.625918,11.0,12.0,14.0,18.0,25.0


**Insights**

* Approximately **46%** of employees are married, **22%** have devorced, whilst **22%** are single.

* Three departments **(Sales, Development, Research and Development)** out of the total of six constitute **89%** of the total employees.

* About **71%** of total employees rarely travel.

* Only **29%** of total employees work overtime.

* There is a low employee attrition of about **15%**.

* The number of the educational backgrounds present in the employees is **six** unique backgrounds.

* **Nineteen** unique employee job roles are present in this company.

* The most of the employees are having the education level of **3**

* The Job satisfaction level in this company is high level for the majority of employees.

* Only **11%** of employees in the company were achieved **level 4** - performance rating

* **NB:** This information can be used to predict employees who're likely to leave the company.



---


# **4. Data Cleaning**

In [None]:
df.isna().values.any() # checking NaN values

False

In [None]:
df.isnull().values.any() # checkingt Null values

False

After assessing the data, no data quality issues were observed in the dataset hence no further steps are required for the cleaning process.

* all remaining features from the dataset are deemed necessary for the analysis,
* all data types are deemed accurate.

### **4.1 Save Preprocessed Data**

In [None]:
# save df as csv in current directory

df.to_csv('/content/gdrive/MyDrive/PRCL_10281-employee_performance_analysis/2-data/1-processed/pre_processed_data.csv', index=False)