# Human Resources Data Analysis and Attrition Prediction

_Exploratory Analysis of HR Dataset_

--- 

HR analysis helps us to interpret organizational data. It identifies people-related trends in the data, and enables the HR department to take appropriate action to keep the organization running smoothly and profitably. Attrition in a company is one of the complex challenges faced by human resources managers and HR staff.
Interestingly, machine learning models can be deployed to predict potential cases of attrition, helping HR staff to take the necessary steps to retain the employee.

Before any statistical processing, we need to ensure the quality of our data by (possibly) **cleaning** it - deleting redundant data, formatting strings, deleting NaNs, possibly performing data transformations - and **visualizing** it.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 

%matplotlib inline

## Data Loading

_Note:_ In Python, [colormap](https://matplotlib.org/stable/users/explain/colors/colormaps.html) allows easier dataframe visualization.

In [None]:
data = pd.read_csv('HR-Employee-Attrition.csv')

display( data.head().style.background_gradient(cmap='BuPu') )

For each employee, we observe the following 35 variables:
* `Age` : Age in years of the employee
* `Attrition` : People who people leave
* `BusinessTravel` : How often an employee embark on a job related travel
* `DailyRate` : Daily rate at which an employee is paid
* `Department` : Department where the employee works
* `DistanceFromHome` : Distance an employee travels from home to work
* `Education` : Level of education of the employee
* `EducationField` : What field the employee studied in school
* `EmployeeCount` : Count of employee
* `EmployeeNumber` : EMployee number
* `EnvironmentSatisfaction` : Employee environment satisfaction
* `Gender` : Gender of the employee
* `HourlyRate` : Hourly rate of pay of the employee
* `JobInvolvement` : Employee job involvement ratings
* `JobLevel` : Employee Job level
* `JobRole` : Employee Job role
* `JobSatisfaction` : Employee Job Staisfaction
* `MaritalStatus` : Employee Marital Status
* `MonthlyIncome` : Employee monthly income
* `MonthlyRate` : Employee Monthly rate
* `NumCompaniesWorked` : Number of companies worked
* `Over18` : Age over 18 years
* `OverTime` : Work overtime
* `PercentSalaryHike` : Salary increment in Percentages
* `PerformanceRating` : Performance rating
* `RelationshipSatisfaction` : Relationship satisfaction
* `StandardHours` : Employee standard hours worked
* `StockOptionLevel` : Stock options level
* `TotalWorkingYears` : Total working hours
* `TrainingTimesLastYear` : Total working years
* `WorkLifeBalance` : Work life balance rating
* `YearsAtCompany` : Years at the company
* `YearsInCurrentRole` : Years in current role
* `YearsSinceLastPromotion` : Years since last promotion
* `YearsWithCurrManager` : Years with current manager

##### <span style="color:purple"> **Question:** How many employees are we looking at? </span>

In [None]:
### TO BE COMPLETED ###

[...]

## Data Cleaning

##### <span style="color:purple"> **Todo:** Check data quality.</span>

1. Does this dataset contain duplicate rows?
2. How many distinct (unique) values does each variable take?
3. How many empty cells are there in each column?
4. How are the variables encoded? What type?

In [None]:
### TO BE COMPLETED ###

[...]

In [33]:
# %load solutions/data_quality.py

> Comments.

##### <span style="color:purple"> **Todo:** Remove unnecessary columns.</span>

Dropping columns that have:
- Only _one_ inside it that doesn't contribute to the analysis
- Columns that contain _index_ value

Use pandas [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function.

In [None]:
### TO BE COMPLETED ###

[...]

In [32]:
# %load solutions/data_drop.py

Consider two dataframes, `num` and `cat`, containing _quantitative_ and _qualitative_ data respectively.

In [None]:
num = data.select_dtypes(exclude='O')
display(num.head())

In [None]:
cat = data.select_dtypes(include='O')
display(cat.head())

##### <span style="color:purple"> **Question:**  What is the proportion of qualitative data in this dataset? Of quantitative data?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [34]:
# %load solutions/data_proportion.py

##### <span style="color:purple"> **Todo:** For each variable, display its basic statistical descriptors: variable, mean, quantiles, _etc._</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [35]:
# %load solutions/stat_summary.py

## Correlation Analysis

##### <span style="color:purple"> **Todo:** Assess the correlation between variables.</span>

1. Plot the Correlation Matrix between variables,
2. You can select columns with high correlation:

In [None]:
### TO BE COMPLETED ###

[...]

In [36]:
# %load solutions/data_correlation.py

##### <span style="color:purple"> **Todo:** Draw a pairplot for the 5 highest correlation variable.</span>

Color the observations according to their attribution.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data_pairplot.py

According to the figure above, attrition appears to be strongly correlated with employees' career history. 

##### <span style="color:purple"> **Todo:** Represent attrition according to the worker's historical profile.</span>

- Plot the total number of years worked as a function of (i) the number of years spent in this company, (ii) the number of years spent in the current position and, finally, (iii) the number of years since the last promotion. 
- Colour the dots according to attrition.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/history.py

> Comments.

## Data Visualization¶

### Breakdown of Overtime Work

##### <span style="color:purple"> **Question:** What is the profile of employees working overtime?</span>

View the distribution of overtime hours worked according to, for example, employees' marital status, department of practice, gender or age.

You can use the [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot) function.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/overtime.py

> Comments.

## Company Departure

##### <span style="color:purple"> **Question:** What is the profile of employees working overtime?</span>

Carry out the same exploratory study, but distinguish between employees who leave the company (attrition) and those who do not.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/attrition.py

> Comments.

## One-hot Encoding of Categorical Data

To facilitate statistical analysis (LDA decomposition), we will "translate" the various categorical variables into [one-hot](https://fr.wikipedia.org/wiki/Encodage_one-hot) encoder form.

### Attrition

In [None]:
data.loc[data['Attrition']=='No','Attrition'] = 0
data.loc[data['Attrition']=='Yes','Attrition'] = 1

data['Attrition'] = data['Attrition'].astype('int')

### Business Travel

In [None]:
BT = data['BusinessTravel'].unique()

for travel in BT:
    data['Business_'+travel] = 0
    data.loc[data['BusinessTravel']==travel,'Business_'+travel] = 1
    data['Business_'+travel] = data['Business_'+travel].astype('int')

data = data.drop('BusinessTravel',axis=1)
# data.head()

### Working Department

In [None]:
DPT = data['Department'].unique()
DPT_names = ['Sales', 'R & D', 'Dpt HR']

for i, dpt in enumerate(DPT):
    data[DPT_names[i]] = 0
    data.loc[data['Department']==dpt, DPT_names[i]] = 1
    data[DPT_names[i]] = data[DPT_names[i]].astype('int')

data = data.drop('Department',axis=1)
# data.head()

### Education Field

In [None]:
EDUC = data['EducationField'].unique()
EDUC_names = ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Educ HR']

for i, educ in enumerate(EDUC):
    data[EDUC_names[i]] = 0
    data.loc[data['EducationField']==educ, EDUC_names[i]] = 1
    data[EDUC_names[i]] = data[EDUC_names[i]].astype('int')

data = data.drop('EducationField',axis=1)
# data.head()

### Gender

In [None]:
data.loc[data['Gender']=='Male','Gender'] = 1
data.loc[data['Gender']=='Female','Gender'] = 0

data['Gender'] = data['Gender'].astype('int')

### Job Role

In [None]:
JOB = data['JobRole'].unique()
JOB_names = JOB
JOB_names[-1] = 'Job HR'

for i, job in enumerate(JOB):
    data[JOB_names[i]] = 0
    data.loc[data['JobRole']==job, JOB_names[i]] = 1
    data[JOB_names[i]] = data[JOB_names[i]].astype('int')

data = data.drop('JobRole',axis=1)
# data.head()

### MaritalStatus

In [None]:
STATUS = data['MaritalStatus'].unique()
STATUS_names = STATUS

for i, statues in enumerate(STATUS):
    data[STATUS_names[i]] = 0
    data.loc[data['MaritalStatus']==statues, STATUS_names[i]] = 1
    data[STATUS_names[i]] = data[STATUS_names[i]].astype('int')

data = data.drop('MaritalStatus',axis=1)
# data.head()

### OverTime

In [None]:
data.loc[data['OverTime']=='No','OverTime'] = 0
data.loc[data['OverTime']=='Yes','OverTime'] = 1

data['OverTime'] = data['OverTime'].astype('int')

We check:

In [None]:
data.dtypes

## Prediction of Attrition by **Logistic Regression**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from sklearn.linear_model import LogisticRegression

##### <span style="color:purple"> **Todo:** Using the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function, define a training and test data set.</span>

In [None]:
### TO BE COMPLETED ###

[...]

train_x, test_x, train_y, test_y = ...

In [None]:
# %load solutions/train_test_split.py

##### <span style="color:purple"> **Todo:** Implement attrition prediction using a logistic regression model.</span>

In [None]:
### TO BE COMPLETED ###

clf = LogisticRegression(solver='newton-cholesky')
clf.fit(...)

[...]

accuracy = ...

print( classification_report( ... ) )

In [None]:
# %load solutions/logistic_regression.py

> Comments

## Prediction of Attrition by **Linear Discriminant Analysis** (LDA)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

##### <span style="color:purple"> **Todo:** Implement attrition prediction using LDA decomposition.</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/linear_discriminant_analysis.py

> Comments.

## Prediction of Attrition by **Quadratic Discriminant Analysis** (QDA)

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

##### <span style="color:purple"> **Todo:** Implement attrition prediction using QDA decomposition.</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/quadratic_discriminant_analysis.py

> Comments.