# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project 5 : HR Analytics - Attrition

## Learning Objectives

At the end of the mini project, you will be able to -

* Get an understanding of the dataset.
* Perform Extensive EDA and Visualizations
* Handraft the raw data suitable for a ML problem
* Predict(Classify) the employee Attrition based on employee performance


Perform Exhaustive EDA and engineer the features to build a model on a training data that predicts (Classifies) whether an employee (from a test dataset) will quit the company or not.


## Information

### HR Analytics

In any organization, Human Resources (HR) plays the role of a backbone. The strength of the comapny's performance is dependant on the people who make up the various roles and departments. So, it is vital to monitor and make business decisions based on the employees' data.


HR Analytics is one of the latest yet a powerful domain that used the Data Science and Machine Learning. HR analytics is a broad term and has mutiple applications including, but not limited to the following

- Performance Analysis
- Attrition Analysis and Prediction
- Hiring Analytics
- Employee satisfaction and perk recommendation
- Skills assesment and team restructuring


### About the Dataset

This Mini-Project uses the Dataset from the [Kaggle Link](https://www.kaggle.com/code/whale9490/ibm-hr-analytics-employee-attrition-performance).

This Mini-Project is based on the HR Analytics.
The goal of this project is to use the above dataset to study the performance and predict the disssatisfied Employees who are most probable to quit the company.

We have 1470 employees data with 35 fields that are self-explanatory. 
The fields (variables) are a mix of categorical and numerical data.

Except the `Attrition`, all other fields are feature variables. The fields are sorted in alphabetical order as follows:

- `Age`
- **`Attrition`** - *TARGET VARIABLE*
- `BusinessTravel`
- `DailyRate`
- `Department`
- `DistanceFromHome`
- `Education`
- `EducationField`
- `EmployeeCount`,
- `EmployeeNumber`
- `EnvironmentSatisfaction`
- `Gender`
- `HourlyRate`,
- `JobInvolvement`
- `JobLevel`
- `JobRole`
- `JobSatisfaction`,
- `MaritalStatus`
- `MonthlyIncome`
- `MonthlyRate`
- `NumCompaniesWorked`
- `Over18`
- `OverTime`
- `PercentSalaryHike`
- `PerformanceRating`,
- `RelationshipSatisfaction`
- `StandardHours`
- `StockOptionLevel`,
- `TotalWorkingYears`
- `TrainingTimesLastYear`
- `WorkLifeBalance`,
- `YearsAtCompany`
- `YearsInCurrentRole`
- `YearsSinceLastPromotion`
- `YearsWithCurrManager`

**Python Packages used:**  

* [`Google.colab`](https://colab.research.google.com/notebooks/io.ipynb) for linking the notebook to your Google-drive
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the pre-processing data, building ML models, and performance metrics
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting


## Importing the packages

In [None]:
### The required libraries and packages ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from google.colab import drive
import os
from tqdm import tqdm
import time
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import Normalizer

## Importing the Data

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
path = 'drive/MyDrive/Colab Notebooks/M3_MP5_HRA/'
# path = 'drive/MyDrive/<YOUR FOLDER NAME AS IT APPEARS ON GOOGLE DRIVE>'

df_raw = pd.read_csv(path+'Employee_Attrition.csv')
print(df_raw.shape)
df_raw.head()

(1470, 35)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [None]:
df = df_raw.copy()
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [None]:
df.iloc[:3, :10]

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4


In [None]:
df.iloc[:3, 10:20]

Unnamed: 0,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate
0,2,Female,94,3,2,Sales Executive,4,Single,5993,19479
1,3,Male,61,2,2,Research Scientist,2,Married,5130,24907
2,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396


In [None]:
df.iloc[:3, 20:28]

Unnamed: 0,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel
0,8,Y,Yes,11,3,1,80,0
1,1,Y,No,23,4,4,80,1
2,6,Y,Yes,15,3,2,80,0


In [None]:
df.iloc[:3, 28:]

Unnamed: 0,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,8,0,1,6,4,0,5
1,10,3,3,10,7,1,7
2,7,3,3,0,0,0,0


## Graded Exercises (10 points)

**Exercises 1 to 4** (7 points)

deal with the data, the basic anslysis, and its visualization and data preparation of **FEATURES** only.

**Exercises 5 & 6** (3 points)

Exercise 5 and 6 deal with the classification model and its performance Metrics.

As you can see, this Mini-Project is centered around the data, rather than the algorithms

### Exercise 1 (1 point): Basic EDA

- Check the shape of the data
- Check the nulls present in each field
- Check the unique number of entries per field
- Check the statistics of the data for each column
- Drop the features that are redundant and that have constant values throughout all rows


**Hint** : Use the `pandas` module

In [None]:
# Check the shape of the data
# YOUR CODE HERE

In [None]:
# Check the nulls present in each field
# YOUR CODE HERE

In [None]:
# Check the unique number of entries per field
# YOUR CODE HERE

In [None]:
# Check the statistics of the data for each column
# YOUR CODE HERE

### Exercise 2 (2 points): Features Visualization - 1

Plot the data distribution to see how the data is distributed (ex: Normal, Uniform, Poisson, Skewed-Normal etc.)

1. Determine the variables that are best viewed with:
 - Histograms (Choose an appropriate bin size/ number of bins if the default does not give good plot)
 - Bar plot 
 - Categorical Plot (Box/Violin/Swarm)

2. Display using the appropriate plotting corresponding to the feature variables.

**Hints**: Refer to the `seaborn` or `matplotlib` to achieve the respective tasks

**Optional**:
It is preferable to have multiple variables **in a single picture such as 4x5 grid** (subplot) format , rather than to scroll 35 individual images. Adjust the setting like spacings, font size, aspect ratio, choice of colors to produce a visually appealing picture that appears on a professional dashboard based websites as in some of the links below:

[Image-1](https://docs.microsoft.com/en-us/power-bi/create-reports/media/service-dashboards/power-bi-dashboard2.png)

[Image-2](https://d22e4d61ky6061.cloudfront.net/sites/default/files/full-width-stripy-socks/scrnshot-infrastructure-complete.png)

[Image-3](https://www.google.com/url?sa=i&url=https%3A%2F%2Fmjsharma.github.io%2FAzureMonitorDashboards%2F&psig=AOvVaw1Lt7Zus-l0yXWJ_YIg4eP4&ust=1649836098729000&source=images&cd=vfe&ved=2ahUKEwjTqIvGhI73AhVEQ2wGHfLsBT0QjRx6BAgAEAk)

Atleast try to present in 3 pictures for the Histograms/Box/Categorical Plots

In [None]:
# List the Variables that would be best described by Histogram
# YOUR CODE HERE

In [None]:
# List the Variables that would be best described by Bar Plots
# YOUR CODE HERE

In [None]:
# Plot the histograms
# YOUR CODE HERE

In [None]:
# Plot the bar Plots
# YOUR CODE HERE

In [None]:
# Plot the Categorical Plots
# YOUR CODE HERE

### Exercise 3 (2 points): Feature Engineering 

- Fill the missing values:
  - numerical: With mean
  - categorical: with `Others`/Suitable name as appropriate
- Identify and list the Categorical features which needs to be bucketized in discrete Bins
- Convert the Text categories into Numerical values using `pandas`'s '`Normalizer`'
- Ensure that entire dataset has **Numerical values Only**

In [None]:
# Check for Null and if any, replace with mean
# YOUR CODE HERE

In [None]:
# Convert the Text categories into Numerical Categories
# YOUR CODE HERE

In [None]:
# Identify and list the features and target which needs to be bucketized in discrete Bins
# YOUR CODE HERE

In [None]:
# Normalization
# YOUR CODE HERE

In [None]:
# Ensure that entire dataset has Numerical values Only
# YOUR CODE HERE

### Exercise 4 (2 points): Data Preparation

 - Plot an annotated heatmap of the correlation
 - Drop highly correlated variables, if any
 - Split the data into training and testing
 - Check if the Target data is balanced/imbalanced

In [None]:
# Plot an annotated heatmap of the correlation
# YOUR CODE HERE

In [None]:
# Drop highly correlated variables, if any
# YOUR CODE HERE

In [None]:
# Split the data into training and testing
# YOUR CODE HERE

In [None]:
# Check if the Target data is balanced/imbalanced
# YOUR CODE HERE

### Exercise 6 (2 point) : Building ML models

- Build any 3 `sklearn`'s classifiers of your choice
- Train and fit on the Training Data
- Predict on the Test Data

**Hint**: Train the model and predict on seperate cells. It will save time debugging.

In [None]:
# MODEL 1

# YOUR CODE HERE

In [None]:
# MODEL 2

# YOUR CODE HERE

In [None]:
# MODEL 3

# YOUR CODE HERE

### Exercise 6 (1 point) : Attrition Analysis

- Write a function to Compute the Classification metrics. 

The arguments of the function will be the true and predicted values of the `y` variable:
    - Print the Classification Report, 
    - Print the Confusion Matrix, 
    - Print the AUC metrics (or Plot ROC curves)

- Explain why one model behaves better than the other(s) in terms of Accuracy, Precision, Recall and F1-Score



In [None]:
def att_analysis(ytrue, ypred):
  # Print the Classification Report
  print("="*30)
  print("CLASSIFICATION REPORT = ")
  print("="*30)
  # YOUR CODE HERE

  # Print the Confusion Matrix
  print("="*30)
  print("CONFUSION MATRIX = ")
  print("="*30)
  # YOUR CODE HERE

  # Print the AUC metrics (or Plot ROC curves)
  print("="*30)
  print("AUC FOR CLASS 0 (NO) = ") #  or print("ROC CURVE : ") 
  print("="*30) 
  # YOUR CODE HERE

  # Print the AUC metrics (or Plot ROC curves)
  print("="*30)
  print("AUC FOR CLASS 1 (YES) = ")#  or print("ROC CURVE : ") 
  print("="*30)   
  # YOUR CODE HERE

In [None]:
# CLASSIFICATION METRICS- MODEL1
att_analysis(# YOUR CODE HERE)

In [None]:
# CLASSIFICATION METRICS- MODEL2
att_analysis(# YOUR CODE HERE)

In [None]:
# CLASSIFICATION METRICS- MODEL3
att_analysis(# YOUR CODE HERE)

#### YOUR FINDINGS/ Reasoning for which model is better and why (Qualitatively and Quantatively)

Explain why one model behaves better than the other(s) in terms of Accuracy, Precision, Recall and F1-Score

## Additional Ungraded Exercise for Practice:

- Try out for other ML Models