# Attrition Analytics - Exploratory Analysis & Predictive Modeling

- The entire code used in this kernel is uploaded into my [github repo](https://github.com/Niranjankumar-c/HRAnalyticsEmployeeAttrition). Feel free to fork the repo
- https://github.com/Niranjankumar-c/HRAnalyticsEmployeeAttrition
- For more beginner friendly projects in Data Science, check out my github https://github.com/Niranjankumar-c

## Objective:
The objective of the present report is to study factors like salary, satisfactory level, growth opportunities, facilities, policies and procedures, recognition, appreciation, suggestions of the employee’s by which it helps to know the Attrition level in the organizations and factors relating to retain them. This study also helps to find out where the organizations are lagging in retaining.

## Hypothesis:

> Employee attrition increases costs of recruitment, hiring and training replacement in the industries.
> Employee attrition reduces production, and profit in the industries.

# Approach
> Decision Tree Modeling: I have used Decision tree to create model. The major hurdle was we had 1223 unlabelled (No in Attrition) and 237 labelled (Yes in Attrition), which is a highly imbalanced data. So I used stratified sampling based on proportion of attrition in overall data.

> Model Evaluation: I have used the k fold cross validation technique for assessing how the results of a model will generalize to an independent test data set. I used k =5 i.e.. 5 fold cross validation. Model was fit on the stratified sample data and tested on unmarked dataset.

> ROC Curve: The ROC curve is a simple plot that shows the trade-off between the true positive rate and the false positive rate of a classifier for various choices of the probability threshold. ROC area = 0.6128 and F1 = 0.81

> Decision Tree Visualization: Used Python pydotplus package to visualize the decision tree to draw insights


# Import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
#using the seaborn style for graphs
plt.style.use("seaborn")

In [None]:
## Read the dataset
employee_data = pd.read_csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
employee_data.head()

In [None]:
##looking for any missing values

employee_data.isnull().sum()

In [None]:
employee_data.info()

# Exploratory Data Analysis

In [None]:
## basic descriptive statistics
employee_data.describe()

In [None]:
#Mapping the attrition 1 - yes and 0 - no in the new column

employee_data["left"] = np.where(employee_data["Attrition"] == "Yes",1,0)

In [None]:
employee_data.head()

In [None]:
#supressing all the warnings
import warnings
warnings.filterwarnings('ignore')

- Remove not usefull features

In [None]:
def NumericalVariables_targetPlots(df,segment_by,target_var = "Attrition"):
    """A function for plotting the distribution of numerical variables and its effect on attrition"""
    
    fig, ax = plt.subplots(ncols= 2, figsize = (14,6))    

    #boxplot for comparison
    sns.boxplot(x = target_var, y = segment_by, data=df, ax=ax[0])
    ax[0].set_title("Comparision of " + segment_by + " vs " + target_var)
    
    #distribution plot
    ax[1].set_title("Distribution of "+segment_by)
    ax[1].set_ylabel("Frequency")
    sns.distplot(a = df[segment_by], ax=ax[1], kde=False)
    
    plt.show()

In [None]:
def CategoricalVariables_targetPlots(df, segment_by,invert_axis = False, target_var = "left"):
    
    """A function for Plotting the effect of variables(categorical data) on attrition """
    
    fig, ax = plt.subplots(ncols= 2, figsize = (14,6))
    
    #countplot for distribution along with target variable
    #invert axis variable helps to inter change the axis so that names of categories doesn't overlap
    if invert_axis == False:
        sns.countplot(x = segment_by, data=df,hue="Attrition",ax=ax[0])
    else:
        sns.countplot(y = segment_by, data=df,hue="Attrition",ax=ax[0])
        
    ax[0].set_title("Comparision of " + segment_by + " vs " + "Attrition")
    
    #plot the effect of variable on attrition
    if invert_axis == False:
        sns.barplot(x = segment_by, y = target_var ,data=df,ci=None)
    else:
        sns.barplot(y = segment_by, x = target_var ,data=df,ci=None)
        
    ax[1].set_title("Attrition rate by {}".format(segment_by))
    ax[1].set_ylabel("Average(Attrition)")
    plt.tight_layout()

    plt.show()

## Analyizing the variables

- Numerical Variables

### Age

In [None]:
# we are checking the distribution of employee age and its related to attrition or not

NumericalVariables_targetPlots(employee_data,segment_by="Age")

- We found that median age of employee's in the company is 30 - 40 Yrs. Minimum age is 18 Yrs and Maximum age is 60 Yrs.
- From the Age Comparision boxplot, majority of people who left the company are below 40 Yrs and among the people who didn't left the company are of age 32 to 40 years

### Daily Rate & Montly Income & HourlyRate

In [None]:
#Analyzing the daily wage rate vs employee left the company or not

NumericalVariables_targetPlots(employee_data,"DailyRate")

In [None]:
NumericalVariables_targetPlots(employee_data,"MonthlyIncome")

- Employee's working with lower daily rates are more prone to leave the company than compared to the employee's working with higher rates. The same trend is resonated with monthly income too.

**Hourly Rate**

In [None]:
NumericalVariables_targetPlots(employee_data,"HourlyRate")

- From plot we have seen that there is no significant difference in the hourly rate and attrition. Therefore hourly rate is considered as not signifcant to attrition 

### PercentSalaryHike

In [None]:
NumericalVariables_targetPlots(employee_data,"PercentSalaryHike")

- Majority (60% of total strength) of employee's receive 16% salary hike in the company, employee's who received less salary hike have left the company.

### Total Working years

In [None]:
NumericalVariables_targetPlots(employee_data,"TotalWorkingYears")

In [None]:
sns.lmplot(x = "TotalWorkingYears", y = "PercentSalaryHike", data=employee_data,fit_reg=False,hue="Attrition",size=6,
           aspect=1.5)

plt.show()

- Employee's with less working years have received 25% Salary hike when they switch to another company, but there is no linear relationship between working years and salary hike. 
- Attrition is not seen amomg the employee's having more than 20 years of experience if their salary hike is more than 20%, even if the salary hike is below 20% attrition rate among the employee's is very low.
- Employee's with lesser years of experience are prone to leave the company in search of better pay, irrespective of salary hike

### Distance From Home

In [None]:
NumericalVariables_targetPlots(employee_data,"DistanceFromHome")

- There is a higher number of people who reside near to offices and hence the attrition levels are lower for distance less than 10. With increase in distance from home, attrition rate also increases

## Analyizing the variables

- Categorical Variables

### Job Involvement

In [None]:
#cross tabulation between attrition and JobInvolvement
pd.crosstab(employee_data.JobInvolvement,employee_data.Attrition)

In [None]:
#calculating the percentage of people having different job involvement rate
round(employee_data.JobInvolvement.value_counts()/employee_data.shape[0] * 100,2)

In [None]:
CategoricalVariables_targetPlots(employee_data,"JobInvolvement")

1. In the total data set, 59% have high job involvement whereas 25% have medium involvement rate
2. From above plot we can observe that round 50% of people in low job involvement (level 1 & 2) have left the company.
3. Even the people who have high job involmenent have higher attrition rate around 15% in that category have left company

### JobSatisfaction

In [None]:
CategoricalVariables_targetPlots(employee_data,"JobSatisfaction")

As expected, people with low satisfaction have left the company around 23% in that category. what surprising is out of the people who rated medium and high job satisfaction around 32% has left the company. There should be some other factor which triggers their exit from the company

### Performance Rating

In [None]:
#checking the number of categories under performance rating
employee_data.PerformanceRating.value_counts()

In [None]:
#calculate the percentage of performance rating per category in the whole dataset
round(employee_data.PerformanceRating.value_counts()/employee_data.shape[0] * 100,2)

Around 85% of people in the company rated as Excellent and remaining 15% rated as Outstanding

In [None]:
CategoricalVariables_targetPlots(employee_data,"PerformanceRating")

Contrary to normal belief that employee's having higher rating will not leave the company. It may be seen that there is no significant difference between the performance rating and Attrition Rate.

### RelationshipSatisfaction

In [None]:
#percentage of each relationship satisfaction category across the data
round(employee_data.RelationshipSatisfaction.value_counts()/employee_data.shape[0],2)

In [None]:
CategoricalVariables_targetPlots(employee_data,"RelationshipSatisfaction")

In this too, we found that almost 30% of employees with high and very high RelationshipSatisfaction have left the company. Here also there is no visible trend among the relationshipsatisfaction and attrition rate

### WorkLifeBalance


In [None]:
#percentage of worklife balance rating across the company data
round(employee_data.WorkLifeBalance.value_counts()/employee_data.shape[0],2)

More than 60% of the employee's rated that they have Better worklife balance and 10% rated for Best worklife balance

In [None]:
CategoricalVariables_targetPlots(employee_data,"WorkLifeBalance")

- As expected more than 30% of the people who rated as Bad WorkLifeBalance have left the company and around 15% of the people who rated for Best WorkLifeBalance also left the company

In [None]:
CategoricalVariables_targetPlots(employee_data,"OverTime")

More than 30% of employee's who worked overtime has left the company, where as 90% of employee's who have not experienced overtime has not left the company. Therefore overtime is a strong indicator of attrition

### BusinessTravel

In [None]:
CategoricalVariables_targetPlots(employee_data,segment_by="BusinessTravel")

- There are more people who travel rarely compared to people who travel frequently. In case of people who travel Frequently  around 25% of people have left the company and in other cases attrition rate doesn't vary significantly on travel

### Department

In [None]:
employee_data.Department.value_counts()

In [None]:
CategoricalVariables_targetPlots(employee_data,segment_by="Department")

- On comparing departmentwise,we can conclude that HR has seen only a marginal high in turnover rates whereas the numbers are significant in sales department with turnover rates of 39 %. The attrition levels are not appreciable in R & D where 67 % have recorded no attrition.
- Sales has seen higher attrition levels about 20.6% followed by HR around 18%

### EducationField

In [None]:
employee_data.EducationField.value_counts()

In [None]:
CategoricalVariables_targetPlots(employee_data,"EducationField",invert_axis=True)

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(y = "EducationField", x = "left", hue="Education", data=employee_data,ci=None)
plt.show()

- There are more people with a Life sciences followed by medical and marketing
- Employee's in the EducationField of Human Resources and Technical Degree have highest attrition levels around 26% and 23% respectively
- When compared with Education level, we have observed that employees in the highest level of education in there field of study have left the company. We can conclude that EducationField is a strong indicator of attrition

### EnvironmentSatisfaction

In [None]:
CategoricalVariables_targetPlots(employee_data,"EnvironmentSatisfaction")

 we can see that people having low environment satisfaction 25% leave the company

### Gender Vs Attrition

In [None]:
sns.boxplot(employee_data['Gender'], employee_data['MonthlyIncome'])
plt.title('MonthlyIncome vs Gender Box Plot', fontsize=20)      
plt.xlabel('MonthlyIncome', fontsize=16)
plt.ylabel('Gender', fontsize=16)
plt.show()

In [None]:
CategoricalVariables_targetPlots(employee_data,"Gender")

- Monthly Income distribution for Male and Female is almost similar, so the attrition rate of Male and Female is almost the same around 15%. Gender is not a strong indicator of attrition

In [None]:
fig,ax = plt.subplots(2,3, figsize=(20,20))               # 'ax' has references to all the four axes
plt.suptitle("Comparision of various factors vs Gender", fontsize=20)
sns.barplot(employee_data['Gender'],employee_data['DistanceFromHome'],hue = employee_data['Attrition'], ax = ax[0,0],ci=None); 
sns.barplot(employee_data['Gender'],employee_data['YearsAtCompany'],hue = employee_data['Attrition'], ax = ax[0,1],ci=None); 
sns.barplot(employee_data['Gender'],employee_data['TotalWorkingYears'],hue = employee_data['Attrition'], ax = ax[0,2],ci=None); 
sns.barplot(employee_data['Gender'],employee_data['YearsInCurrentRole'],hue = employee_data['Attrition'], ax = ax[1,0],ci=None); 
sns.barplot(employee_data['Gender'],employee_data['YearsSinceLastPromotion'],hue = employee_data['Attrition'], ax = ax[1,1],ci=None); 
sns.barplot(employee_data['Gender'],employee_data['NumCompaniesWorked'],hue = employee_data['Attrition'], ax = ax[1,2],ci=None); 
plt.show()

1. Distance from home matters to women employees more than men. 
2. Female employes are spending more years in one company compare to their counterpart. 
3. Female employes spending more years in current company are more inclined to switch.

### Job Role

In [None]:
CategoricalVariables_targetPlots(employee_data,"JobRole",invert_axis=True)

1. Jobs held by the employee is maximum in Sales Executive, then R&D , then Laboratory Technician
2. People working in Sales department is most likely quit the company followed by Laboratory Technician and Human Resources there attrition rates are 40%, 24% and 22% respectively

### Marital Status

In [None]:
CategoricalVariables_targetPlots(employee_data,"MaritalStatus")

From the plot,it is understood that irrespective of the marital status,there are large people who stay with the company and do not leave.Therefore,marital status is a weak predictor of attrition

# Building Decision Tree

In [None]:
from sklearn.model_selection import train_test_split

#for fitting classification tree
from sklearn.tree import DecisionTreeClassifier

#to create a confusion matrix
from sklearn.metrics import confusion_matrix

#import whole class of metrics
from sklearn import metrics

In [None]:
employee_data.Attrition.value_counts().plot(kind = "bar")
plt.xlabel("Attrition")
plt.ylabel("Count")
plt.show()

In [None]:
employee_data["Attrition"].value_counts()

From the Exploratory data analysis, variable that are not significant to attrition are:

- EmployeeCount, EmployeeNumber, Gender, HourlyRate, JobLevel, MaritalStatus, Over18, StandardHours

In [None]:
#copying the main employee data to another dataframe
employee_data_new = employee_data.copy()

In [None]:
#dropping the not significant variables
employee_data_new.drop(["EmployeeCount","EmployeeNumber","Gender","HourlyRate","Over18","StandardHours","left"], axis=1,inplace=True)

# Handling Categorical Variables

- Segregate the numerical and Categorical variables
- Convert Categorical variables to dummy variables

In [None]:
#data types of variables
dict(employee_data_new.dtypes)

In [None]:
#segregating the variables based on datatypes

numeric_variable_names  = [key for key in dict(employee_data_new.dtypes) if dict(employee_data_new.dtypes)[key] in ['float64', 'int64', 'float32', 'int32']]

categorical_variable_names = [key for key in dict(employee_data_new.dtypes) if dict(employee_data_new.dtypes)[key] in ["object"]]

In [None]:
categorical_variable_names

In [None]:
#store the numerical variables data in seperate dataset

employee_data_num = employee_data_new[numeric_variable_names]

In [None]:
#store the categorical variables data in seperate dataset

employee_data_cat = employee_data_new[categorical_variable_names]
#dropping the attrition 
employee_data_cat.drop(["Attrition"],axis=1,inplace=True)

In [None]:
#converting into dummy variables

employee_data_cat = pd.get_dummies(employee_data_cat)

In [None]:
#Merging the both numerical and categorical data

employee_data_final = pd.concat([employee_data_num, employee_data_cat,employee_data_new[["Attrition"]]],axis=1)

In [None]:
employee_data_final.head()

In [None]:
#final features
features =  list(employee_data_final.columns.difference(["Attrition"]))

In [None]:
features

## Separating the Target and the Predictors

In [None]:
#seperating the target and predictors

X = employee_data_final[features]
y = employee_data_final[["Attrition"]]

In [None]:
X.shape

# Train-Test Split(Stratified Sampling of Y)

In [None]:
# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

#function for crossvalidate score
from sklearn.model_selection import cross_validate

#to find the best 
from sklearn.model_selection import GridSearchCV

In [None]:
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size = 0.3,stratify = y,random_state = 100)

In [None]:
#Checks
#Proportion in training data
y_train.Attrition.value_counts()/len(y_train)

In [None]:
#Checks
#Proportion in training data
pd.DataFrame(y_train.Attrition.value_counts()/len(y_train)).plot(kind = "bar")
plt.show()

In [None]:
#Proportion of test data
y_test.Attrition.value_counts()/len(y_test)

In [None]:
#make a pipeline for decision tree model 

pipelines = {
    "clf": make_pipeline(DecisionTreeClassifier(max_depth=3,random_state=100))
}

#### Cross Validate 
- To check the accuracy of the pipeline

In [None]:
scores = cross_validate(pipelines['clf'], X_train, y_train,return_train_score=True)

In [None]:
scores['test_score'].mean()

Average accuracy of pipeline with Decision Tree Classifier is 83.48%

# Cross-Validation and Hyper Parameters Tuning
Cross Validation is the process of finding the best combination of parameters for the model by traning and evaluating the model for each combination of the parameters
- Declare a hyper-parameters to fine tune the Decision Tree Classifier

**Decision Tree is a greedy alogritum it searches the entire space of possible decision trees. so we need to find a optimum parameter(s) or criteria for stopping the decision tree at some point. We use the hyperparameters to prune the decision tree**

In [None]:
decisiontree_hyperparameters = {
    "decisiontreeclassifier__max_depth": np.arange(3,12),
    "decisiontreeclassifier__max_features": np.arange(3,10),
    "decisiontreeclassifier__min_samples_split": [2,3,4,5,6,7,8,9,10,11,12,13,14,15],
    "decisiontreeclassifier__min_samples_leaf" : np.arange(1,3)
}

In [None]:
pipelines['clf']

## Decision Tree classifier with gini index

#### Fit and tune models with cross-validation

Now that we have our <code style="color:steelblue">pipelines</code> and <code style="color:steelblue">hyperparameters</code> dictionaries declared, we're ready to tune our models with cross-validation.
- We are doing 5 fold cross validation

In [None]:
#Create a cross validation object from decision tree classifier and it's hyperparameters

clf_model = GridSearchCV(pipelines['clf'], decisiontree_hyperparameters, cv=5, n_jobs=-1)

In [None]:
#fit the model with train data
clf_model.fit(X_train, y_train)

In [None]:
#Display the best parameters for Decision Tree Model
clf_model.best_params_

In [None]:
#Display the best score for the fitted model
clf_model.best_score_

In [None]:
#In Pipeline we can use the string names to get the decisiontreeclassifer

clf_model.best_estimator_.named_steps['decisiontreeclassifier']

In [None]:
#saving into a variable to get graph

clf_best_model = clf_model.best_estimator_.named_steps['decisiontreeclassifier']

# Model Performance Evaluation
- On Test Data

In [None]:
#Making a dataframe of actual and predicted data from test set

tree_test_pred = pd.concat([y_test.reset_index(drop = True),pd.DataFrame(clf_model.predict(X_test))],axis=1)
tree_test_pred.columns = ["actual","predicted"]

#setting the index to original index
tree_test_pred.index = y_test.index

In [None]:
tree_test_pred.head()

In [None]:
#keeping only positive condition (yes for attrition)

pred_probability = pd.DataFrame(p[1] for p in clf_model.predict_proba(X_test))
pred_probability.columns = ["predicted_prob"]
pred_probability.index = y_test.index

In [None]:
#merging the predicted data and its probability value

tree_test_pred = pd.concat([tree_test_pred,pred_probability],axis=1)

In [None]:
tree_test_pred.head()

In [None]:
#converting the labels Yes --> 1 and No --> 0 for further operations below

tree_test_pred["actual_left"] = np.where(tree_test_pred["actual"] == "Yes",1,0)
tree_test_pred["predicted_left"] = np.where(tree_test_pred["predicted"] == "Yes",1,0)

In [None]:
tree_test_pred.head()

### Confusion Matrix

The confusion matrix is a way of tabulating the number of misclassifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.

In [None]:
#confusion matrix
metrics.confusion_matrix(tree_test_pred.actual,tree_test_pred.predicted,labels=["Yes","No"])

In [None]:
#confusion matrix visualization using seaborn heatmap

sns.heatmap(metrics.confusion_matrix(tree_test_pred.actual,tree_test_pred.predicted,
                                    labels=["Yes","No"]),cmap="Greens",annot=True,fmt=".2f",
           xticklabels = ["Left", "Not Left"] , yticklabels = ["Left", "Not Left"])

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
#Area Under ROC Curve

auc_score_test = metrics.roc_auc_score(tree_test_pred.actual_left,tree_test_pred.predicted_left)
print("AUROC Score:",round(auc_score_test,4))

In [None]:
##Plotting the ROC Curve

fpr, tpr, thresholds = metrics.roc_curve(tree_test_pred.actual_left, tree_test_pred.predicted_prob,drop_intermediate=False)


plt.figure(figsize=(8, 6))
plt.plot( fpr, tpr, label='ROC curve (area = %0.4f)' % auc_score_test)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic cuve')
plt.legend(loc="lower right")
plt.show()


From the ROC Curve, we have a choice to make depending on the value we place on true positive and tolerance for false positive rate
- If we wish to find the more people who are leaving, we could increase the true positive rate by adjusting the probability cutoff for classification. However by doing so would also increase the false positive rate. we need to find the optimum value of cutoff for classification

#### Metrics

- Recall: Ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized
- Precision: To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive

In [None]:
#calculating the recall score

print("Recall Score:",round(metrics.recall_score(tree_test_pred.actual_left,tree_test_pred.predicted_left) * 100,3))

In [None]:
#calculating the precision score

print("Precision Score:",round(metrics.precision_score(tree_test_pred.actual_left,tree_test_pred.predicted_left) * 100,3))

In [None]:
print(metrics.classification_report(tree_test_pred.actual_left,tree_test_pred.predicted_left))

# Visualization of Decision Tree
- Dependencies 
    - Need to install graphviz (conda install pydot graphviz)
    - Set the environment path variable to graphviz folder

In [None]:
# conda install pydot graphviz
#! pip install pydotplus

In [None]:
from sklearn.tree import export_graphviz

In [None]:
!pip install pydotplus

In [None]:
import pydotplus as pdot

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus as pdot

In [None]:
#write the dot data
dot_data = StringIO()

In [None]:
#export the decision tree along with the feature names into a dot file format

export_graphviz(clf_best_model,out_file=dot_data,filled=True,
                rounded=True,special_characters=True,feature_names = X_train.columns.values,class_names = ["No","Yes"])

In [None]:
#make a graph from dot file 
graph = pdot.graph_from_dot_data(dot_data.getvalue())

In [None]:
Image(graph.create_png())

In [None]:
#export the tree diagram
graph.write_png("employee_attirtion.png")