# 0. Attrition Analysis

> Data is from [IBM HR Analytics Employee Attrition & Performance](#https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)

Employee attrition is a big issue among many companies. Balance [lessening employee turnover](#http://bit.ly/2f8I4js) and employee attrition is a kind of trade off. Although more than 40 percent of employees have potential to attrition, companies are still taking healthy growth seriously. Take many Chinese internet companies as examples, large enterprises like NetEase and Huawei, employee attrition is a big concern although they can recruit numerous fresh employees every year. Key person risk may be ringing the bell for the company’s CEO and making the company in jeopardy, however, ordinary employee like you and me, who leaving the company as the company grows, can also raise the red flag for the company. Attrition is essentially the slow decay of a company and if the management does not take action to this, the workforce will shrink, production will stop, and the company will likely go out of business. However, if these companies can tackle this challenge, more positive growth of turnover would be earned. Therefore, this problem catches our attention and our team are deciding to make a best fit prediction model for predicting employee attrition in one company based on assorted variables. In a research of McKinsey on people analytics, the intelligence included a range of information from the employee such as levels of compensation and demographic profile can indeed predict or raising red flag on certain employee so that the machine can help HR in one enterprise to make decisions in some ways.

# 1. Preparation

## 1.1 Import Packages

Import some packages for exploratory data analysis and data visualization packages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import warnings
%matplotlib inline 
%config InlineBackend.figure_format = 'retina' # Set to retina version
pd.set_option('display.max_columns', None) # Set max columns output
warnings.filterwarnings('ignore')

## 1.2  Overview of Data Set

In [None]:
df = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
print(df.shape)
display(df.head())

By overview of data set, it can be found that this data set include 1470 observations and 35 features. But feature 'Over18', 'EmployeeCount' and 'StandardHours' are exactly same in every rows and 'EmployeeNumber' is the number that tag employees so we decide to drop these columns.

In [None]:
df = df.drop(columns=['Over18', 'EmployeeCount', 'StandardHours', 'EmployeeNumber'])

According to description page of data set, sign the category of some categorical features.

In [None]:
education_map = {1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor'}
education_satisfaction_map = {1: 'Low', 2:'Medium', 3:'High', 4:'Very High'}
job_involvement_map = {1: 'Low', 2:'Medium', 3:'High', 4:'Very High'}
job_satisfaction_map = {1: 'Low', 2:'Medium', 3:'High', 4:'Very High'}
performance_rating_map = {1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'}
relationship_satisfaction_map = {1: 'Low', 2:'Medium', 3:'High', 4:'Very High'}
work_life_balance_map = {1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best'}
# Use the pandas apply method to numerically encode our attrition target variable
df['Education'] = df["Education"].apply(lambda x: education_map[x])
df['EnvironmentSatisfaction'] = df["EnvironmentSatisfaction"].apply(lambda x: education_satisfaction_map[x])
df['JobInvolvement'] = df["JobInvolvement"].apply(lambda x: job_involvement_map[x])
df['JobSatisfaction'] = df["JobSatisfaction"].apply(lambda x: job_satisfaction_map[x])
df['PerformanceRating'] = df["PerformanceRating"].apply(lambda x: performance_rating_map[x])
df['RelationshipSatisfaction'] = df["RelationshipSatisfaction"].apply(lambda x: relationship_satisfaction_map[x])
df['WorkLifeBalance'] = df["WorkLifeBalance"].apply(lambda x: work_life_balance_map[x])

Then check the data set again.

In [None]:
display(df.head())

### Type of Features

Take a close look at the type of features.

#### Categorical
- **Nominal**
    - Attrition
    - BusinessTravel
    - Department
    - EducationField
    - Gender
    - JobRole
    - MaritalStatus
    - OverTime
- **Ordinal**
    - Education
    - EnvironmentSatisfaction
    - JobInvolvement
    - JobLevel
    - JobSatisfaction
    - PerformanceRating
    - RelationshipSatisfaction
    - StockOptionLevel
    - WorkLifeBalance
    
#### Numerical
- **Continuous**
    - Age
    - DailyRate
    - DistanceFromHome	
    - HourlyRate
    - MonthlyIncome
    - MonthlyRate
    - NumCompaniesWorked
    - PercentSalaryHike
    - TotalWorkingYears
    - TrainingTimesLastYear
    - YearsAtCompany
    - YearsInCurrentRole
    - YearsSinceLastPromotion
    - YearsWithCurrManager

## 1.3 Check Missing Values

In [None]:
print("Missing Value:", df.isnull().any().any())

This is complete data set that contain no missing values.

# 2. EDA (Exploratory Data Analysis)

## 2.1 Distribution of Target Variable

In [None]:
colors = ['#66b3ff', '#ff9999']
explode = (0.05,0.05)
plt.figure(figsize=(5, 5))
plt.pie(df['Attrition'].value_counts(), colors = colors, labels=['No', 'Yes'], 
        autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = explode)
plt.legend()
plt.title("Attrition (Target) Distribution")
plt.show()

Note that the distribution of target variable is quite unbalanced that 16.1% of employees decide to leave but most employees decide to stay. It should be notice that this unbalance may influence the learning model later.

## 2.2 Frequency Analysis

### 2.2.1 Frequency Analysis of Numerical Features

The numerical features in these data set are: 'Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion' and 'YearsWithCurrManager'. Plot the distribution graph of these features and find some pattern.

In [None]:
numerical_list = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate',
                  'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear',
                  'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

plt.figure(figsize=(10, 10))
for i, column in enumerate(numerical_list, 1):
    plt.subplot(5, 3, i)
    sns.distplot(df[column], bins=20)
plt.tight_layout()
plt.show()

#### Findings of Numerical Features

- Age: The age distribution of this data set distributed normally which cover from 20 to 60. Most employees are 30 to 40.
- DistanceFromHome: Most of employees live close to company which most distance are below 10km.
- MonthlyIncome: The majority of monthly income of employees are centred at around 5000. Only a few of people get high income over 10000.
- NumCompaniesWorked: Most employees only worked for one company.
- TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager: These time type data are right skewed which most of people stay in company only for a few years.
- DailyRate, HourlyRate, MonthlyRate distributed uniformly which the figure is similar in different intervals.

### 2.2.2 Frequency Analysis of Categorical Features

Categorical features are: 'Attrition', 'BusinessTravel', 'Department', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole','JobSatisfaction', 'MaritalStatus', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction','StockOptionLevel' and 'WorkLifeBalance'. First, change the type of nominal features to object in order to process frequency analysis, because some classified variables were signed by integer in original data set. Then, browse the frequency of categorical features in data set.

In [None]:
cate_list = ['Attrition', 'BusinessTravel', 'Department', 'Education', 'EducationField', 
             'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole',
             'JobSatisfaction', 'MaritalStatus', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction',
             'StockOptionLevel', 'WorkLifeBalance']

for i in cate_list:
    df[i] = df[i].astype(object)
    
plt.figure(figsize=(20, 20))
gridspec.GridSpec(7, 3)
locator1, locator2 = [0, 0]

for column in cate_list:
    if column == 'JobRole':
        plt.subplot2grid((7, 3), (locator1, locator2), colspan=3, rowspan=1)
        sns.countplot(df[column], palette='Set2')
        locator1 += 1
        locator2 = 0
        continue
    plt.subplot2grid((7, 3), (locator1, locator2))
    sns.countplot(df[column], palette='Set2')
    locator2 += 1
    if locator2 == 3:
        locator1 += 1
        locator2 = 0
        continue
    if locator1 == 7:
        break
        
plt.tight_layout()
plt.show()

#### Findings of Categorical Features

- In terms of satisfaction (enviroment, job and relationship satisfaction), most of employees are satisfied, which the number of "3" and "4" indicate high satisfaction are a lot.
- Most employees hold bachelor degree and many people major in life science.
- Since that most of employees are relative new to company, job level 1 is the majority job level.
- Job involment is pretty high among employees.

## 2.3 Correlation Analysis

Plot correlation matrix by heatmap.

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, cmap="Greys", annot_kws={"size":15})
plt.show()

#### Findings of Correlation Analysis

- Time type of features (Years at company, in current role, since last promotion and with current manager) are quite correlated to each other indicated by green zone.
- Besides, totle working year is also correlated to monthly income as we imagined.
- Most of features are not correlated to each other, so we do not have to deal with that. As for time type of features, we will consider that whether to alter them according to the machine learning model.  

## 2.4 Cross Analysis

### 2.4.1 Cross Analysis between Attrition and Numerical Features

In [None]:
plt.figure(figsize=(10, 10))
for i, column in enumerate(numerical_list, 1):
    plt.subplot(5, 3, i)
    sns.violinplot(data=df, x=column, y='Attrition')
plt.tight_layout()
plt.show()

#### Findings of Cross Analysis between Attrition and Numerical Features

> Above violin graph indicate the average value of features and similar information like boxplot. These findings are merely based on superficial results whithout consideration of confounding.

- Age: Younger employees are tend to leave compared to elder employees.
- DistanceFromHome: Those who live far from company have more possibility to leave.
- MonthlyIncome: Most income of attrition employee are below 5000 while those who's income is higher than 5000 tend to stay.
- NumCompaniesWorked, Time type of features: People who worked for many companies and worked for many years are likely to stay.


### 2.4.2 Cross Analysis between Attrition and Categorical Features

In [None]:
plt.figure(figsize=(20, 20))
gridspec.GridSpec(7, 3)
locator1, locator2 = [0, 0]
for column in cate_list:
    if column == 'JobRole':
        plt.subplot2grid((7, 3), (locator1, locator2), colspan=3, rowspan=1)
        sns.countplot(x=column, hue='Attrition', data=df, palette='BrBG')
        locator1 += 1
        locator2 = 0
        continue
    plt.subplot2grid((7, 3), (locator1, locator2))
    sns.countplot(x=column, hue='Attrition', data=df, palette='BrBG')
    locator2 += 1
    if locator2 == 3:
        locator1 += 1
        locator2 = 0
        continue
    if locator1 == 7:
        break
plt.tight_layout()
plt.show()

#### Findings of Cross Analysis between Attrition and Categorical Features

> By comparing between different level among the employees attrition.

- Those who has business travel are more likely leave the company.
- Employees of Human Resource are the most stable group of employees.
- Employees who have Doctor degree are stable.
- Enviroment satisfaction is not related a lot to attrition.
- The higher job level is, the less possibility that employess leave.
- Technical employees are tend to leave.
- Low performance rating and low stock option level may result employees' attrition.

# 3. Preprocessing and Encoding of Features

After exploration of data, we need to process these features in order to apply them into machine learning models.

In [None]:
from sklearn import preprocessing
from IPython.display import Image
# Reload the data
df = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df = df.drop(columns=['Over18', 'EmployeeCount', 'StandardHours', 'EmployeeNumber'])

## 3.1 Preprocessing and Encoding of Categorical Features

### 3.1.1 Label all 'object' type of feature as 'integer'

In [None]:
for cate_features in df.select_dtypes(include='object').columns:
    le = preprocessing.LabelEncoder()
    df[cate_features] = le.fit_transform(df[cate_features])
    print("Origin Classes:", list(le.classes_))

### 3.1.2 Dummy Nominal Features

Select those nominal features which have more than two class as dummy variables.

In [None]:
dummies = ['Department', 'EducationField', 'JobRole', 'MaritalStatus']
df = pd.get_dummies(data=df, columns=dummies)
display(df.head())

## 3.2 Preprocessing and Encoding of Numerical Features

### 3.2.1 Scaling Numerical Features

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

$$Z=\frac{X-\mu}{S}$$

In [None]:
std = preprocessing.StandardScaler()
scaled = std.fit_transform(df[numerical_list])
scaled = pd.DataFrame(scaled, columns=numerical_list)
for i in numerical_list:
    df[i] = scaled[i]
display(df.head())

# 4. Model Selection

After dummy process and standardized numerical feature. We split the data set into training set and test set with ratio 8:2. Then, duplicate the training set to:
- Type1: original size of training set; 
- Type2: Oversampled size of training set. Decide to use which version of training set by comparing the performance of each one after implementing a machine learning model.

Moreover, this report used Grid search which is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified. First, choose some of classifying machine learning model as candidates. We put a customized list of parameters into grid to search out an optimized model under cv equals to 5. Then apply this trained model to test set. Evaluate the performance and compare them between different models.

![flow.png](attachment:flow.png)

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, roc_auc_score
from subprocess import call
from IPython.display import Image 
from imblearn.over_sampling import SMOTE

Define a function to output the confusion matrix after building the model.

In [None]:
def my_confusion_matrix(test, test_pred):
    cf = pd.DataFrame(confusion_matrix(test, test_pred), 
                      columns=['Predicted NO', 'Predicted Yes'], 
                      index=['True No', 'True Yes'])
    report = pd.DataFrame(classification_report(test, test_pred, target_names=['No', 'Yes'], 
                                                        output_dict=True)).round(2).transpose()
    display(cf)
    display(report)

Define a funciton to draw ROC curve to illustrate the AUC.

In [None]:
def plot_roc_curve(model, y, x):
    tree_auc = roc_auc_score(y, model.predict(x))
    fpr, tpr, thresholds = roc_curve(y, model.predict_proba(x)[:,1])
    plt.figure(figsize=(15, 10))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='Decision Tree ROC curve (area = %0.2f)' % tree_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.fill_between(fpr, tpr, color='orange', alpha=0.2)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

Splite the data set into traning set and test set with ratio 8:2
- Duplicate the training set which:
    - Original size of training set
    - Oversampled size of training set

Decide to use which version of training set before implementing a machine learning model.

In [None]:
X = df.drop(columns=['Attrition'])
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
oversampler = SMOTE(random_state=0)
smote_X_train, smote_y_train = oversampler.fit_sample(X_train, y_train)

After oversampling, the scale of data set changed to following.

In [None]:
colors = ['#66b3ff', '#ff9999']
explode = (0.05,0.05)
plt.figure(figsize=(5, 5))
plt.pie(pd.Series(smote_y_train).value_counts(), colors = colors, labels=['No', 'Yes'], 
        autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = explode)
plt.legend()
plt.title("Oversampled Targets in Training Set")
plt.show()

## 4.1 Decision Tree

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

### 4.1.1 Decision Tree

In [None]:
params = {"criterion": ("gini", "entropy"), 
          "splitter": ("best", "random"), 
          "max_depth": np.arange(1, 20), 
          "min_samples_split": [2, 3, 4], 
          "min_samples_leaf": np.arange(1, 20)}
tree1_grid = GridSearchCV(DecisionTreeClassifier(random_state=0), params, scoring="roc_auc", n_jobs=-1, cv=5)
tree1_grid.fit(X_train, y_train)

#### Optimized Model 

In [None]:
print(tree1_grid.best_score_)
print(tree1_grid.best_params_)
print(tree1_grid.best_estimator_)

#### Tree Graph

In [None]:
tree1_clf = DecisionTreeClassifier(random_state=0, **tree1_grid.best_params_)
tree1_clf.fit(X_train, y_train)
tree.export_graphviz(tree1_clf, out_file='/kaggle/working/tree1.dot', special_characters=True, rounded = True, filled= True,
                     feature_names=X.columns, class_names=['Yes', 'No'])
call(['dot', '-T', 'png', '/kaggle/working/tree1.dot', '-o', '/kaggle/working/tree1.png'])
display(Image("/kaggle/working/tree1.png", height=2000, width=1900))

#### Results of Test Set

In [None]:
y_test_pred_tree1 = tree1_clf.predict(X_test)
my_confusion_matrix(y_test, y_test_pred_tree1) # Defined before
tree1_auc = roc_auc_score(y_test, y_test_pred_tree1)
print("AUC:", tree1_auc)

#### Importance of Features

In [None]:
IP = pd.DataFrame({"Features": np.array(X.columns), "Importance": tree1_clf.feature_importances_})
IP = IP.sort_values(by=['Importance'], ascending=False)
plt.figure(figsize=(15, 10))
sns.barplot(x='Importance', y='Features', data=IP[:10])
plt.show()

#### ROC curve

In [None]:
plot_roc_curve(tree1_clf, y_test, X_test)
plt.show()

### 4.1.2 Decision Tree (Oversample)

In [None]:
tree2_grid = GridSearchCV(DecisionTreeClassifier(random_state=0), params, scoring="roc_auc", n_jobs=-1, cv=5)
tree2_grid.fit(smote_X_train, smote_y_train)

#### Optimized Model 

In [None]:
print(tree2_grid.best_score_)
print(tree2_grid.best_params_)
print(tree2_grid.best_estimator_)

#### Tree Graph

In [None]:
tree2_clf = DecisionTreeClassifier(random_state=65, **tree2_grid.best_params_)
tree2_clf.fit(smote_X_train, smote_y_train)
tree.export_graphviz(tree2_clf, out_file='/kaggle/working/tree2.dot', special_characters=True, rounded = True, filled= True,
                     feature_names=X.columns, class_names=['Yes', 'No'])
call(['dot', '-T', 'png', 'tree.dot', '-o', '/kaggle/working/tree2.png'])
display(Image("/kaggle/working/tree2.png", height=2000, width=1900))

#### Results of Test Set

In [None]:
y_test_pred_tree2 = tree2_clf.predict(X_test)
my_confusion_matrix(y_test, y_test_pred_tree2)
tree2_auc = roc_auc_score(y_test, y_test_pred_tree2)
print("AUC:", tree2_auc)

#### ROC Curve

In [None]:
plot_roc_curve(tree2_clf, y_test, X_test)
plt.show()

#### Importance of Features

In [None]:
IP = pd.DataFrame({"Features": np.array(X.columns), "Importance": tree2_clf.feature_importances_})
IP = IP.sort_values(by=['Importance'], ascending=False)
plt.figure(figsize=(15, 10))
sns.barplot(x='Importance', y='Features', data=IP[:10])
plt.show()

# 5. Evaluation

In [None]:
tree1_fpr, tree1_tpr, tree1_thresholds = roc_curve(y_test, tree1_clf.predict_proba(X_test)[:,1])
tree2_fpr, tree2_tpr, tree2_thresholds = roc_curve(y_test, tree2_clf.predict_proba(X_test)[:,1])
plt.figure(figsize=(15, 10))
plt.plot(tree1_fpr, tree1_tpr, color='skyblue', lw=2, label='Decision Tree ROC curve (area = %0.2f)' % tree1_auc)
plt.plot(tree2_fpr, tree2_tpr, color='darkorange', lw=2, label='Decision Tree (with oversampling) ROC curve (area = %0.2f)' % tree2_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.fill_between(tree2_fpr, tree2_tpr, color='darkorange', alpha=0.2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (Without Oversampling)')
plt.legend(loc="lower right")
plt.show()

In [None]:
report1 = pd.DataFrame(classification_report(y_test, y_test_pred_tree1, target_names=['No', 'Yes'], output_dict=True)).round(2).transpose()
report2 = pd.DataFrame(classification_report(y_test, y_test_pred_tree2, target_names=['No', 'Yes'], output_dict=True)).round(2).transpose()
evaluation = pd.DataFrame([{'Method': 'Decision Tree (without oversample)', 'F1': report1['f1-score'][1], 'Precision': report1['precision'][1], 'Recall': report1['recall'][1], 'AUC': tree1_auc}, 
                           {'Method': 'Decision Tree (with oversample)', 'F1': report2['f1-score'][1], 'Precision': report2['precision'][1], 'Recall': report2['recall'][1], 'AUC': tree2_auc}])
display(evaluation)

Compared with the model trained from training set without oversampling, those oversampled training set will help to generate better model with high score in terms of precision, recall, f1-score. Because in a company, attrition will cause much more loss than doing something to detain employees. So, whether can distinguish all employees who want to leave is significant. Hence, we need to consider the AUC value and recall rate for ‘Yes’ class.  As for feature importance acquired from SVM, working overtime is the most significant reason to attrion. Then, Laboratory Technician and Sales Representative have more possibility to leave. Besides, those who’s education background is human resource are also tend to leave.