<a href="https://colab.research.google.com/github/Pataweepr/applyML_vistec_2019/blob/master/hw5_hr_employee_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HR Employee Attrition prediction using XGBoost

In this lab, we will work on the task of employee attrition prediction. We will predict whether an employee will leave the job or not. The model we will use is XGBoost which is a kind of random forest. One key feature for XGBoost is that it can handle missing values by learning the default path to take when the data is null. Moreover, since it is a decision tree in nature, it is highly interpretable while maintaining high performance. In this lab, we will learn the way to tune, and analyze the model to get insights about what the model is doing.

The data is modified from the [Kaggle tutorial](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset).

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import mode
from sklearn import preprocessing

from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import shuffle
from sklearn import metrics

import seaborn as sns

import xgboost as xgb

from numpy import random 

seed = 5
random.seed(seed)

The data can be acquired as usual. Get the data [here](https://drive.google.com/file/d/1iWb3YTeCddIhqAeRvhW6z9YnhQFYX4_8/view?usp=sharing) by clicking add to drive.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive/')

Read the data file. Study the names of the features. There are two columns that should be dropped. 

Which ones?

** Ans: **

In [0]:
hr_data = pd.read_csv("/content/gdrive/My Drive/hr-employee-attrition.csv")
## TODO#1 ##
# Drop two useless columns


The column we will predict is the Attrition. Change Attrition==No to 0 and Attrition==Yes to 1.

In [0]:
## TODO#2 ##


 Is there an imbalance in the prediction colum? What is the number of people who left? What is the number of people who stay?

** Ans: **

How many cells have null values?

** Ans: **

Below are the same functions we used before. 

Function splitTrainTest splits train, validation, and test function in a stratified manner.

Funciton get_feature_groups split the data into numerical features and categorical features.

In [0]:
def splitTrainTest(data,ratio_train_valid_test,name_label):
    keyDatas = data[name_label].value_counts().keys()
    train = pd.DataFrame()
    valid = pd.DataFrame()
    test = pd.DataFrame()
    first_ratio = ratio_train_valid_test[2]/np.sum(ratio_train_valid_test)
    second_ratio =  ratio_train_valid_test[1]/np.sum(ratio_train_valid_test[:2])
    for k in keyDatas:
        tmp = data[data[name_label]==k]
        tmp_train, tmp_test = train_test_split(tmp, test_size = first_ratio, random_state=seed)
        tmp_train, tmp_valid = train_test_split(tmp_train, test_size = second_ratio, random_state=seed)
        train = train.append(tmp_train)
        valid = valid.append(tmp_valid)
        test = test.append(tmp_test)
    train.reset_index(drop=True)
    valid.reset_index(drop=True)
    test.reset_index(drop=True)
    return train, valid, test

In [0]:
def get_feature_groups(ames_df):
    # Numerical Features
    numberical_features = ames_df.select_dtypes(include=['int64','float64']).columns
    # We drop ID and SalePrice since these are not input features
    numberical_features = numberical_features.drop('Attrition') 

    # Categorical Features
    catagorical_features = ames_df.select_dtypes(include=['object']).columns
    return list(numberical_features), list(catagorical_features)

## Data exploration

The code below changes the strings in categorical features into [categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [0]:
num_features, cat_features = get_feature_groups(hr_data)

print('numberical feature')
print(num_features)
print('----------------------------------')
print('catagorical feature')
print(cat_features)

for col in cat_features:
    hr_data[col] = pd.Categorical(hr_data[col])

### Numerical features analysis

The code below shows the distribution of the numerical features.

In [0]:
f = pd.melt(hr_data, value_vars=sorted(num_features))
g = sns.FacetGrid(f, col='variable', col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')
plt.show()

Create two sets of plots. One using only the data from employees who left. The other using the data from employees who stayed.

What features show the biggest differences? Does it make sense?

** Ans: **

Suggest other methods to study the difference between leave and stay besides looking at the distributions.

** Ans: **

In [0]:
## TODO#3 ##


### Catagorical features analysis

The code below shows the distribution of the categorical features.

In [0]:
f = pd.melt(hr_data, value_vars=sorted(cat_features))
g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False)
plt.xticks(rotation='vertical')
g = g.map(sns.countplot, 'value')
[plt.setp(ax.get_xticklabels(), rotation=60) for ax in g.axes.flat]
g.fig.tight_layout()
plt.show()

Identify at least two useless features in the data (from both categorical and numerical) Remove them.

** Ans: ** 

In [0]:
## TODO#4 ##


### Change categorical features into numbers

We need to change categorical features into numbers. There are many ways to do this. In this lab, we will one-hot encoded our features.

** One hot encoding ** is a method that change categories into a binary vector where the category correponds to the index where the vector is one. For example, for the gender feature we change "male" to \[1, 0\] and "female" to \[0, 1\].

Pandas has a nice function [pandas.get_dummies](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) that can help use do this.

For a more advance but extremely robust method to change categories to numbers, see the optional section.

In [0]:
## TODO#5 ##
# Change the categorical features using one hot encoding


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
for feature in cat_features:
  # Create one-hot feature columns for Max_EduInstituteGroup, HireType, HireSourceGroup
  df_onehot = pd.get_dummies(hr_data[feature], prefix=feature)
  # Concat the new columns to the dataframe
  hr_data = pd.concat([hr_data, df_onehot], axis=1)
  hr_data = hr_data.drop(columns = feature)
        </code>
      </pre>
</details>

In [0]:
print(hr_data.shape)
print(hr_data.columns)

### Down sampling

In order to deal with class imbalance, we will down sample the data. This can be done by randomly removing samples from that majority class. The function below down samples the data.

In [0]:
def down_sampling(X,ratio_neg_pos):
  X_pos = X.loc[X['Attrition'] == 1].reset_index(drop=True)
  X_neg = X.loc[X['Attrition'] == 0].reset_index(drop=True)
  ## shuffle
  np_rand_ind = np.array(X_neg.index)
  random.shuffle(np_rand_ind)
  size_neg = int(ratio_neg_pos*X_pos.shape[0])
  np_rand_ind = np_rand_ind[:size_neg]
  X_neg = X_neg.iloc[np_rand_ind]
  X_down_samp = pd.concat([X_pos,X_neg])
  X_down_samp = shuffle(X_down_samp,random_state = seed)
  X_down_samp.reset_index(drop=True)
  return X_down_samp

In [0]:
ratio = np.array([70,20,10])
train_set, valid_set, test_set = splitTrainTest(hr_data,ratio,'Attrition')

print(train_set.shape)
print(valid_set.shape)
print(test_set.shape)
print('---------')
# We do not need the ratio to be 1:1 but they should be in the same order of magnitude
ratio_neg_pos = 1.5
train_set = down_sampling(train_set,ratio_neg_pos)
print(train_set.shape)
print(valid_set.shape)
print(test_set.shape)

X_train = train_set.drop(columns = ['Attrition'])
y_train = train_set['Attrition']

X_valid = valid_set.drop(columns = ['Attrition'])
y_valid = valid_set['Attrition']

X_test = test_set.drop(columns = ['Attrition'])
y_test = test_set['Attrition']

In [0]:
print(train_set.columns)
train_set.head(10)

### Normalize the data

Use minmax scaler to normalize the features in the train, validation, and test set.

In [0]:
## TODO#6 ##


## XGBoost

[XGBoost](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) stands for Extreme Gradient Boosting. It is a kind of random forest model that can be used for classification and regression. The gradient boosting term refers to kind of machine learning model that is created by combining weak classifiers. Unlike random forests, XGBoost trees are usually shallower. It has built in feature selection capability just like random forests.

The code below is an example of a XGBoost model.

In [0]:
model = xgb.XGBClassifier(silent=False, 
                          scale_pos_weight=ratio_neg_pos,
                          learning_rate=0.01,  
                          colsample_bytree = 0.4,
                          subsample = 0.8,
                          min_child_weight = 2.5 ,
                          objective='binary:logistic', 
                          n_estimators=1000,
                          reg_alpha = 0.3,
                          max_depth=5, 
                          gamma=10)

**What are the meaning of each settings listed above?** You do not need to answer this, but you should try to understand its implications.

In [0]:
# Set the validation set
eval_set = [(X_valid_minmax,y_valid)]
eval_metric = ['auc','error']

# train the model
model.fit(X_train_minmax, y_train,eval_metric=eval_metric,eval_set=eval_set,verbose =True)

Use the model to classify the test set, e.g. *.predict()* . Also get probability values, e.g. *.predict_proba()*, so that we can create an RoC curve.

In [0]:
## TODO#6 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
y_pred_test_prob = model.predict_proba(X_test_minmax)[:,1]
y_pred_test = model.predict(X_test_minmax)
print(y_pred_test)
print(y_pred_test_prob)
print(y_test.values)
        </code>
      </pre>
</details>

What is the threshold of that probability that the model will answer class 0?

** Ans: **

We will re-visit this when we look into the RoC curve.

Create a confusion matrix. 

Which class is harder to predict? What do you think makes this class harder?

** Ans: **

In [0]:
## TODO#7 ##


### Metrics for imbalance classification

Next we will compare different kinds of metrics, namely accuracy, precision, and recall.

The code below calculates different metrics on the test set. Do you know the difference between rows and columns?

In [0]:
print('acc : ',accuracy_score(y_test.values, y_pred_test))
tg_name = ['Attrition : no','Attrition : yes']
print(classification_report(y_test.values,y_pred_test,target_names=tg_name))

If you always answer the majority class (Attribution = no), what is the accuracy, recall (for attribution = yes), and precision?

** Ans: **

It turns out that the majority class answer has higher accuracy but zero recall. ** This shows how accuracy might not be the best metric for when there is high class imbalance. **

Note that we are still missing one important piece. The threshold value for our prediction.

In [0]:
## TODO#8 ##
# Calculate accuracy for majority class prediction


### RoC

Next we will look at the RoC curve. The code below plots the RoC curve.

In [0]:
fpr, tpr, thresholds = metrics.roc_curve(y_test.values, y_pred_test_prob, pos_label=1)
thresholds = np.array(thresholds)[::-1]

plt.plot(fpr,tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

If we really care about the accuracy, we can maximize the accuracy by varying the threshold. Modify the code below to find the threshold that maximizes the accuracy.

What are the best threshold and best accuracy values?

** Ans: **

Note: we should do this by tuning on the validation set rather than the test set.

In [0]:
## TODO#8 ##
## MODIFY THIS CODE ##
for thes in thresholds:
  print('threshold : ', thes)
  y_pred_test_th = y_pred_test_prob >= thes
  y_pred_test_th = list(y_pred_test_th.astype(int))
  conf_test = confusion_matrix(y_test.values, y_pred_test_th)

## Tree visualization

We can visualize the trees in XGBoost to get a sense on what the model is doing, by using the code below.

In [0]:
# View tree
# Because graphviz seams to have some resolution problem with pyplot and jupyter, we have to use to trick below to get a readable graph
fig, ax = plt.subplots(figsize=(5, 5), dpi=350)

# Plot the n-th tree of the model. Note that the model consists of many tress (as defined when creating the model)
xgb.plot_tree(model, num_trees=15, rankdir='LR', ax=ax)

## Feature importance

Another useful feature is to look at how often each feature is used in the trees. Feature importance of a single tree signifies how a feature helps seperate the classes in the tree. The feature importance of every tree can be averaged and shown in a single plot below:

In [0]:
xgb.plot_importance(model)

What are the top four features with highest importance? Do they make sense? Do they agree with your analysis earlier (in the data exploration section)?

** Ans: **

You can read more about how feature importance is calculated in [Matthew Drury's answer](https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)

## (Optional): Target encoding

Besides one hot encoding, there are other ways to change categorical features to numbers.

The most popular methods are based on target encoding, which change the value of each category to the average value of the class you want to predict. More advanced versions involve smoothing and prior values so that the encoding will not overfit. You can read more about target encoding [here](https://maxhalford.github.io/blog/target-encoding-done-the-right-way/). Another more rigorous source can be found (here)[https://dl.acm.org/citation.cfm?id=507538]

Perform any target encoding variants of your choice and re-do the XGBoost training and compare the difference.

## (Optional): XGBoost tuning

There are many hyperparameters in XGBoost which can be cumbersome to tune. This [blog](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python) describes tips and tricks for tuning the hyperparameters.

Try tuning the hyperparameter to get better classification results.