<b>Data mining project - 2020/21</b><br>
<b>Author</b>: [Alexandra Bradan](https://github.com/alexandrabradan)<br>
<b>Python version</b>: 3.x<br>
<b>Last update: 20/11/2020<b>

In [213]:
%matplotlib inline

# general libraries
import sys
import math
import operator
import itertools
import pydotplus
import collections
import missingno as msno
from pylab import MaxNLocator
from collections import Counter
from collections import defaultdict
from IPython.display import Image

# pandas libraries
import pandas as pd
from pandas import DataFrame
from pandas.testing import assert_frame_equal

# visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot

# numpy libraries
import numpy as np
from numpy import std
from numpy import mean
from numpy import arange
from numpy import unique
from numpy import percentile

# scipy libraries
import scipy.stats as stats
from scipy.stats import kstest
from scipy.stats import normaltest

# sklearn libraries
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.experimental import enable_iterative_imputer  # explicitly require this experimental feature
from sklearn.impute import IterativeImputer

from sklearn import tree
from sklearn.feature_selection import RFE
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.pipeline import make_pipeline as imbmake_pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, f1_score, fbeta_score, recall_score, precision_score, classification_report, roc_auc_score 

In [214]:
data_directory = "../../../data/"
plot_directory = "../../../plots/DataUnderstanding/"
TR_file = data_directory + "Train_HR_Employee_Attrition.csv"
TR_cleaned_file = data_directory + "Cleaned_Train_HR_Employee_Attrition.csv"
TS_file = data_directory + "One_Hot_Encoding_Test_HR_Employee_Attrition.csv"

In [215]:
df_cleaned = pd.read_csv(TR_cleaned_file, sep=",") 
df_ts = pd.read_csv(TS_file, sep=",") 

In [216]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 883 entries, 0 to 882
Data columns (total 36 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Age                                     883 non-null    int64  
 1   Attrition                               883 non-null    int64  
 2   BusinessTravel_Non-Travel               883 non-null    int64  
 3   BusinessTravel_Travel_Rarely            883 non-null    int64  
 4   BusinessTravel_Travel_Frequently        883 non-null    int64  
 5   DistanceFromHome                        883 non-null    int64  
 6   Education                               883 non-null    int64  
 7   EnvironmentSatisfaction                 883 non-null    int64  
 8   Gender                                  883 non-null    int64  
 9   JobInvolvement                          883 non-null    int64  
 10  JobLevel                                883 non-null    int64 

In [217]:
print(df_cleaned.shape)
print(df_ts.shape)

(883, 36)
(219, 37)


<h2> Discretisation approach </h2> 
Approaches to transform continuous variables into discrete ones. This process is also known as <b>binning</b>, with each bin being each interval. Discretization methods fall into 2 categories: 

- supervised: do not use any information, other than the variable distribution, to create the contiguous bins in which the values will be placed;
- unsupervised: typically use target information in order to create bins or intervals.

Since we are dealying with DT it is natural to use a **supervised discretisation method** with them:

<u>Step 1</u>: First it trains a decision tree of limited depth (2, 3 or 4) using the variable we want to discretize to predict the target;

<u>Step 2</u>: The original variable values are then replaced by the probability returned by the tree. The probability is the same for all the observations within a single bin, thus replacing by the probability is equivalent to grouping the observations within the cut-off decided by the decision tree.

**Advantages** :
- The probabilistic predictions returned decision tree are monotonically related to the target.
- The new bins show decreased entropy, this is the observations within each bucket/bin are more similar to themselves than to those of other buckets/bins.
- The tree finds the bins automatically.

**Disadvantages**:
- It may cause over-fitting
- More importantly, some tuning of tree parameters might need to be done to obtain the optimal splits (e.g., depth, the minimum number of samples in one partition, the maximum number of partitions, and a minimum information gain). This it can be time-consuming.

<u>Features to discretize</u>:
- Age
- DistanceFromHome
- YearsAtCompany
- YearsInCurrentRole
- NumCompaniesWorked
- MonthlyIncome
- MonthlyHours

- PercentSalaryHike
- TaxRate

<h2> Training discretisation </h2>

In [218]:
X_train = df_cleaned.copy()
y_train = df_cleaned['Attrition']

In [219]:
def discretize_based_on_histogram_distribution(curr_column, bins, labels):
    print("%s max_train" %curr_column, df_cleaned[curr_column].max(), "%s min_train" % curr_column, df_cleaned[curr_column].min())
    print("%s max_test" %curr_column, df_ts[curr_column].max(), "%s min_test" % curr_column, df_ts[curr_column].min())
    print(pd.cut(df_cleaned[curr_column], bins, labels=labels, include_lowest=True, right=False).unique())
    df_cleaned[curr_column] = pd.cut(df_cleaned[curr_column], bins, labels=labels, include_lowest=True, right=False).astype(int)
    print(pd.cut(df_ts[curr_column], bins, labels=labels, include_lowest=True, right=False).unique())
    df_ts[curr_column] = pd.cut(df_ts[curr_column], bins, labels=labels, include_lowest=True, right=False).astype(int)
    print("%s train_unique" % curr_column, sorted(df_cleaned[curr_column].unique()))
    print("%s test_unique" % curr_column, sorted(df_ts[curr_column].unique()))

<h6>Age </h6>
Build a classification tree using the Age to predict Attrition in order to discretise the age variable

In [220]:
bins = list(range(10, 71, 10))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("Age", bins, labels)

Age max_train 60 Age min_train 18
Age max_test 58 Age min_test 22
[5, 3, 2, 4, 6, 1]
Categories (6, int64): [1 < 2 < 3 < 4 < 5 < 6]
[2, 3, 5, 4]
Categories (4, int64): [2 < 3 < 4 < 5]
Age train_unique [1, 2, 3, 4, 5, 6]
Age test_unique [2, 3, 4, 5]


<h6>DistanceFromHome </h6>
Build a classification tree using the variable to predict Attrition in order to discretise it

In [221]:
bins = list(range(0, 31, 5))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("DistanceFromHome", bins, labels)

DistanceFromHome max_train 29 DistanceFromHome min_train 1
DistanceFromHome max_test 29 DistanceFromHome min_test 1
[2, 1, 4, 6, 3, 5]
Categories (6, int64): [1 < 2 < 3 < 4 < 5 < 6]
[4, 1, 2, 3, 5, 6]
Categories (6, int64): [1 < 2 < 3 < 4 < 5 < 6]
DistanceFromHome train_unique [1, 2, 3, 4, 5, 6]
DistanceFromHome test_unique [1, 2, 3, 4, 5, 6]


<h6> YearsAtCompany </h6>

In [222]:
to_drop_indexes = df_ts.index[df_ts["YearsAtCompany"] > 20]
df_ts.drop(list(to_drop_indexes), axis=0, inplace=True)
df_ts.reset_index(drop=True, inplace=True)
print("dropped rows = ", len(to_drop_indexes), sep="\t")

df_ts.shape

dropped rows = 	0


(219, 37)

In [223]:
bins = list(range(0, 26, 5), )
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("YearsAtCompany", bins, labels)

YearsAtCompany max_train 20 YearsAtCompany min_train 0
YearsAtCompany max_test 20 YearsAtCompany min_test 0
[2, 1, 3, 4, 5]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
[1, 2, 3, 5, 4]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
YearsAtCompany train_unique [1, 2, 3, 4, 5]
YearsAtCompany test_unique [1, 2, 3, 4, 5]


<h6> YearsInCurrentRole </h6>

In [224]:
to_drop_indexes = df_ts.index[df_ts["YearsInCurrentRole"] > 16]
df_ts.drop(list(to_drop_indexes), axis=0, inplace=True)
df_ts.reset_index(drop=True, inplace=True)
print("dropped rows = ", len(to_drop_indexes), sep="\t")

df_ts.shape

dropped rows = 	0


(219, 37)

In [225]:
bins = list(range(0, 21, 5))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("YearsInCurrentRole", bins, labels)

YearsInCurrentRole max_train 16 YearsInCurrentRole min_train 0
YearsInCurrentRole max_test 15 YearsInCurrentRole min_test 0
[2, 1, 3, 4]
Categories (4, int64): [1 < 2 < 3 < 4]
[1, 3, 2, 4]
Categories (4, int64): [1 < 2 < 3 < 4]
YearsInCurrentRole train_unique [1, 2, 3, 4]
YearsInCurrentRole test_unique [1, 2, 3, 4]


<h6> NumCompaniesWorked </h6>

In [226]:
bins = list(range(0, 11, 5))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("NumCompaniesWorked", bins, labels)

NumCompaniesWorked max_train 9 NumCompaniesWorked min_train 0
NumCompaniesWorked max_test 9 NumCompaniesWorked min_test 0
[2, 1]
Categories (2, int64): [1 < 2]
[2, 1]
Categories (2, int64): [1 < 2]
NumCompaniesWorked train_unique [1, 2]
NumCompaniesWorked test_unique [1, 2]


NumCompaniesWorked is a discretisation candidate

<h6> MonthlyIncome </h6>

In [227]:
bins = list(range(0, 30000, 2500))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("MonthlyIncome", bins, labels)

MonthlyIncome max_train 26997 MonthlyIncome min_train 1009
MonthlyIncome max_test 25479 MonthlyIncome min_test 1393
[4, 2, 3, 1, 6, ..., 5, 11, 10, 8, 9]
Length: 11
Categories (11, int64): [1 < 2 < 3 < 4 ... 8 < 9 < 10 < 11]
[2, 3, 5, 4, 1, ..., 7, 8, 11, 10, 9]
Length: 11
Categories (11, int64): [1 < 2 < 3 < 4 ... 8 < 9 < 10 < 11]
MonthlyIncome train_unique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
MonthlyIncome test_unique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


<h6> MonthlyHours </h6>

In [228]:
to_drop_indexes = df_ts.index[df_ts["MonthlyHours"] > 590.9767441860465]
df_ts.drop(list(to_drop_indexes), axis=0, inplace=True)
df_ts.reset_index(drop=True, inplace=True)
print("dropped rows = ", len(to_drop_indexes), sep="\t")

df_ts.shape

dropped rows = 	0


(219, 37)

In [229]:
bins = list(range(0, 601, 200))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("MonthlyHours", bins, labels)

MonthlyHours max_train 590.9767441860465 MonthlyHours min_train 26.04347826086957
MonthlyHours max_test 574.7954545454545 MonthlyHours min_test 33.71590909090909
[1, 3, 2]
Categories (3, int64): [1 < 2 < 3]
[2, 3, 1]
Categories (3, int64): [1 < 2 < 3]
MonthlyHours train_unique [1, 2, 3]
MonthlyHours test_unique [1, 2, 3]


<h6> PercentSalaryHike </h6>

In [230]:
bins = list(range(0, 31, 5))
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("PercentSalaryHike", bins, labels)

PercentSalaryHike max_train 25 PercentSalaryHike min_train 11
PercentSalaryHike max_test 25 PercentSalaryHike min_test 11
[4, 3, 6, 5]
Categories (4, int64): [3 < 4 < 5 < 6]
[3, 4, 5, 6]
Categories (4, int64): [3 < 4 < 5 < 6]
PercentSalaryHike train_unique [3, 4, 5, 6]
PercentSalaryHike test_unique [3, 4, 5, 6]


<h6> TaxRate </h6>

In [231]:
bins = list(np.linspace(0, 1, 11))
print("bins", bins)
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("TaxRate", bins, labels)

bins [0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9, 1.0]
TaxRate max_train 0.9513959334891722 TaxRate min_train 0.0
TaxRate max_test 0.9138676137092978 TaxRate min_test 0.0
[4, 5, 8, 9, 3, 6, 1, 7, 2, 10]
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]
[9, 8, 7, 1, 4, 3, 6, 2, 5, 10]
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]
TaxRate train_unique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
TaxRate test_unique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


<h6> OverallSatisfaction </h6>

In [232]:
bins = list(np.linspace(0, 5, 11))
print("bins", bins)
labels= list(range(1, len(bins)))
discretize_based_on_histogram_distribution("OverallSatisfaction", bins, labels)

bins [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
OverallSatisfaction max_train 4.0 OverallSatisfaction min_train 1.2
OverallSatisfaction max_test 3.6 OverallSatisfaction min_test 1.4
[6, 7, 5, 8, 3, 4, 9]
Categories (7, int64): [3 < 4 < 5 < 6 < 7 < 8 < 9]
[5, 7, 6, 8, 4, 3]
Categories (6, int64): [3 < 4 < 5 < 6 < 7 < 8]
OverallSatisfaction train_unique [3, 4, 5, 6, 7, 8, 9]
OverallSatisfaction test_unique [3, 4, 5, 6, 7, 8]


<h2> Discretize variables and save them on new file </h2>

In [233]:
def map_probabilities_to_increasing_integers(var_tree):
    values = sorted(list(X_train[var_tree].unique()))
    values_map = {}
    for i in range(1, len(values) + 1):
        values_map[str(values[i-1])] = i
    return values_map

In [234]:
def get_bin_indeces(var, X, train_or_test_flag):
    var_tree = "%s_tree" % var
    values_map = map_probabilities_to_increasing_integers(var_tree)
    bin_edges = [float(x) for x in list(values_map.keys())]
    bin_indeces = []
    if train_or_test_flag == "test":
        values = X[var]
    else:
        values = X[var_tree]
    for x in values:
        for edge in bin_edges:
            if x <= edge:
                bin_indeces.append(values_map[str(edge)])
                break
    return bin_indeces

In [235]:
def replace_categorical_feature_with_dummy_ones(column_name, categories_list, dummy_features, X):
    """
    Function which replaces the nominal feature passed by argument with dummy ones, 
    to convert nominal column's M values in M new binary (dummy) features.
    """
    # retrive nominal feature's index. It is used to know where to insert the new M binary features
    index = X.columns.get_loc(column_name)
    for i in range(0, dummy_features.shape[1]):
        index += 1
        X.insert(index, column_name + "_" + str(categories_list[i]), 
                                                              dummy_features[:, i].todense().astype(int), True)
    # remove categorical feature
    del X[column_name]
    return X

In [236]:
def perform_one_encoding(column_name, X):
    unique = list(X[column_name].unique())
    categories_list = sorted(unique)
    encoder = OneHotEncoder(categories=[categories_list])   # excplict force encoding order
    # fit and transform model on data
    dummy_features = encoder.fit_transform(X[column_name].values.reshape(-1,1))
    # add dummy features to dataset, replacing categorical feature
    return replace_categorical_feature_with_dummy_ones(column_name, categories_list, dummy_features, X)

In [237]:
print(df_cleaned["DistanceFromHome"])

0      2
1      1
2      4
3      2
4      2
      ..
878    3
879    4
880    2
881    1
882    1
Name: DistanceFromHome, Length: 883, dtype: int64


In [238]:
discrete_variables = ["DistanceFromHome", "YearsAtCompany", "YearsInCurrentRole", "NumCompaniesWorked",
                      "MonthlyIncome", "PercentSalaryHike", "TaxRate", "Age", "MonthlyHours", 
                      "OverallSatisfaction"]
for var in discrete_variables:
    print(var)
    df_cleaned = perform_one_encoding(var, df_cleaned)
    df_ts = perform_one_encoding(var, df_ts)

DistanceFromHome
YearsAtCompany
YearsInCurrentRole
NumCompaniesWorked
MonthlyIncome
PercentSalaryHike
TaxRate
Age
MonthlyHours
OverallSatisfaction


In [239]:
discrete_variables = ["Education", "JobLevel", "EnvironmentSatisfaction", "JobInvolvement", "JobSatisfaction",
                      "RelationshipSatisfaction", "WorkLifeBalance", "StockOptionLevel", "TrainingTimesLastYear"]
for var in discrete_variables:
    df_cleaned = perform_one_encoding(var, df_cleaned)
    df_ts = perform_one_encoding(var, df_ts)

In [240]:
for x in df_cleaned.columns:
    if len(sorted(df_cleaned[x].unique())) > 2:
        print(x)

In [241]:
for x in df_ts.columns:
    if len(sorted(df_ts[x].unique())) > 2:
        print(x)

In [242]:
print(df_cleaned.shape)
print(df_ts.shape)

(883, 116)
(219, 114)


In [243]:
df1 = df_cleaned.copy()
df2 = df_ts.copy()

In [244]:
print(df1.shape)
print(df2.shape)

(883, 116)
(219, 114)


In [245]:
set(df1.columns).difference(set(df2.columns))

{'Age_1', 'Age_6', 'OverallSatisfaction_9'}

In [246]:
df1.to_csv(data_directory + "Discretized_HISTOGRAM_One_Hot_Encoding_Train_HR_Employee_Attrition.csv", index=False, header=True)
df2.to_csv(data_directory + "Discretized_HISTOGRAM_One_Hot_Encoding_Test_HR_Employee_Attrition.csv", index=False, header=True)

In [247]:
df_discretized = pd.read_csv(data_directory + "Discretized_HISTOGRAM_One_Hot_Encoding_Train_HR_Employee_Attrition.csv", sep=",") 
df_discretized.shape

(883, 116)

In [248]:
df_discretized = pd.read_csv(data_directory + "Discretized_HISTOGRAM_One_Hot_Encoding_Test_HR_Employee_Attrition.csv", sep=",") 
df_discretized.shape

(219, 114)