### Summary

<p>
Some employees of a company are going to think about leaving the company. This phenomenon is captured with the
terms "attrition" or "churn". Given data algorithms can try to predict, which employees are going to leave the
company. It is up to the decision making of the enterprise to act on this knowledge, maybe offer incentives to hold
the employee in the company.
</p>

<p>
Algorithms applied here on the attrition data set are logistic regression and random forest. The application is
made possible by preparing the data and split it into train-, validation-, and test data. The quality of the
results is measured with metrics like precision, accuracy and recall.
</p>

<p>
According to the evaluation the logistic regression model proved to work best for the data set.
</p>


### Load data and packages

In [1]:
# Test notebook works
3+3

In [2]:
getwd()

In [3]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.5.2"-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.1.0     [32mv[39m [34mpurrr  [39m 0.2.5
[32mv[39m [34mtibble [39m 1.4.2     [32mv[39m [34mdplyr  [39m 0.7.8
[32mv[39m [34mtidyr  [39m 0.8.2     [32mv[39m [34mstringr[39m 1.3.1
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.4.0
"package 'forcats' was built under R version 3.5.2"-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [4]:
library(rsample)

"package 'rsample' was built under R version 3.5.2"
Attaching package: 'rsample'

The following object is masked from 'package:tidyr':

    fill



In [5]:
library(Metrics)

"package 'Metrics' was built under R version 3.5.2"

In [6]:
library(ranger)

"package 'ranger' was built under R version 3.5.2"

### Read in the data

In [7]:
file_path = 'WA_Fn-UseC_-HR-Employee-Attrition.csv'
file_path

In [8]:
attrition = read.csv(file_path)

### Explore data

In [9]:
dim(attrition)
nrow(attrition)
ncol(attrition)

In [10]:
colnames(attrition)

The variable 'Attrition' is the response or dependent variable, which we want to predict.

In [11]:
glimpse(attrition)

Observations: 1,470
Variables: 35
$ ï..Age                   <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
$ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, ...
$ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_R...
$ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
$ Department               <fct> Sales, Research & Development, Research & ...
$ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
$ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
$ EducationField           <fct> Life Sciences, Life Sciences, Other, Life ...
$ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
$ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
$ Gender                   <fct> Female, Male, Male, Female, Male, Male, Fe...
$ HourlyRate      

In [12]:
unique(attrition$Attrition )

In [13]:
summary(attrition)

     ï..Age      Attrition            BusinessTravel   DailyRate     
 Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
 1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
 Median :36.00              Travel_Rarely    :1043   Median : 802.0  
 Mean   :36.92                                       Mean   : 802.5  
 3rd Qu.:43.00                                       3rd Qu.:1157.0  
 Max.   :60.00                                       Max.   :1499.0  
                                                                     
                  Department  DistanceFromHome   Education    
 Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
 Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
 Sales                 :446   Median : 7.000   Median :3.000  
                              Mean   : 9.193   Mean   :2.913  
                              3rd Qu.:14.000   3rd Qu.:4.000  
                              Max.   :29.000   Max.   :5.000  

In [14]:
head(attrition)

ï..Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6


In [15]:
tail(attrition)

Unnamed: 0,ï..Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1465,26,No,Travel_Rarely,1167,Sales,5,3,Other,1,2060,...,4,80,0,5,2,3,4,2,0,0
1466,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1467,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1468,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1469,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8
1470,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,1,80,0,6,3,4,4,3,1,2


In [16]:
# Drop Over18 from the model
attrition$Over18 <- NULL

### Train-Test-Validate data

Initial split (from the rsample libary)

In [17]:
set.seed(42)
attr_split <- initial_split(attrition, prop=0.75)

In [18]:
summary(attr_split)

       Length Class      Mode   
data     34   data.frame list   
in_id  1103   -none-     numeric
out_id    1   -none-     logical
id        1   tbl_df     list   

Read out training and testing data into variables

In [19]:
training_attr <- training(attr_split)
head(training_attr, 2)
dim(training_attr)

Unnamed: 0,ï..Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
5,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
7,59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0


In [20]:
testing_attr <- testing(attr_split)
head(testing_attr, 2)
dim(testing_attr)


ï..Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7


Train-Validate Split<br>
Create folds. <br>
Creating a number of folds means deviding the training data into subsets.<br>


In [21]:
#v: number of folds
cv_split_attr <- vfold_cv(training_attr, v=5)
summary(cv_split_attr)


 splits.Length  splits.Class  splits.Mode      id           
 4       rsplit  list                     Length:5          
 4       rsplit  list                     Class :character  
 4       rsplit  list                     Mode  :character  
 4       rsplit  list                                       
 4       rsplit  list                                       

In [22]:
str(cv_split_attr[1,1])

Classes 'rset', 'tbl_df', 'tbl' and 'data.frame':	1 obs. of  1 variable:
 $ splits:List of 1
  ..$ 1:List of 4
  .. ..$ data  :'data.frame':	1103 obs. of  34 variables:
  .. .. ..$ ï..Age                  : int  27 59 30 38 36 29 31 34 28 29 ...
  .. .. ..$ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
  .. .. ..$ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 2 3 3 3 3 3 3 ...
  .. .. ..$ DailyRate               : int  591 1324 1358 216 1299 153 670 1346 103 1389 ...
  .. .. ..$ Department              : Factor w/ 3 levels "Human Resources",..: 2 2 2 2 2 2 2 2 2 2 ...
  .. .. ..$ DistanceFromHome        : int  2 3 24 23 27 15 26 19 24 21 ...
  .. .. ..$ Education               : int  1 3 1 3 3 2 1 2 3 4 ...
  .. .. ..$ EducationField          : Factor w/ 6 levels "Human Resources",..: 4 4 2 2 4 2 2 4 2 2 ...
  .. .. ..$ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ EmployeeNumber          

This creates a data frame, which stores 5 folds, whereby each split contains list of the original data frame.

Create train-validation <br>
The folds or subsets are now themselves devided into training and testing data.<br>
Essentially this is cross-validation: <br>
First create subsets or folds.<br>
Second devide folds into training and validation (testing) sets.

In [23]:
cv_data_attr <- cv_split_attr %>% mutate(train=map(.x = splits, .f=~training(.x)),
                                         test=map(.x=splits, .f=~testing(.x)))


In [24]:
class(cv_data_attr)

In [25]:
colnames(cv_data_attr)

Now the data is split into training, validation (5-folds) and test data and is therefore 
prepared for model building.

Building cross-validated models

In [26]:
colnames(cv_data_attr)

In [27]:
str(cv_data_attr[1,"train"])

Classes 'tbl_df', 'tbl' and 'data.frame':	1 obs. of  1 variable:
 $ train:List of 1
  ..$ 1:'data.frame':	882 obs. of  34 variables:
  .. ..$ ï..Age                  : int  59 30 38 36 29 31 34 28 32 38 ...
  .. ..$ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
  .. ..$ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 3 3 3 3 3 3 ...
  .. ..$ DailyRate               : int  1324 1358 216 1299 153 670 1346 103 334 371 ...
  .. ..$ Department              : Factor w/ 3 levels "Human Resources",..: 2 2 2 2 2 2 2 2 2 2 ...
  .. ..$ DistanceFromHome        : int  3 24 23 27 15 26 19 24 5 2 ...
  .. ..$ Education               : int  3 1 3 3 2 1 2 3 2 3 ...
  .. ..$ EducationField          : Factor w/ 6 levels "Human Resources",..: 4 2 2 4 2 2 4 2 2 2 ...
  .. ..$ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ EmployeeNumber          : int  10 11 12 13 15 16 18 19 21 24 ...
  .. ..$ EnvironmentSatisfa

### Building a logistic regression model 

In [28]:
# Use all predictor variables as features
# cv_models_attr <- cv_data_attr  %>% mutate(model= map(train, ~glm(formula=Attrition~., data=.x, family="binomial")))

This model produces the following error: 
<p>
Error in mutate_impl(.data, dots): Evaluation error: contrasts can be applied only to factors with 2 or more levels.Traceback:
</p>
<p>
How to solve this problem?<br>
Clue: "can be applied only to factors with 2 or more levels"<br>
All variables are used as predictors.<br>
These factors are:<br>
Attrition is a factor.<br>
BusinessTravel + Department + EducationField + Gender + JobRole + MaritalStatus + Over18 + OverTime<br>
Over18 has only one level: Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...<br>
This seems critical.
<p>
Solution?
</p>
</p>
<p>
Induction (Latin inducere 'bring about', 'induce', 'introduce') since Aristotle means the abstracting conclusion of observed phenomena to a more general knowledge, such as a general concept or a law of nature.
</p>
<p>
Test every factor one at a time and observe the result.
</p>
<p>
Result:<br>
Over18 throws an error.
</p>
<p>
Investigate:<br> 
Over18 has only one level. All are over 18. It makes sense to drop it from the model. 
</p>

In [29]:
# Use factor predictor variables as features one at a time
cv_models_attr <- cv_data_attr  %>% mutate(model= 
                                           (map(train, ~glm(formula=Attrition~BusinessTravel+Department + EducationField+ Gender+ JobRole+ MaritalStatus + OverTime , data=.x, family="binomial"))))

In [30]:
# Check the model
# The model works
summary(cv_models_attr)

 splits.Length  splits.Class  splits.Mode      id           
 4       rsplit  list                     Length:5          
 4       rsplit  list                     Class :character  
 4       rsplit  list                     Mode  :character  
 4       rsplit  list                                       
 4       rsplit  list                                       
 train.Length  train.Class  train.Mode test.Length  test.Class  test.Mode 
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 model.Length  model.Class  model.Mode
 30    glm   list                     
 30    glm   list                     
 30    glm   list                     
 30    glm   list           

Drop Over18 from the model with minus-sign does not work. Instead drop it from the data set above at start of the notebook.

In [31]:
 cv_models_lr <- cv_data_attr  %>% mutate(model = map(train, ~glm(formula=Attrition~., data = .x, family = "binomial")))

In [32]:
summary(cv_models_lr)

 splits.Length  splits.Class  splits.Mode      id           
 4       rsplit  list                     Length:5          
 4       rsplit  list                     Class :character  
 4       rsplit  list                     Mode  :character  
 4       rsplit  list                                       
 4       rsplit  list                                       
 train.Length  train.Class  train.Mode test.Length  test.Class  test.Mode 
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 model.Length  model.Class  model.Mode
 30    glm   list                     
 30    glm   list                     
 30    glm   list                     
 30    glm   list           

In [33]:
colnames(cv_models_lr)

### Evaluating the attrition classification model for a single fold

Extract the model and the validate dataframe from the first fold of the cross-validation. 

In [34]:
model <- cv_models_lr$model[[1]]
validate <- cv_models_lr$test[[1]]

In [35]:
colnames(validate)

In [36]:
# Prepare binary vector of actual Attrition values in validate
validate_actual <- validate$Attrition == "Yes"
# Predict the probabilities for the observations in validate
validate_probabilities <- predict(model, validate, type="response")


"prediction from a rank-deficient fit may be misleading"

In [37]:
glimpse(validate_actual)
glimpse(validate_probabilities)


 logi [1:221] FALSE FALSE FALSE FALSE FALSE TRUE ...
 Named num [1:221] 0.3013 0.04876 0.12358 0.00296 0.15301 ...
 - attr(*, "names")= chr [1:221] "5" "16" "18" "19" ...


In [38]:
# Turning probabilities into predictions depending on threshold of 0.5
validate_predictions <- ifelse(validate_probabilities > 0.5, TRUE, FALSE)
glimpse(validate_predictions)

 Named logi [1:221] FALSE FALSE FALSE FALSE FALSE TRUE ...
 - attr(*, "names")= chr [1:221] "5" "16" "18" "19" ...


Make a contigency table or confusion matrix

In [39]:
table(validate_actual, validate_predictions)

               validate_predictions
validate_actual FALSE TRUE
          FALSE   183    3
          TRUE     18   17

#### Calculating binary classification metrics for selected fold



<li>
<ul>Accuracy: the share of correct predictions in relation to all observations</ul>
<ul>Precision: the share  of true predictions of true and false predictions </ul>
<ul>Recall:  the share  of true predictions of all actually true observations</ul>
</li>


In [40]:
accuracy(validate_actual, validate_predictions)

In [41]:
precision(validate_actual, validate_predictions)

In [42]:
recall(validate_actual, validate_predictions)

#### Prepare for cross-validated performance

In [43]:
suppressWarnings(
cv_prep_lr <-  cv_models_lr %>% 
  mutate(
    # Prepare binary vector of actual Attrition values in validate
    validate_actual = map(test, ~.x$Attrition == "Yes"),
    # Prepare binary vector of predicted Attrition values for validate
    validate_predicted = map2(.x = model, .y = test, ~predict(.x, .y, type = "response") > 0.5)
  ))

In [44]:
summary(cv_prep_lr )

 splits.Length  splits.Class  splits.Mode      id           
 4       rsplit  list                     Length:5          
 4       rsplit  list                     Class :character  
 4       rsplit  list                     Mode  :character  
 4       rsplit  list                                       
 4       rsplit  list                                       
 train.Length  train.Class  train.Mode test.Length  test.Class  test.Mode 
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 34          data.frame  list          34          data.frame  list       
 model.Length  model.Class  model.Mode
 30    glm   list                     
 30    glm   list                     
 30    glm   list                     
 30    glm   list           

In [45]:
# Calculate the validate recall for each cross validation fold
cv_performance_recall <- cv_prep_lr %>% mutate(validate_recall =  map2_dbl(.x=validate_actual, .y=validate_predicted, ~recall(.x, .y)))

In [46]:
# Print the validate_recall column
cv_performance_recall$validate_recall

In [47]:
mean(cv_performance_recall$validate_recall)

The average validate recall is 0.43.

#### Can a random forest model perform better than the logistic regression model (recall = 0.43)?

In [48]:
# Tuning the model by inserting different values of mtry
cv_tune <- cv_data_attr %>% crossing(mtry=c(2,4,8,16))


In [49]:
# Build a cross validation model for each fold & mtry combination of the training data
cv_rf <- cv_tune %>% mutate(rf_model = 
                            map2(.x= train, .y=mtry,.f= ~ranger(formula=Attrition~.,data=.x,
                                                     mtry=.y, num.trees=100, seed=42)))



In [50]:
colnames(cv_rf )

In [51]:
cv_rf$mtry
length(cv_performance_recall$recall)

"Unknown or uninitialised column: 'recall'."

Prepare the validate_actual and validate_predicted columns 
for each mtry/fold combination of the random forest model.

In [52]:
cv_prepare_rf <- cv_rf %>% mutate(validate_actual = map(test, ~.x$Attrition == "Yes"),
                           validate_prediction = map2(.x=rf_model, .y=test, ~predict(.x, .y, type="response")$predictions == "Yes"))

In [53]:
# Calculate the validate recall for each cross validation fold
cv_performance_recall <- cv_prepare_rf %>% mutate(recall= map2_dbl(.x=validate_actual, .y=validate_prediction, ~recall(actual=.x,predicted=.y)))

In [54]:
colnames(cv_performance_recall)

In [55]:
cv_performance_recall$recall
length(cv_performance_recall$recall)

In [56]:
# Calculate the mean recall for each mtry used 
cv_average_recall_mtry <- cv_performance_recall %>% 
select(mtry, recall) %>% group_by(mtry) %>% summarise(mean_recall = mean(recall))
# Problem solved: use select otherwise you run into error
# https://community.rstudio.com/t/warning-error-in-names-attribute-must-be-the-same-length-as-the-vector/11312/3

In [57]:
colnames(cv_average_recall_mtry )
class(bb)
dim(bb)

ERROR: Error in eval(expr, envir, enclos): Objekt 'bb' nicht gefunden


In [None]:
print(cv_average_recall_mtry )

The best performing random forest model has a recall of 0.128 with an mtry of 4. The recall of the logistic regression model was clearly better with 0.49.

### Building the logistic regression model as the best model and make predicitions on the testing data

In [None]:

# Build the logistic regression model using all training data
best_model <- glm(formula = Attrition~., 
                  data = training_attr , family = "binomial")


# Prepare binary vector of actual Attrition values for testing_data
test_actual <- testing_attr$Attrition == "Yes"

# Prepare binary vector of predicted Attrition values for testing_data
test_predicted <- predict(best_model, testing_attr , type = "response") > 0.5

In [None]:
summary(best_model)

In [None]:
class(test_predicted)

In [None]:
test_predicted[1:10]

In [None]:
# Compare the actual & predicted performance visually using a table
table(test_actual, test_predicted)

# Calculate the test accuracy
accuracy(test_actual, test_predicted)

# Calculate the test precision
precision(test_actual, test_predicted)

# Calculate the test recall
recall(test_actual, test_predicted)

Of all actually leaving employees 45% can be identitfied as at risk of leaving.