# Case-study: Data Science

**Author**: Jacopo Ventura

**Date**: 24th September 2017

Dataset: HR employee attrition and performance.

Tasks:
1. Import, clean and visualize the data
2. Bulid a predictive model of Employee churn
3. Generate and validate hypothesis of why Employees churn 
4. Build a lookalike model of the users and reason about their groupings

The goal of the project is to predict attrition. In HR termonology, attrition occurs when an employee retires or when the company eliminates his job.

## 2. Bulid a predictive model of Employee churn (Attrition)

We now build a predictive model of Employee churn (Attrition). We proceed as follow:

- clean the dataset according to the data wrangling phase
- generate training and test data from the dataset
- prepare data for training the Machine Learning model
- train different binary classifier models: logistic regression, random forest, SVM
- fine tuning of the best model
- test on the test dataset


In [1]:
# Import packages for data analysis
import os    
import tarfile
from six.moves import urllib
import pandas as pd    
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Import and suppress warnings
import warnings
warnings.filterwarnings('ignore')

# step 1: Import data as Pandas dataframe
data_path='C:/Users/jacopo/Desktop/WA_Fn-UseC_-HR-Employee-Attrition.csv'
dataset = pd.read_csv(data_path)   # dataset as pandas dataframe

### Clean dataset by eliminating over 18, EmployeeCount, EmployeeNumber and StandardHours

In [2]:
# Drop attributes from the dataframe (see data wrangling phase)
dataset.drop(["EmployeeCount","Over18","EmployeeNumber","StandardHours"],axis = 1, inplace = True)

### Separate training and test dataset

Apply stratified sampling to split dataset in test and train datasets.

In [3]:
# perform stratified sampling
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)  # n_splits: number of slitted dataset; 
for train_index, test_index in split.split(dataset, dataset["Attrition"]):    
    dataset_train = dataset.loc[train_index]
    dataset_test = dataset.loc[test_index]

In [4]:
# Separate features and label
dataset_test_Features = dataset_test.drop(["Attrition"], axis=1) 
dataset_test_Label = dataset_test["Attrition"].copy()

dataset_train_Features = dataset_train.drop(["Attrition"], axis=1) 
dataset_train_Label = dataset_train["Attrition"].copy()

### Prepare data for training the Machine Learning model

In this phase, we normalize the numerical features and convert categorical variable into dummy/indicator variables. In fact, Machine Learning models works with numerical values only.

In [5]:
# Define numerical attributes
num_attribs = ["Age","DailyRate","DistanceFromHome","HourlyRate","MonthlyIncome",
              "MonthlyRate","NumCompaniesWorked","PercentSalaryHike",
              "TotalWorkingYears","TrainingTimesLastYear","YearsAtCompany","YearsInCurrentRole",
              "YearsSinceLastPromotion","YearsWithCurrManager",
              "Education","EnvironmentSatisfaction","JobInvolvement","JobLevel","JobSatisfaction","PerformanceRating",
              "RelationshipSatisfaction","StockOptionLevel","WorkLifeBalance"] 

# Define categorical attributes
cat_attribs = ["BusinessTravel","Department","EducationField","Gender","JobRole",
              "MaritalStatus","OverTime"]       

In [6]:
from sklearn import preprocessing

# Normalize numerical features
normalizer = preprocessing.Normalizer()
dataset_train_FeaturesX_normalized = normalizer.fit_transform(dataset_train_Features[num_attribs])

# Encoding of categorical attributes and return numpy array
dataset_train_FeaturesX_encoded = pd.get_dummies(dataset_train_Features[cat_attribs]).as_matrix()

# Merge the two feature arrays
dataset_train_FeaturesX_encoded_normalized = np.concatenate((dataset_train_FeaturesX_normalized, 
                                                             dataset_train_FeaturesX_encoded), 
                                                             axis=1)

# Convert label from categorical to numerical
dataset_train_Label.replace(['No','Yes'],[0,1], inplace=True)

### Train Machine Learning model

We train three different Machine Learning models:
- Logistic regression
- Random Forest
- Support Vector Machine

We evaluate the performance of each trained model through the F1_score: 

$F_1 = 2\dfrac{precision * recall}{precision + recall}$

using cross validation.

In [7]:
# Step 5: try some ML models
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

In [8]:
# Logistic regression
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

# cross validation on the training dataset
scores = cross_val_score(log_reg,
                         dataset_train_FeaturesX_encoded_normalized, 
                         dataset_train_Label, 
                         cv=10, 
                         scoring='f1')
np.mean(scores)

0.23727456762239374

In [9]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=20,
                                 max_leaf_nodes=20,
                                 n_jobs=-1,
                                 class_weight = 'balanced')


# cross validation on the training dataset
scores = cross_val_score(rnd_clf, 
                         dataset_train_FeaturesX_encoded_normalized, 
                         dataset_train_Label, 
                         cv=10, 
                         scoring='f1')
np.mean(scores)

0.40255257828749491

In [10]:
# SVM
from sklearn.svm import SVC

svm_clf = SVC(kernel="rbf", 
              degree=3, 
              coef0=1, 
              C=5)

# cross validation on the training dataset
scores = cross_val_score(svm_clf, 
                         dataset_train_FeaturesX_encoded_normalized, 
                         dataset_train_Label,
                         cv=10, 
                         scoring='f1')
np.mean(scores)

0.0

### Fine tuning of Random Forest

We select the Random Forest model since it has the highest F1_score. We now perform the fine tuning using grid search approach.

In [11]:
# Fine - tuning of the selected ML model

from sklearn.model_selection import GridSearchCV

param_grid = [{'n_estimators': [10, 30, 100], 
               'max_features': [2, 8, 20],  
               'max_leaf_nodes': [5, 10, 20 ,40]}]

forest_clf = RandomForestClassifier(class_weight = 'balanced')
grid_search = GridSearchCV(forest_clf, param_grid, cv=5,scoring='f1')
grid_search.fit(dataset_train_FeaturesX_encoded_normalized, dataset_train_Label)
print(grid_search.best_params_)
print(grid_search.best_score_)

# Best model
best_model = grid_search.best_estimator_
y_train_predict = best_model.predict(dataset_train_FeaturesX_encoded_normalized)
print('Confusion matrix:')
print(confusion_matrix(dataset_train_Label,y_train_predict))

{'max_features': 2, 'max_leaf_nodes': 10, 'n_estimators': 100}
0.468206689344
Confusion matrix:
[[867 119]
 [ 52 138]]


### Test on the test dataset

We now test the tuned Random Forest model on the test dataset. First we need to transform the test dataset by scaling numerical features and converting categorical features into numerical.

In [12]:
# Normalize numerical features
dataset_test_FeaturesX_normalized = normalizer.transform(dataset_test_Features[num_attribs])

# Encoding of categorical attributes and return numpy array
dataset_test_FeaturesX_encoded = pd.get_dummies(dataset_test_Features[cat_attribs]).as_matrix()

# Merge the two feature arrays
dataset_test_FeaturesX_encoded_normalized = np.concatenate((dataset_test_FeaturesX_normalized, 
                                                            dataset_test_FeaturesX_encoded), axis=1)

# Convert label from categorical to numerical
dataset_test_Label.replace(['No','Yes'],[0,1], inplace=True)

In [13]:
# Generate confusion matrix
from sklearn.metrics import confusion_matrix

y_test_predict = best_model.predict(dataset_test_FeaturesX_encoded_normalized)
print('Confusion matrix:')
print(confusion_matrix(dataset_test_Label,y_test_predict))
print('F1 score:')
f1_score(dataset_test_Label, y_test_predict)

Confusion matrix:
[[202  45]
 [ 22  25]]
F1 score:


0.42735042735042733