# HOW TO PREDICT AN EMPLOYEE'S RESIGNATION ?

Nowadays, many companies have to deal with attrition. It is possible to find on Internet the following definition for this concept :

"The unpredictable and uncontrollable, but normal, reduction of work force due to resignations, retirement, sickness, or death." (http://www.businessdictionary.com/definition/attrition.html)

But is it really unpredictable ? The famous data-science oriented website Kaggle provides us an useful dataset (unfortunately not available anymore) for HR Analytics. This dataset contains information (such as evaluation, wage level or department) about 15 000 employees. My goal is to anticipate the resignation of the best of them.

## Data preprocessing

In [1]:
import pandas
import pandas
import matplotlib.pyplot as plt
import numpy as np
import math
#import sklearn.cross_validation
#import sklearn.grid_search
import random
from sklearn.linear_model import LogisticRegression
import sklearn.feature_selection
import sklearn.ensemble
hr = pandas.read_csv("C:/Users/tbeucher/Documents/Projet Python/HR_comma_sep.csv")
hr.info()
hr.head(n = 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low
5,0.41,0.5,2,153,3,0,1,0,sales,low
6,0.1,0.77,6,247,4,0,1,0,sales,low
7,0.92,0.85,5,259,5,0,1,0,sales,low
8,0.89,1.0,5,224,5,0,1,0,sales,low
9,0.42,0.53,2,142,3,0,1,0,sales,low


#### Selection of the best employees

I want to keep only the best employees because an company does not need to retain the bad ones. Arbitrarily, a good employee is someone who satisfies ONE of the following conditions :
- His/her last evaluation is above average (70% is the mean for this variable in the whole dataset)
- He/she has been given a promotion recently (last 5 years)
- He/she is an hard-worker (works more than 200h/a month)

In [3]:
#The dataset with all "good" employees will be called hr_good

good = hr.apply ( lambda row : row["last_evaluation"] > 0.70 or row["average_montly_hours"] > 200 
                 or row["promotion_last_5years"] > 0, axis = 1 )
hr_good = hr[good]

print("We have", len(hr_good), "observations in the new dataset")
print(int(np.mean(hr_good.left)*100), "% of these employees left the company")


We have 10480 observations in the new dataset
19 % of these employees left the company


#### Train/test split

In [4]:
split = hr_good.apply ( lambda row : random.randint(0,10000), axis = 1 ) 
hr_good ["split"] = split

cond_test = hr_good.apply ( lambda row : row["split"] > 9850, axis = 1 ) #around 150 employees in the test dataset
cond_train = hr_good.apply( lambda row : row["split"] < 9851, axis = 1 )

test = hr_good [cond_test]
hr_good = hr_good[cond_train]

test = test.drop("split", axis = 1)
hr_good = hr_good.drop("split", axis = 1)
copie_hr = hr #just for safety
print("There are", len(test), "good employees in the test dataset.")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are  153 good employees in the test dataset.


Adding set.seed, train/test split sklearn

## Data Analysis

To be improved

In [9]:
left = hr_good [ hr_good.left > 0 ]
stay = hr_good [ hr_good.left < 1 ]
print("Summary of quantitative variables (only on employees who left the company)")
print("Theses employees are", len(left), ". That represents approximately", int(len(left)/len(hr_good)*100), "% of the train dataset")
pandas.DataFrame.describe(left)

Summary of quantitative variables (only on employees who left the company)
Theses employees are 1985 . That represents approximately 19 % of the train dataset


Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,1985.0,1985.0,1985.0,1985.0,1985.0,1985.0,1985.0,1985.0
mean,0.463592,0.875914,5.21864,256.318892,4.521914,0.047355,1.0,0.009572
std,0.345975,0.103355,1.168537,32.735292,0.821,0.212451,0.0,0.097391
min,0.09,0.45,2.0,128.0,2.0,0.0,1.0,0.0
25%,0.1,0.83,4.0,242.0,4.0,0.0,1.0,0.0
50%,0.48,0.89,5.0,258.0,5.0,0.0,1.0,0.0
75%,0.81,0.95,6.0,277.0,5.0,0.0,1.0,0.0
max,0.92,1.0,7.0,310.0,6.0,1.0,1.0,1.0


In [10]:
print("Summary of quantitative variables (only on employees who are still working for the company)")
print("Theses employees are", len(stay), ". That represents approximately", int(len(stay)/len(hr_good)*100),"% de la base de données")
pandas.DataFrame.describe(stay)

Summary of quantitative variables (only on employees who are still working for the company)
Theses employees are 8342 . That represents approximately 80 % de la base de données


Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,8342.0,8342.0,8342.0,8342.0,8342.0,8342.0,8342.0,8342.0
mean,0.677281,0.766845,3.824982,212.943299,3.391153,0.17298,0.0,0.035723
std,0.21563,0.150735,0.962956,42.979044,1.571944,0.378253,0.0,0.185609
min,0.12,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.55,0.66,3.0,181.0,2.0,0.0,0.0,0.0
50%,0.7,0.78,4.0,219.0,3.0,0.0,0.0,0.0
75%,0.84,0.89,4.0,248.0,4.0,0.0,0.0,0.0
max,1.0,1.0,6.0,287.0,10.0,1.0,0.0,1.0
