<a href="https://colab.research.google.com/github/Ad2891/Slutuppgift-team12/blob/main/Slutuppgift.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Slutuppgift 1 - Identifying Employee Attrition 

## The Problem Question

During this task, it is our goal to identify features of interest that can be used to predict the reasons for employee attrition. So the question used for this problem is:

>**How can prior indicators from features be used to predict if an employee is at risk of leaving the company?**



In [4]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn

from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import classification_report

First we retrieve the data.

In [5]:
#Import .csv file and save in variable employee_data 
employee_data = pd.read_csv("/content/WA_Fn-UseC_-HR-Employee-Attrition.csv", sep=",")
employee_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Converting string text into dummy variables

In order to see correlation between the features, we have to be able to compare the values. As many of the features contain text as a value, it difficult to do this comparison. To combat this, we use dummy variable in there place. As all the features that contain text use a predefine value, we are able to use dummy in their place as we know what values are found in each feature.

### Dummy for features with only two options

We are able to use an already existing function to convert the values of features "Gender", "Attrition", "OverTime" and "Over18" as these only contain values "Male" and "Female" for the "Gender" feature, and "Yes" and "No" for the others.

Something to note is that this will change the name of the columns to that of the old name and the first value in the column (ex. "Attrition" --> "Attrition_Yes"). We change that at the end of this section.

In [6]:
#Create dummy feature for the feature "Attrition" and then drop it
employee_data_attrition_dummy = pd.get_dummies(employee_data, columns=['Attrition'],
                                               drop_first=True)

In [7]:
#Create dummy feature for the feature "Gender" and then drop it
employee_data_attrition_gender_dummy = pd.get_dummies(employee_data_attrition_dummy,
                                                      columns=['Gender'], drop_first=True) 

In [11]:
#Create dummy feature for the feature "Over18" and then drop it
employee_data_attrition_gender_over_dummy = pd.get_dummies(employee_data_attrition_gender_dummy, 
                         columns=['Over18'], drop_first=True)

Something to note is that the feature 'Over18' only contains one value ('Y') so the column will be dropped with no dummy feature taking its place. This isn't a problem as the feature would not have influenced our result.

In [12]:
#Create dummy feature for the feature "OverTime" and then drop it
employee_data_attrition_gender_over_time_dummy = pd.get_dummies(employee_data_attrition_gender_over_dummy, 
                         columns=['OverTime'], drop_first=True)

All that is left is to rename the now columns to their prior names.

In [17]:
employee_data_attrition_gender_over_time_dummy.rename(columns={'Attrition_Yes': 'Attrition',
                                                               'Gender_Male': 'Gender',
                                                               'OverTime_Yes': 'OverTime'},
                                                      inplace = True)

This is a tool tip for understanding the values:

>"Attrition": 1 = Yes, 0 = No
>
>"Gender": 1 = Male, 0 = Female
>
>"OverTime_Yes" = 1 = Yes, 0 = No

### Dummy for features with more then two options

As some features have more then two unique values in them, we cannot use the previous method. Instead, we will have change them manually. Each of the changes will occur in a new dataframe.

In [20]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dummy = employee_data_attrition_gender_over_time_dummy
#
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'Travel_Rarely', 'BusinessTravel'] = 3
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'Travel_Frequently', 'BusinessTravel'] = 2
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'No_Travel', 'BusinessTravel'] = 1

In [21]:
employee_data_attrition_gender_over_time_business_dep_dummy = employee_data_attrition_gender_over_time_business_dummy
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Sales', 'Department'] = 3
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Research & Development', 'Department'] = 2
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Human Resources', 'Department'] = 1

In [22]:
employee_data_attrition_gender_over_time_business_dep_educ_dummy = employee_data_attrition_gender_over_time_business_dep_dummy
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Technical Degree', 'EducationField'] = 6
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Other', 'EducationField'] = 5
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Medical', 'EducationField'] = 4
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Marketing', 'EducationField'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Life Sciences', 'EducationField'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Human Resources', 'EducationField'] = 1

In [23]:
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy = employee_data_attrition_gender_over_time_business_dep_educ_dummy
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Sales Representative', 'JobRole'] = 9
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Sales Executive', 'JobRole'] = 8
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Research Scientist', 'JobRole'] = 7
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Research Director', 'JobRole'] = 6
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Manufacturing Director', 'JobRole'] = 5
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Manager', 'JobRole'] = 4
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Laboratory Technician', 'JobRole'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Human Resources', 'JobRole'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Healthcare Representative', 'JobRole'] = 1

In [24]:
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy = employee_data_attrition_gender_over_time_business_dep_educ_job_dummy
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Single', 'MaritalStatus'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Married', 'MaritalStatus'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Divorced', 'MaritalStatus'] = 1

In [28]:
employee_data_dummies = employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy

In [32]:
employee_data_dummies.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition_Yes,Gender_Male
0,41,3,1102,3,1,2,2,1,1,2,...,0,8,0,1,6,4,0,5,1,0
1,49,2,279,2,8,1,2,1,2,3,...,1,10,3,3,10,7,1,7,0,1
2,37,3,1373,2,2,2,5,1,4,4,...,0,7,3,3,0,0,0,0,1,1
3,33,2,1392,2,3,4,2,1,5,4,...,0,8,3,3,8,7,3,0,0,0
4,27,3,591,2,2,1,4,1,7,1,...,1,6,3,3,2,2,2,2,0,1


In [33]:
employee_data_dummies.corr()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition_Yes,Gender_Male
Age,1.0,0.010661,-0.001686,0.208034,,-0.010145,0.010146,0.024287,0.02982,0.509604,...,0.03751,0.680381,-0.019621,-0.02149,0.311309,0.212901,0.216513,0.202089,-0.159205,-0.036311
DailyRate,0.010661,1.0,-0.004985,-0.016806,,-0.05099,0.018355,0.023381,0.046135,0.002966,...,0.042143,0.014515,0.002453,-0.037848,-0.034055,0.009932,-0.033229,-0.026363,-0.056652,-0.011716
DistanceFromHome,-0.001686,-0.004985,1.0,0.021042,,0.032916,-0.016075,0.031131,0.008783,0.005303,...,0.044872,0.004628,-0.036942,-0.026556,0.009508,0.018845,0.010029,0.014406,0.077924,-0.001851
Education,0.208034,-0.016806,0.021042,1.0,,0.04207,-0.027128,0.016775,0.042438,0.101589,...,0.018422,0.14828,-0.0251,0.009819,0.069114,0.060236,0.054254,0.069065,-0.031373,-0.016547
EmployeeCount,,,,,,,,,,,...,,,,,,,,,,
EmployeeNumber,-0.010145,-0.05099,0.032916,0.04207,,1.0,0.017621,0.035179,-0.006888,-0.018519,...,0.062227,-0.014365,0.023603,0.010309,-0.01124,-0.008416,-0.009019,-0.009197,-0.010577,0.022556
EnvironmentSatisfaction,0.010146,0.018355,-0.016075,-0.027128,,0.017621,1.0,-0.049857,-0.008278,0.001212,...,0.003432,-0.002693,-0.019359,0.027627,0.001458,0.018007,0.016194,-0.004999,-0.103369,0.000508
HourlyRate,0.024287,0.023381,0.031131,0.016775,,0.035179,-0.049857,1.0,0.042861,-0.027853,...,0.050263,-0.002334,-0.008548,-0.004607,-0.019582,-0.024106,-0.026716,-0.020123,-0.006846,-0.000478
JobInvolvement,0.02982,0.046135,0.008783,0.042438,,-0.006888,-0.008278,0.042861,1.0,-0.01263,...,0.021523,-0.005533,-0.015338,-0.014617,-0.021355,0.008717,-0.024184,0.025976,-0.130016,0.01796
JobLevel,0.509604,0.002966,0.005303,0.101589,,-0.018519,0.001212,-0.027853,-0.01263,1.0,...,0.013984,0.782208,-0.018191,0.037818,0.534739,0.389447,0.353885,0.375281,-0.169105,-0.039403


In [35]:
employee_data_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   BusinessTravel            1470 non-null   object
 2   DailyRate                 1470 non-null   int64 
 3   Department                1470 non-null   object
 4   DistanceFromHome          1470 non-null   int64 
 5   Education                 1470 non-null   int64 
 6   EducationField            1470 non-null   object
 7   EmployeeCount             1470 non-null   int64 
 8   EmployeeNumber            1470 non-null   int64 
 9   EnvironmentSatisfaction   1470 non-null   int64 
 10  HourlyRate                1470 non-null   int64 
 11  JobInvolvement            1470 non-null   int64 
 12  JobLevel                  1470 non-null   int64 
 13  JobRole                   1470 non-null   object
 14  JobSatisfaction         