<a href="https://colab.research.google.com/github/Ad2891/Slutuppgift-team12/blob/main/Slutuppgift.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Slutuppgift 1 - Identifying Employee Attrition 

## The Problem Question

During this task, it is our goal to identify features of interest that can be used to predict the reasons for employee attrition. So the question used for this problem is:

>**How can prior indicators from features be used to predict if an employee is at risk of leaving the company?**



In [4]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn

from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import classification_report

First we retrieve the data.

In [5]:
#Import .csv file and save in variable employee_data 
employee_data = pd.read_csv("/content/WA_Fn-UseC_-HR-Employee-Attrition.csv", sep=",")
employee_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Converting string text into dummy variables

In order to see correlation between the features, we have to be able to compare the values. As many of the features contain text as a value, it difficult to do this comparison. To combat this, we use dummy variable in there place. As all the features that contain text use a predefine value, we are able to use dummy in their place as we know what values are found in each feature.

### Dummy for features with only two options

We are able to use an already existing function to convert the values of features "Gender", "Attrition", "OverTime" and "Over18" as these only contain values "Male" and "Female" for the "Gender" feature, and "Yes" and "No" for the others.

Something to note is that this will change the name of the columns to that of the old name and the first value in the column (ex. "Attrition" --> "Attrition_Yes"). We change that at the end of this section.

In [18]:
#Create dummy feature for the feature "Attrition" and then drop it
employee_data_attrition_dummy = pd.get_dummies(employee_data, columns=['Attrition'],
                                               drop_first=True)

In [19]:
#Create dummy feature for the feature "Gender" and then drop it
employee_data_attrition_gender_dummy = pd.get_dummies(employee_data_attrition_dummy,
                                                      columns=['Gender'], drop_first=True) 

In [20]:
#Create dummy feature for the feature "Over18" and then drop it
employee_data_attrition_gender_over_dummy = pd.get_dummies(employee_data_attrition_gender_dummy, 
                         columns=['Over18'], drop_first=True)

Something to note is that the feature 'Over18' only contains one value ('Y') so the column will be dropped with no dummy feature taking its place. This isn't a problem as the feature would not have influenced our result.

In [21]:
#Create dummy feature for the feature "OverTime" and then drop it
employee_data_attrition_gender_over_time_dummy = pd.get_dummies(employee_data_attrition_gender_over_dummy, 
                         columns=['OverTime'], drop_first=True)

All that is left is to rename the now columns to their prior names.

In [22]:
employee_data_attrition_gender_over_time_dummy.rename(columns={'Attrition_Yes': 'Attrition',
                                                               'Gender_Male': 'Gender',
                                                               'OverTime_Yes': 'OverTime'},
                                                      inplace = True)

This is a tool tip for understanding the values:

>"Attrition": 1 = Yes, 0 = No
>
>"Gender": 1 = Male, 0 = Female
>
>"OverTime_Yes" = 1 = Yes, 0 = No

### Dummy for features with more then two options

As some features have more then two unique values in them, we cannot use the previous method. Instead, we will have change them manually. Each of the changes will occur in a new dataframe.

In [51]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dummy = employee_data_attrition_gender_over_time_dummy
#Create the dummy values from the available values
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'Travel_Rarely', 'BusinessTravel'] = 3
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'Travel_Frequently', 'BusinessTravel'] = 2
employee_data_attrition_gender_over_time_business_dummy.loc[
                employee_data_attrition_gender_over_time_business_dummy['BusinessTravel'] 
                == 'Non-Travel', 'BusinessTravel'] = 1

In [52]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dep_dummy = employee_data_attrition_gender_over_time_business_dummy
#Create the dummy values from the available values
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Sales', 'Department'] = 3
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Research & Development', 'Department'] = 2
employee_data_attrition_gender_over_time_business_dep_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_dummy['Department'] 
                == 'Human Resources', 'Department'] = 1

In [53]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dep_educ_dummy = employee_data_attrition_gender_over_time_business_dep_dummy
#Create the dummy values from the available values
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Technical Degree', 'EducationField'] = 6
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Other', 'EducationField'] = 5
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Medical', 'EducationField'] = 4
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Marketing', 'EducationField'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Life Sciences', 'EducationField'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_dummy['EducationField'] 
                == 'Human Resources', 'EducationField'] = 1

In [54]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy = employee_data_attrition_gender_over_time_business_dep_educ_dummy
#Create the dummy values from the available values
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Sales Representative', 'JobRole'] = 9
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Sales Executive', 'JobRole'] = 8
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Research Scientist', 'JobRole'] = 7
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Research Director', 'JobRole'] = 6
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Manufacturing Director', 'JobRole'] = 5
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Manager', 'JobRole'] = 4
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Laboratory Technician', 'JobRole'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Human Resources', 'JobRole'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_job_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_dummy['JobRole'] 
                == 'Healthcare Representative', 'JobRole'] = 1

In [55]:
#Create a new dataframe from the previous dataframe 
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy = employee_data_attrition_gender_over_time_business_dep_educ_job_dummy
#Create the dummy values from the available values
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Single', 'MaritalStatus'] = 3
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Married', 'MaritalStatus'] = 2
employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy.loc[
                employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy['MaritalStatus'] 
                == 'Divorced', 'MaritalStatus'] = 1

In [56]:
#rename the last dataframe
employee_data_dummies = employee_data_attrition_gender_over_time_business_dep_educ_job_rel_dummy

These changes have not changed the dtype of the features so we will have to change it manually. Not doing this would mean a correlation could not be established.

In [61]:
employee_data_dummies = employee_data_dummies.astype({'Department' : int})
employee_data_dummies = employee_data_dummies.astype({'BusinessTravel' : int})
employee_data_dummies = employee_data_dummies.astype({'EducationField' : int})
employee_data_dummies = employee_data_dummies.astype({'JobRole' : int})
employee_data_dummies = employee_data_dummies.astype({'MaritalStatus' : int})

## Removing unnecessary features

As many of the features in the dataframe are linked others, it will require us to remove a number of them to reduce the redundance and avoid using the same variables twice in our prediction model.