### 1. Exploring `HR_Dataset`

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
hr_dataset = pd.read_csv('../data/HR_dataset/HR_Dataset.csv')
pd.options.display.max_columns = None
hr_dataset.head()

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,"Adinolfi, Wilson K",10026,0,0,1,1,5,4,0,62506,0,19,Production Technician I,MA,1960,07/10/83,M,Single,US Citizen,No,White,7/5/2011,,N/A-StillEmployed,Active,Production,Michael Albert,22.0,LinkedIn,Exceeds,4.6,5,0,1/17/2019,0,1
1,"Ait Sidi, Karthikeyan",10084,1,1,1,5,3,3,0,104437,1,27,Sr. DBA,MA,2148,05/05/75,M,Married,US Citizen,No,White,3/30/2015,6/16/2016,career change,Voluntarily Terminated,IT/IS,Simon Roup,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,"Akinkuolie, Sarah",10196,1,1,0,5,5,3,0,64955,1,20,Production Technician II,MA,1810,09/19/88,F,Married,US Citizen,No,White,7/5/2011,9/24/2012,hours,Voluntarily Terminated,Production,Kissy Sullivan,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,"Alagbe,Trina",10088,1,1,0,1,5,3,0,64991,0,19,Production Technician I,MA,1886,09/27/88,F,Married,US Citizen,No,White,1/7/2008,,N/A-StillEmployed,Active,Production,Elijiah Gray,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,"Anderson, Carol",10069,0,2,0,5,5,3,0,50825,1,19,Production Technician I,MA,2169,09/08/89,F,Divorced,US Citizen,No,White,7/11/2011,9/6/2016,return to school,Voluntarily Terminated,Production,Webster Butler,39.0,Google Search,Fully Meets,5.0,4,0,2/1/2016,0,2


In [148]:
# let's clean the database = dtypes:
hr_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               311 non-null    object 
 1   EmpID                       311 non-null    int64  
 2   MarriedID                   311 non-null    int64  
 3   MaritalStatusID             311 non-null    int64  
 4   GenderID                    311 non-null    int64  
 5   EmpStatusID                 311 non-null    int64  
 6   DeptID                      311 non-null    int64  
 7   PerfScoreID                 311 non-null    int64  
 8   FromDiversityJobFairID      311 non-null    int64  
 9   Salary                      311 non-null    int64  
 10  Termd                       311 non-null    int64  
 11  PositionID                  311 non-null    int64  
 12  Position                    311 non-null    object 
 13  State                       311 non

In [149]:
hr_dataset.duplicated().sum()

0

In [150]:
hr_dataset.isnull().sum()

Employee_Name                   0
EmpID                           0
MarriedID                       0
MaritalStatusID                 0
GenderID                        0
EmpStatusID                     0
DeptID                          0
PerfScoreID                     0
FromDiversityJobFairID          0
Salary                          0
Termd                           0
PositionID                      0
Position                        0
State                           0
Zip                             0
DOB                             0
Sex                             0
MaritalDesc                     0
CitizenDesc                     0
HispanicLatino                  0
RaceDesc                        0
DateofHire                      0
DateofTermination             207
TermReason                      0
EmploymentStatus                0
Department                      0
ManagerName                     0
ManagerID                       8
RecruitmentSource               0
PerformanceSco

___
___
### 2. Checking database mistakes

2.1. `MaritalStatusID`, `MaritalDesc` and `MarriedID`

2.2. `Termd`, `EmploymentStatus`, `DateofTermination` and `EmpStatusID`

2.3. `GenderID` and `Sex`
___

2.1. Checking if `MaritalStatusID` matches with `MaritalDesc`

In [151]:
# Checking (visual) if value_counts matches:
print(f"MaritalStatusID values: \n\n{hr_dataset['MaritalStatusID'].value_counts()}")
print("------------------------------------\n")
print(f"MaritalDesc values: \n\n{hr_dataset['MaritalDesc'].value_counts()}")

# it seems to be

MaritalStatusID values: 

MaritalStatusID
0    137
1    124
2     30
3     12
4      8
Name: count, dtype: int64
------------------------------------

MaritalDesc values: 

MaritalDesc
Single       137
Married      124
Divorced      30
Separated     12
Widowed        8
Name: count, dtype: int64


In [152]:
# Creating a dict for checking manually if they are grouped as expected:
marital_desc_dict = dict(zip(hr_dataset['MaritalDesc'],hr_dataset['MaritalStatusID']))
marital_desc_dict
#it seems to be

{'Single': 0, 'Married': 1, 'Divorced': 2, 'Widowed': 4, 'Separated': 3}

In [153]:
for key, value in zip(hr_dataset['MaritalStatusID'],hr_dataset['MaritalDesc']):
    if key == marital_desc_dict[value]:
        continue
    else:
        print(hr_dataset['EmpID'])

# it seems to be perfect

In [154]:
#let's check more efficently:
pd.crosstab(hr_dataset['MaritalStatusID'], [hr_dataset['MarriedID'], hr_dataset['MaritalDesc']], rownames = ['MaritalStatusID'], colnames = ['MarriedID','MaritalDesc']) 

MarriedID,0,0,0,0,1
MaritalDesc,Divorced,Separated,Single,Widowed,Married
MaritalStatusID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,0,0,137,0,0
1,0,0,0,0,124
2,30,0,0,0,0
3,0,12,0,0,0
4,0,0,0,8,0


___
2.2. Check between `Termd`, `EmploymentStatus`, `DateofTermination` and 	`EmpStatusID`

In [155]:
pd.crosstab(hr_dataset['EmpStatusID'], [hr_dataset['Termd'], hr_dataset['EmploymentStatus']], rownames = ['EmpStatusID'], colnames = ['Termd', 'EmploymentStatus'])

Termd,0,1,1
EmploymentStatus,Active,Terminated for Cause,Voluntarily Terminated
EmpStatusID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,182,2,0
2,11,0,0
3,14,0,0
4,0,14,0
5,0,0,88


In [156]:
pd.crosstab(hr_dataset['Termd'], hr_dataset['DateofTermination'])

DateofTermination,1/11/2014,1/12/2014,1/12/2017,1/15/2016,1/2/2012,1/26/2016,1/7/2013,1/9/2013,1/9/2014,10/22/2011,10/25/2015,10/31/2014,10/31/2015,11/10/2018,11/11/2016,11/14/2015,11/15/2015,11/15/2016,11/30/2012,11/4/2015,12/12/2015,12/15/2015,12/28/2017,2/12/2016,2/19/2016,2/21/2016,2/22/2017,2/25/2018,2/4/2013,2/5/2016,2/8/2012,2/8/2016,3/15/2015,3/31/2014,4/1/2013,4/1/2016,4/15/2013,4/15/2015,4/15/2018,4/24/2014,4/29/2018,4/4/2014,4/6/2017,4/7/2012,4/7/2018,4/8/2015,5/1/2016,5/1/2018,5/14/2012,5/15/2014,5/17/2016,5/18/2016,5/25/2016,5/30/2011,6/15/2013,6/16/2016,6/18/2013,6/24/2013,6/25/2015,6/27/2015,6/29/2015,6/4/2015,6/4/2018,6/5/2013,6/6/2017,6/8/2016,7/2/2014,7/30/2018,7/8/2017,8/13/2018,8/15/2015,8/19/2012,8/19/2013,8/19/2018,8/2/2014,8/30/2010,8/4/2017,8/7/2014,9/1/2015,9/12/2015,9/15/2015,9/15/2016,9/19/2016,9/23/2016,9/24/2012,9/25/2013,9/26/2011,9/26/2017,9/26/2018,9/27/2018,9/29/2015,9/4/2014,9/5/2015,9/5/2016,9/6/2016,9/7/2015
Termd,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,2,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,2


It seems that ID 4 and 5 in `EmpStatusID` means that are terminated and indicates the reason. Let's check the relation between `EmpStatusID` == 1 | 2 | 3 and another field. 

In [157]:
pd.crosstab(hr_dataset['EmpStatusID'], [hr_dataset['Termd'], hr_dataset['EmploymentStatus']], rownames = ['EmpStatusID'], colnames = ['Termd', 'EmploymentStatus'])

Termd,0,1,1
EmploymentStatus,Active,Terminated for Cause,Voluntarily Terminated
EmpStatusID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,182,2,0
2,11,0,0
3,14,0,0
4,0,14,0
5,0,0,88


In [158]:
check_empstatusid = hr_dataset[(hr_dataset['EmpStatusID'] == 1) | (hr_dataset['EmpStatusID'] == 2) | (hr_dataset['EmpStatusID'] == 3)]

In [159]:
pd.crosstab(check_empstatusid['EmpStatusID'], check_empstatusid['CitizenDesc'])

CitizenDesc,Eligible NonCitizen,Non-Citizen,US Citizen
EmpStatusID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,7,1,176
2,0,0,11
3,0,0,14


In [160]:
pd.crosstab(check_empstatusid['EmpStatusID'], check_empstatusid['TermReason'])

TermReason,Fatal attraction,N/A-StillEmployed,"no-call, no-show"
EmpStatusID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,182,1
2,0,11,0
3,0,14,0


There's relation between `Termd`, `EmploymentStatus`and `DateofTermination` but it is not clear the relation of them with `EmpStatusID`.

___
2.3. Checking `GenderID` and `Sex`

In [161]:
pd.crosstab(hr_dataset['GenderID'], hr_dataset['Sex']) 

Sex,F,M
GenderID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,176,0
1,0,135


___
___
### 3. Data clean

3.1. Replacing `0-1` for `No-Yes`

3.2. `HispanicLatino` data format

3.3. Date fields format

3.4. Date fields format

3.5. Filling nulls

3.6. Removing duplicated positions for same ID

3.7. Correcting department data

3.8. Exporting cleaned data base
___

3.1. Replacing `0-1` for `No-Yes`.

In [162]:
conv_bool = {0: 'No', 1: 'Yes'}

In [163]:
hr_dataset['MarriedID'] = hr_dataset['MarriedID'].map(conv_bool)
hr_dataset['Termd'] = hr_dataset['Termd'].map(conv_bool)
hr_dataset['FromDiversityJobFairID'] = hr_dataset['FromDiversityJobFairID'].map(conv_bool)
hr_dataset.sample(1)

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
111,"Gonzalez, Cayo",10031,No,2,1,1,5,4,Yes,59892,No,19,Production Technician I,MA,2108,09/29/69,M,Divorced,US Citizen,No,Black or African American,7/11/2011,,N/A-StillEmployed,Active,Production,Brannon Miller,12.0,Diversity Job Fair,Exceeds,4.5,4,0,2/18/2019,0,1


___
3.2. `HispanicLatino` data format.

In [164]:
hr_dataset['HispanicLatino'].unique()

array(['No', 'Yes', 'no', 'yes'], dtype=object)

In [165]:
hr_dataset['HispanicLatino'] = hr_dataset['HispanicLatino'].apply(lambda x: x.title())

In [166]:
hr_dataset['HispanicLatino'].unique()

array(['No', 'Yes'], dtype=object)

___
3.3. Columns name

In [167]:
hr_dataset.rename(columns= {'MarriedID': 'Married', 'Termd': 'Term'}, inplace=True)

___
3.4. Date fields format

In [168]:
hr_dataset['DOB'] = pd.to_datetime(hr_dataset['DOB'], format='%m/%d/%y')

In [169]:
def conv_dates(df, *cols):
    for col in cols:
        df[col] = pd.to_datetime(df[col], format='%m/%d/%Y')

In [170]:
conv_dates(hr_dataset, 'DateofHire', 'DateofTermination', 'LastPerformanceReview_Date')

In [171]:
hr_dataset.dtypes

Employee_Name                         object
EmpID                                  int64
Married                               object
MaritalStatusID                        int64
GenderID                               int64
EmpStatusID                            int64
DeptID                                 int64
PerfScoreID                            int64
FromDiversityJobFairID                object
Salary                                 int64
Term                                  object
PositionID                             int64
Position                              object
State                                 object
Zip                                    int64
DOB                           datetime64[ns]
Sex                                   object
MaritalDesc                           object
CitizenDesc                           object
HispanicLatino                        object
RaceDesc                              object
DateofHire                    datetime64[ns]
DateofTerm

___
3.5. Filling nulls (in ManagerID):

In [172]:
hr_dataset['ManagerID'].isnull().sum()

8

In [173]:
hr_dataset[hr_dataset['ManagerID'].isnull()][['ManagerID', 'ManagerName']]

Unnamed: 0,ManagerID,ManagerName
19,,Webster Butler
30,,Webster Butler
44,,Webster Butler
88,,Webster Butler
135,,Webster Butler
177,,Webster Butler
232,,Webster Butler
251,,Webster Butler


In [174]:
hr_dataset['ManagerID'].fillna(39, inplace=True)

In [175]:
hr_dataset['ManagerID'].isnull().sum()

0

In [176]:
hr_dataset['ManagerID'] = hr_dataset['ManagerID'].astype(int)

In [177]:
hr_dataset['ManagerID'].dtypes

dtype('int64')

___
3.6. Correcting different positions for same PositionID

In [178]:
positions = hr_dataset[['PositionID', 'Position']]
positions.drop_duplicates(inplace= True)
positions.sort_values('PositionID')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  positions.drop_duplicates(inplace= True)


Unnamed: 0,PositionID,Position
29,1,Accountant I
132,2,Administrative Assistant
32,3,Area Sales Manager
70,4,BI Developer
42,5,BI Director
308,6,CIO
240,7,Data Architect
18,8,Database Administrator
249,9,Data Analyst
12,9,Data Analyst


Let's see what's happening with the "Software Engineer" with PositionID 23 and 24:

In [179]:
hr_dataset[hr_dataset['Position'] == 'Software Engineer']

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
6,"Andreola, Colby",10194,No,0,0,1,4,3,No,95660,No,24,Software Engineer,MA,2110,1979-05-24,F,Single,US Citizen,No,White,2014-11-10,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,LinkedIn,Fully Meets,3.04,3,4,2019-01-02,0,19
37,"Carabbio, Judith",10085,No,0,0,1,4,3,No,93396,No,24,Software Engineer,MA,2132,1987-04-05,F,Single,US Citizen,No,White,2013-11-11,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,Indeed,Fully Meets,4.96,4,6,2019-01-30,0,3
66,"Del Bosque, Keyla",10155,No,0,0,1,4,3,No,101199,No,24,Software Engineer,MA,2176,1979-07-05,F,Single,US Citizen,No,Black or African American,2012-01-09,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,CareerBuilder,Fully Meets,3.79,5,5,2019-01-25,0,8
86,"Exantus, Susan",10290,Yes,1,0,4,4,2,No,99280,Yes,24,Software Engineer,MA,1749,1987-05-15,F,Married,US Citizen,No,Black or African American,2011-05-02,2013-06-05,attendance,Terminated for Cause,Software Engineering,Alex Sweetwater,10,Indeed,Needs Improvement,2.1,5,4,2012-08-10,4,19
180,"Martin, Sandra",10110,No,0,0,1,4,3,No,105688,No,24,Software Engineer,MA,2135,1987-11-07,F,Single,US Citizen,No,Asian,2013-11-11,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,Google Search,Fully Meets,4.5,5,4,2019-01-14,0,14
212,"Patronick, Lucas",10005,No,0,1,5,4,4,Yes,108987,Yes,24,Software Engineer,MA,1844,1979-02-20,M,Single,US Citizen,No,Black or African American,2011-11-07,2015-09-07,Another position,Voluntarily Terminated,Software Engineering,Alex Sweetwater,10,Diversity Job Fair,Exceeds,5.0,5,3,2015-08-16,0,13
227,"Quinn, Sean",10131,Yes,1,1,5,1,3,Yes,83363,Yes,23,Software Engineer,MA,2045,1984-11-06,M,Married,Eligible NonCitizen,No,Black or African American,2011-02-21,2015-08-15,career change,Voluntarily Terminated,Software Engineering,Janet King,2,Diversity Job Fair,Fully Meets,4.15,4,0,2014-04-19,0,4
245,"Saada, Adell",10126,Yes,1,0,1,4,3,No,86214,No,24,Software Engineer,MA,2132,1986-07-24,F,Married,US Citizen,No,White,2012-11-05,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,Indeed,Fully Meets,4.2,3,6,2019-02-13,0,2
274,"Szabo, Andrew",10024,No,0,1,1,4,4,No,92989,No,24,Software Engineer,MA,2140,1983-05-06,M,Single,US Citizen,No,White,2014-07-07,NaT,N/A-StillEmployed,Active,Software Engineering,Alex Sweetwater,10,LinkedIn,Exceeds,4.5,5,5,2019-02-18,0,1
285,"True, Edward",10102,No,0,1,5,4,3,Yes,100416,Yes,24,Software Engineer,MA,2451,1983-06-14,M,Single,Non-Citizen,No,Black or African American,2013-02-18,2018-04-15,medical issues,Voluntarily Terminated,Software Engineering,Alex Sweetwater,10,Diversity Job Fair,Fully Meets,4.6,3,4,2017-02-12,0,9


In [180]:
hr_dataset[hr_dataset['PositionID'] == 23]

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
164,"LeBlanc, Brandon R",10134,Yes,1,1,1,1,3,No,93046,No,23,Shared Services Manager,MA,1460,1984-06-10,M,Married,US Citizen,No,White,2016-01-05,NaT,N/A-StillEmployed,Active,Admin Offices,Janet King,2,CareerBuilder,Fully Meets,4.1,4,0,2019-01-28,0,20
227,"Quinn, Sean",10131,Yes,1,1,5,1,3,Yes,83363,Yes,23,Software Engineer,MA,2045,1984-11-06,M,Married,Eligible NonCitizen,No,Black or African American,2011-02-21,2015-08-15,career change,Voluntarily Terminated,Software Engineering,Janet King,2,Diversity Job Fair,Fully Meets,4.15,4,0,2014-04-19,0,4


Clearly (looking at Manager), what makes more sense is that the incorrect data is Position, and is Position ID the correct data:

In [181]:
hr_dataset['Position'][hr_dataset['PositionID'] == 23] = "Shared Services Manager"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['Position'][hr_dataset['PositionID'] == 23] = "Shared Services Manager"


In [182]:
hr_dataset['Position'][hr_dataset['PositionID'] == 23].value_counts()

Position
Shared Services Manager    2
Name: count, dtype: int64

In [183]:
hr_dataset['Position'][hr_dataset['PositionID'] == 13].value_counts()

Position
IT Manager - DB         2
IT Manager - Support    1
IT Manager - Infra      1
Name: count, dtype: int64

In [184]:
hr_dataset['Position'][hr_dataset['PositionID'] == 13] = 'IT Manager'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['Position'][hr_dataset['PositionID'] == 13] = 'IT Manager'


In [185]:
hr_dataset['Position'][hr_dataset['PositionID'] == 13].value_counts()

Position
IT Manager    4
Name: count, dtype: int64

In [186]:
hr_dataset['Position'][hr_dataset['PositionID'] == 9].value_counts()

Position
Data Analyst     7
Data Analyst     1
Name: count, dtype: int64

In [187]:
hr_dataset['Position'][hr_dataset['PositionID'] == 9] = '../data Analyst'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['Position'][hr_dataset['PositionID'] == 9] = 'Data Analyst'


In [188]:
hr_dataset['Position'][hr_dataset['PositionID'] == 9].value_counts()

Position
Data Analyst    8
Name: count, dtype: int64

___
3.7. Cleaning Department ID and Department errors

In [189]:
depts = hr_dataset[['DeptID', 'Department']]
depts.drop_duplicates(inplace= True)
depts.sort_values('DeptID')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  depts.drop_duplicates(inplace= True)


Unnamed: 0,DeptID,Department
26,1,Admin Offices
227,1,Software Engineering
150,2,Executive Office
1,3,IT/IS
6,4,Software Engineering
0,5,Production
32,6,Sales
64,6,Production


Let's correct the name of production's department:

In [190]:
depts.loc[64,'Department']

'Production       '

In [191]:
hr_dataset['Department'][hr_dataset['Department'] == depts.loc[64,'Department']]  = 'Production'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['Department'][hr_dataset['Department'] == depts.loc[64,'Department']]  = 'Production'


Let's check if, the lines where DeptID == 6, and the Department == Production, what data is correct:

In [192]:
hr_dataset[hr_dataset['DeptID'] == 6].sort_values('Department').head()

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
64,"Dee, Randy",10311,Yes,1,1,1,6,1,No,56991,No,19,Production Technician I,MA,2138,1988-04-15,M,Married,US Citizen,No,White,2018-07-09,NaT,N/A-StillEmployed,Active,Production,Brannon Miller,12,Indeed,Fully Meets,4.3,4,3,2019-01-31,2,2
32,"Bunbury, Jessica",10188,Yes,1,0,5,6,3,No,74326,Yes,3,Area Sales Manager,VA,21851,2064-06-01,F,Married,Eligible NonCitizen,No,Black or African American,2011-08-15,2014-08-02,Another position,Voluntarily Terminated,Sales,John Smith,17,Google Search,Fully Meets,3.14,5,0,2013-02-10,1,19
282,"Torrence, Jack",10013,No,3,1,1,6,4,No,64397,No,3,Area Sales Manager,ND,58782,2068-01-15,M,Separated,US Citizen,No,White,2006-01-09,NaT,N/A-StillEmployed,Active,Sales,Lynn Daneault,21,Indeed,Exceeds,4.1,3,0,2019-01-04,0,6
278,"Terry, Sharlene",10161,No,0,0,1,6,3,No,58370,No,3,Area Sales Manager,OR,97756,2065-05-07,F,Single,US Citizen,No,Black or African American,2014-09-29,NaT,N/A-StillEmployed,Active,Sales,Lynn Daneault,21,Indeed,Fully Meets,3.69,3,0,2019-01-28,0,18
270,"Strong, Caitrin",10241,Yes,1,0,1,6,3,No,60120,No,3,Area Sales Manager,MT,59102,1989-05-12,F,Married,US Citizen,No,Black or African American,2010-09-27,NaT,N/A-StillEmployed,Active,Sales,John Smith,17,Indeed,Fully Meets,4.1,4,0,2019-01-31,0,18


In [193]:
hr_dataset['Department'][hr_dataset['ManagerName'] == 'Brannon Miller'].value_counts()

Department
Production    22
Name: count, dtype: int64

In [194]:
hr_dataset['DeptID'][hr_dataset['ManagerName'] == 'Brannon Miller'].value_counts()

DeptID
5    21
6     1
Name: count, dtype: int64

The correct one, then, is the Department ID:

In [195]:
hr_dataset['DeptID'][hr_dataset['Department'] == 'Production'] = 5

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['DeptID'][hr_dataset['Department'] == 'Production'] = 5


In [196]:
hr_dataset['DeptID'][hr_dataset['ManagerName'] == 'Brannon Miller'].value_counts()

DeptID
5    22
Name: count, dtype: int64

Let's check now the values where DeptID == 1:

In [197]:
hr_dataset['Department'][hr_dataset['DeptID'] == 1].value_counts()

Department
Admin Offices           9
Software Engineering    1
Name: count, dtype: int64

In [198]:
hr_dataset[hr_dataset['DeptID'] == 1].sort_values('Department', ascending= False)

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
227,"Quinn, Sean",10131,Yes,1,1,5,1,3,Yes,83363,Yes,23,Shared Services Manager,MA,2045,1984-11-06,M,Married,Eligible NonCitizen,No,Black or African American,2011-02-21,2015-08-15,career change,Voluntarily Terminated,Software Engineering,Janet King,2,Diversity Job Fair,Fully Meets,4.15,4,0,2014-04-19,0,4
26,"Boutwell, Bonalyn",10081,Yes,1,0,1,1,3,Yes,106367,No,26,Sr. Accountant,MA,2468,1987-04-04,F,Married,US Citizen,No,Black or African American,2015-02-16,NaT,N/A-StillEmployed,Active,Admin Offices,Brandon R. LeBlanc,3,Diversity Job Fair,Fully Meets,5.0,4,3,2019-02-18,0,4
29,"Brown, Mia",10238,Yes,1,0,1,1,3,Yes,63000,No,1,Accountant I,MA,1450,1987-11-24,F,Married,US Citizen,No,Black or African American,2008-10-27,NaT,N/A-StillEmployed,Active,Admin Offices,Brandon R. LeBlanc,1,Diversity Job Fair,Fully Meets,4.5,2,6,2019-01-15,0,14
97,"Foster-Baker, Amy",10080,Yes,1,0,1,1,3,No,99351,No,26,Sr. Accountant,MA,2050,1979-04-16,F,Married,US Citizen,No,White,2009-01-05,NaT,N/A-StillEmployed,Active,Admin Offices,Board of Directors,9,Other,Fully Meets,5.0,3,2,2019-02-08,0,3
132,"Howard, Estelle",10182,Yes,1,0,1,1,3,No,49920,Yes,2,Administrative Assistant,MA,2170,1985-09-16,F,Married,US Citizen,No,Black or African American,2015-02-16,2015-04-15,"no-call, no-show",Terminated for Cause,Admin Offices,Brandon R. LeBlanc,1,Indeed,Fully Meets,3.24,3,4,2015-04-15,0,6
160,"LaRotonda, William",10038,No,2,1,1,1,3,No,64520,No,1,Accountant I,MA,1460,1984-04-26,M,Divorced,US Citizen,No,Black or African American,2014-01-06,NaT,N/A-StillEmployed,Active,Admin Offices,Brandon R. LeBlanc,1,Website,Fully Meets,5.0,4,4,2019-01-17,0,3
164,"LeBlanc, Brandon R",10134,Yes,1,1,1,1,3,No,93046,No,23,Shared Services Manager,MA,1460,1984-06-10,M,Married,US Citizen,No,White,2016-01-05,NaT,N/A-StillEmployed,Active,Admin Offices,Janet King,2,CareerBuilder,Fully Meets,4.1,4,0,2019-01-28,0,20
255,"Singh, Nan",10039,No,0,0,1,1,3,No,51920,No,2,Administrative Assistant,MA,2330,1988-05-19,F,Single,US Citizen,No,White,2015-05-01,NaT,N/A-StillEmployed,Active,Admin Offices,Brandon R. LeBlanc,1,Website,Fully Meets,5.0,3,5,2019-01-15,0,2
259,"Smith, Leigh Ann",10153,Yes,1,0,5,1,3,Yes,55000,Yes,2,Administrative Assistant,MA,1844,1987-06-14,F,Married,US Citizen,No,Black or African American,2011-09-26,2013-09-25,career change,Voluntarily Terminated,Admin Offices,Brandon R. LeBlanc,1,Diversity Job Fair,Fully Meets,3.8,4,4,2013-08-15,0,17
268,"Steans, Tyrone",10147,No,0,1,1,1,3,No,63003,No,1,Accountant I,MA,2703,1986-09-01,M,Single,US Citizen,No,White,2014-09-29,NaT,N/A-StillEmployed,Active,Admin Offices,Brandon R. LeBlanc,1,Indeed,Fully Meets,3.9,5,5,2019-01-18,0,9


In [199]:
hr_dataset['Position'][hr_dataset['Department'] == 'Software Engineering'].value_counts()

Position
Software Engineer               9
Software Engineering Manager    1
Shared Services Manager         1
Name: count, dtype: int64

In [200]:
hr_dataset['Department'][hr_dataset['Position'] == 'Shared Services Manager'].value_counts()

Department
Admin Offices           1
Software Engineering    1
Name: count, dtype: int64

In [201]:
hr_dataset[hr_dataset['Position'] == 'Shared Services Manager']

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
164,"LeBlanc, Brandon R",10134,Yes,1,1,1,1,3,No,93046,No,23,Shared Services Manager,MA,1460,1984-06-10,M,Married,US Citizen,No,White,2016-01-05,NaT,N/A-StillEmployed,Active,Admin Offices,Janet King,2,CareerBuilder,Fully Meets,4.1,4,0,2019-01-28,0,20
227,"Quinn, Sean",10131,Yes,1,1,5,1,3,Yes,83363,Yes,23,Shared Services Manager,MA,2045,1984-11-06,M,Married,Eligible NonCitizen,No,Black or African American,2011-02-21,2015-08-15,career change,Voluntarily Terminated,Software Engineering,Janet King,2,Diversity Job Fair,Fully Meets,4.15,4,0,2014-04-19,0,4


In [202]:
# Just 2 people, is not clear no know which is the correct department. Let's see the profiles under Janet:
hr_dataset[hr_dataset['ManagerName'] == 'Janet King']

Unnamed: 0,Employee_Name,EmpID,Married,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Term,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
27,"Bozzi, Charles",10175,No,0,1,5,5,3,No,74312,Yes,18,Production Manager,MA,1901,1970-03-10,M,Single,US Citizen,No,Asian,2013-09-30,2014-08-07,retiring,Voluntarily Terminated,Production,Janet King,2,Indeed,Fully Meets,3.39,3,0,2014-02-20,0,14
36,"Candie, Calvin",10001,No,0,1,1,5,4,No,72640,No,18,Production Manager,MA,2169,1983-08-09,M,Single,US Citizen,No,White,2016-01-28,NaT,N/A-StillEmployed,Active,Production,Janet King,2,Indeed,Exceeds,5.0,3,0,2019-02-22,0,14
54,"Corleone, Michael",10282,No,2,1,1,5,2,No,68051,No,18,Production Manager,MA,1803,1975-12-17,M,Divorced,US Citizen,No,White,2010-07-20,NaT,N/A-StillEmployed,Active,Production,Janet King,2,CareerBuilder,Needs Improvement,4.13,2,0,2019-01-14,3,3
55,"Corleone, Vito",10019,No,0,1,1,5,4,No,170500,No,10,Director of Operations,MA,2030,1983-03-19,M,Single,US Citizen,No,Black or African American,2009-01-05,NaT,N/A-StillEmployed,Active,Production,Janet King,2,Indeed,Exceeds,3.7,5,0,2019-02-04,0,15
78,"Dunn, Amy",10105,No,0,0,1,5,3,No,75188,No,18,Production Manager,MA,1731,1973-11-28,F,Single,US Citizen,No,White,2014-09-18,NaT,N/A-StillEmployed,Active,Production,Janet King,2,Google Search,Fully Meets,4.52,4,0,2019-01-15,0,4
118,"Gray, Elijiah",10098,No,2,1,1,5,3,No,62957,No,18,Production Manager,MA,1752,1981-07-11,M,Divorced,US Citizen,No,White,2015-06-02,NaT,N/A-StillEmployed,Active,Production,Janet King,2,Employee Referral,Fully Meets,4.63,3,0,2019-01-04,0,2
131,"Houlihan, Debra",10272,Yes,1,0,1,6,3,No,180000,No,11,Director of Sales,RI,2908,2066-03-17,F,Married,US Citizen,No,White,2014-05-05,NaT,N/A-StillEmployed,Active,Sales,Janet King,2,LinkedIn,Fully Meets,4.5,4,0,2019-01-21,0,19
137,"Immediato, Walter",10289,Yes,1,1,5,5,2,No,83082,Yes,18,Production Manager,MA,2128,1976-11-15,M,Married,US Citizen,No,Asian,2011-02-21,2012-09-24,unhappy,Voluntarily Terminated,Production,Janet King,2,Indeed,Needs Improvement,2.34,2,0,2012-04-12,3,4
157,"Landa, Hans",10092,Yes,1,1,4,5,3,No,82758,Yes,18,Production Manager,MA,1890,1972-07-01,M,Married,US Citizen,No,White,2011-01-10,2015-12-12,attendance,Terminated for Cause,Production,Janet King,2,Employee Referral,Fully Meets,4.78,4,0,2015-02-15,0,9
164,"LeBlanc, Brandon R",10134,Yes,1,1,1,1,3,No,93046,No,23,Shared Services Manager,MA,1460,1984-06-10,M,Married,US Citizen,No,White,2016-01-05,NaT,N/A-StillEmployed,Active,Admin Offices,Janet King,2,CareerBuilder,Fully Meets,4.1,4,0,2019-01-28,0,20


All of them are Managers or directors, so assuming the incorrect line is the "Software Engineer":

In [203]:
hr_dataset['Department'][hr_dataset['Position'] == 'Shared Services Manager'] = 'Admin Offices'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hr_dataset['Department'][hr_dataset['Position'] == 'Shared Services Manager'] = 'Admin Offices'


In [204]:
hr_dataset['Department'][hr_dataset['Position'] == 'Shared Services Manager'].value_counts()

Department
Admin Offices    2
Name: count, dtype: int64

___
3.8. Exporting cleaned data base

In [205]:
hr_dataset.to_csv('../data/HR_dataset/HR_Dataset_clean.csv')

___
___
## 4. ANALYSIS QUESTIONS:

### Engagement
- Relation between `EngagementSurvey` and `SpecialProjectsCount`?
- Relation between `EmpSatisfaction` and `SpecialProjectsCount`?
- Relation between `Absences` and `EngagementSurvey`?
- Relation between `DaysLateLast30` and `EngagementSurvey`?
- Relation between `PerformanceScore` and `EmpSatisfaction`?
- Relation between `LastPerformanceReview_Date` and `ManagerID`?
- Relation between `EmpSatisfaction` and `Salary`?

### Attrition
- Relation between `Termd` and `ManagerID`?
- Relation between `Termd` and `Position`?
- Relation between `Termd` and `Department`?

### Diversity
- What is the overall diversity `Sex` profile of the organization?
- What is the overall diversity `RaceDesc` profile of the organization?

### Workers differencies
- Relation between `RecruitmentSource` and `Salary`?
- Relation between `Sex` and `Salary`?
- Relation between `RaceDesc` and `Salary`?
- Relation between `Department` and `Salary`?

In [206]:
# Let's check main relations in numeric categories:
plt.figure(figsize = (20,15))
sns.heatmap(hr_dataset.corr(), cmap = "crest", annot = True,
            vmin = -1, vmax = 1);

ValueError: could not convert string to float: 'Adinolfi, Wilson  K'

<Figure size 2000x1500 with 0 Axes>