### 1. Exploring `HR_Dataset`. Checking database mistakes

In [2]:
import pandas as pd

In [3]:
hr_dataset = pd.read_csv('data/HR_Dataset.csv')
pd.options.display.max_columns = None
hr_dataset.head()

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,"Adinolfi, Wilson K",10026,0,0,1,1,5,4,0,62506,0,19,Production Technician I,MA,1960,07/10/83,M,Single,US Citizen,No,White,7/5/2011,,N/A-StillEmployed,Active,Production,Michael Albert,22.0,LinkedIn,Exceeds,4.6,5,0,1/17/2019,0,1
1,"Ait Sidi, Karthikeyan",10084,1,1,1,5,3,3,0,104437,1,27,Sr. DBA,MA,2148,05/05/75,M,Married,US Citizen,No,White,3/30/2015,6/16/2016,career change,Voluntarily Terminated,IT/IS,Simon Roup,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,"Akinkuolie, Sarah",10196,1,1,0,5,5,3,0,64955,1,20,Production Technician II,MA,1810,09/19/88,F,Married,US Citizen,No,White,7/5/2011,9/24/2012,hours,Voluntarily Terminated,Production,Kissy Sullivan,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,"Alagbe,Trina",10088,1,1,0,1,5,3,0,64991,0,19,Production Technician I,MA,1886,09/27/88,F,Married,US Citizen,No,White,1/7/2008,,N/A-StillEmployed,Active,Production,Elijiah Gray,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,"Anderson, Carol",10069,0,2,0,5,5,3,0,50825,1,19,Production Technician I,MA,2169,09/08/89,F,Divorced,US Citizen,No,White,7/11/2011,9/6/2016,return to school,Voluntarily Terminated,Production,Webster Butler,39.0,Google Search,Fully Meets,5.0,4,0,2/1/2016,0,2


___
___
## CHECKS:

1.1. `MaritalStatusID`, `MaritalDesc` and `MarriedID`

2.1. `Termd`, `EmploymentStatus`, `DateofTermination` and `EmpStatusID`

3.1. `HispanicLatino` and `RaceDesc`

4.1 `GenderID` and `Sex`
___

1.1.1 Checking if `MaritalStatusID` matches with `MaritalDesc`

In [4]:
# Checking (visual) if value_counts matches:
print(f"MaritalStatusID values: \n\n{hr_dataset['MaritalStatusID'].value_counts()}")
print("------------------------------------\n")
print(f"MaritalDesc values: \n\n{hr_dataset['MaritalDesc'].value_counts()}")

# it seems to be

MaritalStatusID values: 

0    137
1    124
2     30
3     12
4      8
Name: MaritalStatusID, dtype: int64
------------------------------------

MaritalDesc values: 

Single       137
Married      124
Divorced      30
Separated     12
Widowed        8
Name: MaritalDesc, dtype: int64


In [5]:
# Creating a dict for checking if they are grouped as expected:
dict(zip(hr_dataset['MaritalStatusID'],hr_dataset['MaritalDesc']))

# it seems to be

{0: 'Single', 1: 'Married', 2: 'Divorced', 4: 'Widowed', 3: 'Separated'}

1.1.2. Checking if `MaritalStatusID` matches with `MarriedID`

In [6]:
# Checking (visual) if value_counts matches:
print(f"MaritalStatusID values : \n\n{hr_dataset['MaritalStatusID'].value_counts()}\n")
print("------------------------------------\n")
print(f"MarriedID values: \n\n{hr_dataset['MarriedID'].value_counts()}")

# it seems to be

MaritalStatusID values : 

0    137
1    124
2     30
3     12
4      8
Name: MaritalStatusID, dtype: int64

------------------------------------

MarriedID values: 

0    187
1    124
Name: MarriedID, dtype: int64


In [7]:
# Creating a dict for checking if they are grouped as expected:
dict(zip(hr_dataset['MaritalStatusID'],hr_dataset['MarriedID']))

# it seems to be

{0: 0, 1: 1, 2: 0, 4: 0, 3: 0}

In [8]:
# !!!!!!!!!!!!!!PROBLEMA: si lo hacemos al revés, también aporta información, pero en este caso no sería correcta (no cuadran los value_counts):
dict(zip(hr_dataset['MarriedID'], hr_dataset['MaritalStatusID']))

{0: 4, 1: 1}

___
1.2.1. Check between `Termd`, `EmploymentStatus`, `DateofTermination` and 	`EmpStatusID`

In [9]:
# Checking null values (should match with actives count):
hr_dataset['DateofTermination'].isnull().sum()

207

In [10]:
print(f"TermID values: \n\n{hr_dataset['Termd'].value_counts()}\n")
print("------------------------------------\n")
print(f"EmploymentStatus values: \n\n{hr_dataset['EmploymentStatus'].value_counts()}")
print("------------------------------------\n")
print(f"EmpStatusID values: \n\n{hr_dataset['EmpStatusID'].value_counts()}")

TermID values: 

0    207
1    104
Name: Termd, dtype: int64

------------------------------------

EmploymentStatus values: 

Active                    207
Voluntarily Terminated     88
Terminated for Cause       16
Name: EmploymentStatus, dtype: int64
------------------------------------

EmpStatusID values: 

1    184
5     88
3     14
4     14
2     11
Name: EmpStatusID, dtype: int64


!!! `TermID` seems to match with `EmploymentStatus`, but not with `EmpStatusID`.
Pending analyze `EmpStatusID` information

In [11]:
# Creating a dict for checking if they are grouped as expected:
dict(zip(hr_dataset['EmploymentStatus'],hr_dataset['Termd']))


{'Active': 0, 'Voluntarily Terminated': 1, 'Terminated for Cause': 1}

In [12]:
print(f"HispanicLatino values: \n\n{hr_dataset['HispanicLatino'].value_counts()}")
print("------------------------------------\n")
print(f"RaceDesc values: \n\n{hr_dataset['RaceDesc'].value_counts()}")

HispanicLatino values: 

No     282
Yes     27
no       1
yes      1
Name: HispanicLatino, dtype: int64
------------------------------------

RaceDesc values: 

White                               187
Black or African American            80
Asian                                29
Two or more races                    11
American Indian or Alaska Native      3
Hispanic                              1
Name: RaceDesc, dtype: int64


___
## ANALYSIS QUESTIONS:

### Engagement
- ¿Relation between `EngagementSurvey` and `SpecialProjectsCount`?
- ¿Relation between `EmpSatisfaction` and `SpecialProjectsCount`?
- ¿Relation between `Absences` and `EngagementSurvey`?
- ¿Relation between `DaysLateLast30` and `EngagementSurvey`?
- ¿Relation between `PerformanceScore` and `EmpSatisfaction`?
- ¿Relation between `LastPerformanceReview_Date` and `ManagerID`?
- ¿Relation between `EmpSatisfaction` and `Salary`?

### Attrition
- ¿Relation between `Termd` and `ManagerID`?
- ¿Relation between `Termd` and `Position`?
- ¿Relation between `Termd` and `Department`?

### Workers differencies
- ¿Relation between `RecruitmentSource` and `Salary`?
- ¿Relation between `Sex` and `Salary`?
- ¿Relation between `RecruitmentSource` and `Salary`?
