In [1]:
import pandas as pd
import numpy as np

# Human Resource / People analytics data.
Project for data cleaning and data visualisation. Read disclaimer attributed to the original author for this project.

# Data cleaning

## Extraction Phase

There is a total of 5 data in which one of them is about survey. The remaining 4 data is considered to essential for the overall analysis while the survey data will be separately used for engagement analysis. Both combined will create People Analytics.

I identified the 4 data as listed:
- company data → `2021.06_COL_2021.txt`
- job_detail data → `2021.06_job_profile_mapping.txt`
- full data → `CompanyData.txt`
- demographic data → `Diversity.txt`

In addition with survey data → `EngagementSurvey.txt`

### Importing first 4 data

In [2]:
company_data = pd.read_csv('data/2021.06_COL_2021.txt', sep='\t')
company_data.head()

Unnamed: 0,Office,COL Amount,Currency
0,NYC,100,USD
1,Boulder,70,USD
2,Oslo,70,NOK
3,SanJose,90,USD
4,London,90,GBP


In [3]:
job_data = pd.read_csv('data/2021.06_job_profile_mapping.txt', sep='\t')
job_data.head()

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15


In [4]:
full_data = pd.read_csv('data/CompanyData.txt', sep='\t', encoding='utf_16_le')
full_data.head()

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,Notes
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,JP_1000,Changes for 2021.06:
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,JP_1001,Changes for 2021.06:
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,JP_1022,Changes for 2021.06:
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,JP_1036,Changes for 2021.06:
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,JP_1015,Changes for 2021.06:


In [5]:
demographic_data = pd.read_csv('data/Diversity.txt', sep='\t')
demographic_data.head()

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation
0,100001,female,female,White,0,0,Undergraduate,Heterosexual
1,100002,male,male,White,0,1,Undergraduate,Heterosexual
2,100003,female,female,Asian,0,0,Undergraduate,Heterosexual
3,100004,male,male,White,0,0,Undergraduate,Heterosexual
4,100005,male,male,Hispanic or Latino,0,0,Undergraduate,Missing


In [7]:
survey_data = pd.read_csv('data/EngagementSurvey.txt', sep='\t')
survey_data.head()

Unnamed: 0,EmployeeID,Survey,I would recommend my friends or Family to work at TheCompany,I feel engaged in my work.,I believe Leadership cares about the employees at TheCompany,My manager supports me in my role at TheCompany,"TheCompany cares about Diversity, Equity and Inclusion.",I believe there is room for me to grow at TheCompany,I work on interesting projects.,My manager motivates me to work hard.,...,I believe TheCompany is in a great position in the market for the next few years to be succesful.,I plan on staying with TheCompany for at least 2 more years.,I believe I am fairly compensated for my work.,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,What does TheCompany do well?,What can TheCompany improve?
0,100001,2023Q2,3,3,2,1,1,3,2,4,...,3,4,2,2,4,4,3,3,,
1,100002,2023Q2,3,4,2,4,2,4,4,2,...,3,1,5,1,2,4,3,1,,
2,100009,2023Q2,3,4,2,3,4,3,2,2,...,1,1,3,2,2,3,2,4,,
3,100014,2023Q2,4,1,5,2,5,3,4,4,...,3,4,2,2,4,3,3,2,,
4,100018,2023Q2,4,1,1,1,2,3,1,3,...,4,3,2,4,3,2,2,3,,


# Transformation Phase

Let's first find out how many data we have in each table

In [19]:
print('The company data has {0} columns and {1} rows'.format(company_data.shape[1],company_data.shape[0]))
print('The job data has {0} columns and {1} rows'.format(job_data.shape[1],job_data.shape[0]))
print('The full data has {0} columns and {1} rows'.format(full_data.shape[1],full_data.shape[0]))
print('The demographic data has {0} columns and {1} rows'.format(demographic_data.shape[1],demographic_data.shape[0]))
print('The survey data has {0} columns and {1} rows'.format(survey_data.shape[1],survey_data.shape[0]))

The company data has 3 columns and 9 rows
The job data has 6 columns and 54 rows
The full data has 25 columns and 4968 rows
The demographic data has 8 columns and 4968 rows
The survey data has 23 columns and 2827 rows


## Company Data

In [23]:
company_data

Unnamed: 0,Office,COL Amount,Currency
0,NYC,100,USD
1,Boulder,70,USD
2,Oslo,70,NOK
3,SanJose,90,USD
4,London,90,GBP
5,Tokyo,85,JPY
6,HongKong,85,HKD
7,SanFran,100,USD
8,Austin,70,USD


In [20]:
company_data.describe(include='all')

Unnamed: 0,Office,COL Amount,Currency
count,9,9.0,9
unique,9,,5
top,NYC,,USD
freq,1,,5
mean,,84.444444,
std,,12.104866,
min,,70.0,
25%,,70.0,
50%,,85.0,
75%,,90.0,


The company data stores a fairly simple model. It stores the office location, the column amount, and currency in three letters code. No null values found, we might need to comeback here after examining all data.

## Job data

In [25]:
job_data.head(10)

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15
5,Corporate,Coordinator,JP_1005,50000.0,Individual Contributor,0.15
6,Corporate,HR Coordinator,JP_1006,50000.0,Individual Contributor,0.15
7,Corporate,Counsel,JP_1007,220000.0,Individual Contributor,0.15
8,Corporate,Finance Coordinator,JP_1008,55000.0,Individual Contributor,0.15
9,Corporate,Accountant,JP_1009,85000.0,Individual Contributor,0.15


In [21]:
job_data.describe(include='all')

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
count,54,54,54,54.0,54,54.0
unique,5,54,54,31.0,8,
top,Corporate,CEO,JP_1000,85000.0,Individual Contributor,
freq,21,1,1,4.0,20,
mean,,,,,,0.227778
std,,,,,,0.15068
min,,,,,,0.1
25%,,,,,,0.15
50%,,,,,,0.15
75%,,,,,,0.2


### Converting `Compensation` into float data type

In [42]:
job_data.rename(columns={' Compensation ':'Compensation'}, inplace=True)
job_data['Compensation']=job_data['Compensation'].str.strip().str.replace(',','')
job_data['Compensation']=pd.to_numeric(job_data['Compensation'])

In [44]:
job_data.head()

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15


In [45]:
job_data.describe(include='all')

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
count,54,54,54,54.0,54,54.0
unique,5,54,54,,8,
top,Corporate,CEO,JP_1000,,Individual Contributor,
freq,21,1,1,,20,
mean,,,,164907.407407,,0.227778
std,,,,116584.614752,,0.15068
min,,,,50000.0,,0.1
25%,,,,76250.0,,0.15
50%,,,,115000.0,,0.15
75%,,,,215000.0,,0.2


In [46]:
job_data.isna().sum()

Department      0
Job_title       0
Job_Profile     0
Compensation    0
Level           0
Bonus %         0
dtype: int64

The job details data does not have any null values. At the current stage, I only need to convert `Compensation` data type.

## Full data

In [47]:
full_data.head(10)

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,Notes
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,JP_1000,Changes for 2021.06:
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,JP_1001,Changes for 2021.06:
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,JP_1022,Changes for 2021.06:
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,JP_1036,Changes for 2021.06:
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,JP_1015,Changes for 2021.06:
5,100006,Nelson,Grillo,3645 Coolidge Street,North Custer,MT,Montana,59024,US,United States,...,Sales,USD,0.15,Account Executive,6/21/1993,Individual Contributor,76000,1,JP_1029,Changes for 2021.06: Termed
6,100007,Kevin,Rainey,977 Black Oak Hollow Road,Santa Clara,CA,California,95054,US,United States,...,Customer Service,USD,0.15,Account Manager,5/13/1990,Individual Contributor,56000,0,JP_1016,Changes for 2021.06:
7,100008,Melanie,Hurst,2751 Holden Street,San Diego,CA,California,92103,US,United States,...,Sales,USD,0.15,Account Executive,1/23/1983,Individual Contributor,72000,0,JP_1029,Changes for 2021.06:
8,100009,Greg,Boon,4791 Loving Acres Road,Grapevine,TX,Texas,76051,US,United States,...,Sales,USD,0.2,"Director, Sales",1/4/1992,Director,74000,1,JP_1030,Changes for 2021.06:
9,100010,Frank,Stockdale,1413 Roy Alley,Centennial,CO,Colorado,80111,US,United States,...,Customer Service,USD,0.15,Account Manager,10/21/1989,Individual Contributor,52000,0,JP_1016,Changes for 2021.06:


In [51]:
full_data['Notes']

0                              Changes for 2021.06:  
1                              Changes for 2021.06:  
2                              Changes for 2021.06:  
3                              Changes for 2021.06:  
4                              Changes for 2021.06:  
                            ...                      
4963    Changes for 2021.06: Added on 2021.06, Termed
4964          Changes for 2021.06: Added on 2021.06, 
4965          Changes for 2021.06: Added on 2021.06, 
4966          Changes for 2021.06: Added on 2021.06, 
4967          Changes for 2021.06: Added on 2021.06, 
Name: Notes, Length: 4968, dtype: object

Unseen Columns

In [57]:
full_data.head().iloc[:,5:]

Unnamed: 0,State,StateFull,ZipCode,Country,CountryFull,Age,Office,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,Notes
0,NY,New York,13212,US,United States,35,NYC,5/4/2009,12/12/2999,Corporate,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,JP_1000,Changes for 2021.06:
1,GA,Georgia,31206,US,United States,49,NYC,5/4/2009,12/12/2999,Corporate,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,JP_1001,Changes for 2021.06:
2,NJ,New Jersey,7087,US,United States,32,NYC,5/18/2009,6/5/2013,Corporate,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,JP_1022,Changes for 2021.06:
3,GA,Georgia,30165,US,United States,25,Boulder,6/22/2009,10/16/2013,Corporate,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,JP_1036,Changes for 2021.06:
4,CA,California,92705,US,United States,49,NYC,7/13/2009,1/10/2011,Corporate,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,JP_1015,Changes for 2021.06:


In [62]:
full_data.describe(include='all').iloc[:,:13]

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,Age,Office,Start_Date
count,4968.0,4968,4968,4846,4846,4446,4445,4968.0,4968,4968,4968.0,4968,4968
unique,,1455,2863,4843,1962,101,51,2837.0,5,5,,9,609
top,,James,Smith,4866 Fairfax Drive,New York,CA,California,0.0,US,United States,,NYC,1/12/2015
freq,,97,78,2,87,445,494,122.0,4446,4446,,1796,41
mean,102484.5,,,,,,,,,,44.187399,,
std,1434.282399,,,,,,,,,,12.368092,,
min,100001.0,,,,,,,,,,19.0,,
25%,101242.75,,,,,,,,,,34.0,,
50%,102484.5,,,,,,,,,,44.0,,
75%,103726.25,,,,,,,,,,54.0,,


In [63]:
full_data.describe(include='all').iloc[:,13:]

Unnamed: 0,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,Notes
count,4968,4968,4968,4968,4968.0,4968,4968,4968,4968.0,4968.0,4968,4968
unique,1797,2,5,5,,54,4268,8,,,54,4
top,12/12/2999,Corporate,Technology,USD,,Software Engineer,6/10/1983,Individual Contributor,,,JP_1038,Changes for 2021.06:
freq,2413,2972,1915,4446,,1019,5,2947,,,1019,3921
mean,,,,,0.160688,,,,150285.6,0.62661,,
std,,,,,0.035295,,,,616953.6,0.483753,,
min,,,,,0.1,,,,2000.0,0.0,,
25%,,,,,0.15,,,,58000.0,0.0,,
50%,,,,,0.15,,,,76000.0,1.0,,
75%,,,,,0.15,,,,94000.0,1.0,,


Notice that the termination_date includes 12/12/2999 which can indicate that the employee has not terminated yet. For data analysis purpose, replacing it with `NaN` 