In [1]:
import pandas as pd
import numpy as np
import warnings
from datetime import datetime, timedelta
warnings.filterwarnings('ignore')

# Human Resource / People analytics data.
Project for data cleaning and data visualisation. Read disclaimer attributed to the original author for this project.

# Data cleaning

## Extraction Phase

There is a total of 5 data in which one of them is about survey. The remaining 4 data is considered to essential for the overall analysis while the survey data will be separately used for engagement analysis. Both combined will create People Analytics.

I identified the 4 data as listed:
- company data → `2021.06_COL_2021.txt`
- job_detail data → `2021.06_job_profile_mapping.txt`
- full data → `CompanyData.txt`
- demographic data → `Diversity.txt`

In addition with survey data → `EngagementSurvey.txt`

In [2]:
def log(message):
    timestamp_format = '%Y-%h-%d-%H:%M:%S'
    now = datetime.now()
    timestamp = now.strftime(timestamp_format)
    with open("logfile.txt",'a') as f:
        f.write(timestamp+', '+message+'\n')
    print(message)

### Importing first 4 data

In [3]:
import time

In [4]:
log('Extracting Data ...')
start_time = time.time()

Extracting Data ...


In [5]:
company_data = pd.read_csv('data/2021.06_COL_2021.txt', sep='\t')
company_data.head()

Unnamed: 0,Office,COL Amount,Currency
0,NYC,100,USD
1,Boulder,70,USD
2,Oslo,70,NOK
3,SanJose,90,USD
4,London,90,GBP


In [6]:
job_data = pd.read_csv('data/2021.06_job_profile_mapping.txt', sep='\t')
job_data.head()

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15


In [7]:
full_data = pd.read_csv('data/CompanyData.txt', sep='\t', encoding='utf_16_le')
full_data.head()

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,Notes
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,JP_1000,Changes for 2021.06:
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,JP_1001,Changes for 2021.06:
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,JP_1022,Changes for 2021.06:
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,JP_1036,Changes for 2021.06:
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,JP_1015,Changes for 2021.06:


In [8]:
demographic_data = pd.read_csv('data/Diversity.txt', sep='\t')
demographic_data.head()

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation
0,100001,female,female,White,0,0,Undergraduate,Heterosexual
1,100002,male,male,White,0,1,Undergraduate,Heterosexual
2,100003,female,female,Asian,0,0,Undergraduate,Heterosexual
3,100004,male,male,White,0,0,Undergraduate,Heterosexual
4,100005,male,male,Hispanic or Latino,0,0,Undergraduate,Missing


In [9]:
survey_data = pd.read_csv('data/EngagementSurvey.txt', sep='\t')
survey_data.head()

Unnamed: 0,EmployeeID,Survey,I would recommend my friends or Family to work at TheCompany,I feel engaged in my work.,I believe Leadership cares about the employees at TheCompany,My manager supports me in my role at TheCompany,"TheCompany cares about Diversity, Equity and Inclusion.",I believe there is room for me to grow at TheCompany,I work on interesting projects.,My manager motivates me to work hard.,...,I believe TheCompany is in a great position in the market for the next few years to be succesful.,I plan on staying with TheCompany for at least 2 more years.,I believe I am fairly compensated for my work.,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,What does TheCompany do well?,What can TheCompany improve?
0,100001,2023Q2,3,3,2,1,1,3,2,4,...,3,4,2,2,4,4,3,3,,
1,100002,2023Q2,3,4,2,4,2,4,4,2,...,3,1,5,1,2,4,3,1,,
2,100009,2023Q2,3,4,2,3,4,3,2,2,...,1,1,3,2,2,3,2,4,,
3,100014,2023Q2,4,1,5,2,5,3,4,4,...,3,4,2,2,4,3,3,2,,
4,100018,2023Q2,4,1,1,1,2,3,1,3,...,4,3,2,4,3,2,2,3,,


In [10]:
log('Extraction done in --- %s seconds ---' % (time.time()-start_time))

Extraction done in --- 0.2184157371520996 seconds ---


# Transformation Phase

In [11]:
log('Transforming Data ...')
start_time = time.time()

Transforming Data ...


Let's first find out how many data we have in each table

In [12]:
print('The company data has {0} columns and {1} rows'.format(company_data.shape[1],company_data.shape[0]))
print('The job data has {0} columns and {1} rows'.format(job_data.shape[1],job_data.shape[0]))
print('The full data has {0} columns and {1} rows'.format(full_data.shape[1],full_data.shape[0]))
print('The demographic data has {0} columns and {1} rows'.format(demographic_data.shape[1],demographic_data.shape[0]))
print('The survey data has {0} columns and {1} rows'.format(survey_data.shape[1],survey_data.shape[0]))

The company data has 3 columns and 9 rows
The job data has 6 columns and 54 rows
The full data has 25 columns and 4968 rows
The demographic data has 8 columns and 4968 rows
The survey data has 23 columns and 2827 rows


## Company Data

In [13]:
company_data

Unnamed: 0,Office,COL Amount,Currency
0,NYC,100,USD
1,Boulder,70,USD
2,Oslo,70,NOK
3,SanJose,90,USD
4,London,90,GBP
5,Tokyo,85,JPY
6,HongKong,85,HKD
7,SanFran,100,USD
8,Austin,70,USD


We are going to remove `COL Amount` because it is not relevant to analysis.

In [14]:
company_data.drop('COL Amount', axis=1, inplace=True)

The table stores office details, it does not contain any meaningful features, so I decided to remove `COL Amount` for now.

## Job data

In [15]:
job_data.head(10)

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15
5,Corporate,Coordinator,JP_1005,50000.0,Individual Contributor,0.15
6,Corporate,HR Coordinator,JP_1006,50000.0,Individual Contributor,0.15
7,Corporate,Counsel,JP_1007,220000.0,Individual Contributor,0.15
8,Corporate,Finance Coordinator,JP_1008,55000.0,Individual Contributor,0.15
9,Corporate,Accountant,JP_1009,85000.0,Individual Contributor,0.15


In [16]:
job_data.describe(include='all')

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
count,54,54,54,54.0,54,54.0
unique,5,54,54,31.0,8,
top,Corporate,CEO,JP_1000,85000.0,Individual Contributor,
freq,21,1,1,4.0,20,
mean,,,,,,0.227778
std,,,,,,0.15068
min,,,,,,0.1
25%,,,,,,0.15
50%,,,,,,0.15
75%,,,,,,0.2


### Converting `Compensation` into float data type

In [17]:
job_data.rename(columns={' Compensation ':'Compensation'}, inplace=True)
job_data['Compensation']=job_data['Compensation'].str.strip().str.replace(',','')
job_data['Compensation']=pd.to_numeric(job_data['Compensation'])

In [18]:
job_data.head()

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
0,Corporate,CEO,JP_1000,500000.0,CSuite,1.0
1,Corporate,HR Manager,JP_1001,100000.0,Manager,0.2
2,Corporate,AR Specialist,JP_1002,65000.0,Individual Contributor,0.15
3,Corporate,AP Specialist,JP_1003,65000.0,Individual Contributor,0.15
4,Corporate,FP&A Analyst,JP_1004,70000.0,Individual Contributor,0.15


In [19]:
job_data.describe(include='all')

Unnamed: 0,Department,Job_title,Job_Profile,Compensation,Level,Bonus %
count,54,54,54,54.0,54,54.0
unique,5,54,54,,8,
top,Corporate,CEO,JP_1000,,Individual Contributor,
freq,21,1,1,,20,
mean,,,,164907.407407,,0.227778
std,,,,116584.614752,,0.15068
min,,,,50000.0,,0.1
25%,,,,76250.0,,0.15
50%,,,,115000.0,,0.15
75%,,,,215000.0,,0.2


In [20]:
job_data.isna().sum()

Department      0
Job_title       0
Job_Profile     0
Compensation    0
Level           0
Bonus %         0
dtype: int64

The job details data does not have any null values. At the current stage, I only need to convert `Compensation` data type.

Changes note: Adding validation discovered in normalization

Checking if job profile in full_data matches job profile in job_data

In [21]:
job_data.rename(columns={'Compensation':'Salary', 'Bonus %':'Bonus_pct', 'Level':'level'}, inplace=True)
col_to_drop = list(job_data.columns)
col_to_drop

['Department', 'Job_title', 'Job_Profile', 'Salary', 'level', 'Bonus_pct']

In [22]:
job_copy = job_data.copy()

In [23]:
job_data.equals(job_copy)

True

In [24]:
job_copy=job_copy[['Job_Profile']]
job_copy.head()

Unnamed: 0,Job_Profile
0,JP_1000
1,JP_1001
2,JP_1002
3,JP_1003
4,JP_1004


In [25]:
col_to_drop.remove('Salary')
job_copy = job_copy.merge(full_data[col_to_drop], on='Job_Profile', how='left').drop_duplicates()
print(job_copy.shape)
job_copy.reset_index(drop=True,inplace=True)
job_copy

(60, 5)


Unnamed: 0,Job_Profile,Department,Job_title,level,Bonus_pct
0,JP_1000,Corporate,CEO,CSuite,1.0
1,JP_1001,Corporate,HR Manager,Manager,0.2
2,JP_1001,Corporate,HR Manager,Individual Contributor,0.2
3,JP_1002,Corporate,AR Specialist,Individual Contributor,0.15
4,JP_1003,Corporate,AP Specialist,Individual Contributor,0.15
5,JP_1004,Corporate,FP&A Analyst,Individual Contributor,0.15
6,JP_1005,Corporate,Coordinator,Individual Contributor,0.15
7,JP_1006,Corporate,HR Coordinator,Individual Contributor,0.15
8,JP_1007,Corporate,Counsel,Individual Contributor,0.15
9,JP_1008,Corporate,Finance Coordinator,Individual Contributor,0.15


Obviously, there are duplicated values of job profile. `Salary` was omitted here because it does not depend on job profile. If we had included `Salary`, there will be **750** unique job profiles. We are going to add the new job profiles created above by creating new job_profile (ids). This will not remove the original job profile, instead we add more job profiles. Originally we have 54 job profiles, now we will ad 6 more.

Creating new job profiles

In [26]:
job_copy.drop('Job_Profile',axis=1, inplace=True)
job_copy['Job_Profile']=['JP_'+str(i) for i in range(1000,1000+job_copy.shape[0])]
job_copy

Unnamed: 0,Department,Job_title,level,Bonus_pct,Job_Profile
0,Corporate,CEO,CSuite,1.0,JP_1000
1,Corporate,HR Manager,Manager,0.2,JP_1001
2,Corporate,HR Manager,Individual Contributor,0.2,JP_1002
3,Corporate,AR Specialist,Individual Contributor,0.15,JP_1003
4,Corporate,AP Specialist,Individual Contributor,0.15,JP_1004
5,Corporate,FP&A Analyst,Individual Contributor,0.15,JP_1005
6,Corporate,Coordinator,Individual Contributor,0.15,JP_1006
7,Corporate,HR Coordinator,Individual Contributor,0.15,JP_1007
8,Corporate,Counsel,Individual Contributor,0.15,JP_1008
9,Corporate,Finance Coordinator,Individual Contributor,0.15,JP_1009


In [27]:
job_data[job_data.duplicated(subset='Job_Profile')] # Check if there are still duplicates

Unnamed: 0,Department,Job_title,Job_Profile,Salary,level,Bonus_pct


In [28]:
full_data.drop('Job_Profile', axis=1, inplace=True)

In [29]:
col_to_drop.remove('Job_Profile')

In [30]:
full_data=full_data.merge(job_copy, on=col_to_drop, how='left')
full_data.head(10)

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,Changes for 2021.06:,JP_1000
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,Changes for 2021.06:,JP_1001
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,Changes for 2021.06:,JP_1027
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,Changes for 2021.06:,JP_1041
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,Changes for 2021.06:,JP_1018
5,100006,Nelson,Grillo,3645 Coolidge Street,North Custer,MT,Montana,59024,US,United States,...,Sales,USD,0.15,Account Executive,6/21/1993,Individual Contributor,76000,1,Changes for 2021.06: Termed,JP_1034
6,100007,Kevin,Rainey,977 Black Oak Hollow Road,Santa Clara,CA,California,95054,US,United States,...,Customer Service,USD,0.15,Account Manager,5/13/1990,Individual Contributor,56000,0,Changes for 2021.06:,JP_1020
7,100008,Melanie,Hurst,2751 Holden Street,San Diego,CA,California,92103,US,United States,...,Sales,USD,0.15,Account Executive,1/23/1983,Individual Contributor,72000,0,Changes for 2021.06:,JP_1034
8,100009,Greg,Boon,4791 Loving Acres Road,Grapevine,TX,Texas,76051,US,United States,...,Sales,USD,0.2,"Director, Sales",1/4/1992,Director,74000,1,Changes for 2021.06:,JP_1035
9,100010,Frank,Stockdale,1413 Roy Alley,Centennial,CO,Colorado,80111,US,United States,...,Customer Service,USD,0.15,Account Manager,10/21/1989,Individual Contributor,52000,0,Changes for 2021.06:,JP_1020


In [31]:
full_data.columns

Index(['EmployeeID', 'First_Name', 'Surname', 'StreetAddress', 'City', 'State',
       'StateFull', 'ZipCode', 'Country', 'CountryFull', 'Age', 'Office',
       'Start_Date', 'Termination_Date', 'Office_Type', 'Department',
       'Currency', 'Bonus_pct', 'Job_title', 'DOB', 'level', 'Salary',
       'Active Status', 'Notes', 'Job_Profile'],
      dtype='object')

Validated Job profile for job_data, created new job profiles and fixing it inside full_data.

## Full data

In [32]:
full_data.head(10)

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,Changes for 2021.06:,JP_1000
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,Changes for 2021.06:,JP_1001
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,Changes for 2021.06:,JP_1027
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,Changes for 2021.06:,JP_1041
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,Changes for 2021.06:,JP_1018
5,100006,Nelson,Grillo,3645 Coolidge Street,North Custer,MT,Montana,59024,US,United States,...,Sales,USD,0.15,Account Executive,6/21/1993,Individual Contributor,76000,1,Changes for 2021.06: Termed,JP_1034
6,100007,Kevin,Rainey,977 Black Oak Hollow Road,Santa Clara,CA,California,95054,US,United States,...,Customer Service,USD,0.15,Account Manager,5/13/1990,Individual Contributor,56000,0,Changes for 2021.06:,JP_1020
7,100008,Melanie,Hurst,2751 Holden Street,San Diego,CA,California,92103,US,United States,...,Sales,USD,0.15,Account Executive,1/23/1983,Individual Contributor,72000,0,Changes for 2021.06:,JP_1034
8,100009,Greg,Boon,4791 Loving Acres Road,Grapevine,TX,Texas,76051,US,United States,...,Sales,USD,0.2,"Director, Sales",1/4/1992,Director,74000,1,Changes for 2021.06:,JP_1035
9,100010,Frank,Stockdale,1413 Roy Alley,Centennial,CO,Colorado,80111,US,United States,...,Customer Service,USD,0.15,Account Manager,10/21/1989,Individual Contributor,52000,0,Changes for 2021.06:,JP_1020


In [33]:
full_data['Notes']

0                              Changes for 2021.06:  
1                              Changes for 2021.06:  
2                              Changes for 2021.06:  
3                              Changes for 2021.06:  
4                              Changes for 2021.06:  
                            ...                      
4963    Changes for 2021.06: Added on 2021.06, Termed
4964          Changes for 2021.06: Added on 2021.06, 
4965          Changes for 2021.06: Added on 2021.06, 
4966          Changes for 2021.06: Added on 2021.06, 
4967          Changes for 2021.06: Added on 2021.06, 
Name: Notes, Length: 4968, dtype: object

Unseen Columns

In [34]:
full_data.head().iloc[:,5:]

Unnamed: 0,State,StateFull,ZipCode,Country,CountryFull,Age,Office,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
0,NY,New York,13212,US,United States,35,NYC,5/4/2009,12/12/2999,Corporate,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,Changes for 2021.06:,JP_1000
1,GA,Georgia,31206,US,United States,49,NYC,5/4/2009,12/12/2999,Corporate,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,Changes for 2021.06:,JP_1001
2,NJ,New Jersey,7087,US,United States,32,NYC,5/18/2009,6/5/2013,Corporate,Marketing,USD,0.15,Graphic Designer,1/25/1989,Individual Contributor,77000,0,Changes for 2021.06:,JP_1027
3,GA,Georgia,30165,US,United States,25,Boulder,6/22/2009,10/16/2013,Corporate,Technology,USD,0.5,CTO,5/1/1996,CSuite,400000,0,Changes for 2021.06:,JP_1041
4,CA,California,92705,US,United States,49,NYC,7/13/2009,1/10/2011,Corporate,Customer Service,USD,0.15,Associate Account Manager,5/5/1972,Manager,51000,0,Changes for 2021.06:,JP_1018


In [35]:
full_data.describe(include='all').iloc[:,:13]

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,Age,Office,Start_Date
count,4968.0,4968,4968,4846,4846,4446,4445,4968.0,4968,4968,4968.0,4968,4968
unique,,1455,2863,4843,1962,101,51,2837.0,5,5,,9,609
top,,James,Smith,4866 Fairfax Drive,New York,CA,California,0.0,US,United States,,NYC,1/12/2015
freq,,97,78,2,87,445,494,122.0,4446,4446,,1796,41
mean,102484.5,,,,,,,,,,44.187399,,
std,1434.282399,,,,,,,,,,12.368092,,
min,100001.0,,,,,,,,,,19.0,,
25%,101242.75,,,,,,,,,,34.0,,
50%,102484.5,,,,,,,,,,44.0,,
75%,103726.25,,,,,,,,,,54.0,,


In [36]:
full_data.describe(include='all').iloc[:,13:]

Unnamed: 0,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
count,4968,4968,4968,4968,4968.0,4968,4968,4968,4968.0,4968.0,4968,4968
unique,1797,2,5,5,,54,4268,8,,,4,60
top,12/12/2999,Corporate,Technology,USD,,Software Engineer,6/10/1983,Individual Contributor,,,Changes for 2021.06:,JP_1043
freq,2413,2972,1915,4446,,1019,5,2947,,,3921,1016
mean,,,,,0.160688,,,,150285.6,0.62661,,
std,,,,,0.035295,,,,616953.6,0.483753,,
min,,,,,0.1,,,,2000.0,0.0,,
25%,,,,,0.15,,,,58000.0,0.0,,
50%,,,,,0.15,,,,76000.0,1.0,,
75%,,,,,0.15,,,,94000.0,1.0,,


### Data Validation

We are going to check whether there is any values that violate data integrity. First, we will check the age of the employees.  
I am going to do a general check first by checking the distribution of employees' age. This can be found out in the descriptive statistics above and below for detailed view.

In [37]:
full_data.describe(include='all').loc[:,'Age']

count     4968.000000
unique            NaN
top               NaN
freq              NaN
mean        44.187399
std         12.368092
min         19.000000
25%         34.000000
50%         44.000000
75%         54.000000
max         90.000000
Name: Age, dtype: float64

Check complete. The minimum age is 19 and the maximum age is 90. This looks normal except probably age 90 where it could be the CEO or someone with high position.  

Now, we are going to check the termination date. We are going to do a simple check where the value cannot be more than the today date. However, this value actually can be true because some employee might give notice or had a plan to quit the company in near future (all back to company rules).

Notice that the termination_date includes 12/12/2999 which indicate that the employee has not terminated yet. For example:

In [38]:
full_data.loc[full_data['Termination_Date']=='12/12/2999'].head()

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,Corporate,USD,1.0,CEO,1/5/1986,CSuite,500000,1,Changes for 2021.06:,JP_1000
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,Corporate,USD,0.2,HR Manager,7/13/1971,Manager,70000,1,Changes for 2021.06:,JP_1001
8,100009,Greg,Boon,4791 Loving Acres Road,Grapevine,TX,Texas,76051,US,United States,...,Sales,USD,0.2,"Director, Sales",1/4/1992,Director,74000,1,Changes for 2021.06:,JP_1035
13,100014,Jacqueline,Hernandez,1676 Hidden Meadow Drive,Crystal,ND,North Dakota,58222,US,United States,...,Corporate,USD,0.15,AP Specialist,10/29/1997,Individual Contributor,67000,1,Changes for 2021.06:,JP_1004
17,100018,Virginia,Ferguson,3024 Traders Alley,Maysville,MO,Missouri,64469,US,United States,...,Customer Service,USD,0.3,"SVP, Customer Service",6/20/1993,SVP,198000,1,Changes for 2021.06:,JP_1021


For data analysis purpose, replacing it with `NaN` will be more meaningful.  
Now we can check for termination date that is past today date.

In [39]:
full_data['Start_Date']=pd.to_datetime(full_data['Start_Date'])
full_data['Termination_Date']=pd.to_datetime(full_data['Termination_Date'], errors='coerce')
full_data.dtypes

EmployeeID                   int64
First_Name                  object
Surname                     object
StreetAddress               object
City                        object
State                       object
StateFull                   object
ZipCode                     object
Country                     object
CountryFull                 object
Age                          int64
Office                      object
Start_Date          datetime64[ns]
Termination_Date    datetime64[ns]
Office_Type                 object
Department                  object
Currency                    object
Bonus_pct                  float64
Job_title                   object
DOB                         object
level                       object
Salary                       int64
Active Status                int64
Notes                       object
Job_Profile                 object
dtype: object

In [40]:
from datetime import datetime,timedelta

full_data.loc[full_data['Termination_Date']>datetime.today(),'Start_Date':]

Unnamed: 0,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile
189,2010-12-27,2023-10-28,Technology,Technology,USD,0.30,"VP, Technology",11/17/1987,VP,98000,1,Changes for 2021.06: Termed,JP_1047
240,2011-02-28,2023-07-01,Technology,Technology,USD,0.15,Software Engineer,8/2/1957,Individual Contributor,97000,1,Changes for 2021.06: Termed,JP_1043
257,2011-03-14,2023-08-24,Technology,Technology,USD,0.15,Senior Software Engineer,5/13/1969,Senior,99000,1,Changes for 2021.06: Termed,JP_1046
351,2011-08-22,2023-09-22,Technology,Technology,USD,0.15,Software Engineer,1/26/1961,Individual Contributor,99000,1,Changes for 2021.06: Termed,JP_1043
352,2011-08-22,2023-12-14,Technology,Technology,USD,0.15,Software Engineer,5/11/1985,Individual Contributor,94000,1,Changes for 2021.06: Termed,JP_1043
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4828,2021-03-01,2023-10-07,Technology,Technology,USD,0.15,Software Engineer,8/23/1974,Individual Contributor,77000,1,"Changes for 2021.06: Added on 2021.06, Termed",JP_1043
4886,2021-09-27,2023-10-24,Technology,Customer Service,USD,0.20,"Manager, Customer Service",5/19/1981,Manager,76500,1,"Changes for 2021.06: Added on 2021.06, Termed",JP_1023
4909,2021-05-17,2023-07-07,Corporate,Corporate,USD,0.15,HR Analyst,11/22/1958,Individual Contributor,65000,1,"Changes for 2021.06: Added on 2021.06, Termed",JP_1017
4951,2021-09-13,2023-10-22,Technology,Customer Service,USD,0.15,CS Operations Specialist,12/5/1958,Individual Contributor,80000,1,"Changes for 2021.06: Added on 2021.06, Termed",JP_1025


It turns out there is a lot of rows with termination date violating data integrity **(124 rows)**. Obviously this can or cannot happen in real-world scenario (depends on company rules). The company might allow for employee to give for example 1 month notice period. But, more than 1 month is a little unrealistic. So, for this project purposes, we are going to convert the termination date to today's date. 

While we are at it, might have to check if there are any data that has start date after termination date.

In [41]:
full_data.loc[full_data['Termination_Date']<full_data['Start_Date'],'Start_Date':]

Unnamed: 0,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Notes,Job_Profile


Check complete.

In [42]:
# Converting termination date to today's date
full_data['Termination_Date'].loc[full_data['Termination_Date']>datetime.today()]=datetime.today()
full_data.iloc[189,12:]

Start_Date                   2010-12-27 00:00:00
Termination_Date      2023-06-18 19:12:34.307585
Office_Type                           Technology
Department                            Technology
Currency                                     USD
Bonus_pct                                    0.3
Job_title                         VP, Technology
DOB                                   11/17/1987
level                                         VP
Salary                                     98000
Active Status                                  1
Notes               Changes for 2021.06:  Termed
Job_Profile                              JP_1047
Name: 189, dtype: object

Dropping `Notes` column.

In [43]:
full_data.drop('Notes', axis=1, inplace=True)

Adding `start_year` and `termination_year` as a column

In [44]:
full_data['start_year']=full_data['Start_Date'].dt.strftime('%Y')
full_data['termination_year']=full_data['Termination_Date'].dt.strftime('%Y')
full_data.tail().iloc[:,12:]

Unnamed: 0,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year
4963,2021-03-08,2022-11-27,Corporate,Marketing,USD,0.5,Chief Marketing Officer,11/27/1950,CSuite,370000,1,JP_1051,2021,2022.0
4964,2021-01-25,NaT,Technology,Technology,USD,0.15,Software Engineer,7/5/1975,Individual Contributor,77000,1,JP_1043,2021,
4965,2021-08-23,NaT,Corporate,Sales,USD,0.15,Sales Team Lead,4/3/1982,Senior,85500,1,JP_1039,2021,
4966,2021-02-01,NaT,Technology,Technology,USD,0.15,Senior Software Engineer,6/18/2001,Senior,98000,1,JP_1046,2021,
4967,2021-09-06,NaT,Corporate,Corporate,USD,0.5,Chief Financial Officer,2/6/1970,CSuite,390000,1,JP_1054,2021,


In [45]:
full_data['termination_year'].fillna(0, inplace=True) # Fill null values in termination year with 0
full_data = full_data.astype({'start_year':int,'termination_year':int})

Adding `tenure_months` and `tenure_years` to data by identifying whether employee still active or not

Before we add, we are going to check if active status is true to termination date i.e. checking if there is any employee with active status true having terminated.

In [46]:
full_data.loc[(full_data['Active Status']==1) & (full_data['termination_year']>0)]

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year
5,100006,Nelson,Grillo,3645 Coolidge Street,North Custer,MT,Montana,59024,US,United States,...,USD,0.15,Account Executive,6/21/1993,Individual Contributor,76000,1,JP_1034,2009,2021
32,100033,Fannie,Smith,1829 Burnside Court,Phoenix,AZ,Arizona,85003,US,United States,...,USD,0.15,HR Coordinator,1/16/1976,Individual Contributor,67000,1,JP_1007,2009,2022
35,100036,Joellen,Deleon,4719 Edgewood Avenue,Fresno,CA,California,93704,US,United States,...,USD,0.15,Senior Account Manager,6/19/1986,Senior,59000,1,JP_1022,2010,2021
62,100063,Marius,Andersen,Bnntjernveien 186,HNEFOSS,,,03514,NO,Norway,...,NOK,0.15,Senior Software Engineer,3/1/1993,Senior,534000,1,JP_1046,2010,2022
81,100082,Danielle,Bowler,1399 Goosetown Drive,Saluda,NC,North Carolina,28773,US,United States,...,USD,0.15,Account Manager,4/7/1981,Individual Contributor,59000,1,JP_1020,2010,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4921,104922,William,Connelly,3138 Godfrey Road,New York,New York,New York,10023,US,United States,...,USD,0.20,"Director, Engineering",6/7/1956,Director,140000,1,JP_1045,2021,2022
4951,104952,Charles,Arant,3440 Reynolds Alley,Cypress,California,California,90630,US,United States,...,USD,0.15,CS Operations Specialist,12/5/1958,Individual Contributor,80000,1,JP_1025,2021,2023
4954,104955,Annette,Miller,506 Hill Haven Drive,Killeen,Texas,Texas,76541,US,United States,...,USD,0.15,CS Operations Specialist,8/15/1961,Individual Contributor,80000,1,JP_1025,2021,2023
4955,104956,Walter,Gittens,4508 Arrowood Drive,Jacksonville,Florida,Florida,32202,US,United States,...,USD,0.15,Software Engineer,4/19/1957,Individual Contributor,77000,1,JP_1043,2021,2023


It shows that there is 700 rows with active status still true for terminated employee. We are going to turn the true value into false.

In [47]:
full_data['Active Status'].loc[(full_data['Active Status']==1) & (full_data['termination_year']>0)]=0
full_data.loc[(full_data['Active Status']==1) & (full_data['termination_year']>0)]

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year


Converting complete, now adding columns.

First, extract the difference in days (we are going to drop this in the end) to make calculations easier for getting months and years.

In [48]:
full_data['diff_in_days']=0

full_data['diff_in_days'].loc[full_data['Active Status']==0]=(full_data['Termination_Date']-full_data['Start_Date'])
full_data['diff_in_days'].loc[full_data['Active Status']==1]=(datetime.today()-full_data['Start_Date'])

full_data['tenure_months']=full_data['diff_in_days']/timedelta(days=30)
full_data['tenure_years']=full_data['diff_in_days']/timedelta(days=365)
full_data.tail(15).loc[:,'Start_Date':]

Unnamed: 0,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year,diff_in_days,tenure_months,tenure_years
4953,2021-02-08,NaT,Technology,Marketing,USD,0.15,Programmatic Marketing Specialist,6/26/1975,Individual Contributor,49000,1,JP_1028,2021,0,860 days 19:12:35.035740,28.693347,2.358357
4954,2021-09-13,2023-06-18 19:12:34.307585,Corporate,Customer Service,USD,0.15,CS Operations Specialist,8/15/1961,Individual Contributor,80000,0,JP_1025,2021,2023,643 days 19:12:34.307585,21.460013,1.763837
4955,2021-08-30,2023-03-02 00:00:00.000000,Technology,Technology,USD,0.15,Software Engineer,4/19/1957,Individual Contributor,77000,0,JP_1043,2021,2023,549 days 00:00:00,18.3,1.50411
4956,2021-04-26,NaT,Corporate,Marketing,USD,0.1,"Associate, Marketing",3/6/1950,Associate,120000,1,JP_1030,2021,0,783 days 19:12:35.035740,26.12668,2.147398
4957,2021-11-22,NaT,Technology,Technology,USD,0.15,Software Engineer,8/5/1977,Individual Contributor,77000,1,JP_1043,2021,0,573 days 19:12:35.035740,19.12668,1.572056
4958,2021-11-22,NaT,Corporate,Corporate,USD,0.2,"Manager, Finance",9/6/1979,Manager,110000,1,JP_1053,2021,0,573 days 19:12:35.035740,19.12668,1.572056
4959,2021-11-22,NaT,Corporate,Corporate,USD,0.15,HR Coordinator,11/24/1996,Individual Contributor,50000,1,JP_1007,2021,0,573 days 19:12:35.035740,19.12668,1.572056
4960,2021-04-19,NaT,Corporate,Corporate,USD,0.15,"Manager, Corporate",6/4/1965,Manager,120000,1,JP_1013,2021,0,790 days 19:12:35.035740,26.360014,2.166576
4961,2021-09-13,NaT,Technology,Corporate,USD,0.5,Chief Human Resources Officer,2/5/1957,CSuite,266000,1,JP_1055,2021,0,643 days 19:12:35.035740,21.460014,1.763837
4962,2021-06-21,NaT,Technology,Sales,USD,0.15,Sales Team Lead,12/22/1974,Senior,95000,1,JP_1039,2021,0,727 days 19:12:35.035740,24.260014,1.993974


Checking other parts of the data

In [49]:
full_data.iloc[4960:,12:]

Unnamed: 0,Start_Date,Termination_Date,Office_Type,Department,Currency,Bonus_pct,Job_title,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year,diff_in_days,tenure_months,tenure_years
4960,2021-04-19,NaT,Corporate,Corporate,USD,0.15,"Manager, Corporate",6/4/1965,Manager,120000,1,JP_1013,2021,0,790 days 19:12:35.035740,26.360014,2.166576
4961,2021-09-13,NaT,Technology,Corporate,USD,0.5,Chief Human Resources Officer,2/5/1957,CSuite,266000,1,JP_1055,2021,0,643 days 19:12:35.035740,21.460014,1.763837
4962,2021-06-21,NaT,Technology,Sales,USD,0.15,Sales Team Lead,12/22/1974,Senior,95000,1,JP_1039,2021,0,727 days 19:12:35.035740,24.260014,1.993974
4963,2021-03-08,2022-11-27,Corporate,Marketing,USD,0.5,Chief Marketing Officer,11/27/1950,CSuite,370000,0,JP_1051,2021,2022,629 days 00:00:00,20.966667,1.723288
4964,2021-01-25,NaT,Technology,Technology,USD,0.15,Software Engineer,7/5/1975,Individual Contributor,77000,1,JP_1043,2021,0,874 days 19:12:35.035740,29.160014,2.396713
4965,2021-08-23,NaT,Corporate,Sales,USD,0.15,Sales Team Lead,4/3/1982,Senior,85500,1,JP_1039,2021,0,664 days 19:12:35.035740,22.160014,1.821371
4966,2021-02-01,NaT,Technology,Technology,USD,0.15,Senior Software Engineer,6/18/2001,Senior,98000,1,JP_1046,2021,0,867 days 19:12:35.035740,28.92668,2.377535
4967,2021-09-06,NaT,Corporate,Corporate,USD,0.5,Chief Financial Officer,2/6/1970,CSuite,390000,1,JP_1054,2021,0,650 days 19:12:35.035740,21.693347,1.783015


In [50]:
full_data.drop('diff_in_days', axis=1, inplace=True)

#### Cleaning inconsistent state abbreviation

In [51]:
list(set(full_data['State']))[:10]

[nan, 'MA', 'WV', 'AZ', 'ME', 'MN', 'GA', 'New Hampshire', 'Georgia', 'LA']

We can see from above example that some state is not abbreviated defeating the purpose of `StateFull` column. We are going to replace the full name in `State` to abbreviated code.

Checking if there are any states beside US States

In [52]:
print('Null values for states beside US =',full_data.loc[full_data['Country']!='US']['State'].isna().sum())
full_data.loc[full_data['Country']!='US']['State']

Null values for states beside US = 522


54      NaN
62      NaN
105     NaN
113     NaN
126     NaN
       ... 
4474    NaN
4475    NaN
4496    NaN
4520    NaN
4521    NaN
Name: State, Length: 522, dtype: object

We are going to webscrape the table for abbreviation US states.

In [53]:
url='https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971'

In [54]:
import requests
from bs4 import BeautifulSoup

html_data = requests.get(url).text
html_data

'<!DOCTYPE html>\n\n<!--[if lt IE 9]><html class="no-js lt-ie9" lang="en" dir="ltr"><![endif]--><!--[if gt IE 8]><!-->\n<html class="no-js" lang="en" dir="ltr">\n<!--<![endif]-->\n\n<head>\n<meta charset="utf-8">\n<!-- Web Experience Toolkit (WET) / BoÃ\x83Â®te Ã\x83Â\xa0 outils de l\'expÃ\x83Â©rience Web (BOEW)\n     wet-boew.github.io/wet-boew/License-en.htm / wet-boew.github.io/wet-boew/Licence-fr.htm -->\n\n<title>List of U&#46;S&#46; States with Codes and Abbreviations</title>\n<meta name="description" content="List of U.S. States with Codes and Abbreviations - Table of: Code, State, Abbreviation, Alpha code" />\n<meta name="dcterms.creator" content="Government of Canada, Statistics Canada" />\n<meta name="dcterms.title" content="List of U.S. States with Codes and Abbreviations" />\n<meta name="dcterms.issued" title="W3CDTF" content="2008-12-29" />\n<meta name="dcterms.modified" title="W3CDTF" content="2019-05-06" />\n<meta name="dcterms.subject" title="gcstc" content="none" />\n<

In [55]:
soup = BeautifulSoup(html_data, 'html5lib')

In [56]:
table = soup.find_all('table')[0] # The located table

In [57]:
table_dict = {}

table_dict['StateFull']=[]
table_dict['Alpha code']=[]

i = 0
for row in table.find_all('td'): # Iterate rows
    if i%3==0 and i!=0: # There are 3 columns in 'td' element, we need to iterate every 3 column to get to the new row
        i=0 # reset i
    # Extract state name
    if i==0: 
        table_dict['StateFull'].append(' '.join(row.contents))
    # Extract state code
    if i==2:
        table_dict['Alpha code'].append(' '.join(row.contents))
    i+=1

In [58]:
state_df = pd.DataFrame(table_dict)
state_df.head()

Unnamed: 0,StateFull,Alpha code
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [59]:
full_data=full_data.merge(state_df, on='StateFull', how='left')
full_data.head()

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,...,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year,tenure_months,tenure_years,Alpha code
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,NY,New York,13212,US,United States,...,1/5/1986,CSuite,500000,1,JP_1000,2009,0,171.960014,14.1337,NY
1,100002,David,Rickards,4265 Graystone Lakes,Macon,GA,Georgia,31206,US,United States,...,7/13/1971,Manager,70000,1,JP_1001,2009,0,171.960014,14.1337,GA
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,NJ,New Jersey,7087,US,United States,...,1/25/1989,Individual Contributor,77000,0,JP_1027,2009,2013,49.3,4.052055,NJ
3,100004,Justin,Edgin,1262 Limer Street,Rome,GA,Georgia,30165,US,United States,...,5/1/1996,CSuite,400000,0,JP_1041,2009,2013,52.566667,4.320548,GA
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,CA,California,92705,US,United States,...,5/5/1972,Manager,51000,0,JP_1018,2009,2011,18.2,1.49589,CA


Adding Washington DC state manually as DC. This is because there are 2 washington's in the data, Washington DC and Washington. The correct state name for Washington DC is District of Columbia, so we are going to change that as well.

In [61]:
full_data.loc[full_data['Country']!='US']['Alpha code'] # Check whether there are still the same number of null values

54      NaN
62      NaN
105     NaN
113     NaN
126     NaN
       ... 
4474    NaN
4475    NaN
4496    NaN
4520    NaN
4521    NaN
Name: Alpha code, Length: 522, dtype: object

In [62]:
full_data.drop('State', axis=1, inplace=True)
full_data.rename(columns={'Alpha code':'State_code'}, inplace=True)
full_data.head()

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,StateFull,ZipCode,Country,CountryFull,Age,...,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year,tenure_months,tenure_years,State_code
0,100001,Patrice,Moore,1427 Buckhannan Avenue,North Syracuse,New York,13212,US,United States,35,...,1/5/1986,CSuite,500000,1,JP_1000,2009,0,171.960014,14.1337,NY
1,100002,David,Rickards,4265 Graystone Lakes,Macon,Georgia,31206,US,United States,49,...,7/13/1971,Manager,70000,1,JP_1001,2009,0,171.960014,14.1337,GA
2,100003,Grace,Maldonado,1680 Hudson Street,Weehawken,New Jersey,7087,US,United States,32,...,1/25/1989,Individual Contributor,77000,0,JP_1027,2009,2013,49.3,4.052055,NJ
3,100004,Justin,Edgin,1262 Limer Street,Rome,Georgia,30165,US,United States,25,...,5/1/1996,CSuite,400000,0,JP_1041,2009,2013,52.566667,4.320548,GA
4,100005,Benjamin,Vargas,2431 Rainbow Road,Santa Ana,California,92705,US,United States,49,...,5/5/1972,Manager,51000,0,JP_1018,2009,2011,18.2,1.49589,CA


In [63]:
full_data['State_code'].loc[full_data['StateFull']=='Washington DC']='DC'
full_data['StateFull'].loc[full_data['State_code']=='DC']='District of Columbia'
full_data.loc[full_data['State_code']=='DC']

Unnamed: 0,EmployeeID,First_Name,Surname,StreetAddress,City,StateFull,ZipCode,Country,CountryFull,Age,...,DOB,level,Salary,Active Status,Job_Profile,start_year,termination_year,tenure_months,tenure_years,State_code
661,100662,Jerome,Hamlin,273 School Street,Washington,District of Columbia,20036,US,United States,41,...,1/13/1980,Individual Contributor,66000,1,JP_1007,2012,0,130.893347,10.758357,DC
686,100687,Charles,Smith,1299 Rhode Island Avenue,Silver Spring,District of Columbia,20904,US,United States,31,...,8/18/1989,Individual Contributor,95000,0,JP_1043,2012,2020,89.166667,7.328767,DC
882,100883,James,Maynard,3350 School Street,Washington,District of Columbia,20005,US,United States,53,...,1/14/1968,Senior,98000,0,JP_1046,2013,2023,122.493347,10.067946,DC
1115,101116,Justin,Sherwood,4184 Passaic Street,Washington,District of Columbia,20036,US,United States,31,...,12/20/1989,Associate,83000,0,JP_1030,2014,2015,23.233333,1.909589,DC
1255,101256,John,McArthur,1729 Hickory Lane,Washington,District of Columbia,20009,US,United States,46,...,1/28/1975,Senior,97000,1,JP_1046,2014,0,111.993347,9.204933,DC
1335,101336,Thomas,Cantrell,421 Hickory Lane,Beltsville,District of Columbia,20705,US,United States,49,...,3/6/1972,Individual Contributor,56000,0,JP_1020,2014,2015,10.166667,0.835616,DC
1701,101702,Willard,Gonzales,1140 Passaic Street,Washington,District of Columbia,20005,US,United States,62,...,2/7/1959,Individual Contributor,90000,1,JP_1043,2015,0,101.260014,8.322741,DC
2026,102027,Nathan,Romano,3351 Massachusetts Avenue,Washington,District of Columbia,20005,US,United States,29,...,2/29/1992,Senior,95000,0,JP_1046,2015,2016,3.8,0.312329,DC
2315,102316,Mary,Elliot,3439 Hickory Lane,Washington,District of Columbia,20007,US,United States,48,...,6/15/1973,Individual Contributor,65000,0,JP_1007,2016,2018,19.9,1.635616,DC
2788,102789,Joan,Escobar,2389 Hickory Lane,Washington,District of Columbia,20017,US,United States,25,...,9/12/1995,Individual Contributor,97000,0,JP_1043,2017,2018,20.1,1.652055,DC


In [64]:
# Check whether state_code is an abbreviation now
set(full_data['State_code'])

{'AK',
 'AL',
 'AR',
 'AZ',
 'CA',
 'CO',
 'CT',
 'DC',
 'DE',
 'FL',
 'GA',
 'HI',
 'IA',
 'ID',
 'IL',
 'IN',
 'KS',
 'KY',
 'LA',
 'MA',
 'MD',
 'ME',
 'MI',
 'MN',
 'MO',
 'MS',
 'MT',
 'NC',
 'ND',
 'NE',
 'NH',
 'NJ',
 'NM',
 'NV',
 'NY',
 'OH',
 'OK',
 'OR',
 'PA',
 'RI',
 'SC',
 'SD',
 'TN',
 'TX',
 'UT',
 'VA',
 'VT',
 'WA',
 'WI',
 'WV',
 'WY',
 nan}

Update age to reflect current year

In [85]:
full_data['DOB']=pd.to_datetime(full_data['DOB'])
round((datetime.today()-full_data['DOB'])/timedelta(days=365))

0       37.0
1       52.0
2       34.0
3       27.0
4       51.0
        ... 
4963    73.0
4964    48.0
4965    41.0
4966    22.0
4967    53.0
Name: DOB, Length: 4968, dtype: float64

In [86]:
full_data['Age']

0       35
1       49
2       32
3       25
4       49
        ..
4963    70
4964    46
4965    39
4966    20
4967    51
Name: Age, Length: 4968, dtype: int64

The full data (or main data) is a join of the other 3 data (except survey). It includes all details about the employee.  

In [65]:
log("Transformation that were done in full data {0} :".format(datetime.today()))
log("- Replacing termination date value of 12/12/2999 to NaN")
log("- Converting violating termination date values to today's date")
log("- Drop 'notes' column")
log("- Adding start_year and termination_year")
log("- Inactivating status for terminated employees")
log("- Adding tenure_year and tenure_months")
log("- Cleaning state column")

Transformation that were done in full data 2023-06-18 19:14:37.273685 :
- Replacing termination date value of 12/12/2999 to NaN
- Converting violating termination date values to today's date
- Drop 'notes' column
- Adding start_year and termination_year
- Inactivating status for terminated employees
- Adding tenure_year and tenure_months
- Cleaning state column


## Demographic Data

In [66]:
demographic_data.head(10)

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation
0,100001,female,female,White,0,0,Undergraduate,Heterosexual
1,100002,male,male,White,0,1,Undergraduate,Heterosexual
2,100003,female,female,Asian,0,0,Undergraduate,Heterosexual
3,100004,male,male,White,0,0,Undergraduate,Heterosexual
4,100005,male,male,Hispanic or Latino,0,0,Undergraduate,Missing
5,100006,male,Prefer not to say,White,0,0,Undergraduate,Heterosexual
6,100007,male,male,Asian,0,0,Some College,Bisexual
7,100008,female,female,White,0,0,High School,Missing
8,100009,male,male,White,0,0,Undergraduate,Heterosexual
9,100010,male,male,White,0,0,PhD,Gay


In [67]:
demographic_data.isna().sum()

EmployeeID              0
Gender                  0
Gender Identity         0
Race/Ethnicity        549
Veteran                 0
Disability              0
Education               0
Sexual Orientation      0
dtype: int64

In [68]:
demographic_data.describe(include='all')

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation
count,4968.0,4968,4968,4419,4968.0,4968.0,4968,4968
unique,,2,5,9,,,5,6
top,,male,male,White,,,Undergraduate,Heterosexual
freq,,2570,2308,2658,,,3288,3141
mean,102484.5,,,,0.047504,0.042271,,
std,1434.282399,,,,0.212736,0.201226,,
min,100001.0,,,,0.0,0.0,,
25%,101242.75,,,,0.0,0.0,,
50%,102484.5,,,,0.0,0.0,,
75%,103726.25,,,,0.0,0.0,,


Investigating null values in Race/Ethnicity column

In [69]:
set(demographic_data['Race/Ethnicity'])

{'American Indian or Alaska Native',
 'Asian',
 'Black or African American',
 'Hispanic or Latino',
 'Native American or Alaska Native',
 'Native Hawaiian or Other Pacific Islander',
 'Native Hawaiian or Pacific Islander',
 'Two or More Races',
 'White',
 nan}

We have 9 unique ethnicities in the data. For `NaN` values, I will treat is as `Prefer not to say` as it will be more meaningful for our data analysis purposes.

In [70]:
demographic_data.loc[demographic_data['Race/Ethnicity'].isna()].head(5)

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation
54,100055,male,male,,0,0,Undergraduate,Heterosexual
62,100063,male,Prefer not to say,,0,0,Undergraduate,Heterosexual
105,100106,female,female,,0,0,Graduate,Heterosexual
113,100114,female,female,,0,0,Undergraduate,Heterosexual
126,100127,female,female,,0,0,Undergraduate,Missing


In [71]:
demographic_data['Race/Ethnicity'].fillna('Prefer not to say', inplace=True)
demographic_data.loc[demographic_data['Race/Ethnicity'].isna()]

Unnamed: 0,EmployeeID,Gender,Gender Identity,Race/Ethnicity,Veteran,Disability,Education,Sexual Orientation


This also removes all null values from demographic data.

## Survey Data

In [72]:
survey_data.head(10)

Unnamed: 0,EmployeeID,Survey,I would recommend my friends or Family to work at TheCompany,I feel engaged in my work.,I believe Leadership cares about the employees at TheCompany,My manager supports me in my role at TheCompany,"TheCompany cares about Diversity, Equity and Inclusion.",I believe there is room for me to grow at TheCompany,I work on interesting projects.,My manager motivates me to work hard.,...,I believe TheCompany is in a great position in the market for the next few years to be succesful.,I plan on staying with TheCompany for at least 2 more years.,I believe I am fairly compensated for my work.,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,What does TheCompany do well?,What can TheCompany improve?
0,100001,2023Q2,3,3,2,1,1,3,2,4,...,3,4,2,2,4,4,3,3,,
1,100002,2023Q2,3,4,2,4,2,4,4,2,...,3,1,5,1,2,4,3,1,,
2,100009,2023Q2,3,4,2,3,4,3,2,2,...,1,1,3,2,2,3,2,4,,
3,100014,2023Q2,4,1,5,2,5,3,4,4,...,3,4,2,2,4,3,3,2,,
4,100018,2023Q2,4,1,1,1,2,3,1,3,...,4,3,2,4,3,2,2,3,,
5,100020,2023Q2,3,3,4,1,4,2,2,2,...,4,3,4,4,3,1,1,3,,
6,100021,2023Q2,4,1,4,2,2,4,4,1,...,2,3,4,2,3,4,4,4,,
7,100024,2023Q2,4,3,4,2,2,2,2,3,...,3,3,3,4,3,2,2,2,,
8,100031,2023Q2,4,4,2,4,3,4,4,4,...,3,4,4,1,3,2,2,4,,
9,100038,2023Q2,3,2,3,2,3,3,1,3,...,2,2,3,4,3,2,4,4,,


In [73]:
survey_data.describe(include='all').iloc[:,:10]

Unnamed: 0,EmployeeID,Survey,I would recommend my friends or Family to work at TheCompany,I feel engaged in my work.,I believe Leadership cares about the employees at TheCompany,My manager supports me in my role at TheCompany,"TheCompany cares about Diversity, Equity and Inclusion.",I believe there is room for me to grow at TheCompany,I work on interesting projects.,My manager motivates me to work hard.
count,2827.0,2827,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0
unique,,1,,,,,,,,
top,,2023Q2,,,,,,,,
freq,,2827,,,,,,,,
mean,102732.81995,,2.919703,2.931022,2.925716,2.907676,2.881854,2.901309,2.933498,2.931022
std,1443.057615,,0.99018,1.025774,1.001485,1.016652,0.99584,1.012216,1.009421,1.002041
min,100001.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,101481.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
50%,102786.0,,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
75%,104030.5,,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


In [74]:
survey_data.describe(include='all').iloc[:,10:]

Unnamed: 0,I am motivated to work hard at TheCompany.,I believe I am recognized for the work I do.,I believe there are good career opportunities for me at TheCompany.,I believe TheCompany is in a great position in the market for the next few years to be succesful.,I plan on staying with TheCompany for at least 2 more years.,I believe I am fairly compensated for my work.,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,What does TheCompany do well?,What can TheCompany improve?
count,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,2827.0,0.0,0.0
unique,,,,,,,,,,,,,
top,,,,,,,,,,,,,
freq,,,,,,,,,,,,,
mean,2.936328,2.938451,2.922179,2.941988,2.922179,2.921825,2.915104,2.923594,2.895649,2.940927,2.914751,,
std,1.001863,0.997749,1.01804,1.012394,1.02047,0.99818,1.002408,1.020923,1.03105,1.022592,1.024549,,
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,
25%,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,
50%,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,,
75%,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,,


The survey is a rating with a range of 1-5 with 5 being the highest. There is 19 questions for the survey. Two additional questions was for feedback towards TheCompany. I am going to extract the `quarter` value from `Survey` column.

In [75]:
survey_data['survey_quarter']=survey_data['Survey'].str[-2:]
survey_data.head().iloc[:,16:]

Unnamed: 0,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,What does TheCompany do well?,What can TheCompany improve?,survey_quarter
0,2,4,4,3,3,,,Q2
1,1,2,4,3,1,,,Q2
2,2,2,3,2,4,,,Q2
3,2,4,3,3,2,,,Q2
4,4,3,2,2,3,,,Q2


Let's look at the null values

In [76]:
survey_data.isna().sum()

EmployeeID                                                                                              0
Survey                                                                                                  0
I would recommend my friends or Family to work at TheCompany                                            0
I feel engaged in my work.                                                                              0
I believe Leadership cares about the employees at TheCompany                                            0
My manager supports me in my role at TheCompany                                                         0
TheCompany cares about Diversity, Equity and Inclusion.                                                 0
I believe there is room for me to grow at TheCompany                                                    0
I work on interesting projects.                                                                         0
My manager motivates me to work hard.         

All rows for the feedback columns i.e. `What does TheCompany do well?` and `What can TheCompany improve?` are null. We are going to remove those columns because at this stage it does not give any information.

In [77]:
survey_data.drop(['What does TheCompany do well?','What can TheCompany improve?'], axis=1,
                inplace=True)
survey_data.head()

Unnamed: 0,EmployeeID,Survey,I would recommend my friends or Family to work at TheCompany,I feel engaged in my work.,I believe Leadership cares about the employees at TheCompany,My manager supports me in my role at TheCompany,"TheCompany cares about Diversity, Equity and Inclusion.",I believe there is room for me to grow at TheCompany,I work on interesting projects.,My manager motivates me to work hard.,...,I believe there are good career opportunities for me at TheCompany.,I believe TheCompany is in a great position in the market for the next few years to be succesful.,I plan on staying with TheCompany for at least 2 more years.,I believe I am fairly compensated for my work.,I believe there is little to not politics at TheCompany,I feel comfortable going to someone in leadership if there is an issue.,My values align with the culture at TheCompany,I know what TheCompany values are at TheCompany,I feel like I can take off my accrued Paid Time Off (PTO)/Vacation without feeling guilty,survey_quarter
0,100001,2023Q2,3,3,2,1,1,3,2,4,...,2,3,4,2,2,4,4,3,3,Q2
1,100002,2023Q2,3,4,2,4,2,4,4,2,...,4,3,1,5,1,2,4,3,1,Q2
2,100009,2023Q2,3,4,2,3,4,3,2,2,...,2,1,1,3,2,2,3,2,4,Q2
3,100014,2023Q2,4,1,5,2,5,3,4,4,...,3,3,4,2,2,4,3,3,2,Q2
4,100018,2023Q2,4,1,1,1,2,3,1,3,...,4,4,3,2,4,3,2,2,3,Q2


In [78]:
log('Transformation done in --- %s seconds ---' % (time.time()-start_time))

Transformation done in --- 124.50651621818542 seconds ---


We just dropped the two empty columns and add survey time `quarter` to the data.  
Now, let's load the data into a new dataset.

## Loading Phase

In [79]:
log('Loading Data ...')
start_time = time.time()

Loading Data ...


In [80]:
company_data.to_csv('cleanData/company_details.csv', index=False)
job_data.to_csv('cleanData/job_details.csv', index=False)
full_data.to_csv('cleanData/main_data.csv', index=False)
demographic_data.to_csv('cleanData/employee_details.csv', index=False)
survey_data.to_csv('cleanData/survey_data.csv', index=False)

In [81]:
log('Loading done in --- %s seconds ---' % (time.time()-start_time))
log('\n')

Loading done in --- 0.20245862007141113 seconds ---


