# People Analytics Case Study

## Libraries

In [268]:
# import libraries
import pandas as pd

In this analysis, we will be using the following libraries:

## Data Overview

In [269]:
# load data
df = pd.read_csv('../data/raw/Human Resources.csv')

In [270]:
# View the DataFrame
df

Unnamed: 0,id,first_name,last_name,birthdate,gender,race,department,jobtitle,location,hire_date,termdate,location_city,location_state
0,00-0037846,Kimmy,Walczynski,6/4/1991,Male,Hispanic or Latino,Engineering,Programmer Analyst I,Headquarters,1/20/2002,,Cleveland,Ohio
1,00-0041533,Ignatius,Springett,6/29/1984,Male,White,Business Development,Business Analyst,Headquarters,4/8/2019,,Cleveland,Ohio
2,00-0045747,Corbie,Bittlestone,7/29/1989,Male,Black or African American,Sales,Solutions Engineer Manager,Headquarters,10/12/2010,,Cleveland,Ohio
3,00-0055274,Baxy,Matton,9/14/1982,Female,White,Services,Service Tech,Headquarters,4/10/2005,,Cleveland,Ohio
4,00-0076100,Terrell,Suff,4/11/1994,Female,Two or More Races,Product Management,Business Analyst,Remote,9/29/2010,2029-10-29 06:09:38 UTC,Flint,Michigan
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37403,,,,,,,,,,,,,
37404,,,,,,,,,,,,,
37405,,,,,,,,,,,,,
37406,,,,,,,,,,,,,


In [271]:
# General information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37408 entries, 0 to 37407
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              22214 non-null  object
 1   first_name      22214 non-null  object
 2   last_name       22214 non-null  object
 3   birthdate       22214 non-null  object
 4   gender          22214 non-null  object
 5   race            22214 non-null  object
 6   department      22214 non-null  object
 7   jobtitle        22267 non-null  object
 8   location        22214 non-null  object
 9   hire_date       22214 non-null  object
 10  termdate        3929 non-null   object
 11  location_city   22214 non-null  object
 12  location_state  22214 non-null  object
dtypes: object(13)
memory usage: 3.7+ MB


## Data Cleaning

In [272]:
# Rename columns in the DataFrame
df = df.rename(columns={
    'birthdate': 'birth_date',
    'jobtitle': 'job_title',
    'termdate': 'term_date',
    'id': 'emp_id'
})

In [273]:
# Remove rows where all values are NaN (empty)
df = df.dropna(how='all')

In [274]:
df

Unnamed: 0,emp_id,first_name,last_name,birth_date,gender,race,department,job_title,location,hire_date,term_date,location_city,location_state
0,00-0037846,Kimmy,Walczynski,6/4/1991,Male,Hispanic or Latino,Engineering,Programmer Analyst I,Headquarters,1/20/2002,,Cleveland,Ohio
1,00-0041533,Ignatius,Springett,6/29/1984,Male,White,Business Development,Business Analyst,Headquarters,4/8/2019,,Cleveland,Ohio
2,00-0045747,Corbie,Bittlestone,7/29/1989,Male,Black or African American,Sales,Solutions Engineer Manager,Headquarters,10/12/2010,,Cleveland,Ohio
3,00-0055274,Baxy,Matton,9/14/1982,Female,White,Services,Service Tech,Headquarters,4/10/2005,,Cleveland,Ohio
4,00-0076100,Terrell,Suff,4/11/1994,Female,Two or More Races,Product Management,Business Analyst,Remote,9/29/2010,2029-10-29 06:09:38 UTC,Flint,Michigan
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22433,,,,,,,,Support Staff III,,,,,
22434,,,,,,,,Support Staff III,,,,,
22435,,,,,,,,Support Staff III,,,,,
22436,,,,,,,,Support Staff III,,,,,


In [275]:
# Check for duplicated rows
df.duplicated().sum()

50

In [276]:
# Show duplicated rows
df[df.duplicated()]

Unnamed: 0,emp_id,first_name,last_name,birth_date,gender,race,department,job_title,location,hire_date,term_date,location_city,location_state
22322,,,,,,,,Support Staff II,,,,,
22323,,,,,,,,Support Staff II,,,,,
22324,,,,,,,,Support Staff II,,,,,
22325,,,,,,,,Support Staff II,,,,,
22326,,,,,,,,Support Staff II,,,,,
22327,,,,,,,,Support Staff II,,,,,
22328,,,,,,,,Support Staff II,,,,,
22329,,,,,,,,Support Staff II,,,,,
22330,,,,,,,,Support Staff II,,,,,
22331,,,,,,,,Support Staff II,,,,,


In [277]:
# Remove rows where all columns except 'jobtitle' are NaN
df = df.dropna(how='all', subset=df.columns.difference(['job_title']))

In [278]:
# Count the occurrences of each value in the 'id' column
duplicate_counts = df['emp_id'].value_counts()

# Filter the counts to show only duplicates
duplicates = duplicate_counts[duplicate_counts > 1]

# Show the duplicated ids and their counts
print(duplicates)

Series([], Name: count, dtype: int64)


In [279]:
# Dropping the 'location_city' column as it's not relevant for the analysis
df.drop(columns=['location_city'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['location_city'], inplace=True)


In [280]:
# Count unique values in 'location_state'
df['location_state'].value_counts()

location_state
Ohio            18025
Pennsylvania     1115
Illinois          868
Indiana           700
Michigan          673
Kentucky          451
Wisconsin         382
Name: count, dtype: int64

In [281]:
# Mapping states to California and neighboring states for storytelling purposes.
# The original dataset is fictitious, allowing for creative flexibility in the analysis.

state_mapping = {
    'Ohio': 'California',
    'Pennsylvania': 'Oregon',
    'Illinois': 'Nevada',
    'Indiana': 'Arizona',
    'Michigan': 'Utah',
    'Kentucky': 'New Mexico',
    'Wisconsin': 'Idaho'
}

# Apply the mapping to the 'location_state' column
df['location_state'] = df['location_state'].replace(state_mapping)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['location_state'] = df['location_state'].replace(state_mapping)


In [282]:
# Count unique values in 'location_state'
df['location_state'].value_counts()

location_state
California    18025
Oregon         1115
Nevada          868
Arizona         700
Utah            673
New Mexico      451
Idaho           382
Name: count, dtype: int64

In [283]:
# Convert the 'birth_date' and 'hire_date' columns to datetime format
# This will create NaT for any unconvertible values
df['birth_date'] = pd.to_datetime(df['birth_date'], errors='coerce')
df['hire_date'] = pd.to_datetime(df['hire_date'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['birth_date'] = pd.to_datetime(df['birth_date'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['hire_date'] = pd.to_datetime(df['hire_date'], errors='coerce')


In [284]:
# Check for NaT values in 'birth_date' and 'hire_date'
nat_counts = df[['birth_date', 'hire_date']].isna().sum()
print(nat_counts)

birth_date    0
hire_date     0
dtype: int64


In [285]:
df

Unnamed: 0,emp_id,first_name,last_name,birth_date,gender,race,department,job_title,location,hire_date,term_date,location_state
0,00-0037846,Kimmy,Walczynski,1991-06-04,Male,Hispanic or Latino,Engineering,Programmer Analyst I,Headquarters,2002-01-20,,California
1,00-0041533,Ignatius,Springett,1984-06-29,Male,White,Business Development,Business Analyst,Headquarters,2019-04-08,,California
2,00-0045747,Corbie,Bittlestone,1989-07-29,Male,Black or African American,Sales,Solutions Engineer Manager,Headquarters,2010-10-12,,California
3,00-0055274,Baxy,Matton,1982-09-14,Female,White,Services,Service Tech,Headquarters,2005-04-10,,California
4,00-0076100,Terrell,Suff,1994-04-11,Female,Two or More Races,Product Management,Business Analyst,Remote,2010-09-29,2029-10-29 06:09:38 UTC,Utah
...,...,...,...,...,...,...,...,...,...,...,...,...
22209,99-9797418,Dorella,Garvan,1998-07-08,Female,Hispanic or Latino,Research and Development,Research Assistant I,Headquarters,2012-02-08,,California
22210,99-9869877,Dasie,Thorsby,2001-04-19,Female,Two or More Races,Services,Service Manager,Headquarters,2017-10-06,,California
22211,99-9919822,Nerty,Wilding,1970-02-09,Female,Two or More Races,Training,Junior Trainer,Headquarters,2001-02-08,,California
22212,99-9960380,Mabelle,Dawks,1985-09-02,Male,Two or More Races,Accounting,Staff Accountant I,Headquarters,2005-04-03,2012-12-10 14:29:59 UTC,California


In [286]:
# General information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22214 entries, 0 to 22213
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   emp_id          22214 non-null  object        
 1   first_name      22214 non-null  object        
 2   last_name       22214 non-null  object        
 3   birth_date      22214 non-null  datetime64[ns]
 4   gender          22214 non-null  object        
 5   race            22214 non-null  object        
 6   department      22214 non-null  object        
 7   job_title       22214 non-null  object        
 8   location        22214 non-null  object        
 9   hire_date       22214 non-null  datetime64[ns]
 10  term_date       3929 non-null   object        
 11  location_state  22214 non-null  object        
dtypes: datetime64[ns](2), object(10)
memory usage: 2.2+ MB
