# Import Necessary Libraries

In [1]:
import pandas as pd

In [2]:
# See all column names & rows when you doing .head(). None of the column name will be truncated.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# About the dataset
`The Ohio State University 2023 Combined Earnings` dataset is taken from its website at https://apps.hr.osu.edu/salaries/

**Statistics and Reports**\
The Office of Human Resources provides analysis and reporting of human resources management information, including demographics, internal and external markets for faculty and staff salaries, and mandated federal and state reporting.

**Earnings**\
Earnings dataset includes all non-student employees’ regular, overtime, bonus and sick leave/vacation payout for the timeframe. For individuals with more than one job title/appointment, their earnings are listed in the area that corresponds to their primary appointment.

**About the Data**\
Amounts represent paid earnings, including corrections to prior payrolls and in some cases may reflect negative values.
- Regular: components of the employee’s regular pay, including paid leave
- Overtime: includes call back pay, FLSA premium and holiday pay (worked)
- Bonus: includes staff award, STEP faculty and STEP staff
- Other: remaining components of the employee’s pay

**Form Field Definitions**\
Database: Differentiates positions associated with the Wexner Medical Center/Health System, Athletics, and the rest of the university.
The Ohio State Athletics Department operates a self-sustaining budget, receiving no University general funds, student fees or state tax support.
The Wexner Medical Center/Health System operates a self-sustaining budget, receiving no University general funds, student fees or state tax support.

Resulting Data: Earnings data includes all non-student employees’ regular, overtime, bonus and sick leave/vacation payout for 2023. For individuals with more than one job title/appointment, their earnings are listed in the area that corresponds to their primary appointment.

**It contains the following columns:**
- Last Name - Legal
- First Name - Preferred
- Job Profile Name
- Cost Center
- Cost Center Hierarchy CCH6
- Position Group
- Regular Pay (Base Pay + Paid Time Off + Premium)
- Bonus
- Overtime
- Other (Allowance + Supplemental + Uncategorized)
- Gross Pay

# Load the Dataset

In [3]:
df = pd.read_excel('Data/2023-Earnings-Combined.xlsx')

# Overview of the dataset to understand its structure and contents.
Use transpose to show a little more of the output

In [4]:
# View the first few rows
df.head(). T

Unnamed: 0,0,1,2,3,4
Last Name - Legal,Green,Brew,Judd,Wilkinson,Conroy
First Name - Preferred,Eric,Chris,Robin,Ian,Maria
Job Profile Name,Professor - Clinical,Visiting Associate Professor,Associate Professor (9M),Professor (9M),Associate Professor (9M)
Cost Center,CC10587 Veterinary Medicine | Veterinary Clini...,CC11786 Engineering | Computer Science and Eng...,CC12460 Arts and Sciences | History,CC11186 EHE | Teaching and Learning Administra...,CC11809 Engineering | Knowlton School of Archi...
Cost Center Hierarchy CCH6,Veterinary Medicine CCH6,Engineering CCH6,Arts and Sciences CCH6,Education and Human Ecology CCH6,Engineering CCH6
Position Group,University,University,University,University,University
Regular Pay (Base Pay + Paid Time Off + Premium),181225.86,7429.9,111042.26,114951.75,106820.5
Bonus,29332.76,0.0,0.0,0.0,0.0
Overtime,0.0,0.0,0.0,0.0,0.0
Other (Allowance + Supplemental + Uncategorized),3000.0,0.0,3000.0,0.0,0.0


In [5]:
# Overview of data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47656 entries, 0 to 47655
Data columns (total 11 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Last Name - Legal                                 47656 non-null  object 
 1   First Name - Preferred                            47656 non-null  object 
 2   Job Profile Name                                  47656 non-null  object 
 3   Cost Center                                       47656 non-null  object 
 4   Cost Center Hierarchy CCH6                        47656 non-null  object 
 5   Position Group                                    47656 non-null  object 
 6   Regular Pay (Base Pay + Paid Time Off + Premium)  47656 non-null  float64
 7   Bonus                                             47656 non-null  float64
 8   Overtime                                          47656 non-null  float64
 9   Other (Allowance 

**There are no missing values and each column has the correct data type.**

# Rename columns
Some names are too long, so that's it's easier to work with, let's rename the columns in the dataset.

In [6]:
df.columns

Index(['Last Name - Legal', 'First Name - Preferred', 'Job Profile Name',
       'Cost Center', 'Cost Center Hierarchy CCH6', 'Position Group',
       'Regular Pay (Base Pay + Paid Time Off + Premium)', 'Bonus', 'Overtime',
       'Other (Allowance + Supplemental + Uncategorized)', 'Gross Pay'],
      dtype='object')

In [7]:
df.rename(columns = {"Last Name - Legal": "last_name",
                     "First Name - Preferred" : "first_name",
                     "Job Profile Name": "job",
                     "Cost Center" : "cost_center",
                     "Cost Center Hierarchy CCH6" : "hierarchy",
                     "Position Group" : "position_group",
                     "Regular Pay (Base Pay + Paid Time Off + Premium)" : "regular_pay",
                     "Bonus" : "bonus",
                     "Overtime" : "overtime",
                     "Other (Allowance + Supplemental + Uncategorized)" : "other",
                     "Gross Pay" : "gross_pay"
                     }, inplace = True)

In [8]:
df.head(2)

Unnamed: 0,last_name,first_name,job,cost_center,hierarchy,position_group,regular_pay,bonus,overtime,other,gross_pay
0,Green,Eric,Professor - Clinical,CC10587 Veterinary Medicine | Veterinary Clini...,Veterinary Medicine CCH6,University,181225.86,29332.76,0.0,3000.0,213558.62
1,Brew,Chris,Visiting Associate Professor,CC11786 Engineering | Computer Science and Eng...,Engineering CCH6,University,7429.9,0.0,0.0,0.0,7429.9


# Numerical Values

In [9]:
# Summary statistics for numerical columns
df.describe().apply(lambda s: s.apply('{0:.0f}'.format))

Unnamed: 0,regular_pay,bonus,overtime,other,gross_pay
count,47656,47656,47656,47656,47656
mean,67554,2204,1142,2775,73676
std,72626,15181,4530,35331,96414
min,-2186,-1000,-15,-2500,0
25%,27117,0,0,0,28443
50%,53867,0,0,0,56560
75%,86062,600,210,38,90448
max,2433750,1385738,153042,6923463,9173463


## Analysis
- The **negative minimum values are normal** as the dataset description says "amounts represent paid earnings, including corrections to prior payrolls and in some cases may reflect negative values".
- The mean is higher than the median, indicating a **right-skewed distribution**.
- The standard deviation is quite large, indicating **high variability** (especially in regular pay).
- There are some **extreme outliers**. This needs to be investigated and potentially be removed or capped to prevent them from skewing analyses.
- **Many employees do not receive bonuses, do not earn overtime or receive other compensation** since the 25th and 50th percentiles are 0.

# Find duplicates

In [10]:
# Find Duplicates
df.duplicated().any()

False

**There are no duplicates in the dataset.**

# Names
## Check if all names are consistent
Check if all names start with a capital letter

In [11]:
# Function to check if a string starts with a capital letter
def starts_with_capital(name):
    return name[0].isupper()

# Check if all first names start with a capital letter
first_names_check = df['first_name'].apply(starts_with_capital)
all_first_names_capitalized = first_names_check.all()

# Check if all last names start with a capital letter
last_names_check = df['last_name'].apply(starts_with_capital)
all_last_names_capitalized = last_names_check.all()

# Report results
print(f"All first names start with a capital letter: {all_first_names_capitalized}")
print(f"All last names start with a capital letter: {all_last_names_capitalized}")

# Report any anomalies
if not all_first_names_capitalized:
    print("First names with issues:")
    print(df[~first_names_check]['first_name'])

if not all_last_names_capitalized:
    print("Last names with issues:")
    print(df[~last_names_check]['last_name'])

All first names start with a capital letter: False
All last names start with a capital letter: False
First names with issues:
1074                sam
1411              jason
2032              andre
2287               jeff
2911             louiza
3128               judy
6427                yan
6557               omer
6840              brett
8392              kevin
8553              peter
8822               john
9065               tish
9214              kathy
9372           laurence
9720              kezia
10216         zhangfang
10227              john
11324              kojo
12027            joseph
12915              greg
13995             kelly
14208          clarence
14437             frank
15347              tony
15698               ian
16425             nadia
16715            heaven
19562            senait
19983            selina
20180            calvin
20431              tara
20877               joe
21056         zachariah
21070              isha
22088              bill
22133     

## Capitalize first letters of the names

In [12]:
# make them all first letter capitalized
df['first_name'] = df['first_name'].str.title()
df['last_name'] = df['last_name'].str.title()

In [13]:
# Function to check if a string starts with a capital letter
def starts_with_capital(name):
    return name[0].isupper()

# Check if all first names start with a capital letter
first_names_check = df['first_name'].apply(starts_with_capital)
all_first_names_capitalized = first_names_check.all()

# Check if all last names start with a capital letter
last_names_check = df['last_name'].apply(starts_with_capital)
all_last_names_capitalized = last_names_check.all()

# Report results
print(f"All first names start with a capital letter: {all_first_names_capitalized}")
print(f"All last names start with a capital letter: {all_last_names_capitalized}")

# Report any anomalies
if not all_first_names_capitalized:
    print("First names with issues:")
    print(df[~first_names_check]['first_name'])

if not all_last_names_capitalized:
    print("Last names with issues:")
    print(df[~last_names_check]['last_name'])

All first names start with a capital letter: False
All last names start with a capital letter: True
First names with issues:
31280    .
Name: first_name, dtype: object


In [14]:
df.query('first_name == "."')

Unnamed: 0,last_name,first_name,job,cost_center,hierarchy,position_group,regular_pay,bonus,overtime,other,gross_pay
31280,Nitin,.,Researcher 2,CC12851 Medicine | Surgery Administration,Medicine CCH6,University,25865.99,0.0,0.0,1192.67,27058.66


# Create new data

## Full name
It might be useful to create a column with the full names for performing some text-based analyses later.

In [15]:
df['full_name'] = df['first_name'] + " " + df['last_name']

In [16]:
df.head(2)

Unnamed: 0,last_name,first_name,job,cost_center,hierarchy,position_group,regular_pay,bonus,overtime,other,gross_pay,full_name
0,Green,Eric,Professor - Clinical,CC10587 Veterinary Medicine | Veterinary Clini...,Veterinary Medicine CCH6,University,181225.86,29332.76,0.0,3000.0,213558.62,Eric Green
1,Brew,Chris,Visiting Associate Professor,CC11786 Engineering | Computer Science and Eng...,Engineering CCH6,University,7429.9,0.0,0.0,0.0,7429.9,Chris Brew


## Gender
I want to feature engineer gender information from the first names in the dataset. I'm going to use gender-guesser library that maps first names to likely genders. I need to keep in mind that gender detection based on first names might not be 100% accurate.

In [17]:
# some first names are made up of two words which gender guesser can't place, so we split the column and search by just the first word.
df1 = df['first_name'].str.split().str[0]

In [18]:
import gender_guesser.detector as gender
# Initialize the gender detector
d = gender.Detector()

# Guess gender from first names
def guess_gender(first_name):
    gender = d.get_gender(first_name)
    # Standardize the output of the gender detector
    if gender in ['male', 'mostly_male']:
        return 'Male'
    elif gender in ['female', 'mostly_female']:
        return 'Female'
    else:
        return 'Unknown'

df['gender'] = df1.apply(guess_gender)

In [19]:
df.gender.value_counts()

gender
Female     24400
Male       17113
Unknown     6143
Name: count, dtype: int64

### Input unknown gender of people whose salary in > $400 000
Because the information about the people with extremely high salaries might be important for analysis, I googled their gender. 

In [20]:
df[(df["gender"] == "Unknown") & (df["gross_pay"] > 400000)]

Unnamed: 0,last_name,first_name,job,cost_center,hierarchy,position_group,regular_pay,bonus,overtime,other,gross_pay,full_name,gender
912,Hou,Kewei,Professor (9M),CC11600 Fisher College | Finance,Business CCH6,University,335349.26,0.0,0.0,89194.79,424544.05,Kewei Hou,Unknown
993,Backes,Floor,Professor,CC12858 Medicine | Obstetrics and Gynecology,Medicine CCH6,University,487446.67,98960.42,0.0,956.67,587363.76,Floor Backes,Unknown
1229,Paskett,Electra,Professor,CC11287 Medicine | IM Cancer Prevention and Co...,Medicine CCH6,University,339704.05,70201.78,0.0,0.0,409905.83,Electra Paskett,Unknown
1345,Satyapriya,Sree,Associate Professor - Clinical,CC12848 Medicine | Anesthesiology,Medicine CCH6,University,424075.08,31899.59,0.0,58199.99,514174.66,Sree Satyapriya,Unknown
1482,Ramaswamy,Bhuvaneswari,Professor - Clinical,CC11299 Medicine | IM Medical Oncology,Medicine CCH6,University,390553.36,19679.09,0.0,0.0,410232.45,Bhuvaneswari Ramaswamy,Unknown
1653,Thariani,Rachit,"Chief Administrative Officer, Post-Acute & Hom...",CC92601 Health System Shared Services | Popula...,Health System | Shared Services CCH6,Health System,451266.0,97781.74,0.0,0.0,549047.74,Rachit Thariani,Unknown
3557,Baliga,Ragavendra,Professor - Clinical,CC11283 Medicine | IM Cardiovascular Medicine,Medicine CCH6,University,388198.56,36748.23,0.0,0.0,424946.79,Ragavendra Baliga,Unknown
3801,Narula,Vimal,Professor - Clinical,CC12866 Medicine | Surgery General,Medicine CCH6,University,391297.39,24955.11,0.0,0.0,416252.5,Vimal Narula,Unknown
4668,Erel,Isil,Professor (9M),CC11600 Fisher College | Finance,Business CCH6,University,359374.99,0.0,0.0,119070.71,478445.7,Isil Erel,Unknown
4680,Sandhu,Gurneet,Assistant Professor - Clinical,CC12848 Medicine | Anesthesiology,Medicine CCH6,University,403542.36,51496.94,0.0,262871.75,717911.05,Gurneet Sandhu,Unknown


In [21]:
df.loc[df["first_name"] == "Nahush", "gender"] = "Male"
df.loc[df["first_name"] == "Asvin", "gender"] = "Male"
df.loc[df["first_name"] == "Zihai", "gender"] = "Male"
df.loc[df["first_name"] == "Ana Suelves", "gender"] = "Female"
df.loc[df["first_name"] == "Kinh Luan", "gender"] = "Male"
df.loc[df["first_name"] == "Yiping", "gender"] = "Male"
df.loc[df["first_name"] == "Kartik", "gender"] = "Male"
df.loc[df["first_name"] == "Vishnu", "gender"] = "Male"
df.loc[df["first_name"] == "Somashekar", "gender"] = "Male"
df.loc[df["first_name"] == "Sonu", "gender"] = "Male"
df.loc[df["first_name"] == "Gates", "gender"] = "Male"
df.loc[df["first_name"] == "Umair", "gender"] = "Male"
df.loc[df["first_name"] == "Kanu", "gender"] = "Male"
df.loc[df["first_name"] == "Lanla", "gender"] = "Female"
df.loc[df["first_name"] == "Gehan", "gender"] = "Female"
df.loc[df["first_name"] == "M. Rizwan", "gender"] = "Male"
df.loc[df["first_name"] == "Fatoumata", "gender"] = "Female"
df.loc[df["first_name"] == "Floor", "gender"] = "Female"
df.loc[df["first_name"] == "Jose A", "gender"] = "Female"
df.loc[df["first_name"] == "Rachit", "gender"] = "Male"
df.loc[df["first_name"] == "Gurneet", "gender"] = "Male"
df.loc[df["first_name"] == "Zarine", "gender"] = "Female"
df.loc[df["first_name"] == "Taimur", "gender"] = "Male"
df.loc[df["first_name"] == "Amna", "gender"] = "Female"
df.loc[df["first_name"] == "Arnab", "gender"] = "Male"
df.loc[df["first_name"] == "Arpit", "gender"] = "Male"
df.loc[df["first_name"] == "Sabrena", "gender"] = "Female"
df.loc[df["first_name"] == "Lu", "gender"] = "Male"
df.loc[df["first_name"] == "Qian", "gender"] = "Female"
df.loc[df["first_name"] == "Casey", "gender"] = "Male"
df.loc[df["first_name"] == "Sree", "gender"] = "Female"
df.loc[df["full_name"] == "Pat Schneider", "gender"] = "Male"
df.loc[df["first_name"] == "Kimmy", "gender"] = "Female"
df.loc[df["first_name"] == "Vikram", "gender"] = "Male"
df.loc[df["full_name"] == "Ling Hu", "gender"] = "Male"
df.loc[df["full_name"] == "Yun Xia", "gender"] = "Male"
df.loc[df["first_name"] == "Plato", "gender"] = "Male"
df.loc[df["first_name"] == "Shang-Jui", "gender"] = "Male"
df.loc[df["first_name"] == "Xiaoyi", "gender"] = "Female"
df.loc[df["first_name"] == "Archana Pahlaj", "gender"] = "Female"
df.loc[df["first_name"] == "Varun", "gender"] = "Male"
df.loc[df["first_name"] == "Liang-Shih", "gender"] = "Male"
df.loc[df["first_name"] == "Hasel", "gender"] = "Male"
df.loc[df["first_name"] == "Spero", "gender"] = "Male"
df.loc[df["first_name"] == "Sorabh", "gender"] = "Male"
df.loc[df["first_name"] == "Jr", "gender"] = "Male"
df.loc[df["first_name"] == "Demicha", "gender"] = "Female"
df.loc[df["first_name"] == "Xen", "gender"] = "Male"
df.loc[df["first_name"] == "Chyke", "gender"] = "Male"
df.loc[df["first_name"] == "Shaoli", "gender"] = "Female"
df.loc[df["first_name"] == "Waqas", "gender"] = "Male"
df.loc[df["first_name"] == "Shang-Jui Wang", "gender"] = "Male"
df.loc[df["first_name"] == "Adeeti", "gender"] = "Female"
df.loc[df["first_name"] == "Kerry-Ann", "gender"] = "Female"
df.loc[df["first_name"] == "Thangam", "gender"] = "Female"
df.loc[df["first_name"] == "Carroll Ann", "gender"] = "Female"
df.loc[df["first_name"] == "Yuchi", "gender"] = "Female"
df.loc[df["first_name"] == "Emmanuel", "gender"] = "Male"
df.loc[df["first_name"] == "Lambros", "gender"] = "Male"
df.loc[df["first_name"] == "Chathur", "gender"] = "Male"
df.loc[df["first_name"] == "Mhd Ezzat", "gender"] = "Male"
df.loc[df["first_name"] == "Chyke", "gender"] = "Male"
df.loc[df["first_name"] == "Electra", "gender"] = "Female"
df.loc[df["first_name"] == "Rukya", "gender"] = "Female"
df.loc[df["first_name"] == "Joici", "gender"] = "Female"
df.loc[df["first_name"] == "Thura", "gender"] = "Male"
df.loc[df["first_name"] == "Cathann", "gender"] = "Female"
df.loc[df["first_name"] == "Sharukh", "gender"] = "Male"
df.loc[df["full_name"] == "Lang Li", "gender"] = "Male"
df.loc[df["first_name"] == "Saurabh", "gender"] = "Male"
df.loc[df["first_name"] == "Musab", "gender"] = "Male"
df.loc[df["first_name"] == "Ritesh", "gender"] = "Male"
df.loc[df["first_name"] == "Sujith", "gender"] = "Male"
df.loc[df["first_name"] == "Subhankar", "gender"] = "Male"
df.loc[df["full_name"] == "Kewei Hou", "gender"] = "Male"
df.loc[df["first_name"] == "Bhuvaneswari", "gender"] = "Female"
df.loc[df["first_name"] == "Ragavendra", "gender"] = "Male"
df.loc[df["first_name"] == "Vimal", "gender"] = "Male"
df.loc[df["first_name"] == "Isil", "gender"] = "Female"
df.loc[df["full_name"] == "Ness Shroff", "gender"] = "Male"
df.loc[df["first_name"] == "C Nicholas", "gender"] = "Male"
df.loc[df["first_name"] == "Jill G", "gender"] = "Female"
df.loc[df["first_name"] == "Sakima", "gender"] = "Male"
df.loc[df["first_name"] == "Bingfeng", "gender"] = "Male"
df.loc[df["first_name"] == "Gbemiga", "gender"] = "Male"
df.loc[df["first_name"] == "Xuan", "gender"] = "Male"
df.loc[df["full_name"] == "Na Li", "gender"] = "Female"

In [22]:
df.gender.value_counts()

gender
Female     24438
Male       17239
Unknown     5979
Name: count, dtype: int64

### Check if there are other names we can categorize by gender

In [23]:
unknown = df[df["gender"] == "Unknown"]

In [24]:
unknown.first_name.value_counts()

first_name
Paige                            44
Jackie                           41
Pat                              17
Dominique                        14
Kendall                          13
Aubrey                           12
Gabby                            12
Jo                               12
Peyton                           11
Jing                             11
Summer                           10
Aj                               10
Mariama                           9
Yang                              9
Tj                                9
Avery                             8
Wei                               8
Lei                               8
Zac                               8
Skyler                            7
Carey                             7
Min                               7
Addison                           7
Shea                              7
Payton                            7
Lesley                            7
Abhishek                          6
Ming             

In [25]:
df.loc[df["first_name"] == "Paige", "gender"] = "Female"
df.loc[df["first_name"] == "Jackie", "gender"] = "Female"
df.loc[df["first_name"] == "Aubrey", "gender"] = "Female"
df.loc[df["first_name"] == "Gabby", "gender"] = "Female"
df.loc[df["first_name"] == "Summer", "gender"] = "Female"
df.loc[df["first_name"] == "Mariama", "gender"] = "Female"
df.loc[df["first_name"] == "Zac", "gender"] = "Male"
df.loc[df["first_name"] == "Shea", "gender"] = "Female"
df.loc[df["first_name"] == "Payton", "gender"] = "Female"
df.loc[df["first_name"] == "Mckenna", "gender"] = "Female"
df.loc[df["first_name"] == "Bri", "gender"] = "Female"
df.loc[df["first_name"] == "Catie", "gender"] = "Female"
df.loc[df["first_name"] == "Latoya", "gender"] = "Female"
df.loc[df["first_name"] == "Skylar", "gender"] = "Female"
df.loc[df["first_name"] == "Erinn", "gender"] = "Female"
df.loc[df["first_name"] == "Bekah", "gender"] = "Female"
df.loc[df["first_name"] == "Gabbie", "gender"] = "Female"
df.loc[df["first_name"] == "Dreama", "gender"] = "Female"
df.loc[df["first_name"] == "Celine", "gender"] = "Female"
df.loc[df["first_name"] == "Destany", "gender"] = "Female"
df.loc[df["first_name"] == "Dannielle", "gender"] = "Female"
df.loc[df["first_name"] == "Marylynn", "gender"] = "Female"
df.loc[df["first_name"] == "Courtney", "gender"] = "Female"
df.loc[df["first_name"] == "Frank", "gender"] = "Male"
df.loc[df["first_name"] == "Katie", "gender"] = "Female"
df.loc[df["first_name"] == "Page", "gender"] = "Female"
df.loc[df["first_name"] == "Tiffany", "gender"] = "Female"
df.loc[df["first_name"] == "Jermane", "gender"] = "Male"
df.loc[df["first_name"] == "Marlina", "gender"] = "Female"
df.loc[df["first_name"] == "Maresa", "gender"] = "Female"
df.loc[df["first_name"] == "Angel-Raphaela", "gender"] = "Female"
df.loc[df["first_name"] == "Karolyn", "gender"] = "Female"
df.loc[df["first_name"] == "Tish", "gender"] = "Female"
df.loc[df["first_name"] == "Gregory Cj", "gender"] = "Male"

In [26]:
df.gender.value_counts()

gender
Female     24639
Male       17248
Unknown     5769
Name: count, dtype: int64

In [27]:
# drop unknown values from gender column
df.drop(df[df.gender == 'Unknown'].index, inplace=True) 

# Drop column and change the order


In [28]:
df.drop(['last_name', 'first_name'], axis=1, inplace=True)

In [29]:
df = df[['full_name', 'gender', 'position_group', 'job', 'hierarchy', 'cost_center', 'regular_pay', 'bonus', 
             'other', 'overtime', 'gross_pay']]

In [30]:
df.head(2)

Unnamed: 0,full_name,gender,position_group,job,hierarchy,cost_center,regular_pay,bonus,other,overtime,gross_pay
0,Eric Green,Male,University,Professor - Clinical,Veterinary Medicine CCH6,CC10587 Veterinary Medicine | Veterinary Clini...,181225.86,29332.76,3000.0,0.0,213558.62
1,Chris Brew,Male,University,Visiting Associate Professor,Engineering CCH6,CC11786 Engineering | Computer Science and Eng...,7429.9,0.0,0.0,0.0,7429.9


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41887 entries, 0 to 47655
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   full_name       41887 non-null  object 
 1   gender          41887 non-null  object 
 2   position_group  41887 non-null  object 
 3   job             41887 non-null  object 
 4   hierarchy       41887 non-null  object 
 5   cost_center     41887 non-null  object 
 6   regular_pay     41887 non-null  float64
 7   bonus           41887 non-null  float64
 8   other           41887 non-null  float64
 9   overtime        41887 non-null  float64
 10  gross_pay       41887 non-null  float64
dtypes: float64(5), object(6)
memory usage: 3.8+ MB
