# Creating synthetic Core HR records with Python libraries

**Libraries:** 
- pandas
- faker
- random
- datetime

## Composition of the dataset:
- employee personal information:
    - first & last name
    - contact information (personal email & phone number, company email & phone number)
    - home address
    - demographic information: gender, ethnicity, age & date of birth
- employment information
    - employee number
    - position details (name, department, type, hierarchy level)
    - work location (province of employment, city)
    - tenure information (start & termination date if applicable)
- basic compensation & total rewards information
    - standard working hours & eligibility to overtime
    - base salary (hourly and annually)
    - benefit group

## Sources for how distribution weights were decided:
- gender weights: [per Canada's 2021 Census](https://publications.gc.ca/collections/collection_2022/statcan/98-500-x/98-500-x2021014-eng.pdf), the distribution of gender was
<table>
    <tr>
        <th>Gender</th>
        <th>% of population</th>
    </tr>
    <tr>
        <td>men</td>
        <td>49.21%</td>
    </tr>
    <tr>
        <td>women</td>
        <td>50.66%</td>
    </tr>
    <tr>
        <td>non-binary</td>
        <td>0.13%</td>
    </tr>
</table>
- province or territory of employment weights: [per Statistics Canada province/territory-wise population estimates released as of September 25, 2024 for Q3 2024](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1710000901)
<table>
    <tr>
        <th>Province</th>
        <th>% of population</th>
    </tr>
    <tr>
        <td>Alberta</td>
        <td>11.84%</td>
    </tr>
    <tr>
        <td>British Columbia</td>
        <td>13.80%</td>
    </tr>
    <tr>
        <td>Manitoba</td>
        <td>3.62%</td>
    </tr>
    <tr>
        <td>New Brunswick</td>
        <td>2.07%</td>
    </tr>
    <tr>
        <td>Newfoundland and Labrador</td>
        <td>1.32%</td>
    </tr>
    <tr>
        <td>Northwest Territories</td>
        <td>0.11%</td>
    </tr>
    <tr>
        <td>Nova Scotia</td>
        <td>2.61%</td>
    </tr>
    <tr>
        <td>Nunavut</td>
        <td>0.10%</td>
    </tr>
    <tr>
        <td>Ontario</td>
        <td>39.05%</td>
    </tr>
    <tr>
        <td>Prince Edward Island</td>
        <td>0.43%</td>
    </tr>
    <tr>
        <td>Quebec</td>
        <td>21.93%</td>
    </tr>
    <tr>
        <td>Saskatchewan</td>
        <td>3.00%</td>
    </tr>
    <tr>
        <td>Yukon</td>
        <td>0.11%</td>
    </tr>
</table>
- hierarchy level weights per this [ratio of individual contributors/managers/directors](https://ravio.com/blog/efective-management-structures-how-to-know-if-your-company-is-too-top-heavy#)
<table>
    <tr>
        <th>Level</th>
        <th>%</th>
    </tr>
     <tr>
        <td>Support</td>
        <td>7%</td>
    </tr>
    <tr>
        <td>Individual contributor</td>
        <td>72%</td>
    </tr>
    <tr>
        <td>Manager/Director</td>
        <td>16%</td>
    </tr>
    <tr>
        <td>Senior Leadership</td>
        <td>5%</td>
    </tr>
</table>
- ethnicity distribution weights per the [demographics of Canada](https://en.wikipedia.org/wiki/Demographics_of_Canada)
<table>
    <tr>
        <th>Ethnicity</th>
        <th>%</th>
    </tr>
    <tr>
        <td>caucasian</td>
        <td>69%</td>
    </tr>
    <tr>
        <td>african descent</td>
        <td>4.3%</td>
    </tr>
    <tr>
        <td>indigenous</td>
        <td>5%</td>
    </tr>
    <tr>
        <td>bi-racial</td>
        <td>3%</td>
    </tr>
    <tr>
        <td>hispanic</td>
        <td>1.6%</td>
    </tr>
    <tr>
        <td>pacific islander</td>
        <td>2.6%</td>
    </tr>
    <tr>
        <td>middle eastern</td>
        <td>1.9%</td>
    </tr>
    <tr>
        <td>asian</td>
        <td>16%</td>
    </tr>
</table>

In [1]:
# importing libraries
from faker import Faker
import pandas as pd

# creating a Faker instance in Canada
fake = Faker(locale='en_CA')

# importing the random module for weighted choices
import random
from random import choices

# importing datetime for time intelligence
from datetime import datetime, timedelta

In [2]:
# instancing hierarchy levels with their weights
hierarchy = {
    "Support": 0.07,
    "Professional": 0.72,
    "Director": 0.16,
    "Executive": 0.05
}

# instancing genders with their weights
genders = {
    "Man": 0.4921,
    "Woman": 0.5066,
    "Person": 0.0013
}

# instancing ethnicities with their weights
ethnicities = {
    "Caucasian": 0.656,
    "African descent": 0.043,
    "Indigenous": 0.05,
    "Bi-racial": 0.03,
    "Hispanic": 0.016,
    "Pacific Islander": 0.026,
    "Middle Eastern": 0.019,
    "Asian": 0.16
}

# creating the weights for the provinces
province_pool = ['Alberta', 'British Columbia', 'Manitoba', 'New Brunswick',
                 'Newfoundland and Labrador', 'Northwest Territories',
                 'Nova Scotia', 'Nunavut', 'Ontario', 'Prince Edward Island',
                 'Quebec', 'Saskatchewan', 'Yukon']
province_acronyms = ['AB', 'BC', 'MB', 'NB', 'NL', 'NT', 'NS', 'NU', 'ON',
                     'PE', 'QC', 'SK', 'YT']
province_with_acronyms = list(zip(province_pool, province_acronyms))
province_weights = [0.1184, 0.1380, 0.0362, 0.0207, 0.0132, 0.0011, 0.0261,
                    0.0011, 0.3905, 0.0043, 0.2193, 0.0300, 0.0011]

# creating the list of departments
department_pool = ["Human Capital Strategy", "Financial Strategy & Analysis",
                   "Communications & Brand Strategy", "Service Delivery & Operations",
                   "Technology Solutions & Services", "Client Experience & Engagement",
                   "Regulatory Compliance & Risk Management", "Portfolio Management & Execution",
                   "Corporate Services & Solutions"]
department_acronyms = ["HCS", "FSA", "CBS", "SDO", "TSS", "CEE", "RCRM", "PME", "CSS"]
department_with_acronyms = list(zip(department_pool, department_acronyms))

# creating employee types with their weights
ee_types = {
    "Full Time - Regular": 0.97,
    "Full Time - Temporary": 0.01,
    "Part Time - Regular": 0.01,
    "Part Time - Temporary": 0.01
}

# creating basic salary bands & benefits groups per hierarchy level
salary_bands = {
    'Support': {'Minimum': 45000, 'Midpoint': 55000, 'Maximum': 65000,
               'Benefits': 'Class 4'},
    'Professional': {'Minimum': 60000, 'Midpoint': 70000, 'Maximum': 80000,
                    'Benefits': 'Class 3'},
    'Director': {'Minimum': 90000, 'Midpoint': 120000, 'Maximum': 140000,
                'Benefits': 'Class 2'},
    'Executive': {'Minimum': 130000, 'Midpoint': 180000, 'Maximum': 250000,
                 'Benefits': 'Class 1'}
}

In the separate file 'job_titles', sample job titles were compiled per hierarchy level and department in the format of the examples below.
<table>
    <tr>
        <th>Department</th>
        <th>Hierarchy Level</th>
        <th>Job Title</th>
    </tr>
    <tr>
        <td>Corporate Services & Solutions</td>
        <td>Support</td>
        <td>Receptionist</td>
    </tr>
    <tr>
        <td>Client Experience & Engagement</td>
        <td>Professional</td>
        <td>Account Manager</td>
    </tr>
    <tr>
        <td>Human Capital Strategy</td>
        <td>Director</td>
        <td>Director of Talent Acquisition</td>
    </tr>
    <tr>
        <td>Financial Strategy & Analysis</td>
        <td>Executive</td>
        <td>Executive Director of Finance</td>
    </tr> 
</table>

In [3]:
# loading the job titles into a dataframe
job_titles = pd.read_csv('job_titles.csv')
job_titles.sample(5)

Unnamed: 0,Department,Hierarchy Level,Job Title
113,Client Experience & Engagement,Executive,Vice President of Customer Experience
37,Financial Strategy & Analysis,Executive,Senior Vice President of Finance
83,Technology Solutions & Services,Professional,Network Administrator
97,Service Delivery & Operations,Executive,Chief Operating Officer
124,Regulatory Compliance & Risk Management,Professional,Regulatory Compliance Specialist


In the separate file 'province_cities', sample cities per province have been compiled to serve as work locations. 
5 cities are provided for the biggest provinces, 3 for medium provinces and the capital for small provinces.
<table>
    <tr>
        <th>Province</th>
        <th>Cities</th>
    </tr>
    <tr>
        <td>Ontario</td>
        <td>Toronto, Ottawa, Mississauga, Brampton, Hamilton</td>
    </tr>
    <tr>
        <td>Alberta</td>
        <td>Calgary, Edmonton, Red Deer</td>
    </tr>
    <tr>
        <td>Nova Scotia</td>
        <td>Halifax</td>
   </tr> 
</table>

In [4]:
# loading the cities into a dataframe
cities = pd.read_csv('cities.csv')
cities.sample(5)

Unnamed: 0,Province,City
12,Alberta,Red Deer
2,Ontario,Mississauga
24,Yukon,Whitehorse
14,British Columbia,Surrey
18,Newfoundland and Labrador,St. John’s


In [5]:
# defining a function to generate random date between two dates
def random_between_dates(start, end):
    delta = end - start
    random_point = random.randint(0, delta.days)
    return start + timedelta(days= random_point)

# defining a function to calculate age either at time of departure or now
def calcul_age(birth_date, reference_date):
    return reference_date.year - birth_date.year - \
            ((reference_date.month, reference_date.day) < \
             (birth_date.month, birth_date.day))

# defining a function to extract a random job title with department + level
def get_job_title(department, hierarchy_level):
    filtered_df = job_titles[(job_titles['Department'] == department) & (job_titles['Hierarchy Level'] == hierarchy_level)]
    if not filtered_df.empty:
        return random.choice(filtered_df['Job Title'].tolist())
    else:
        return 'Invalid department or hierarchy level.'

# defining a function to extract a random city per province
def get_city(province):
    filtered_df = cities[cities['Province'] == province]
    if not filtered_df.empty:
        return random.choice(filtered_df['City'].tolist())
    else:
        return 'Invalid province.'

# defining a function to extract salary and benefits per level
def get_salary_benefits(level):
    level_info = salary_bands.get(level)
    if level_info:
        annual_comp = fake.random_int(
            min= level_info['Minimum'],
            max= level_info['Maximum'],
            step= 1000)
        benefits = level_info['Benefits']
        return annual_comp, benefits
    else:
        raise ValueError('Invalid level provided')

# defining constants
_year = 365.25 # days, as we're taking into account leap years
_current_date = datetime.now()
_weeks_perYear = 52

In [6]:
# creating a function to generate employee records
def create_employees(num_employees):
    employee_list = []
    for i in range(num_employees):
        employee = {}
        employee['ee#'] = 10000000+i
        employee['first_name'] = fake.first_name()
        employee['last_name'] = fake.last_name()
        employee['full_name'] = f"{employee['last_name']}, {employee['first_name']}"
          
        # instancing random choices for DEI variables
        level = random.choices(population= list(hierarchy.keys()),
                               weights= list(hierarchy.values()),
                               k= 1)[0]
        department, d_acronym = random.choices(department_with_acronyms,
                                               k= 1)[0]
        position = get_job_title(department, level)
        type = random.choices(population= list(ee_types.keys()),
                              weights= list(ee_types.values()),
                              k= 1)[0]
        work_hours = 37.5 if "Full Time" in type else 24
        overtime = False if "Regular" in type else True
        annual_comp, benefits = get_salary_benefits(level)
        hour_comp = annual_comp / (work_hours * _weeks_perYear)
        gender = random.choices(population= list(genders.keys()),
                               weights= list(genders.values()),
                               k= 1)[0]
        ethnicity = random.choices(population= list(ethnicities.keys()),
                                   weights= list(ethnicities.values()),
                                   k= 1)[0]
        province, p_acronym = random.choices(province_with_acronyms,
                                           weights= province_weights,
                                           k= 1)[0]
        work_location = get_city(province)
        
        # generating appropriate birth dates
        birth_date = random_between_dates(
            _current_date - timedelta(days= _year * 100),
            _current_date)

        # generating appropriate dates
        '''
        birth date is generated first to serve as a base from a range of 100 years.
        based on birth date, a start date is generated for the employee to be between
        25 and 60 years old when joining the company.
        if the generated start date is beyond current date, it is regenerated to ensure
        that it stays in the past, though that limit can be modified. 
        
        based on start date, a termination date is generated with a lower bound equal
        to the start date (the employee resigned same day as they started) and an upper
        bound either equal to the current date (which will be transformed as none) or
        retirement age defined by tenure <= 40 years or age <= 70.
        to ensure enough employees are active while maintaining retirement restrictions
        a random probability is then applied to nullify the term date.
        '''
        
        while True:
            try:
                birth_date = random_between_dates(
                            _current_date - timedelta(days= _year * 100),
                            _current_date)
                min_start_date = birth_date + timedelta(days= _year*25)
                max_start_date = min(_current_date,
                             birth_date + timedelta(days= _year*60))
                start_date = random_between_dates(min_start_date, 
                                                  max_start_date)
                if start_date > _current_date:
                    raise ValueError
                min_term_date = start_date
                max_term_date = min(_current_date,
                                    start_date + timedelta(days=_year*40),
                                    birth_date + timedelta(days=_year*70))
                term_date = random_between_dates(min_term_date,
                                                 max_term_date)
                if term_date == _current_date:
                    term_date = None
                # having a ratio of 60/40 active/terminated employees
                generate_term = random.choices([True, False],
                                               weights= [20,80],
                                               k= 1)[0]
                if not generate_term:
                    potential_tenure = (_current_date - start_date).days \
                                        // _year
                    potential_age = (_current_date - birth_date).days \
                                        // _year
                    if potential_tenure <= 40 and potential_age <= 70:
                        term_date = None
                break
            except ValueError:
                continue
                    
        # calculating tenure
        if term_date is None:
            tenure = (_current_date - start_date).days / _year
        else:
            tenure = (term_date - start_date).days / _year
        tenure = round(tenure, 1)
        
        # calculating age at time of departure if terminated
        if term_date is None:
            age = (_current_date - birth_date).days / _year
        else:
            age = (term_date - birth_date).days / _year
        age = round(age, 1)
              
        # instancing the dataframe's columns
        employee['personal_email'] = f"{employee['first_name']}{employee['last_name'][0]}@example.ca"
        employee['personal_phone'] = fake.phone_number()
        employee['company_email'] = f"{employee['first_name']}.{employee['last_name']}@company-abc.ca"
        employee['company_phone'] = fake.phone_number()
        employee['address'] = f"{fake.street_address()}, {fake.city()}, {p_acronym} {fake.postalcode_in_province(p_acronym)}"
        employee['position'] = position
        employee['position_type'] = type
        employee['standard_hours'] = work_hours
        employee['eligibility_overtime'] = overtime
        employee['annual_comp'] = annual_comp
        employee['hour_comp'] = round(hour_comp,2)
        employee['benefit_group'] = benefits
        employee['start_date'] = start_date
        employee['term_date'] = term_date
        employee['tenure'] = tenure
        employee['birth_date'] = birth_date
        employee['age'] = age
        employee['department'] = department
        employee['dpt_acronym'] = d_acronym
        employee['work_location'] = work_location
        employee['employment_province'] = province
        employee['province_acronym'] = p_acronym
        employee['level'] = level
        employee['gender'] = gender
        employee['ethnicity'] = ethnicity
        employee_list.append(employee)
    return pd.DataFrame(employee_list)

In [7]:
# creating a dataframe to hold the output of the function
# and visualize it to check for correct output
records = create_employees(2000)
records.sample(5)

Unnamed: 0,ee#,first_name,last_name,full_name,personal_email,personal_phone,company_email,company_phone,address,position,...,birth_date,age,department,dpt_acronym,work_location,employment_province,province_acronym,level,gender,ethnicity
1807,10001807,Sheila,Rodriguez,"Rodriguez, Sheila",SheilaR@example.ca,(328) 845-5219 x847,Sheila.Rodriguez@company-abc.ca,+1 (987) 298-1686,"759 Nelson Lights, North Madelinestad, QC J3Y9A8",Senior Vice President of Finance,...,1934-03-05 10:17:28.713475,68.4,Financial Strategy & Analysis,FSA,Quebec City,Quebec,QC,Executive,Man,Middle Eastern
194,10000194,Denise,Hernandez,"Hernandez, Denise",DeniseH@example.ca,1 (218) 620-8992,Denise.Hernandez@company-abc.ca,1 (719) 544-8136,"887 Mckee Estates Suite 567, North Tonya, MB R...",Office Services Assistant,...,1980-01-27 10:17:28.713475,44.7,Corporate Services & Solutions,CSS,Winnipeg,Manitoba,MB,Professional,Man,Caucasian
1447,10001447,Alex,Howe,"Howe, Alex",AlexH@example.ca,1 (635) 085-4529,Alex.Howe@company-abc.ca,309-610-1889 x978,"98849 Mcgee Common Suite 548, South Emily, QC ...",Customer Experience Manager,...,1957-05-18 10:17:28.713475,67.4,Client Experience & Engagement,CEE,Longueuil,Quebec,QC,Professional,Man,Caucasian
1926,10001926,David,Harvey,"Harvey, David",DavidH@example.ca,1-201-228-1265,David.Harvey@company-abc.ca,491-266-0742 x186,"25546 Jacqueline Dale, Kennethberg, ON N4C2M2",Office Services Assistant,...,1931-08-05 10:17:28.713475,53.5,Corporate Services & Solutions,CSS,Ottawa,Ontario,ON,Professional,Man,Caucasian
1421,10001421,Ellen,Morton,"Morton, Ellen",EllenM@example.ca,191 335 4270,Ellen.Morton@company-abc.ca,146.025.1830,"6036 Sims Mill Suite 656, South Shannonhaven, ...",Global Compliance Officer,...,1933-09-20 10:17:28.713475,56.0,Regulatory Compliance & Risk Management,RCRM,Winnipeg,Manitoba,MB,Executive,Man,Bi-racial


In [8]:
# export the dataframe to .csv file
records.to_csv('CoreHR_dataset.csv', index=False)