# HR Synthetic Database Generator - Agentic Workflow

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DougTrajano/agentic-ai-workshop/blob/main/hr_synthetic_database.ipynb)

This Jupyter Notebook demonstrates how to generate a synthetic HR database based on a company description provided by the user using Agentic Workflow.

The dataset generation workflow follows these main phases:

1. **Company Specification**: Convert user input into a structured company specification using LLM.
2. **Demographic Ratios**: Generate demographic ratios based on company characteristics using LLM.
3. **Database Setup**: Create the database structure with HR schema.
4. **Business Unit Processing**: For each business unit:
   - Add business unit and director job to database.
   - Generate director employee (including education, compensation, and storage).
   - Process all departments within the business unit.
5. **Department Processing**: For each department:
   - Add department and all job roles to database.
   - Generate department employees in parallel batches (manager + staff).
   - Each employee generation includes education determination, compensation calculation, and database storage.


## Set up Environment


### Install Dependencies

In this section, we will install all necessary dependencies required for the notebook to run successfully.

In [3]:
%pip install -q pydantic==2.11 pydantic-settings==2.11 \
    duckdb>=1.3 faker>=37.12 datasets>=4.0 huggingface_hub>=0.35 \
        langchain>=1.0 langchain-openai>=1.0 langgraph>=1.0

Note: you may need to restart the kernel to use updated packages.


### Define Parameters

We'll configure the path where our DuckDB database will be stored. This local database will hold all the generated HR data including employees, departments, jobs, and compensation information.

The environment variables include the OpenAI API and Hugging Face credentials needed for the notebook to function properly.


In [None]:
from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings


class Settings(BaseSettings):
    """Agentic Workflow settings."""

    OPENAI_API_KEY: SecretStr = Field(
        ...,
        description='API key for OpenAI services.'
    )

    HF_TOKEN: SecretStr = Field(
        ...,
        description='Hugging Face API token for accessing models and datasets.'
    )

    DUCKDB_PATH: str = Field(
        './data/hr_database.duckdb',
        description='Path to the DuckDB database file.'
    )

In [None]:
import os


# Detect if running on Google Colab
try:
    import google.colab

    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Load environment variables based on environment
if IN_COLAB:
    # Running on Google Colab - use userdata
    from google.colab import userdata

    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')


settings = Settings()

## Define Database Schema


### Data Models

We'll define the data models representing the HR database schema using Pydantic for type-safe, validated data structures.

In [None]:
import datetime
import uuid
from enum import Enum, StrEnum
from typing import List, Optional

from pydantic import BaseModel, Field, computed_field, field_validator

#### Common Fields

Define base classes that will be inherited by other models:

- **BaseFields**: Provides auto-generated unique identifiers (UUID) for all entities
- **StartAndEndDates**: Adds start/end date fields with validation for time-bound entities like contracts

In [None]:
class BaseFields(BaseModel):
    """Base fields providing a unique identifier for all models.

    This base class ensures every model in the HR system has a consistent
    unique identifier that is automatically generated using UUID1.
    """

    id: uuid.UUID = Field(
        default_factory=uuid.uuid1,
        description='Auto-generated unique identifier'
    )


class StartAndEndDates(BaseModel):
    """Mixin for models that require time-bound validity periods.

    This mixin provides start and end date fields with validation to ensure
    logical date ordering. Useful for contracts, employment periods, and
    other time-limited entities.
    """

    start_date: datetime.date = Field(..., description='Start date')
    end_date: Optional[datetime.date] = Field(None, description='End date')

    @field_validator('end_date')
    @classmethod
    def validate_end_date(cls, v, values):
        """Ensure end date is after start date.

        Args:
            v: The end_date value being validated
            values: Dictionary containing other field values

        Returns:
            datetime.date: The validated end_date

        Raises:
            ValueError: If end_date is before or equal to start_date
        """
        if v and 'start_date' in values and v <= values['start_date']:
            raise ValueError('End date must be after start date')
        return v

#### Company Models

Define the core company organizational models:

- **Job**: Represents a position with level, family, contract type, and workplace arrangement
- **Department**: A functional unit with a manager and specific job roles
- **BusinessUnit**: A major division led by a director, containing multiple departments
- **Company**: The complete organizational structure with industry classification and business units
- **Ratios**: Demographic distribution ratios for generating diverse, realistic workforce composition

These enums and models define the hierarchical structure and classifications used throughout the HR system.

In [None]:
class JobLevel(str, Enum):
    """Hierarchical job level classifications for career progression tracking.

    Defines standardized levels that represent responsibility, experience,
    and authority within the organizational structure, enabling clear
    career pathing and compensation benchmarking.
    """

    INTERN = 'Intern'
    JUNIOR = 'Junior'
    MID = 'Mid'
    SENIOR = 'Senior'
    LEAD = 'Lead'
    MANAGER = 'Manager'
    DIRECTOR = 'Director'
    PRESIDENT = 'President'


class JobFamily(str, Enum):
    """Functional job family classifications grouping related roles.

    Organizes positions by functional area and skill set, facilitating
    talent management, training programs, and career development within
    similar discipline areas.
    """

    ENGINEERING = 'Engineering'
    PRODUCT = 'Product'
    DESIGN = 'Design'
    DATA = 'Data'
    MARKETING = 'Marketing'
    SALES = 'Sales'
    CUSTOMER_SUCCESS = 'Customer Success'
    FINANCE = 'Finance'
    HUMAN_RESOURCES = 'Human Resources'
    LEGAL = 'Legal'
    OPERATIONS = 'Operations'
    SECURITY = 'Security'
    QUALITY_ASSURANCE = 'Quality Assurance'
    BUSINESS_DEVELOPMENT = 'Business Development'
    EXECUTIVE = 'Executive'


class ContractType(str, Enum):
    """Employment contract classifications defining work arrangements.

    Specifies the nature of the employment relationship, affecting benefits,
    working hours, and legal obligations between employer and employee.
    """

    FULL_TIME = 'Full-Time'
    PART_TIME = 'Part-Time'
    CONTRACT = 'Contract'
    TEMPORARY = 'Temporary'
    INTERN = 'Intern'


class WorkplaceType(str, Enum):
    """Work location and arrangement classifications for modern workplace flexibility.

    Defines where and how work is performed, supporting diverse work
    arrangements and accommodating different employee preferences and
    business needs.
    """

    REMOTE = 'Remote'
    ONSITE = 'Onsite'
    HYBRID = 'Hybrid'


class Job(BaseFields):
    """Complete job position specification within the organizational structure.

    Defines a specific role with all its characteristics including level,
    functional area, contract terms, and workplace arrangements. Used for
    both organizational planning and employee assignment.
    """

    name: str = Field(..., description='Job name')
    description: Optional[str] = Field(None, description='A brief description of the job')
    job_level: JobLevel = Field(..., description='Job level classification')
    job_family: JobFamily = Field(..., description='Job family classification')
    contract_type: ContractType = Field(..., description='Contract type')
    workplace_type: WorkplaceType = Field(..., description='Type of workplace')


class Industry(str, Enum):
    """Industry classifications for companies following standard industry categories.

    These classifications align with common sector groupings used in business
    and financial analysis, providing a standardized way to categorize companies
    by their primary business focus.
    """

    COMMUNICATION_SERVICES = 'Communication Services'
    CONSULTING = 'Consulting'
    CONSUMER_STAPLES = 'Consumer Staples'
    EDUCATION = 'Education'
    ENERGY = 'Energy'
    FINANCIALS = 'Financials'
    HEALTHCARE = 'Healthcare'
    INDUSTRIALS = 'Industrials'
    REAL_ESTATE = 'Real Estate'
    RETAIL = 'Retail'
    TECHNOLOGY = 'Technology'
    UTILITIES = 'Utilities'


class Department(BaseFields):
    """Specification for a department within a company's organizational structure.

    Departments represent functional units within business units, each with a
    designated manager and a specific set of job roles with defined headcounts.
    This model is used for organizational planning and synthetic data generation.
    """

    name: str = Field(..., description='Department name')
    description: Optional[str] = Field(None, description='Department description')
    manager: Job = Field(..., description='Manager of the department')
    jobs: List['JobsSpec'] = Field(
        ..., description='List of jobs and their headcount in this department'
    )

    class JobsSpec(BaseModel):
        """Specification for job distribution and headcount planning within a department.

        Defines the relationship between specific job types and their planned
        headcount allocation, enabling precise workforce planning and
        organizational structure modeling.
        """

        job: Job = Field(..., description='Job specification')
        headcount: int = Field(..., ge=1, description='Planned headcount for job')


class BusinessUnit(BaseFields):
    """Specification for a business unit within a company's organizational hierarchy.

    Business units represent major operational divisions within a company,
    each led by a director and containing multiple departments. This structure
    enables clear accountability and management hierarchy modeling.
    """

    name: str = Field(..., description='Business unit name')
    description: Optional[str] = Field(None, description='Business unit description')
    director: Job = Field(..., description='Director of the business unit')
    departments: List[Department] = Field(
        ..., description='List of departments within the business unit'
    )


class Company(BaseFields):
    """Complete specification for a company structure.

    This model guides the synthetic data generation process by defining
    the company's structure, demographics, and workforce composition.
    """

    name: str = Field(..., description='Company name')
    description: Optional[str] = Field(None, description='Company description')
    industry: Industry = Field(..., description='Industry classification')
    business_units: List[BusinessUnit] = Field(
        ..., description='List of business units within the company'
    )


class Ratios(BaseModel):
    """Demographic distribution ratios for workforce composition modeling.

    This model defines the proportional representation of different demographic
    groups within a company's workforce. All ratios should sum to 1.0 within
    each category and use 2 decimal places for precision in synthetic data generation.
    """

    gender: 'GenderRatios' = Field(..., description='Proportion of employees by gender')
    ethnicity: 'EthnicityRatios' = Field(..., description='Proportion of employees by ethnicity')
    generation: 'GenerationRatios' = Field(
        ..., description='Proportion of employees by generation'
    )

    class GenderRatios(BaseModel):
        """Proportional distribution of employees by gender identity.

        Represents the workforce composition across different gender identities,
        supporting inclusive demographic modeling. All ratios should sum to 1.0.
        """

        MALE: float = Field(..., ge=0, le=1, description='Proportion of male employees')
        FEMALE: float = Field(..., ge=0, le=1, description='Proportion of female employees')
        NON_BINARY: float = Field(
            ..., ge=0, le=1, description='Proportion of non-binary employees'
        )
        PREFER_NOT_TO_SAY: float = Field(
            ..., ge=0, le=1, description='Proportion of employees who prefer not to say'
        )

    class EthnicityRatios(BaseModel):
        """Proportional distribution of employees by ethnic background.

        Represents workforce diversity across major ethnic categories,
        enabling realistic demographic modeling. All ratios should sum to 1.0.
        """

        WHITE: float = Field(..., ge=0, le=1, description='Proportion of white employees')
        BLACK: float = Field(..., ge=0, le=1, description='Proportion of black employees')
        ASIAN: float = Field(..., ge=0, le=1, description='Proportion of Asian employees')
        HISPANIC: float = Field(..., ge=0, le=1, description='Proportion of Hispanic employees')
        OTHER: float = Field(
            ..., ge=0, le=1, description='Proportion of employees from other ethnicities'
        )

    class GenerationRatios(BaseModel):
        """Proportional distribution of employees by generational cohorts.

        Represents the age distribution of the workforce using standard
        generational categories. Enables modeling of generational diversity
        and workplace dynamics. All ratios should sum to 1.0.
        """

        BABY_BOOMER: float = Field(
            ..., ge=0, le=1, description='Proportion of Baby Boomer employees'
        )

        GEN_X: float = Field(..., ge=0, le=1, description='Proportion of Gen X employees')

        MILLENNIAL: float = Field(
            ..., ge=0, le=1, description='Proportion of Millennial employees'
        )

        GEN_Z: float = Field(..., ge=0, le=1, description='Proportion of Gen Z employees')

#### Compensation Models

Define the compensation model that captures employee pay structure:

- **RateType**: Whether the employee is paid hourly or on salary
- **Compensation**: Complete compensation package including base salary, bonuses, and commissions

The model automatically calculates total compensation by summing all components.

In [None]:
class RateType(str, Enum):
    """Classification of compensation payment structures.

    Defines whether an employee is paid on an hourly basis or receives
    a fixed annual salary, affecting how compensation is calculated
    and administered.
    """

    HOURLY = 'Hourly'
    SALARY = 'Salary'


class Compensation(BaseFields):
    """The compensation package information for an employee."""

    annual_base_salary: float = Field(..., ge=1, description='Base annual salary')
    annual_bonus_amount: float | None = Field(None, ge=0, description='Annual bonus amount')
    annual_commission_amount: float | None = Field(
        None, ge=0, description='Annual commission amount (usually a percentage of sales)'
    )

    rate_type: RateType = Field(..., description='Type of compensation rate')

    @property
    def total_compensation(self) -> float:
        """Calculate the total annual compensation including all components.

        Returns:
            float: Sum of base salary, bonus amount, and commission amount.
                   None values are treated as zero in the calculation.
        """
        return (
            self.annual_base_salary
            + (self.annual_bonus_amount or 0)
            + (self.annual_commission_amount or 0)
        )

#### Employee Models

Define the employee model and related demographic classifications:

- **Gender, Ethnicity, Generation**: Demographic categories for workforce diversity
- **EducationLevel, EducationField**: Academic qualifications and areas of study
- **Employee**: Complete employee record with demographics, education, and organizational placement

The Employee model includes computed properties that automatically generate realistic names and determine generational cohort based on birth date.

In [None]:
from faker import Faker


faker = Faker()


class Gender(StrEnum):
    """Gender identity classifications supporting inclusive workforce representation.

    Provides comprehensive gender categories that respect individual identity
    while enabling demographic analysis and reporting.
    """

    MALE = 'Male'
    FEMALE = 'Female'
    NON_BINARY = 'Non-binary'
    PREFER_NOT_TO_SAY = 'Prefer not to say'


class Ethnicity(StrEnum):
    """Ethnic background classifications for workforce diversity tracking.

    Enables demographic analysis and diversity reporting while respecting
    individual backgrounds and promoting inclusive workplace practices.
    """

    ASIAN = 'Asian'
    BLACK = 'Black'
    HISPANIC = 'Hispanic'
    WHITE = 'White'
    OTHER = 'Other'


class Generation(StrEnum):
    """Generational cohort classifications based on birth year ranges.

    Categorizes employees into standard generational groups for analyzing
    workplace dynamics, communication preferences, and career development needs.
    """

    BABY_BOOMER = 'Baby Boomer'
    GEN_X = 'Generation X'
    MILLENNIAL = 'Millennial'
    GEN_Z = 'Generation Z'


class EducationLevel(StrEnum):
    """Academic achievement levels for skills and qualification assessment.

    Represents the highest level of formal education completed, used for
    job matching, career development, and compensation analysis.
    """

    HIGH_SCHOOL = 'High School'
    ASSOCIATE = 'Associate Degree'
    BACHELORS = 'Bachelor Degree'
    MASTERS = 'Master Degree'
    DOCTORATE = 'Doctorate'


class EducationField(StrEnum):
    """Academic discipline classifications for specialized knowledge assessment.

    Comprehensive categorization of fields of study that enables skills
    matching, career pathing, and department alignment based on educational
    background and expertise areas.
    """

    AGRICULTURE = 'Agriculture'
    ARTS = 'Arts'
    BIOLOGICAL_SCIENCES = 'Biological Sciences'
    BUSINESS = 'Business & Management'
    COMMUNICATION_MEDIA = 'Communication, Journalism & Media'
    COMPUTER_SCIENCE = 'Computer Science'
    CIVIL_ENGINEERING = 'Civil Engineering'
    ELECTRICAL_ENGINEERING = 'Electrical Engineering'
    MECHANICAL_ENGINEERING = 'Mechanical Engineering'
    CHEMICAL_ENGINEERING = 'Chemical Engineering'
    BIOMEDICAL_ENGINEERING = 'Biomedical Engineering'
    MATERIALS_ENGINEERING = 'Materials Engineering'
    ECONOMICS = 'Economics'
    HEALTH_SCIENCES = 'Health Sciences'
    LAW = 'Law'
    LITERATURE = 'Literature'
    MATHEMATICS_STATISTICS = 'Mathematics & Statistics'
    MEDICINE = 'Medicine'
    MILITARY_SCIENCE = 'Military Science'
    NURSING = 'Nursing'
    PEDAGOGY = 'Pedagogy'
    PHARMACY = 'Pharmacy'
    PHYSICAL_SCIENCES = 'Physics & Chemistry'
    POLITICAL_SCIENCE = 'Political Science'
    PSYCHOLOGY = 'Psychology'
    RELIGIOUS_STUDIES = 'Religious Studies'
    SOCIAL_SCIENCES = 'Social Sciences'


class Employee(BaseFields):
    """Comprehensive employee record containing all personal and professional information.

    This model serves as the central employee data structure, linking together
    demographic information, educational background, and organizational placement.
    Used for both synthetic data generation and real HR system modeling.
    """

    job_id: uuid.UUID = Field(..., description='Job ID')
    department_id: Optional[uuid.UUID] = Field(None, description='Department ID')
    business_unit_id: Optional[uuid.UUID] = Field(None, description='Business Unit ID')
    birth_date: datetime.date = Field(..., description='Date of birth')
    gender: Gender = Field(..., description='Gender identification')
    ethnicity: Ethnicity = Field(..., description='Ethnicity identification')
    education_level: Optional[EducationLevel] = Field(
        None, description='Highest education level completed'
    )

    education_field: Optional[EducationField] = Field(
        None, description='Field of study for the employee'
    )

    @computed_field
    @property
    def first_name(self) -> str:
        """Generate an appropriate first name based on the employee's gender.

        Uses Faker library to generate culturally appropriate names that
        align with the employee's gender identity for realistic data modeling.

        Returns:
            str: A generated first name appropriate for the employee's gender
        """
        if self.gender == Gender.MALE:
            return faker.first_name_male()
        elif self.gender == Gender.FEMALE:
            return faker.first_name_female()
        else:
            return faker.first_name()

    @computed_field
    @property
    def last_name(self) -> str:
        """Generate an appropriate last name based on the employee's gender.

        Uses Faker library to generate culturally appropriate surnames that
        align with the employee's gender identity for consistent data modeling.

        Returns:
            str: A generated last name appropriate for the employee's gender
        """
        if self.gender == Gender.MALE:
            return faker.last_name_male()
        elif self.gender == Gender.FEMALE:
            return faker.last_name_female()
        else:
            return faker.last_name()

    @computed_field
    @property
    def generation(self) -> Generation:
        """Automatically determine generational cohort based on birth date.

        Uses standard generational date ranges to classify employees into
        appropriate cohorts for workplace analysis and management strategies.

        Returns:
            Generation: The generational category based on birth year

        Raises:
            ValueError: If birth date falls outside recognized generational ranges
        """
        if self.birth_date < datetime.date(1965, 1, 1):
            return Generation.BABY_BOOMER
        elif self.birth_date < datetime.date(1981, 1, 1):
            return Generation.GEN_X
        elif self.birth_date < datetime.date(1997, 1, 1):
            return Generation.MILLENNIAL
        elif self.birth_date < datetime.date(2016, 1, 1):
            return Generation.GEN_Z
        else:
            raise ValueError('Unknown generation')

### Database Class

Create the Database class that provides an opinionated interface for interacting with DuckDB:

- **create_tables()**: Initialize the database schema with all HR tables and relationships
- **add_employee()**: Insert employee records with automatic UUID conflict resolution
- **add_business_unit()**: Store business unit information
- **add_department()**: Store department information linked to business units
- **add_job()**: Store job specifications
- **add_compensation()**: Store compensation packages linked to employees

The database enforces referential integrity through foreign key constraints and ensures data consistency.

In [None]:
import os

import duckdb


class Database:
    """Database interface for DuckDB.

    Provides a database abstraction layer for managing HR entities
    including employees, business units, departments, jobs, and compensation
    records. Uses DuckDB for efficient analytical queries and data processing.

    Attributes:
        file_path (str): Path to the DuckDB database file
    """

    def __init__(self, file_path: str = './data/hr_database.duckdb'):
        """Initialize the database connection with the specified file path.

        Args:
            file_path (str): Path to the DuckDB database file.
                Directory will be created if it doesn't exist.
                Defaults to "./data/hr_database.duckdb"
        """
        self.file_path = file_path
        os.makedirs(os.path.dirname(self.file_path), exist_ok=True)

    def create_tables(self):
        """Create the complete HR database schema with all necessary tables.

        Creates tables for business_units, departments, jobs, employees, and
        compensations with proper foreign key relationships and constraints.
        This method is idempotent and safe to call multiple times.

        Tables created:
        - business_units: Top-level organizational divisions
        - departments: Functional units within business units
        - jobs: Job position definitions and classifications
        - employees: Employee personal and demographic information
        - compensations: Employee compensation packages and amounts
        """
        with duckdb.connect(self.file_path) as con:
            # Create business_units table
            con.execute("""
                CREATE TABLE IF NOT EXISTS business_units (
                    id VARCHAR PRIMARY KEY,
                    name VARCHAR NOT NULL,
                    description VARCHAR,
                    director_job_id VARCHAR NOT NULL
                );
            """)

            # Create departments table
            con.execute("""
                CREATE TABLE IF NOT EXISTS departments (
                    id VARCHAR PRIMARY KEY,
                    name VARCHAR NOT NULL,
                    description VARCHAR,
                    manager_job_id VARCHAR NOT NULL,
                    business_unit_id VARCHAR NOT NULL,
                    FOREIGN KEY (business_unit_id) REFERENCES business_units(id)
                );
            """)

            # Create jobs table
            con.execute("""
                CREATE TABLE IF NOT EXISTS jobs (
                    id VARCHAR PRIMARY KEY,
                    name VARCHAR NOT NULL,
                    description VARCHAR,
                    job_level VARCHAR NOT NULL,
                    job_family VARCHAR NOT NULL,
                    contract_type VARCHAR NOT NULL,
                    workplace_type VARCHAR NOT NULL
                );
            """)

            # Create employees table
            con.execute("""
                CREATE TABLE IF NOT EXISTS employees (
                    id VARCHAR PRIMARY KEY,
                    job_id VARCHAR NOT NULL,
                    department_id VARCHAR,
                    business_unit_id VARCHAR,
                    first_name VARCHAR NOT NULL,
                    last_name VARCHAR NOT NULL,
                    birth_date DATE NOT NULL,
                    gender VARCHAR NOT NULL,
                    ethnicity VARCHAR NOT NULL,
                    education_level VARCHAR,
                    education_field VARCHAR,
                    generation VARCHAR NOT NULL,
                    FOREIGN KEY (job_id) REFERENCES jobs(id),
                    FOREIGN KEY (department_id) REFERENCES departments(id),
                    FOREIGN KEY (business_unit_id) REFERENCES business_units(id)
                );
            """)

            # Create compensation table
            con.execute("""
                CREATE TABLE IF NOT EXISTS compensations (
                    id VARCHAR PRIMARY KEY,
                    employee_id VARCHAR NOT NULL,
                    annual_base_salary DECIMAL(12,2) NOT NULL,
                    annual_bonus_amount DECIMAL(12,2),
                    annual_commission_amount DECIMAL(12,2),
                    rate_type VARCHAR NOT NULL,
                    total_compensation DECIMAL(12,2) NOT NULL,
                    FOREIGN KEY (employee_id) REFERENCES employees(id)
                );
            """)

    def add_employee(self, employee: Employee):
        """Insert a new employee record into the database.

        Stores complete employee information including demographics, education,
        and organizational assignments. Automatically handles UUID conversion
        and enum value extraction.

        Args:
            employee (Employee): Employee model instance containing all
                               required employee information

        Note:
            Either department_id or business_unit_id must be set, but not both,
            as enforced by the database constraint.
        """
        with duckdb.connect(self.file_path) as con:
            while True:
                try:
                    con.execute(
                        """
                        INSERT INTO employees (
                            id, job_id, department_id, business_unit_id, first_name, last_name,
                            birth_date, gender, ethnicity, education_level,
                            education_field, generation
                        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                    """,
                        (
                            str(employee.id),
                            str(employee.job_id),
                            str(employee.department_id) if employee.department_id else None,
                            str(employee.business_unit_id) if employee.business_unit_id else None,
                            employee.first_name,
                            employee.last_name,
                            employee.birth_date,
                            employee.gender.value,
                            employee.ethnicity.value,
                            employee.education_level.value if employee.education_level else None,
                            employee.education_field.value if employee.education_field else None,
                            employee.generation.value,
                        ),
                    )
                except duckdb.ConstraintException as e:
                    print(
                        f'Failed to add employee {employee.first_name} {employee.last_name}: {e}'
                    )
                    # Regenerate UUID and retry
                    employee.id = uuid.uuid1()
                    continue
                break

    def add_business_unit(self, business_unit: BusinessUnit):
        """Insert a new business unit record into the database.

        Creates a business unit entry with its associated director job
        reference. The director job should be added separately using add_job().

        Args:
            business_unit (BusinessUnit): Business unit model instance with
                                        name, description, and director information
        """
        with duckdb.connect(self.file_path) as con:
            con.execute(
                """
                INSERT INTO business_units (id, name, description, director_job_id)
                VALUES (?, ?, ?, ?)
            """,
                (
                    str(business_unit.id),
                    business_unit.name,
                    business_unit.description,
                    str(business_unit.director.id),
                ),
            )

    def add_department(self, department: Department, business_unit_id: str):
        """Insert a new department record linked to its parent business unit.

        Creates a department entry with its manager job reference and business
        unit association. The manager job should be added separately using add_job().

        Args:
            department (Department): Department model instance with name,
                                   description, and manager information
            business_unit_id (str): UUID string of the parent business unit
        """
        with duckdb.connect(self.file_path) as con:
            con.execute(
                """
                INSERT INTO departments (id, name, description, manager_job_id, business_unit_id)
                VALUES (?, ?, ?, ?, ?)
            """,
                (
                    str(department.id),
                    department.name,
                    department.description,
                    str(department.manager.id),
                    business_unit_id,
                ),
            )

    def add_job(self, job: Job):
        """Insert a new job position record into the database.

        Stores complete job specification including level, family, contract
        type, and workplace arrangement. Automatically handles enum value
        extraction for database storage.

        Args:
            job (Job): Job model instance containing position details,
                      classifications, and work arrangements
        """
        with duckdb.connect(self.file_path) as con:
            con.execute(
                """
                INSERT INTO jobs (
                    id, name, description, job_level, job_family,
                    contract_type, workplace_type
                ) VALUES (?, ?, ?, ?, ?, ?, ?)
            """,
                (
                    str(job.id),
                    job.name,
                    job.description,
                    job.job_level.value,
                    job.job_family.value,
                    job.contract_type.value,
                    job.workplace_type.value,
                ),
            )

    def add_compensation(self, compensation: Compensation, employee_id: str):
        """Insert a compensation record linked to an employee.

        Stores complete compensation package including base salary, bonuses,
        and commissions. Automatically calculates and stores total compensation.

        Args:
            compensation (Compensation): Compensation model instance with
                                       salary and benefit information
            employee_id (str): UUID string of the associated employee
        """
        with duckdb.connect(self.file_path) as con:
            while True:
                try:
                    con.execute(
                        """
                        INSERT INTO compensations (
                            id, employee_id, annual_base_salary, annual_bonus_amount,
                            annual_commission_amount, rate_type, total_compensation
                        ) VALUES (?, ?, ?, ?, ?, ?, ?)
                    """,
                        (
                            str(compensation.id),
                            employee_id,
                            compensation.annual_base_salary,
                            compensation.annual_bonus_amount,
                            compensation.annual_commission_amount,
                            compensation.rate_type.value,
                            compensation.total_compensation,
                        ),
                    )
                except duckdb.ConstraintException as e:
                    print(f'Failed to add compensation for employee {employee_id}: {e}')
                    # Regenerate UUID and retry
                    compensation.id = uuid.uuid1()
                    continue
                break

## Define Helper Functions

Define helper functions for generating realistic demographic data:

- **get_birth_date()**: Generates random birth dates within appropriate year ranges for each generation
- **weighted_random_choice()**: Selects items based on weighted probabilities to match demographic ratios

In [None]:
import random
from typing import Any


# Set a fixed seed for reproducibility
random.seed(1993)


def get_birth_date(generation: Generation) -> datetime.date:
    """Generate a realistic birth date for the specified generational cohort.

    Creates a random birth date within the standard year ranges for each
    generation, ensuring demographic accuracy in synthetic data generation.
    Uses a fixed random seed for reproducible results.

    Args:
        generation (Generation): The target generational cohort

    Returns:
        datetime.date: A randomly generated birth date within the appropriate
                      year range for the specified generation

    Raises:
        ValueError: If an invalid or unrecognized generation is provided

    Note:
        Generation year ranges:
        - Baby Boomer: 1946-1964
        - Generation X: 1965-1980
        - Millennial: 1981-1996
        - Generation Z: 1997-2012
    """
    if generation == Generation.BABY_BOOMER:
        start_year, end_year = 1946, 1964
    elif generation == Generation.GEN_X:
        start_year, end_year = 1965, 1980
    elif generation == Generation.MILLENNIAL:
        start_year, end_year = 1981, 1996
    elif generation == Generation.GEN_Z:
        start_year, end_year = 1997, 2012
    else:
        raise ValueError(f'Invalid generation: {generation}')

    # Generate a random birth date within the range
    birth_date = datetime.date(
        year=random.randint(start_year, end_year),  # nosec B311
        month=random.randint(1, 12),  # nosec B311
        day=random.randint(1, 28),  # nosec B311
    )

    return birth_date


def weighted_random_choice(choices: dict[Any, float]) -> Any:
    """Select an item from a weighted distribution using random sampling.

    Implements weighted random selection where each choice has a probability
    proportional to its weight. Uses cumulative distribution for efficient
    selection, ensuring proper statistical distribution in synthetic data.

    Args:
        choices (dict[Any, float]): Dictionary mapping items to their weights.
                                  Weights should be positive numbers and
                                  don't need to sum to 1.0

    Returns:
        Any: The randomly selected item based on weighted probabilities

    Example:
        >>> choices = {'A': 0.7, 'B': 0.2, 'C': 0.1}
        >>> result = weighted_random_choice(choices)
        >>> # 'A' has 70% chance, 'B' has 20% chance, 'C' has 10% chance

    Note:
        Uses a fixed random seed (1993) for reproducible results across runs.
        If weights don't sum exactly due to floating-point precision,
        returns the last choice as fallback.
    """
    total = sum(choices.values())
    r = random.uniform(0, total)  # nosec B311
    for item, weight in choices.items():
        if r < weight:
            return item
        r -= weight
    return list(choices.keys())[-1]  # Fallback

## Define Agentic Workflow

In this section, we will implement our LangGraph-based Agentic Workflow for synthetic HR database generation.

We will use the [LangGraph Functional API](https://docs.langchain.com/oss/python/langgraph/functional-api) to define tasks and orchestrate the workflow.

The LangGraph Functional API uses two key building blocks:

- `@entrypoint` – Marks a function as the starting point of a workflow, encapsulating logic and managing execution flow, including handling long-running tasks and interrupts.
- `@task` – Represents a discrete unit of work, such as an API call or data processing step, that can be executed asynchronously within an entrypoint. Tasks return a future-like object that can be awaited or resolved synchronously.

In [None]:
from itertools import batched

from langchain_core.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.func import entrypoint, task

#### Workflow tasks

Define the individual tasks that make up our agentic workflow. Each task is decorated with `@task` to enable parallel execution and state management:

**AI-Powered Generation Tasks:**
- **get_company_spec()**: Transforms user input into a structured company specification using GPT-4o
- **get_demographic_ratios()**: Generates realistic demographic distributions based on company characteristics
- **get_education_fields()**: Determines appropriate education level and field for each employee role
- **get_employee_compensation()**: Calculates realistic compensation packages based on role and qualifications

**Database Operations:**
- **create_database()**: Initializes the DuckDB database with the HR schema
- **add_business_unit_to_db()**: Stores business unit and director job records
- **add_department_to_db()**: Stores department, manager, and all job specifications
- **add_employee_to_db()**: Stores employee and compensation records

**Data Generation Orchestration:**
- **generate_employee()**: Creates a complete employee record with education and compensation
- **generate_department()**: Generates all employees for a department (manager + staff) in parallel batches

These tasks work together to create a realistic, diverse HR database with proper organizational hierarchy.

In [None]:
@task
def get_company_spec(user_input: str) -> Company:
    """Generate a comprehensive company specification from natural language input.

    Uses OpenAI Language Model to transform user requirements into a structured Company model
    with complete organizational hierarchy including business units, departments,
    and job positions. Ensures realistic and coherent organizational structure.

    Args:
        user_input (str): Natural language description of the desired company
                         structure, industry, and characteristics

    Returns:
        Company: Structured company specification with business units,
                departments, jobs, and leadership roles properly defined

    Note:
        The LLM is instructed to create realistic organizational hierarchies
        with diverse names and common business structures.
    """
    llm = ChatOpenAI(model='gpt-4o', timeout=600, max_retries=3)

    prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(
                'You are an experienced business strategist specializing in creating '
                'detailed organizational specifications from brief company descriptions.\n\n'
                'Your task is to design a realistic company structure that includes:\n\n'
                '- Business Units (based on product lines, regions, or functions).\n\n'
                '- Departments within each business unit (e.g., HR, IT, Sales, Marketing, Finance, etc.).\n\n'
                '- Key Roles and Jobs at different levels, ensuring diversity and realism in job titles and names.\n\n'
                'Guidelines:\n'
                '- Each business unit should be led by a Director overseeing multiple departments.\n\n'
                '- Each department should have a Manager and several distinct job roles across senior, mid-level, and junior positions.\n\n'
                '- Use realistic and varied names for all business units, departments, and roles.\n\n'
                '- Usually, a company has 3-5 business units, each with 3-7 departments, and each department with 3-10 job roles.\n\n'
                '- Ensure the organizational hierarchy is coherent and reflects common corporate structures.'
            ),
            HumanMessagePromptTemplate.from_template('{text}'),
        ]
    )

    chain = prompt | llm.with_structured_output(Company)

    response = chain.invoke({'text': user_input})
    return response


@task
def get_demographic_ratios(company_spec: Company) -> Ratios:
    """Generate realistic demographic distribution ratios for the company.

    Analyzes the company specification and industry to create appropriate
    demographic ratios that align with industry benchmarks and realistic
    workforce distributions. Uses OpenAI Language Model for intelligent ratio generation.

    Args:
        company_spec (Company): The company specification containing industry,
                               size, and organizational structure information

    Returns:
        Ratios: Demographic ratios for gender, ethnicity, and generation
               distributions that reflect realistic workforce composition

    Note:
        The LLM considers industry standards and company characteristics
        to generate statistically reasonable demographic distributions.
    """
    llm = ChatOpenAI(model='gpt-4o', timeout=60, max_retries=3)

    prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(
                'You are an expert at defining demographic ratios '
                'based on the company specification and aligning with industry benchmarks.'
            ),
            HumanMessagePromptTemplate.from_template('{company_spec}'),
        ]
    )

    chain = prompt | llm.with_structured_output(Ratios)

    response = chain.invoke({'company_spec': company_spec.model_dump()})
    return response


@task
def get_education_fields(employee: Employee, job: Job):
    """Determine appropriate education level and field for an employee's role.

    Uses AI analysis to match employee demographics and job requirements
    with realistic education credentials. Considers job family, level,
    and industry standards to assign appropriate qualifications.

    Args:
        employee (Employee): Employee model with demographic information
        job (Job): Job model with position requirements and classifications

    Returns:
        EducationFieldsResponse: Education level and field assignments
                               that align with the job requirements
    """

    class EducationFieldsResponse(BaseModel):
        """Generates education level and field of an employee."""

        education_level: EducationLevel = Field(
            ..., description='The education level of the employee.'
        )

        education_field: EducationField = Field(
            ..., description='The field of education of the employee.'
        )

    llm = ChatOpenAI(model='gpt-5-nano', timeout=20, max_retries=5)

    prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(
                'You are an expert HR professional who determines '
                'the education level and field of an employee based on their data and job role.'
            ),
            HumanMessagePromptTemplate.from_template(
                """{{"employee": "{employee}", "job": "{job}"}}"""
            ),
        ]
    )

    chain = prompt | llm.with_structured_output(EducationFieldsResponse)

    response: EducationFieldsResponse = chain.invoke(
        {'employee': employee.model_dump(), 'job': job.model_dump()}
    )
    return response


@task
def get_employee_compensation(employee: Employee, job: Job) -> Compensation:
    """Calculate appropriate compensation package for an employee's position.

    Analyzes employee qualifications, experience indicators, and job
    characteristics to determine realistic compensation including base
    salary, bonuses, and commissions. Considers market rates and internal equity.

    Args:
        employee (Employee): Employee model with demographics and education
        job (Job): Job model with level, family, and workplace information

    Returns:
        Compensation: Complete compensation package with base salary,
                     bonuses, and commission amounts appropriate for the role
    """
    llm = ChatOpenAI(model='gpt-5-nano', timeout=20, max_retries=5)

    prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(
                'You are an expert HR professional who determines '
                'the compensation of an employee based on their data and job role.'
            ),
            HumanMessagePromptTemplate.from_template(
                """
                {{"employee": "{employee}", "job": "{job}"}}
                """
            ),
        ]
    )

    chain = prompt | llm.with_structured_output(Compensation)

    response = chain.invoke({'employee': employee.model_dump(), 'job': job.model_dump()})
    return response


@task
def create_database() -> None:
    """Initialize a new DuckDB database with the complete HR schema.

    Creates a fresh database instance with all necessary tables for storing
    HR data including business units, departments, jobs, employees, and
    compensation records. Sets up proper relationships and constraints.

    Returns:
        None: Database is created and initialized at the configured path

    Note:
        This operation is idempotent and safe to call multiple times.
        Uses the default database path configured in the Database class.
    """
    db = Database(file_path=settings.DUCKDB_PATH)
    db.create_tables()


@task
def add_department_to_db(department: Department, business_unit_id: str):
    """Add a new department record to the database."""
    db = Database(file_path=settings.DUCKDB_PATH)

    db.add_job(department.manager)

    for job_spec in department.jobs:
        db.add_job(job_spec.job)

    db.add_department(department, business_unit_id)


@task
def add_business_unit_to_db(business_unit: BusinessUnit):
    """Add a new business unit record to the database."""
    db = Database(file_path=settings.DUCKDB_PATH)
    db.add_job(business_unit.director)
    db.add_business_unit(business_unit)


@task
def add_employee_to_db(employee: Employee, compensation: Compensation):
    """Add a new employee record to the database."""
    db = Database(file_path=settings.DUCKDB_PATH)
    db.add_employee(employee)
    db.add_compensation(compensation, str(employee.id))


@task
def generate_employee(
    job: Job,
    birth_date: datetime.date,
    gender: Gender,
    ethnicity: Ethnicity,
    department_id: Optional[uuid.UUID] = None,
    business_unit_id: Optional[uuid.UUID] = None,
):
    """Generate employee records based on job specification and demographic ratios.

    The records will be added in the database.
    """
    employee = Employee(
        job_id=job.id,
        department_id=department_id,
        business_unit_id=business_unit_id,
        birth_date=birth_date,
        gender=gender,
        ethnicity=ethnicity,
        education_level=None,
        education_field=None,
    )

    education_fields = get_education_fields(employee, job).result()
    employee.education_level = education_fields.education_level
    employee.education_field = education_fields.education_field

    compensation = get_employee_compensation(employee, job).result()

    add_employee_to_db(employee, compensation).result()


@task
def generate_department(
    department: Department, ratios: Ratios, business_unit_id: uuid.UUID | None = None
):
    """Generate department records based on department specification and demographic ratios.

    The records will be added in the database.
    """
    # manager is human too
    generate_employee(
        job=department.manager,
        department_id=department.id,
        business_unit_id=business_unit_id,
        birth_date=get_birth_date(
            Generation[weighted_random_choice(ratios.generation.model_dump())]
        ),
        gender=Gender[weighted_random_choice(ratios.gender.model_dump())],
        ethnicity=Ethnicity[weighted_random_choice(ratios.ethnicity.model_dump())],
    ).result()

    for job_spec in department.jobs:
        # Parallel execution with batching
        for batch in batched(range(job_spec.headcount), n=5):
            futures = [
                generate_employee(
                    job=job_spec.job,
                    department_id=department.id,
                    business_unit_id=business_unit_id,
                    birth_date=get_birth_date(
                        Generation[weighted_random_choice(ratios.generation.model_dump())]
                    ),
                    gender=Gender[weighted_random_choice(ratios.gender.model_dump())],
                    ethnicity=Ethnicity[weighted_random_choice(ratios.ethnicity.model_dump())],
                )
                for _ in batch
            ]

            _ = [future.result() for future in futures]

#### Workflow entrypoint

Define the main workflow entrypoint that orchestrates the entire dataset generation process:

The `dataset_workflow()` function coordinates all tasks in the correct sequence:

1. **Specification Phase**: Convert user input to structured company spec and generate demographic ratios
2. **Database Setup**: Create the database schema
3. **Hierarchical Generation**: For each business unit:
   - Add the business unit and director job to database
   - Generate the director employee record
   - For each department:
     - Add department and all job roles
     - Generate department employees (manager + staff) in parallel batches

The workflow uses **InMemorySaver** for checkpointing, allowing for state recovery and debugging.

All employee generation respects the demographic ratios to ensure realistic diversity throughout the organization.

In [None]:
checkpointer = InMemorySaver()


@entrypoint(checkpointer=checkpointer)
def dataset_workflow(user_input: str) -> str:
    """Complete AI-powered workflow for generating synthetic HR datasets.

    Orchestrates the end-to-end process of transforming user requirements
    into a fully populated HR database with realistic employee data.
    Combines AI-driven specification generation with systematic data creation.

    Args:
        user_input (str): Natural language description of the desired
                         company structure, industry, and characteristics

    Returns:
        str: Completion message with database information

    Workflow Steps:
        1. Generate company specification from user input
        2. Create demographic ratios based on company characteristics
        3. Initialize database with proper schema
        4. Generate business units and their directors
        5. Create departments with managers and employees
        6. Assign education and compensation to all employees

    Note:
        Uses LangGraph checkpointing for workflow state management and
        recovery. Ensures consistent demographic distributions across
        all generated employees.
    """
    company = get_company_spec(user_input).result()
    ratios = get_demographic_ratios(company).result()

    db_name = create_database().result()

    for business_unit in company.business_units:
        # Add business unit to database
        add_business_unit_to_db(business_unit).result()

        # Add director as employee
        generate_employee(
            job=business_unit.director,
            business_unit_id=business_unit.id,
            birth_date=get_birth_date(
                Generation[weighted_random_choice(ratios.generation.model_dump())]
            ),
            gender=Gender[weighted_random_choice(ratios.gender.model_dump())],
            ethnicity=Ethnicity[weighted_random_choice(ratios.ethnicity.model_dump())],
        ).result()

        # Add departments and their employees
        for department in business_unit.departments:
            add_department_to_db(department, str(business_unit.id)).result()
            generate_department(department, ratios, business_unit.id).result()

    return f'Dataset generation completed. Database: {db_name}'

## Run Agentic Workflow

Execute the agentic workflow with a sample company description.

This example creates a global retail and wholesale enterprise with:

- Multiple business units (domestic retail, international, wholesale clubs, shared services)
- Various store formats (supercenters, warehouse clubs, discount stores)
- Both physical and digital operations (brick-and-mortar + eCommerce)
- Comprehensive product offerings (groceries, essentials, general merchandise)

The workflow will:

1. Generate a complete organizational structure with realistic business units and departments
2. Create hundreds or thousands of employee records with diverse demographics
3. Assign appropriate education credentials and compensation to each employee
4. Store everything in a queryable DuckDB database

The streaming output shows progress as each task completes.

In [None]:
from langchain_core.runnables import RunnableConfig


# Anonymized Walmart description ;)
brief_description = """\
A global retail and wholesale enterprise that operates a diverse network of physical stores and digital platforms. \
The company's activities span three main divisions: its domestic retail operations, international markets, and membership-based wholesale clubs. \
The company also has a shared services division that provides centralized corporate functions such as HR, Finance, IT, etc. \
Its store formats include large-scale supercenters, supermarkets, warehouse clubs, cash-and-carry outlets, and discount stores. \
In addition to its brick-and-mortar presence, it runs multiple eCommerce platforms and mobile applications across different countries, \
including regional online marketplaces and digital payment services. \
The company's offerings cover a broad range of consumer needs, with a particular strength in groceries, everyday essentials, and general merchandise.
"""

config = RunnableConfig(configurable={'thread_id': '1'})


counter = 0
for step in dataset_workflow.stream(input=brief_description, config=config, stream_mode='tasks'):
    counter += 1
    print(f'\rStep {counter} - {step}', end='')

## Hugging Face Dataset

In this section, we will format the generated database into a Hugging Face dataset and push it to the Hugging Face Hub for easy access and sharing.

### Format Database as HF Dataset

Extract data from the DuckDB database and convert it into a Hugging Face DatasetDict.

This process:

1. Connects to the generated DuckDB database in read-only mode
2. Queries each table (business_units, departments, jobs, employees, compensations)
3. Converts each table to a pandas DataFrame
4. Creates a Hugging Face Dataset from each DataFrame
5. Combines them into a DatasetDict with named splits

The resulting dataset structure provides easy access to all HR data and shows the record counts for each table.

In [None]:
from datasets import Dataset, DatasetDict


with duckdb.connect(settings.DUCKDB_PATH, read_only=True) as con:
    # Read each table into a pandas DataFrame
    business_units_df = con.execute('SELECT * FROM business_units').df()
    departments_df = con.execute('SELECT * FROM departments').df()
    jobs_df = con.execute('SELECT * FROM jobs').df()
    employees_df = con.execute('SELECT * FROM employees').df()
    compensations_df = con.execute('SELECT * FROM compensations').df()

# Convert DataFrames to Hugging Face Datasets
hf_dataset = DatasetDict(
    {
        'business_units': Dataset.from_pandas(business_units_df),
        'departments': Dataset.from_pandas(departments_df),
        'jobs': Dataset.from_pandas(jobs_df),
        'employees': Dataset.from_pandas(employees_df),
        'compensations': Dataset.from_pandas(compensations_df),
    }
)

# Display dataset info
print('Dataset structure:')
print(hf_dataset)

print('\nSample counts:')
for split_name, dataset in hf_dataset.items():
    print(f'  {split_name}: {len(dataset)} records')

### Push to Hugging Face Hub

Upload the formatted dataset to the Hugging Face Hub for easy sharing and access.

In [None]:
dataset_name = 'dougtrajano/hr-synthetic-database'

# Push to Hugging Face Hub
for split_name, dataset in hf_dataset.items():
    dataset.push_to_hub(
        repo_id=dataset_name,
        config_name=str(split_name),
    )

print(f'Dataset successfully pushed to: https://huggingface.co/datasets/{dataset_name}')

## ✅ Agentic Workflow Complete!

The synthetic HR database has been successfully generated and is ready for use.

Users can access the dataset through the Hugging Face Hub at: [dougtrajano/hr-synthetic-database](https://huggingface.co/dougtrajano/hr-synthetic-database)

We can now load the dataset directly using the `datasets` library:

```python
from datasets import load_dataset

business_units = load_dataset('dougtrajano/hr-synthetic-database', 'business_units')
departments = load_dataset('dougtrajano/hr-synthetic-database', 'departments')
jobs = load_dataset('dougtrajano/hr-synthetic-database', 'jobs')
employees = load_dataset('dougtrajano/hr-synthetic-database', 'employees')
compensations = load_dataset('dougtrajano/hr-synthetic-database', 'compensations')
```