# Dataset Generation - AI Workflow

This notebook demonstrates how to generate a synthetic HR dataset for a company using AI Workflow.

The dataset generation workflow follows these main phases:

1. **Company Specification**: Convert user input into a structured company specification using LLM.
2. **Demographic Ratios**: Generate demographic ratios based on company characteristics using LLM.
3. **Database Setup**: Create the database structure with HR schema.
4. **Business Unit Processing**: For each business unit:
   - Add business unit and director job to database.
   - Generate director employee (including education, compensation, and storage).
   - Process all departments within the business unit.
5. **Department Processing**: For each department:
   - Add department and all job roles to database.
   - Generate department employees in parallel batches (manager + staff).
   - Each employee generation includes education determination, compensation calculation, and database storage.

## Workflow Diagram

```mermaid
flowchart TD
    Start([User Input]) --> A[dataset_workflow]
    A --> B[get_company_spec<br/>LLM: gpt-4o]
    A --> C[get_demographic_ratios<br/>LLM: gpt-4o]
    B --> C
    A --> D[create_database]
    C --> D
    
    D --> E{For each<br/>Business Unit}
    E --> F[add_business_unit_to_db]
    F --> G[generate_employee<br/>Director]
    
    G --> H[get_education_fields<br/>LLM: gpt-5-nano]
    G --> I[get_employee_compensation<br/>LLM: gpt-5-nano]
    H --> J[add_employee_to_db]
    I --> J
    
    J --> K{For each<br/>Department}
    K --> L[add_department_to_db]
    L --> M[generate_department]
    
    M --> N[generate_employee<br/>Manager]
    N --> O[get_education_fields<br/>LLM: gpt-5-nano]
    N --> P[get_employee_compensation<br/>LLM: gpt-5-nano]
    O --> Q[add_employee_to_db]
    P --> Q
    
    Q --> R{For each<br/>Job/Employee}
    R --> S[generate_employee<br/>Staff]
    S --> T[get_education_fields<br/>LLM: gpt-5-nano]
    S --> U[get_employee_compensation<br/>LLM: gpt-5-nano]
    T --> V[add_employee_to_db]
    U --> V
    
    V --> R
    R --> K
    K --> E
    E --> End([Dataset Complete])
    
    style B fill:#ff9966,stroke:#333,stroke-width:2px
    style C fill:#ff9966,stroke:#333,stroke-width:2px
    style H fill:#ff9966,stroke:#333,stroke-width:2px
    style I fill:#ff9966,stroke:#333,stroke-width:2px
    style O fill:#ff9966,stroke:#333,stroke-width:2px
    style P fill:#ff9966,stroke:#333,stroke-width:2px
    style T fill:#ff9966,stroke:#333,stroke-width:2px
    style U fill:#ff9966,stroke:#333,stroke-width:2px
    
    style D fill:#6699ff,stroke:#333,stroke-width:2px
    style F fill:#6699ff,stroke:#333,stroke-width:2px
    style J fill:#6699ff,stroke:#333,stroke-width:2px
    style L fill:#6699ff,stroke:#333,stroke-width:2px
    style Q fill:#6699ff,stroke:#333,stroke-width:2px
    style V fill:#6699ff,stroke:#333,stroke-width:2px
    
    style G fill:#99ccff,stroke:#333,stroke-width:2px
    style M fill:#99ccff,stroke:#333,stroke-width:2px
    style N fill:#99ccff,stroke:#333,stroke-width:2px
    style S fill:#99ccff,stroke:#333,stroke-width:2px
```

### Key Implementation Details

1. **LangGraph Framework**: Uses `@task` decorators and `@entrypoint` for workflow orchestration
2. **Composite Tasks**: `generate_employee` and `generate_department` encapsulate multiple operations
3. **Parallel Execution**: Employee generation processes in batches of 5 for performance
4. **Job Management**: Job roles are added to database before employee generation
5. **AI Integration**: Uses GPT-4o for company specs and GPT-5-nano for employee details
6. **Memory Management**: InMemorySaver checkpointer for workflow state persistence

## Setup Environment

In [None]:
# Add project root in the path
import pathlib
import sys

import mlflow
from dotenv import load_dotenv

from src.dataset.workflow import dataset_workflow


project_root = pathlib.Path().resolve().parent.parent
sys.path.append(str(project_root))

# Load environment variables from .env file
load_dotenv(str(project_root / '.env'))

mlflow.openai.autolog()
mlflow.langchain.autolog()

## Run AI Workflow

In [None]:
brief_description = (
    'A global retail and wholesale enterprise that operates a diverse '
    "network of physical stores and digital platforms. The company's activities span three main divisions: "
    'its domestic retail operations, international markets, and membership-based wholesale clubs. '
    'The company also has a shared services division that provides centralized corporate functions such as HR, Finance, IT, etc. '
    'Its store formats include large-scale supercenters, supermarkets, warehouse clubs, cash-and-carry outlets, '
    'and discount stores. In addition to its brick-and-mortar presence, it runs multiple eCommerce platforms and '
    'mobile applications across different countries, including regional online marketplaces and digital payment services. '
    "The company's offerings cover a broad range of consumer needs, with a particular strength in groceries, "
    'everyday essentials, and general merchandise.'
)

config = {'configurable': {'thread_id': '1'}}


counter = 0
for step in dataset_workflow.stream(input=brief_description, config=config, stream_mode='tasks'):
    counter += 1
    print(f'Step {counter} - {step}', end='\r')