# Dataset Generation - AI Workflow

This notebook demonstrates how to generate a synthetic HR dataset for a company using AI Workflow.

The dataset generation workflow follows these main phases:

1. **Company Specification**: Convert user input into a structured company specification using LLM.
2. **Demographic Ratios**: Generate demographic ratios based on company characteristics using LLM.
3. **Database Setup**: Create the database structure with HR schema.
4. **Business Unit Processing**: For each business unit:
   - Add business unit and director job to database.
   - Generate director employee (including education, compensation, and storage).
   - Process all departments within the business unit.
5. **Department Processing**: For each department:
   - Add department and all job roles to database.
   - Generate department employees in parallel batches (manager + staff).
   - Each employee generation includes education determination, compensation calculation, and database storage.

## Workflow Diagram

```mermaid
flowchart TD
    A[User Input: Company Description] --> B[TASK: get_company_spec]
    B --> C[Company Specification]
    C --> D[TASK: get_demographic_ratios]
    D --> E[Demographic Ratios]
    
    E --> F[TASK: create_database]
    F --> G[Database with HR Schema Created]
    
    C --> H{For each Business Unit}
    H --> I[TASK: add_business_unit_to_db]
    I --> I1[Add Director Job to DB]
    I1 --> I2[Add Business Unit to DB]
    
    I2 --> J[TASK: generate_employee - Director]
    E --> J
    J --> J1[Create Employee Object]
    J1 --> J2[TASK: get_education_fields]
    J2 --> J3[TASK: get_employee_compensation]
    J3 --> J4[TASK: add_employee_to_db]
    
    J4 --> K{For each Department in BU}
    K --> L[TASK: add_department_to_db]
    L --> L1[Add Manager Job to DB]
    L1 --> L2[Add All Department Jobs to DB]
    L2 --> L3[Add Department to DB]
    
    L3 --> M[TASK: generate_department]
    E --> M
    
    M --> N[TASK: generate_employee - Manager]
    N --> N1[Manager: Education + Compensation + DB Storage]
    
    M --> O[Parallel Processing: Staff Generation]
    O --> P{For each Job Role}
    P --> Q[Batch Processing<br/>5 employees at a time]
    Q --> R[TASK: generate_employee × Headcount]
    R --> R1[Staff: Education + Compensation + DB Storage]
    
    R1 --> S{More Job Roles?}
    S -->|Yes| P
    S -->|No| T{More Departments?}
    T -->|Yes| K
    T -->|No| U{More Business Units?}
    U -->|Yes| H
    U -->|No| V[Dataset Generation Completed]
    
    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style E fill:#f3e5f5
    style G fill:#e8f5e8
    style V fill:#fff3e0
    style B fill:#fff9c4
    style D fill:#fff9c4
    style F fill:#fff9c4
    style I fill:#fff9c4
    style J fill:#fff9c4
    style L fill:#fff9c4
    style M fill:#fff9c4
    style N fill:#fff9c4
    style R fill:#fff9c4
    style J2 fill:#fce4ec
    style J3 fill:#fce4ec
    style J4 fill:#e8f5e8
    style O fill:#f1f8e9
    style Q fill:#f1f8e9
```

### Legend

- **🟡 TASK**: LangGraph tasks (decorated functions)
- **🔵 Input/Output**: User input and final result
- **🟣 Data Models**: Structured data objects
- **🟢 Database Operations**: Data storage operations
- **🌸 AI Sub-tasks**: LLM-powered sub-operations within composite tasks
- **🟢 Parallel Processing**: Batch execution for performance

### Key Implementation Details

1. **LangGraph Framework**: Uses `@task` decorators and `@entrypoint` for workflow orchestration
2. **Composite Tasks**: `generate_employee` and `generate_department` encapsulate multiple operations
3. **Parallel Execution**: Employee generation processes in batches of 5 for performance
4. **Job Management**: Job roles are added to database before employee generation
5. **AI Integration**: Uses GPT-4o for company specs and GPT-5-nano for employee details
6. **Memory Management**: InMemorySaver checkpointer for workflow state persistence

## Setup Environment

In [None]:
# Add project root in the path
import pathlib
import sys

import mlflow
from dotenv import load_dotenv

from src.dataset.workflow import dataset_workflow


project_root = pathlib.Path().resolve().parent.parent
sys.path.append(str(project_root))

# Load environment variables from .env file
load_dotenv(str(project_root / '.env'))

mlflow.openai.autolog()
mlflow.langchain.autolog()

## Run AI Workflow

In [None]:
brief_description = (
    'A global retail and wholesale enterprise that operates a diverse '
    "network of physical stores and digital platforms. The company's activities span three main divisions: "
    'its domestic retail operations, international markets, and membership-based wholesale clubs. '
    'The company also has a shared services division that provides centralized corporate functions such as HR, Finance, IT, etc. '
    'Its store formats include large-scale supercenters, supermarkets, warehouse clubs, cash-and-carry outlets, '
    'and discount stores. In addition to its brick-and-mortar presence, it runs multiple eCommerce platforms and '
    'mobile applications across different countries, including regional online marketplaces and digital payment services. '
    "The company's offerings cover a broad range of consumer needs, with a particular strength in groceries, "
    'everyday essentials, and general merchandise.'
)

config = {'configurable': {'thread_id': '1'}}


counter = 0
for step in dataset_workflow.stream(input=brief_description, config=config, stream_mode='tasks'):
    counter += 1
    print(f'Step {counter} - {step}', end='\r')