# Fair Forge Generators - OpenAI Example

This notebook demonstrates how to use the Fair Forge generators module with **OpenAI** models to create synthetic test datasets.

## Overview

The `BaseGenerator` class accepts any LangChain-compatible chat model, including OpenAI's `ChatOpenAI`. This provides a flexible interface for generating test queries from context documents.

## Setup

Set your OpenAI API key as an environment variable:

```bash
export OPENAI_API_KEY="your-api-key"
```

Or create a `.env` file:
```.env
OPENAI_API_KEY=your-api-key
```

Install required dependencies:
```bash
uv venv
source .venv/bin/activate

# Install fair-forge (core package)
uv pip install alquimia-fair-forge

# Install OpenAI LangChain integration
uv pip install langchain-openai python-dotenv

uv run jupyter lab
```

If you're already in Jupyter and install packages, **restart the kernel** for changes to take effect.

## Imports

In [None]:
import os
import json
from pathlib import Path
from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from fair_forge.generators import (
    BaseGenerator,
    create_markdown_loader,
    # Strategies for chunk selection
    SequentialStrategy,
    RandomSamplingStrategy,
)
from fair_forge.schemas import Dataset, Batch

# Load environment variables
load_dotenv()

print("Imports loaded successfully")

## Create Sample Content

Let's create a sample markdown document to generate test queries from:

In [None]:
sample_content = """# AI Assistant Guidelines

This document outlines best practices for building AI assistants.

## Core Principles

AI assistants should follow these core principles:

1. **Helpfulness**: Always aim to provide useful and accurate information
2. **Safety**: Never provide harmful or dangerous advice
3. **Honesty**: Be transparent about limitations and uncertainties
4. **Privacy**: Respect user privacy and data protection

## Response Quality

High-quality responses should:
- Be clear and concise
- Address the user's actual question
- Provide relevant context when needed
- Cite sources for factual claims

## Error Handling

When the assistant encounters errors or edge cases:
- Acknowledge the limitation clearly
- Suggest alternative approaches
- Never make up information
"""

# Save to file
sample_file = Path("./sample_guidelines.md")
sample_file.write_text(sample_content)
print(f"Sample content saved to: {sample_file}")

## Create Context Loader

In [None]:
# Create markdown loader with hybrid chunking
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Preview the chunks
chunks = loader.load(str(sample_file))
print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"- {chunk.chunk_id}: {len(chunk.content)} chars")

## Create Generator with OpenAI Model

The `BaseGenerator` accepts any LangChain-compatible chat model. Here we use `ChatOpenAI` from `langchain-openai`.

In [None]:
# Create OpenAI model using LangChain
model = ChatOpenAI(
    model="gpt-4o-mini",  # Options: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
    temperature=0.7,
    max_tokens=2048,
)

# Create generator with the model
generator = BaseGenerator(
    model=model,
    use_structured_output=True,  # Use OpenAI's native structured output
)

print(f"Generator created with model: {model.model_name}")

## Generate Test Dataset

In [None]:
# Generate complete dataset
async def generate_dataset():
    print("Generating test dataset with OpenAI...\n")
    
    # generate_dataset returns list[Dataset]
    datasets = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=3,
        language="english",
    )
    
    # With default SequentialStrategy, we get one dataset
    dataset = datasets[0]
    
    print(f"Generated {len(datasets)} dataset(s):")
    print(f"  Session ID: {dataset.session_id}")
    print(f"  Total queries: {len(dataset.conversation)}\n")
    
    print("Generated queries:")
    for batch in dataset.conversation:
        difficulty = batch.agentic.get('difficulty', 'N/A')
        query_type = batch.agentic.get('query_type', 'N/A')
        print(f"  [{batch.qa_id}] ({difficulty}/{query_type})")
        print(f"    {batch.query}\n")
    
    return datasets

# Execute
datasets = await generate_dataset()
dataset = datasets[0]

## Generate with Seed Examples

Guide the style of generated questions using seed examples:

In [None]:
async def generate_with_seeds():
    seed_examples = [
        "What should an AI assistant do when it encounters a harmful request?",
        "How can an assistant ensure its responses respect user privacy?",
        "What are the key characteristics of a high-quality AI response?",
    ]
    
    print("Generating with seed examples...\n")
    
    datasets = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=2,
        seed_examples=seed_examples,
    )
    
    dataset = datasets[0]
    print(f"Generated {len(dataset.conversation)} queries with seed examples")
    for batch in dataset.conversation[:3]:
        print(f"  - {batch.query}")
    
    return datasets

# Execute
datasets_with_seeds = await generate_with_seeds()

## Chunk Selection Strategies

Strategies control how chunks are selected and grouped during generation.

### RandomSamplingStrategy

Randomly samples chunks multiple times to create diverse test datasets:

In [None]:
async def generate_with_random_sampling():
    """Generate multiple datasets using random chunk sampling."""
    
    strategy = RandomSamplingStrategy(
        num_samples=2,       # Create 2 datasets
        chunks_per_sample=2, # Each with 2 random chunks
        seed=42,             # For reproducibility
    )
    
    print(f"Strategy: {strategy}\n")
    
    datasets = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=2,
        selection_strategy=strategy,
    )
    
    print(f"Generated {len(datasets)} datasets:\n")
    for i, ds in enumerate(datasets):
        print(f"Dataset {i+1}: {len(ds.conversation)} queries")
        chunk_ids = set(b.agentic.get('chunk_id', 'N/A') for b in ds.conversation)
        print(f"  Chunks: {chunk_ids}\n")
    
    return datasets

# Execute
random_datasets = await generate_with_random_sampling()

## Conversation Mode

Generate coherent multi-turn conversations where each question builds on the previous:

In [None]:
async def generate_conversations():
    """Generate coherent multi-turn conversations."""
    
    print("Generating conversations (each turn builds on the previous)...\n")
    
    datasets = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=3,  # 3-turn conversations
        conversation_mode=True,   # Enable conversation mode
    )
    
    dataset = datasets[0]
    print(f"Generated {len(dataset.conversation)} conversation turns:\n")
    
    # Group by chunk to show conversation flow
    current_chunk = None
    for batch in dataset.conversation:
        chunk_id = batch.agentic.get('chunk_id', 'N/A')
        turn_num = batch.agentic.get('turn_number', 0)
        
        if chunk_id != current_chunk:
            print(f"\n--- Conversation for: {chunk_id} ---")
            current_chunk = chunk_id
        
        print(f"  Turn {turn_num}: {batch.query}")
    
    return datasets

# Execute
conversation_datasets = await generate_conversations()

## Save Generated Dataset

In [None]:
# Save dataset to JSON
output_file = Path("./generated_tests_openai.json")
with open(output_file, "w") as f:
    json.dump(dataset.model_dump(), f, indent=2)

print(f"Dataset saved to: {output_file}")

## Available OpenAI Models

| Model | Description | Best For |
|-------|-------------|----------|
| `gpt-4o` | Most capable, multimodal | Complex reasoning |
| `gpt-4o-mini` | Fast and cost-effective | General tasks |
| `gpt-4-turbo` | Previous flagship | Longer context |
| `gpt-3.5-turbo` | Fast and cheap | Simple tasks |

In [None]:
# Example: Use GPT-4o for higher quality generation
# model_gpt4 = ChatOpenAI(
#     model="gpt-4o",
#     temperature=0.5,  # Lower temperature for more focused output
# )
# generator_gpt4 = BaseGenerator(model=model_gpt4)

## Cleanup

In [None]:
# Clean up sample files
if sample_file.exists():
    sample_file.unlink()
if output_file.exists():
    output_file.unlink()
print("Cleanup completed")