# Fair Forge Generators - OpenAI Example

This notebook demonstrates how to use the Fair Forge generators module with **OpenAI** models to create synthetic test datasets.

## Overview

The `OpenAIGenerator` uses LangChain to interact with OpenAI's models (GPT-4, GPT-3.5, etc.) for generating test queries from context documents.

## Setup

Set your OpenAI API key as an environment variable:

```bash
export OPENAI_API_KEY="your-api-key"
```

Or create a `.env` file:
```.env
OPENAI_API_KEY=your-api-key
```

Install required dependencies:
```bash
uv pip install .[generators]
uv pip install langchain-openai python-dotenv
```

## Imports

In [None]:
import os
import json
from pathlib import Path
from dotenv import load_dotenv

from fair_forge.generators import (
    create_openai_generator,
    create_markdown_loader,
    OpenAIGenerator,
)
from fair_forge.schemas import Dataset, Batch

# Load environment variables
load_dotenv()

print("Imports loaded successfully")

## Create Sample Content

Let's create a sample markdown document to generate test queries from:

In [None]:
sample_content = """# AI Assistant Guidelines

This document outlines best practices for building AI assistants.

## Core Principles

AI assistants should follow these core principles:

1. **Helpfulness**: Always aim to provide useful and accurate information
2. **Safety**: Never provide harmful or dangerous advice
3. **Honesty**: Be transparent about limitations and uncertainties
4. **Privacy**: Respect user privacy and data protection

## Response Quality

High-quality responses should:
- Be clear and concise
- Address the user's actual question
- Provide relevant context when needed
- Cite sources for factual claims

## Error Handling

When the assistant encounters errors or edge cases:
- Acknowledge the limitation clearly
- Suggest alternative approaches
- Never make up information
"""

# Save to file
sample_file = Path("./sample_guidelines.md")
sample_file.write_text(sample_content)
print(f"Sample content saved to: {sample_file}")

## Create Context Loader

In [None]:
# Create markdown loader with hybrid chunking
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Preview the chunks
chunks = loader.load(str(sample_file))
print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"- {chunk.chunk_id}: {len(chunk.content)} chars")

## Create OpenAI Generator

The generator reads the API key from the `OPENAI_API_KEY` environment variable.

In [None]:
# Create OpenAI generator with GPT-4o-mini (cost-effective)
generator = create_openai_generator(
    model_name="gpt-4o-mini",  # Options: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
    temperature=0.7,
    max_tokens=2048,
    use_structured_output=True,  # Use OpenAI's native structured output
)

print(f"OpenAI generator created with model: {generator.model_name}")

## Generate Test Dataset

In [None]:
# Generate complete dataset
async def generate_dataset():
    print("Generating test dataset with OpenAI...\n")
    
    dataset = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=3,
        language="english",
    )
    
    print(f"Generated dataset:")
    print(f"  Session ID: {dataset.session_id}")
    print(f"  Total queries: {len(dataset.conversation)}\n")
    
    print("Generated queries:")
    for batch in dataset.conversation:
        difficulty = batch.agentic.get('difficulty', 'N/A')
        query_type = batch.agentic.get('query_type', 'N/A')
        print(f"  [{batch.qa_id}] ({difficulty}/{query_type})")
        print(f"    {batch.query}\n")
    
    return dataset

# Execute
dataset = await generate_dataset()

## Generate with Seed Examples

Guide the style of generated questions using seed examples:

In [None]:
async def generate_with_seeds():
    seed_examples = [
        "What should an AI assistant do when it encounters a harmful request?",
        "How can an assistant ensure its responses respect user privacy?",
        "What are the key characteristics of a high-quality AI response?",
    ]
    
    print("Generating with seed examples...\n")
    
    dataset = await generator.generate_dataset(
        context_loader=loader,
        source=str(sample_file),
        assistant_id="my-assistant",
        num_queries_per_chunk=2,
        seed_examples=seed_examples,
    )
    
    print(f"Generated {len(dataset.conversation)} queries with seed examples")
    for batch in dataset.conversation[:3]:
        print(f"  - {batch.query}")
    
    return dataset

# Execute
dataset_with_seeds = await generate_with_seeds()

## Save Generated Dataset

In [None]:
# Save dataset to JSON
output_file = Path("./generated_tests_openai.json")
with open(output_file, "w") as f:
    json.dump(dataset.model_dump(), f, indent=2)

print(f"Dataset saved to: {output_file}")

## Available OpenAI Models

| Model | Description | Best For |
|-------|-------------|----------|
| `gpt-4o` | Most capable, multimodal | Complex reasoning |
| `gpt-4o-mini` | Fast and cost-effective | General tasks |
| `gpt-4-turbo` | Previous flagship | Longer context |
| `gpt-3.5-turbo` | Fast and cheap | Simple tasks |

In [None]:
# Example: Use GPT-4o for higher quality generation
# generator_gpt4 = create_openai_generator(
#     model_name="gpt-4o",
#     temperature=0.5,  # Lower temperature for more focused output
# )

## Cleanup

In [None]:
# Clean up sample files
if sample_file.exists():
    sample_file.unlink()
if output_file.exists():
    output_file.unlink()
print("Cleanup completed")