# Generators - OpenAI Example

This notebook demonstrates how to use the Fair Forge generators module with **OpenAI** models to create synthetic test datasets.

## Overview

The `BaseGenerator` class accepts any LangChain-compatible chat model, including OpenAI's `ChatOpenAI`. This provides a flexible interface for generating test queries from context documents.

## Installation

First, install Fair Forge and the OpenAI LangChain integration.

In [None]:
import sys
!uv pip install --python {sys.executable} --force-reinstall "$(ls ../../dist/*.whl)" langchain-openai -q

## Setup

Import the required modules and configure your OpenAI API key.

Get your API key from [OpenAI Platform](https://platform.openai.com/api-keys).

In [None]:
import os
import sys
import json
from pathlib import Path

sys.path.insert(0, os.path.dirname(os.getcwd()))

from langchain_openai import ChatOpenAI

from fair_forge.generators import (
    BaseGenerator,
    create_markdown_loader,
    RandomSamplingStrategy,
)
from fair_forge.schemas import Dataset

In [None]:
import getpass

OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")

## Create Sample Content

Let's create a sample markdown document to generate test queries from:

In [None]:
sample_content = """# AI Assistant Guidelines

This document outlines best practices for building AI assistants.

## Core Principles

AI assistants should follow these core principles:

1. **Helpfulness**: Always aim to provide useful and accurate information
2. **Safety**: Never provide harmful or dangerous advice
3. **Honesty**: Be transparent about limitations and uncertainties
4. **Privacy**: Respect user privacy and data protection

## Response Quality

High-quality responses should:
- Be clear and concise
- Address the user's actual question
- Provide relevant context when needed
- Cite sources for factual claims

## Error Handling

When the assistant encounters errors or edge cases:
- Acknowledge the limitation clearly
- Suggest alternative approaches
- Never make up information
"""

# Save to file
sample_file = Path("./sample_guidelines.md")
sample_file.write_text(sample_content)
print(f"Sample content saved to: {sample_file}")

## Create Context Loader

In [None]:
# Create markdown loader with hybrid chunking
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Preview the chunks
chunks = loader.load(str(sample_file))
print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"- {chunk.chunk_id}: {len(chunk.content)} chars")

## Create Generator with OpenAI Model

The `BaseGenerator` accepts any LangChain-compatible chat model. Here we use `ChatOpenAI` from `langchain-openai`.

In [None]:
# Create OpenAI model using LangChain
model = ChatOpenAI(
    model="gpt-4o-mini",  # Options: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
    api_key=OPENAI_API_KEY,
    temperature=0.7,
    max_tokens=2048,
)

# Create generator with the model
generator = BaseGenerator(
    model=model,
    use_structured_output=True,  # Use OpenAI's native structured output
)

print(f"Generator created with model: {model.model_name}")

## Generate Test Dataset

In [None]:
print("Generating test dataset with OpenAI...\n")

datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="my-assistant",
    num_queries_per_chunk=3,
    language="english",
)

# With default SequentialStrategy, we get one dataset
dataset = datasets[0]

print(f"Generated {len(datasets)} dataset(s):")
print(f"  Session ID: {dataset.session_id}")
print(f"  Total queries: {len(dataset.conversation)}\n")

print("Generated queries:")
for batch in dataset.conversation:
    difficulty = batch.agentic.get('difficulty', 'N/A')
    query_type = batch.agentic.get('query_type', 'N/A')
    print(f"  [{batch.qa_id}] ({difficulty}/{query_type})")
    print(f"    {batch.query}\n")

## Generate with Seed Examples

Guide the style of generated questions using seed examples:

In [None]:
seed_examples = [
    "What should an AI assistant do when it encounters a harmful request?",
    "How can an assistant ensure its responses respect user privacy?",
    "What are the key characteristics of a high-quality AI response?",
]

print("Generating with seed examples...\n")

datasets_with_seeds = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="my-assistant",
    num_queries_per_chunk=2,
    seed_examples=seed_examples,
)

dataset_seeds = datasets_with_seeds[0]
print(f"Generated {len(dataset_seeds.conversation)} queries with seed examples:")
for batch in dataset_seeds.conversation[:3]:
    print(f"  - {batch.query}")

## Chunk Selection Strategies

Strategies control how chunks are selected and grouped during generation.

### RandomSamplingStrategy

Randomly samples chunks multiple times to create diverse test datasets:

In [None]:
strategy = RandomSamplingStrategy(
    num_samples=2,       # Create 2 datasets
    chunks_per_sample=2, # Each with 2 random chunks
    seed=42,             # For reproducibility
)

print(f"Strategy: {strategy}\n")

random_datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="my-assistant",
    num_queries_per_chunk=2,
    selection_strategy=strategy,
)

print(f"Generated {len(random_datasets)} datasets:\n")
for i, ds in enumerate(random_datasets):
    print(f"Dataset {i+1}: {len(ds.conversation)} queries")
    chunk_ids = set(b.agentic.get('chunk_id', 'N/A') for b in ds.conversation)
    print(f"  Chunks: {chunk_ids}\n")

## Conversation Mode

Generate coherent multi-turn conversations where each question builds on the previous:

In [None]:
print("Generating conversations (each turn builds on the previous)...\n")

conversation_datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="my-assistant",
    num_queries_per_chunk=3,  # 3-turn conversations
    conversation_mode=True,   # Enable conversation mode
)

dataset_conv = conversation_datasets[0]
print(f"Generated {len(dataset_conv.conversation)} conversation turns:\n")

# Group by chunk to show conversation flow
current_chunk = None
for batch in dataset_conv.conversation:
    chunk_id = batch.agentic.get('chunk_id', 'N/A')
    turn_num = batch.agentic.get('turn_number', 0)
    
    if chunk_id != current_chunk:
        print(f"\n--- Conversation for: {chunk_id} ---")
        current_chunk = chunk_id
    
    print(f"  Turn {turn_num}: {batch.query}")

## Save Generated Dataset

In [None]:
# Save dataset to JSON
output_file = Path("./generated_tests_openai.json")
with open(output_file, "w") as f:
    json.dump(dataset.model_dump(), f, indent=2)

print(f"Dataset saved to: {output_file}")

## Available OpenAI Models

| Model | Description | Best For |
|-------|-------------|----------|
| `gpt-4o` | Most capable, multimodal | Complex reasoning |
| `gpt-4o-mini` | Fast and cost-effective | General tasks |
| `gpt-4-turbo` | Previous flagship | Longer context |
| `gpt-3.5-turbo` | Fast and cheap | Simple tasks |

## Cleanup

In [None]:
# Clean up sample files
if sample_file.exists():
    sample_file.unlink()
if output_file.exists():
    output_file.unlink()
print("Cleanup completed")