# Generators - Groq Example

This notebook demonstrates how to use the Fair Forge generators module with **Groq Cloud** for ultra-fast synthetic test dataset generation.

## Overview

The `BaseGenerator` class accepts any LangChain-compatible chat model, including Groq's `ChatGroq`. This provides extremely fast inference for open-source LLMs.

### Why Groq?
- **Speed**: Up to 10x faster than traditional cloud providers
- **Cost**: Competitive pricing for high-volume usage
- **Models**: Access to popular open-source models (Llama 3, Mixtral, Gemma)

## Installation

First, install Fair Forge and the Groq LangChain integration.

In [None]:
import sys
!uv pip install --python {sys.executable} --force-reinstall "$(ls ../../dist/*.whl)" langchain-groq -q

## Setup

Import the required modules and configure your Groq API key.

Get your free API key from [Groq Console](https://console.groq.com/).

In [None]:
import os
import sys
import json
import time
from pathlib import Path

sys.path.insert(0, os.path.dirname(os.getcwd()))

from langchain_groq import ChatGroq

from fair_forge.generators import (
    BaseGenerator,
    create_markdown_loader,
    RandomSamplingStrategy,
)
from fair_forge.schemas import Dataset

In [None]:
import getpass

GROQ_API_KEY = getpass.getpass("Enter your Groq API key: ")

## Create Sample Content

Let's create a sample markdown document for testing:

In [None]:
sample_content = """# Machine Learning Fundamentals

This guide covers the basics of machine learning for beginners.

## Types of Machine Learning

Machine learning can be categorized into three main types:

### Supervised Learning
- Uses labeled training data
- Predicts outcomes based on input features
- Examples: Classification, Regression

### Unsupervised Learning
- Works with unlabeled data
- Discovers hidden patterns and structures
- Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
- Agent learns through interaction with environment
- Maximizes cumulative reward
- Examples: Game playing, Robotics

## Model Evaluation

Key metrics for evaluating ML models:

- **Accuracy**: Proportion of correct predictions
- **Precision**: True positives among predicted positives
- **Recall**: True positives among actual positives
- **F1 Score**: Harmonic mean of precision and recall

## Best Practices

1. Split data into train/validation/test sets
2. Use cross-validation for robust evaluation
3. Monitor for overfitting
4. Document your experiments
"""

# Save to file
sample_file = Path("./ml_fundamentals.md")
sample_file.write_text(sample_content)
print(f"Sample content saved to: {sample_file}")

## Create Context Loader

In [None]:
# Create markdown loader
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Preview chunks
chunks = loader.load(str(sample_file))
print(f"Created {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"- {chunk.chunk_id}: {len(chunk.content)} chars")

## Create Generator with Groq Model

The `BaseGenerator` accepts any LangChain-compatible chat model. Here we use `ChatGroq` for ultra-fast generation.

In [None]:
# Create Groq model using LangChain
model = ChatGroq(
    model="llama-3.1-8b-instant",  # Fast model for demos
    api_key=GROQ_API_KEY,
    temperature=0.4,
    max_tokens=2048,
)

# Create generator with the model
generator = BaseGenerator(
    model=model,
    use_structured_output=True,
)

print(f"Generator created with model: {model.model_name}")

## Generate Test Dataset

Groq's fast inference makes generation very quick!

In [None]:
print("Generating test dataset with Groq...\n")

start_time = time.time()

datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="ml-assistant",
    num_queries_per_chunk=3,
    language="english",
)

elapsed = time.time() - start_time

# With default SequentialStrategy, we get one dataset
dataset = datasets[0]

print(f"Generated {len(datasets)} dataset(s) in {elapsed:.2f} seconds:")
print(f"  Session ID: {dataset.session_id}")
print(f"  Total queries: {len(dataset.conversation)}\n")

print("Generated queries:")
for batch in dataset.conversation:
    difficulty = batch.agentic.get('difficulty', 'N/A')
    query_type = batch.agentic.get('query_type', 'N/A')
    print(f"  [{batch.qa_id}] ({difficulty}/{query_type})")
    print(f"    {batch.query}\n")

## Generate with Seed Examples

Guide the style of generated questions using seed examples:

In [None]:
seed_examples = [
    "What is the difference between supervised and unsupervised learning?",
    "How do you prevent overfitting in a machine learning model?",
    "When should you use precision vs recall as your primary metric?",
]

print("Generating with seed examples...\n")

datasets_with_seeds = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="ml-assistant",
    num_queries_per_chunk=2,
    seed_examples=seed_examples,
)

dataset_seeds = datasets_with_seeds[0]
print(f"Generated {len(dataset_seeds.conversation)} queries with seed examples:")
for batch in dataset_seeds.conversation[:5]:
    print(f"  - {batch.query}")

## Chunk Selection Strategies

Strategies control how chunks are selected and grouped during generation. By default, all chunks are processed sequentially into a single dataset.

### RandomSamplingStrategy

Randomly samples chunks multiple times to create diverse test datasets:

In [None]:
# Create a strategy that samples 3 random chunks, 2 times
strategy = RandomSamplingStrategy(
    num_samples=2,       # Create 2 datasets
    chunks_per_sample=3, # Each with 3 random chunks
    seed=42,             # For reproducibility
)

print(f"Strategy: {strategy}\n")

random_datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="ml-assistant",
    num_queries_per_chunk=2,
    selection_strategy=strategy,
)

print(f"Generated {len(random_datasets)} datasets:\n")
for i, ds in enumerate(random_datasets):
    print(f"Dataset {i+1}:")
    print(f"  Session: {ds.session_id[:8]}...")
    print(f"  Queries: {len(ds.conversation)}")
    chunk_ids = set(b.agentic.get('chunk_id', 'N/A') for b in ds.conversation)
    print(f"  Chunks: {chunk_ids}\n")

## Conversation Mode

Instead of generating independent queries, conversation mode creates coherent multi-turn conversations where each question builds on the previous ones:

In [None]:
print("Generating conversations (each turn builds on the previous)...\n")

conversation_datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="ml-assistant",
    num_queries_per_chunk=3,  # 3-turn conversations
    conversation_mode=True,   # Enable conversation mode
)

dataset_conv = conversation_datasets[0]
print(f"Generated {len(dataset_conv.conversation)} conversation turns:\n")

# Group by chunk to show conversation flow
current_chunk = None
for batch in dataset_conv.conversation:
    chunk_id = batch.agentic.get('chunk_id', 'N/A')
    turn_num = batch.agentic.get('turn_number', 0)
    
    if chunk_id != current_chunk:
        print(f"\n--- Conversation for chunk: {chunk_id} ---")
        current_chunk = chunk_id
    
    print(f"  Turn {turn_num}: {batch.query}")

### Combined: Random Sampling + Conversation Mode

You can combine strategies with conversation mode to create diverse conversation-based test sets:

In [None]:
strategy = RandomSamplingStrategy(
    num_samples=2,
    chunks_per_sample=2,
    seed=42,
)

print("Generating 2 datasets with 2 random chunks each (conversation mode)...\n")

combined_datasets = await generator.generate_dataset(
    context_loader=loader,
    source=str(sample_file),
    assistant_id="ml-assistant",
    num_queries_per_chunk=2,  # 2-turn conversations
    selection_strategy=strategy,
    conversation_mode=True,
)

for i, ds in enumerate(combined_datasets):
    print(f"Dataset {i+1} ({len(ds.conversation)} turns):")
    for batch in ds.conversation[:4]:  # Show first 4 turns
        chunk = batch.agentic.get('chunk_id', 'N/A')[:15]
        turn = batch.agentic.get('turn_number', 0)
        print(f"  [{chunk}] T{turn}: {batch.query[:50]}...")
    print()

## Save Generated Dataset

In [None]:
# Save dataset to JSON
output_file = Path("./generated_tests_groq.json")
with open(output_file, "w") as f:
    json.dump(dataset.model_dump(), f, indent=2)

print(f"Dataset saved to: {output_file}")

## Available Groq Models

| Model | Description | Best For |
|-------|-------------|----------|
| `llama-3.1-70b-versatile` | Most capable Llama model | Complex reasoning |
| `llama-3.1-8b-instant` | Fast Llama model | Quick tasks |
| `mixtral-8x7b-32768` | Large context window | Long documents |
| `gemma2-9b-it` | Google's Gemma model | General tasks |

Check [Groq Console](https://console.groq.com/docs/models) for the latest available models.

## Cleanup

In [None]:
# Clean up sample files
if sample_file.exists():
    sample_file.unlink()
if output_file.exists():
    output_file.unlink()
print("Cleanup completed")