## 🥬 TinyLettuce: Efficient Hallucination Detection Small Models (Using Synthetic Data Generation)

<p align="center">
  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/dev/assets/tinytinylettuce.png?raw=true" alt="TinyLettuce Detective" width="400"/>
  <br>
  <em>Small, task‑specialized encoders trained on synthetic data</em>
</p>


[![LettuceDetect](https://img.shields.io/badge/LettuceDetect-v0.1.8-green)](https://github.com/your-username/LettuceDetect)
[![Python](https://img.shields.io/badge/Python-3.11+-blue)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT)

## 🎯 Overview

**The Problem**: Training robust hallucination detection models requires large datasets of both correct and hallucinated responses. Manually creating such datasets is expensive and time-consuming.

**Our Solution**: LettuceDetect's synthetic data generation pipeline can generate realistic hallucinations from factual content.

### What This Notebook Demonstrates

1. **Answer-based Generation**: Inject specific error types into correct answers
2. **Batch Processing**: Efficient async generation for large datasets
3. **Training Integration**: Convert to formats ready for model training

### Key Benefits

- **Cost-effective**: Generate thousands of training samples at a fraction of manual annotation cost
- **Controllable**: Specify exact error types and intensity levels
- **Scalable**: Async batch processing for large scale datasets

### Setup

Install LettuceDetect:
```bash
pip install lettucedetect
```

Then, install datasets and rich:
```bash
pip install datasets
pip install rich
```


In [1]:
# We recommend setting your OpenAI API key as an environment variable
# os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

### Generate Synthetic Data

In [5]:
# Initialize the generator
from lettucedetect import HallucinationGenerator

# The heart of the synthetic data generation pipeline is the HallucinationGenerator class
# GPT 5 requires temperature=1.0
generator = HallucinationGenerator(model="gpt-5", temperature=1.0)

In [6]:
# The generator can be used with any context-question-answer format
result = generator.generate(
    context=[
        "Ibuprofen is an NSAID that reduces inflammation and pain. The typical adult dose is 400-600mg every 6-8 hours, not exceeding 2400mg daily."
    ],
    question="What is the maximum daily dose of ibuprofen?",
    answer="The maximum daily dose of ibuprofen for adults is 2400mg.",
)

In [8]:
from rich import console

console = console.Console()

console.print(result)

You can easily tune the error types and intensity to your needs.

Currently, the generator supports the following error types:
- factual = Change facts/entities
- temporal = Change dates, time periods
- numerical = Change numbers, quantities
- relational = Change relationships between entities
- contextual = Add unrelated context
- omission = Remove important details

And intensity is a float between 0 and 1, where 0 is hardly noticable and 1 is very obvious

In [9]:
# Lets try to generate numerical errors
result = generator.generate(
    context=[
        "Ibuprofen is an NSAID that reduces inflammation and pain. The typical adult dose is 400-600mg every 6-8 hours, not exceeding 2400mg daily."
    ],
    question="What is the maximum daily dose of ibuprofen?",
    answer="The maximum daily dose of ibuprofen for adults is 2400mg.",
    error_types=["numerical"],
)

console.print(result)

In [10]:
# Lets try with low intensity
result = generator.generate(
    context=[
        "Ibuprofen is an NSAID that reduces inflammation and pain. The typical adult dose is 400-600mg every 6-8 hours, not exceeding 2400mg daily."
    ],
    question="What is the maximum daily dose of ibuprofen?",
    answer="The maximum daily dose of ibuprofen for adults is 2400mg.",
    error_types=["numerical"],
    intensity=0.1,
)

In [11]:
console.print(result)

In [16]:
# Now lets try to generate factual errors
result = generator.generate(
    context=[
        "Ibuprofen is an NSAID that reduces inflammation and pain. The typical adult dose is 400-600mg every 6-8 hours, not exceeding 2400mg daily."
    ],
    question="What is the maximum daily dose of ibuprofen?",
    answer="The maximum daily dose of ibuprofen for adults is 2400mg.",
    error_types=["factual"],
    intensity=0.4,
)

console.print(result)

In [17]:
# Another example: temporal errors
result = generator.generate(
    context=[
        "Apollo 11 was the first crewed mission to land on the Moon, touching down on July 20, 1969. Neil Armstrong and Buzz Aldrin spent about 21 hours on the lunar surface."
    ],
    question="On what date did Apollo 11 land on the Moon?",
    answer="Apollo 11 landed on the Moon on July 20, 1969.",
    error_types=["temporal"],
    intensity=0.5,
)

console.print(result)

In [21]:
# Hallucinations can be generated in batch as well


async def generate_batch(contexts, questions, answers, error_types, intensity):
    generator = HallucinationGenerator(model="gpt-5-mini", temperature=1.0)
    results = await generator.generate_batch_async(
        contexts, questions, answers, error_types, intensity
    )
    return results


# Lets try to generate a batch of hallucinations
contexts = [
    "Ibuprofen is an NSAID that reduces inflammation and pain. The typical adult dose is 400-600mg every 6-8 hours, not exceeding 2400mg daily.",
    "Apollo 11 was the first crewed mission to land on the Moon, touching down on July 20, 1969. Neil Armstrong and Buzz Aldrin spent about 21 hours on the lunar surface.",
]
questions = [
    "What is the maximum daily dose of ibuprofen?",
    "On what date did Apollo 11 land on the Moon?",
]
answers = [
    "The maximum daily dose of ibuprofen for adults is 2400mg.",
    "Apollo 11 landed on the Moon on July 20, 1969.",
]
error_types = ["numerical", "temporal"]
intensity = 0.5

results = await generate_batch(contexts, questions, answers, error_types, intensity)
console.print(results)

## The rag-mini-BioASQ dataset

The rag-mini-BioASQ dataset is a rag dataset of biomedical questions and answers together with their corresponding context.

We can use the HuggingFace `datasets` library to load the dataset.



In [25]:
def load_rag_mini_bioasq(split: str = "train", filter_min_words: int = 10):
    """Load rag-mini-bioasq dataset and prepare for generation."""
    try:
        from datasets import load_dataset
    except ImportError:
        raise ImportError("datasets package required. Install with: pip install datasets")

    # Load dataset
    qa_dataset = load_dataset("enelpol/rag-mini-bioasq", "question-answer-passages")
    corpus_dataset = load_dataset("enelpol/rag-mini-bioasq", "text-corpus")

    # Create corpus lookup
    corpus_lookup = {item["id"]: item["passage"] for item in corpus_dataset["test"]}

    # Process data
    processed_data = []
    for item in qa_dataset[split]:
        passage_ids = item["relevant_passage_ids"]
        context_passages = [corpus_lookup.get(pid, None) for pid in passage_ids]
        context_passages = [p for p in context_passages if p is not None]

        # Filter by answer length
        if len(item["answer"].split()) >= filter_min_words:
            processed_data.append(
                {
                    "question": item["question"],
                    "answer": item["answer"],
                    "context": context_passages,
                }
            )

    return processed_data


# Lets load the dataset
data = load_rag_mini_bioasq()

# Lets take a look at an example sample
console.print(data[3])

In [26]:
# You can easily use the generator to generate hallucinations for the dataset
result = generator.generate(
    context=data[3]["context"],
    question=data[3]["question"],
    answer=data[3]["answer"],
)

console.print(result)

In [28]:
# You can easily convert this to the format LettuceDetect uses for training
from lettucedetect.detectors.prompt_utils import PromptUtils

train_data = []

# Add the non-hallucinated sample
train_data.append(
    {
        "prompt": PromptUtils.format_context(data[3]["context"], data[3]["question"], lang="en"),
        "answer": result["original_answer"],
        "labels": [],
        "split": "train",
        "task_type": "qa",
    }
)

hallucinated_labels = []
for part in result["hallucinated_parts"]:
    start = result["hallucinated_answer"].find(part)
    if start != -1:
        hallucinated_labels.append(
            {"start": start, "end": start + len(part), "label": "hallucinated"}
        )
# Add the hallucinated sample
train_data.append(
    {
        "prompt": PromptUtils.format_context(data[3]["context"], data[3]["question"], lang="en"),
        "answer": result["hallucinated_answer"],
        "labels": hallucinated_labels,
        "split": "train",
        "task_type": "qa",
    }
)

console.print(train_data)

## Save and train

Now you can save the data and train a model. First lets save the data.

```python
import json

with open('train_data.json', 'w') as f:
    json.dump(train_data, f)
```

Now you can train a model.

```bash
python scripts/train.py \
    --ragtruth-path train_data.json \
    --model-name jhu-clsp/ettin-encoder-68m \
    --output-dir output/hallucination_detector \
    --batch-size 4 \
    --epochs 6 \
    --learning-rate 1e-5 
```

**And that's it!** You have a hallucination detector that you can use to detect hallucinations in your data.


For the published models, we have generated **1500** samples from the rag-mini-bioasq dataset (3000 samples together with the non-hallucinated ones). We've used the `gpt-oss-120b` model for the training data generation. We haven't specified direct error types, and used the default intensity of 0.3.

For the test set, we have generated **300** hallucinated samples (600 samples together with the non-hallucinated ones). We've used the `gpt-5` model for the generation to ensure the quality of the hallucinations for the test set.

For large scale generation, use our script:

```bash
python scripts/generate_synthetic_data.py \\
    --dataset rag-mini-bioasq \\
    --split train \\
    --num-samples 100 \\
    --model gpt-4o-mini \\
    --output data/synthetic_train.json
```




## End-to-End Workflow

```bash
# Step 1: Generate synthetic training data
python scripts/generate_synthetic_data.py \
  --dataset rag-mini-bioasq \
  --num-samples 2000 \
  --model gpt-4o-mini \
  --batch-size 10 \
  --output data/synthetic_large.json

# Step 2: Train TinyLettuce model
python scripts/train.py \
  --ragtruth-path data/train_combined_large.json \
  --model-name jhu-clsp/ettin-encoder-17m \
  --output-dir output/tinylettuce_17m \
  --batch-size 8 \
  --epochs 3

# Step 3: Deploy on CPU for real-time inference
python scripts/start_api.py prod --model output/tinylettuce_17m
```

## Bonus



In [None]:
# We have implemented a triplet-based hallucination detection model that you can use the same way as the standard lettucecedetect models.

from lettucedetect.models.inference import HallucinationDetector
from lettucedetect.ragfactchecker import RAGFactChecker

detector = HallucinationDetector(
    method="rag_fact_checker",
)

fact_checker = RAGFactChecker()

In [35]:
# Get triplets for a sample
triplets = fact_checker.generate_triplets("The capital of France is Paris.")
console.print(triplets)

In [36]:
compare = fact_checker.analyze_text_pair(
    "France is a country in Europe.", "France is a country in Asia."
)
console.print(compare)

In [38]:
# You can use it for detecting hallucinations in your data
result = detector.predict(
    context="The capital of France is Paris.",
    question="What is the capital of France?",
    answer="The capital of France is Berlin.",
    output_format="detailed",
)
console.print(result)