# 2. Synthetic Data Generation: Building Training Signal from Documents

**Purpose:**

Before we change a model's weights, we need something to train on. In most enterprise engagements, that training data simply does not exist in a usable form. Customers have documents, institutional knowledge, and years of accumulated expertise encoded in PDFs, wikis, and policy manuals. What they almost never have is a structured dataset of questions and high-quality answers that a model can learn from directly.

This is the gap that Synthetic Data Generation fills.

SDG is not about fabricating information. It is about transforming existing knowledge into a format that training pipelines can consume. We take the documents we have already ingested, chunked, and validated through retrieval, and we use a language model to generate question-answer pairs that reflect the domain. The documents contain the knowledge. SDG extracts the signal and reshapes it into something a model can actually learn from.

Why not just use the documents directly? Because training a model requires examples of the behavior you want. A 200-page rulebook is not an example of behavior. It is reference material. The model needs to see questions asked and answered correctly, repeatedly, across the full surface area of the domain. That is what SDG produces.

There is an important dependency here. SDG quality is directly tied to everything we built earlier in this workshop. The ingestion pipeline determines whether the source text is clean. The chunking strategy determines whether the generated pairs are coherent or fragmented. If any of those upstream stages are broken, SDG will faithfully reproduce the damage: confident, well-structured question-answer pairs that are subtly wrong, carried forward into training.

This is why SDG comes after retrieval and evaluation, not before. It is not a shortcut. It is a manufacturing step that depends on the quality of every step that preceded it.

## 2.1 Install SDG Hub

`sdg_hub` is an open-source toolkit built by the Red Hat AI Innovation Team for building synthetic data generation pipelines. It is Apache 2.0 licensed and designed around two core concepts: **blocks** and **flows**. Blocks are composable processing units, each responsible for a single transformation, such as generating a question from a document chunk, producing an answer, or evaluating faithfulness. Flows chain blocks together into complete pipelines defined in YAML. You describe the sequence of transformations declaratively, and the framework handles orchestration, validation, and execution.

The library includes pre-built flows that encode tested, validated generation pipelines we can use out of the box rather than assembling one from scratch.

In [1]:
! pip install sdg-hub -q

[33m  DEPRECATION: Building 'pylatexenc' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'pylatexenc'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m[33m  DEPRECATION: Building 'antlr4-python3-runtime' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'antlr4-python3-runtime'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m[31mERROR: 

In [2]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

import nest_asyncio
nest_asyncio.apply()

import logging
logging.getLogger("asyncio").setLevel(logging.CRITICAL)


Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


## 2.3 Discover Available Flows and Blocks

With the library installed, we confirm that everything registered correctly by discovering the available flows and blocks.

In [3]:
from sdg_hub.core.flow import FlowRegistry, Flow
from sdg_hub.core.blocks import BlockRegistry
from datasets import Dataset

# Auto-discover all registered flows and blocks
FlowRegistry.discover_flows()
BlockRegistry.discover_blocks()

# See what shipped with the install
print("Available flows:")
for name in FlowRegistry.list_flows():
    print(f"  - {name}")

Available flows:
  - {'id': 'loud-dawn-245', 'name': 'RAG Evaluation Dataset Flow'}
  - {'id': 'clean-shadow-397', 'name': 'Advanced Japanese Document Grounded Question-Answer Generation Flow for Knowledge Tuning'}
  - {'id': 'mild-thunder-748', 'name': 'Detailed Summary Knowledge Tuning Dataset Generation Flow'}
  - {'id': 'stellar-peak-605', 'name': 'Document Based Knowledge Tuning Dataset Generation Flow'}
  - {'id': 'epic-jade-656', 'name': 'Extractive Summary Knowledge Tuning Dataset Generation Flow'}
  - {'id': 'heavy-heart-77', 'name': 'Key Facts Knowledge Tuning Dataset Generation Flow'}
  - {'id': 'mild-thunder-748-es', 'name': 'Detailed Summary Knowledge Tuning Dataset Generation Flow (Spanish)'}
  - {'id': 'stellar-peak-605-es', 'name': 'Document Based Knowledge Tuning Dataset Generation Flow (Spanish)'}
  - {'id': 'epic-jade-656-es', 'name': 'Extractive Summary Knowledge Tuning Dataset Generation Flow (Spanish)'}
  - {'id': 'heavy-heart-77-es', 'name': 'Key Facts Knowledge 

You should see several pre-built flows in the output, including flows for question-answer generation, knowledge tuning, and reasoning data. If the output is empty or the import fails, check that the install completed without errors and that your Python environment is 3.10 or newer.

## 2.4 Load a Pre-Built Q&A Generation Flow

We select the flow we will use for the rest of this section. The "Document Based Knowledge Tuning Dataset Generation Flow" takes a document chunk and generates grounded question-answer pairs from it. This is the same pattern a customer would follow: take ingested documents, produce structured training data.

In [4]:
flow_name = "Document Based Knowledge Tuning Dataset Generation Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

print(f"Loaded flow: {flow_name}")

Loaded flow: Document Based Knowledge Tuning Dataset Generation Flow


Before we configure the flow, let's see what it recommends for the model backend.

In [5]:
default_model = flow.get_default_model()
recommendations = flow.get_model_recommendations()
print(f"Default model: {default_model}")
print(f"Recommendations: {recommendations}")

Default model: openai/gpt-oss-120b
Recommendations: {'default': 'openai/gpt-oss-120b', 'compatible': ['meta-llama/Llama-3.3-70B-Instruct', 'microsoft/phi-4', 'mistralai/Mixtral-8x7B-Instruct-v0.1'], 'experimental': []}


The flow has a default model preference, but we are not using it. We are routing through the same MaaS endpoint we have been using all workshop. The model recommendations are informational: they tell you what the flow was tested against. In practice, any capable instruction-following model works.

## 2.5 Configure the Model Backend

SDG Hub does not host or download models. It connects to an external endpoint that serves the LLM. Same architectural pattern as everything else in this lab: the model runs somewhere else, and our code talks to it over an API.

The `openai/` prefix on the model name is required here for the same reason it was required on the judge in Section 1. SDG Hub uses `litellm` internally for API routing, and the prefix tells litellm to speak the OpenAI-compatible protocol against our MaaS endpoint.

In [6]:
flow.set_model_config(
    model="openai/microsoft-phi-4",
    api_base=endpoint_base,
    api_key=key,
)

print("Model configured: microsoft-phi-4 via MaaS")

Model configured: microsoft-phi-4 via MaaS


## 2.6 Discover the Dataset Schema

Every flow defines a contract: the shape of the data it expects as input. Before we build anything, we ask the flow what it needs.

In [7]:
requirements = flow.get_dataset_requirements()
print(requirements)

required_columns=['document', 'document_outline', 'domain', 'icl_document', 'icl_query_1', 'icl_query_2', 'icl_query_3'] optional_columns=[] min_samples=1 max_samples=None column_types={} description='Input dataset should contain documents with text content and domain classification. Each document should be substantial enough for meaningful question generation (minimum 100 words recommended). The flow generates three types of summaries: detailed (n=20), extractive (n=10), and key facts (n=50), each producing corresponding QA pairs designed to help LLMs internalize document knowledge for knowledge tuning.'


The output tells us exactly what columns the seed dataset must contain. This is intentional. Rather than guessing what the pipeline expects and debugging schema mismatches later, the flow declares its contract up front.

## 2.7 Build the Seed Dataset

Now we build the input. The seed dataset is a single row: one document chunk, an outline describing its structure, the domain label, and a set of in-context learning (ICL) examples that show the flow what kind of questions we want generated.

The ICL examples matter. They are not training data. They are demonstrations. They show the generation model the style, specificity, and grounding level we expect in the output. Think of them as a few-shot prompt baked into the pipeline configuration.

For the first run, we deliberately use a chunk from the beginning of the document. This part of the text is heavy on table of contents, formatting preamble, and structural boilerplate. It is not rich in rules or domain logic. We want to see what happens when the input is structurally clean but semantically thin.



In [8]:
# Load the Docling output
with open("Basic-Fantasy-RPG-Rules-r142.md", "r", encoding="utf-8") as f:
    full_text = f.read()

# First run: a chunk from early in the document (TOC-heavy, rule-light)
document_chunk = full_text[3000:6000]

print(f"Document chunk length: {len(document_chunk)} characters")
print(f"\nFirst 300 characters:\n{document_chunk[:300]}...")

Document chunk length: 3000 characters

First 300 characters:
   Schoonover, Jason   Brentlinger, Chris Wolfmeyer, Josh   Eaton, Audra Brentlinger, Tim McAfee, Ike Borden, Cody  Drebenstedt, Joseph  BierFauble, Emily Drebenstedt, John Lopez, Pedro  Pablo  Miron  Pozo, Robert Odom, Sergio   I.    Nemirovsky, Will    E.    Sanders, Brian Scalise, Timothy  J.    ...


In [9]:
import pandas as pd

dataset = pd.DataFrame({
    'document': [document_chunk],
    'document_outline': [
        '1. Character Abilities and Ability Scores; '
        '2. Hit Points and Hit Dice; '
        '3. Character Classes and Prime Requisites'
    ],
    'domain': ['Tabletop RPG Rules'],
    'icl_document': [
        'Open Locks allows the Thief to unlock a lock without a proper key. '
        'It may only be tried once per lock. If the attempt fails, the Thief '
        'must wait until they have gained another level of experience before '
        'trying again.'
    ],
    'icl_query_1': ['What happens if a Thief fails an Open Locks attempt?'],
    'icl_query_2': ['Can a Thief retry a failed Open Locks check immediately?'],
    'icl_query_3': ['Does the Open Locks ability require a key?'],
})

print(f"Dataset shape: {dataset.shape}")
print(f"Columns: {list(dataset.columns)}")

Dataset shape: (1, 7)
Columns: ['document', 'document_outline', 'domain', 'icl_document', 'icl_query_1', 'icl_query_2', 'icl_query_3']


## 2.8 Generate: Run 1 (TOC-Heavy Chunk)

The `RUN_LIVE` toggle works the same way as Section 1. Set it to `True` for live generation, `False` to load saved results.

In [11]:
# Set to True to run generation live. Set to False to load saved results.
RUN_LIVE = True

In [12]:
import time
import json

if RUN_LIVE:
    print("Generating synthetic data from TOC-heavy chunk...")
    start_time = time.time()
    result_run1 = flow.generate(dataset)
    elapsed = time.time() - start_time

    print(f"Generation complete in {elapsed:.1f}s")
    print(f"Generated {len(result_run1)} QA pairs")
else:
    result_run1 = pd.read_csv("../prebuilt/sdg_run1_results.csv")
    print(f"Loaded {len(result_run1)} pre-built QA pairs from run 1")

Generating synthetic data from TOC-heavy chunk...


question_generation: 100%|██████████| 1/1 [00:03<00:00,  3.58s/req]


answer_generation: 100%|██████████| 6/6 [00:03<00:00,  1.66req/s]


eval_faithful_llm_chat: 100%|██████████| 6/6 [00:04<00:00,  1.46req/s]


Generation complete in 11.4s
Generated 5 QA pairs


In [None]:
for i in range(len(result_run1)):
    print(f"QA Pair #{i + 1}")
    print(f"  Question:     {result_run1['question'].iloc[i]}")
    print(f"  Answer:       {result_run1['response'].iloc[i][:150]}...")
    if 'faithfulness_judgment' in result_run1.columns:
        print(f"  Faithfulness: {result_run1['faithfulness_judgment'].iloc[i]}")
    print()


Look at what the flow produced. The questions are structurally sound, the answers are fluent, and the faithfulness scores may even be positive. But ask yourself: are these questions useful for training a model on this domain?

If the source chunk is mostly structural boilerplate, the generated questions will reflect that. The flow cannot manufacture domain depth that the input does not contain. This is the same principle we saw with chunking in Day 2: the quality ceiling is set upstream.

## 2.9 Iteration: Better Input, Better Output

Now we change exactly one thing: the document chunk. Instead of the TOC-heavy opening, we select a chunk rich in actual game rules. The Thief abilities section has explicit mechanics, tables, edge cases, and implicit logic. This is the kind of content that generates useful training signal.

Everything else stays the same. Same ICL examples, same domain label, same flow, same model. If the output improves, we know exactly why.

In [19]:
# Find the section on Thief abilities
search_term = "Open Locks"
idx = full_text.find(search_term)

if idx != -1:
    start = max(0, idx - 500)
    end = min(len(full_text), idx + 2500)
    better_chunk = full_text[start:end]
    print(f"Found '{search_term}' at position {idx}")
    print(f"Chunk range: {start} to {end} ({len(better_chunk)} characters)")
    print(f"\nFirst 500 characters:\n{better_chunk[:500]}...")
else:
    print(f"'{search_term}' not found in document")

Found 'Open Locks' at position 57479
Chunk range: 56979 to 59979 (3000 characters)

First 500 characters:
se abilities, as determined by the GM. The GM may choose to make any of these rolls on behalf of the player to help maintain the proper state of uncertainty. Also   note   that   the   GM   may   apply situational   adjustments   (plus   or   minus   percentage points) as they see fit; for instance, it's obviously harder to climb a wall slick with slime than one that is dry, so the GM might apply a penalty of 20% for the slimy wall.

## BASIC FANTASY RPG

## Thief Abilities

|   Thief Level |   ...


In [20]:
dataset_v2 = pd.DataFrame({
    'document': [better_chunk],
    'document_outline': [
        '1. Thief Special Abilities; '
        '2. Open Locks, Pick Pockets, and Remove Traps; '
        '3. Class Restrictions and Level Progression'
    ],
    'domain': ['Tabletop RPG Rules'],
    'icl_document': [
        'Open Locks allows the Thief to unlock a lock without a proper key. '
        'It may only be tried once per lock. If the attempt fails, the Thief '
        'must wait until they have gained another level of experience before '
        'trying again.'
    ],
    'icl_query_1': ['What happens if a Thief fails an Open Locks attempt?'],
    'icl_query_2': ['Can a Thief retry a failed Open Locks check immediately?'],
    'icl_query_3': ['Does the Open Locks ability require a key?'],
})

Notice what changed and what did not. The ICL queries are the same. The domain is the same. The flow configuration is the same. The model is the same. The only thing we changed is the document chunk. If the output improves, we know exactly why.

In [21]:
if RUN_LIVE:
    print("Generating with improved chunk...")
    start_time = time.time()
    result_run2 = flow.generate(dataset_v2)
    elapsed = time.time() - start_time

    print(f"Generation complete in {elapsed:.1f}s")
    print(f"Generated {len(result_run2)} QA pairs")
else:
    result_run2 = pd.read_csv("../prebuilt/sdg_run2_results.csv")
    print(f"Loaded {len(result_run2)} pre-built QA pairs from run 2")


Generating with improved chunk...


question_generation: 100%|██████████| 1/1 [00:03<00:00,  3.00s/req]


answer_generation: 100%|██████████| 8/8 [00:03<00:00,  2.02req/s]


eval_faithful_llm_chat: 100%|██████████| 8/8 [00:05<00:00,  1.58req/s]


Generation complete in 12.2s
Generated 8 QA pairs


## 2.10 Compare the Two Runs

In [22]:
print(f"{'='*70}")
print("COMPARISON: RUN 1 (TOC-HEAVY) vs. RUN 2 (RULE-RICH)")
print(f"{'='*70}")

print(f"\n  Run 1: {len(result_run1)} pairs from TOC-heavy chunk")
print(f"  Run 2: {len(result_run2)} pairs from rule-rich chunk")

print(f"\n  --- Run 2 Q&A Pairs ---")
for i in range(len(result_run2)):
    print(f"\n  QA Pair #{i + 1}")
    print(f"    Question:     {result_run2['question'].iloc[i]}")
    print(f"    Answer:       {result_run2['response'].iloc[i][:150]}...")
    if 'faithfulness_judgment' in result_run2.columns:
        print(f"    Faithfulness: {result_run2['faithfulness_judgment'].iloc[i]}")


COMPARISON: RUN 1 (TOC-HEAVY) vs. RUN 2 (RULE-RICH)

  Run 1: 5 pairs from TOC-heavy chunk
  Run 2: 8 pairs from rule-rich chunk

  --- Run 2 Q&A Pairs ---

  QA Pair #1
    Question:     What is the Thief's Open Locks ability score at level 8?
    Answer:       At level 8, the Thief's Open Locks ability score is 60. This information is found in the Thief Abilities table in the document, under the column for O...
    Faithfulness: YES

  QA Pair #2
    Question:     How does the Thief's Move Silently ability progress from level 1 to level 15?
    Answer:       Based on the provided document, the Thief's "Move Silently" ability progresses as follows from level 1 to level 15:

- Level 1: 25
- Level 2: 30
- Lev...
    Faithfulness: YES

  QA Pair #3
    Question:     Can a GM adjust the Thief's ability scores, and for what reason?
    Answer:       Yes, a Game Master (GM) can adjust a Thief's ability scores in certain situations. The document specifies that the GM may apply situational ad

The difference should be visible. The rule-rich chunk produces questions about specific mechanics: ability percentages, level progression, situational modifiers. These are the kinds of questions a customer would actually ask, and the kinds of answers the model needs to learn.

The TOC-heavy chunk, by contrast, tends to produce structural or definitional questions that the model can often already answer from general knowledge. Those pairs add volume but not depth.

This is the core lesson of this section: **SDG amplifies what you give it.** Good input produces useful training signal. Thin input produces filler. The generation model is not inventing domain knowledge. It is reshaping whatever knowledge the source chunk contains.

> **Facilitator note:** This is worth pausing on. Ask the room:
>
> "If we had generated 10,000 pairs from the TOC chunk and fine-tuned on them, what would we expect the model to learn?"
> The answer: very little that it did not already know. Volume without signal is not training data. It is noise shaped like training data.

## 2.11 Save and Export

In [23]:
if RUN_LIVE:
    # Save run 1 results
    result_run1.to_csv("../prebuilt/sdg_run1_results.csv", index=False)
    result_run1.to_csv("sdg_run1_results.csv", index=False)

    # Save run 2 results
    result_run2.to_csv("../prebuilt/sdg_run2_results.csv", index=False)
    result_run2.to_csv("sdg_run2_results.csv", index=False)

    # Also save run 2 as the primary export for downstream use
    keep_cols = ['question', 'response']
    if 'faithfulness_judgment' in result_run2.columns:
        keep_cols.append('faithfulness_judgment')
    if 'relevancy_score' in result_run2.columns:
        keep_cols.append('relevancy_score')

    qa_df = result_run2[keep_cols]
    qa_df.to_csv("synthetic_qa_pairs.csv", index=False)

    print(f"Saved run 1: {len(result_run1)} pairs to ../prebuilt/sdg_run1_results.csv")
    print(f"Saved run 2: {len(result_run2)} pairs to ../prebuilt/sdg_run2_results.csv")
    print(f"Exported {len(qa_df)} pairs to synthetic_qa_pairs.csv")
else:
    print("Using pre-built results. Nothing to save.")

Saved run 1: 5 pairs to ../prebuilt/sdg_run1_results.csv
Saved run 2: 8 pairs to ../prebuilt/sdg_run2_results.csv
Exported 8 pairs to synthetic_qa_pairs.csv


## 2.12 What This Section Was Really About

We generated synthetic question-answer pairs. But that was the activity, not the lesson.

The lesson is that SDG is a manufacturing process, and like any manufacturing process, the quality of the output is determined by the quality of the input. The generation model is a tool. The flow is a tool. The real leverage is in what you feed them.

This connects directly to the escalation ladder from Day 2:

1. If your documents are poorly ingested, SDG produces garbage.
2. If your chunks are incoherent, SDG produces fragmented pairs.
3. If your source material is structurally clean but semantically thin, SDG produces filler.
4. If your source material is rich, specific, and well-structured, SDG produces training signal.

Every upstream decision compounds. This is not new. It is the same principle that governed ingestion, chunking, retrieval, and evaluation. SDG is just the next stage in the same pipeline.

The pairs we generated here are a starting point, not a finished dataset. In a real engagement, you would iterate across the full document, tune the ICL examples for different sections, filter on faithfulness scores, and review samples manually before training. The pipeline supports all of that. But the critical insight is simpler:

**If you want better training data, start with better source material and better chunking. Not a bigger model.**

**Transition to Section 3:**

"We now have structured, grounded training data generated from the customer's own documents. The next step is to use it. Section 3 takes these QA pairs and applies them to model adaptation: fine-tuning the model so that the knowledge gaps we identified in Sections 1 and 2 are addressed at the weight level."