# 🐶 Data Pre-Processing

This notebook goes through each of the stages of data pre-processing. Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.

1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

***

In [2]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used
SOURCE_DOCUMENT_DIR = WORKSPACE_DIR / "source_documents"
SOURCE_DOCUMENT_DIR.mkdir(parents=True, exist_ok=True)

CONVERSION_OUTPUT_DIR = WORKSPACE_DIR / "conversion"
CONVERSION_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CHUNKING_OUTPUT_DIR = WORKSPACE_DIR / "chunking"
CHUNKING_OUTPUT_DIR.mkdir(exist_ok=True)

AUTHORING_OUTPUT_DIR = WORKSPACE_DIR / "authoring"
AUTHORING_OUTPUT_DIR.mkdir(exist_ok=True)

SEED_EXAMPLE_INPUT_DIR = WORKSPACE_DIR / "sdg_inputs"
SEED_EXAMPLE_INPUT_DIR.mkdir(exist_ok=True)

SEED_EXAMPLE_OUTPUT_DIR = WORKSPACE_DIR / "seed_examples"
SEED_EXAMPLE_OUTPUT_DIR.mkdir(exist_ok=True)

SDG_OUTPUT_DIR = WORKSPACE_DIR / "sdg"
SDG_OUTPUT_DIR.mkdir(exist_ok=True)

## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [2]:
!pip install -qq docling

In [4]:
files = []

if SOURCE_DOCUMENT:
    files.append(Path(SOURCE_DOCUMENT))
else:
    print("***** WARNING! Only one file at a time is supported at this time.")
    files = list(SOURCE_DOCUMENT_DIR.rglob("*.pdf"))
    print(f"***** Using {files[0]})")

print(f"Files to convert: {files}")

***** Using workspaces/default/source_documents/2502.01618v3.pdf)
Files to convert: [PosixPath('workspaces/default/source_documents/2502.01618v3.pdf')]


Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.

In [5]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [6]:
import json

for file in files:
    doc = doc_converter.convert(source=file).document
    doc_dict = doc.export_to_dict()

    json_output_path = CONVERSION_OUTPUT_DIR / f"{file.stem}.json"
    with open(json_output_path, "w") as f:
        json.dump(doc_dict, f)
        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")

Path of JSON output is: /Users/astoyano/Documents/code/examples/notebooks/instructlab-knowledge/workspaces/default/conversion/2502.01618v3.json


## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks.

In [7]:
from docling.chunking import HybridChunker

chunker = HybridChunker() # TODO: expose configuration options

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

In [9]:
all_chunks = []
docs = []
for file in files:
    doc = DocumentConverter().convert(source=file)
    docs.append(doc)
    
    chunk_iter = chunker.chunk(dl_doc=doc.document)
    chunk_objs = list(chunk_iter)
    chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]

    print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
    
    for chunk in chunks:
        c = dict(chunk=chunk, file=file.stem)
        all_chunks.append(c)

# TODO: support multiple files save all chunks to single file for review

Extracted 52 chunks from 2502.01618v3


### View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [8]:
#print(all_chunks)
print(chunks[0])

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
Isha Puri 1 Shivchander Sudalairaj 2 Guangxuan Xu 2 Kai Xu 2 Akash Srivastava 2 1 MIT CSAIL 2 Red Hat AI Innovation
c


### Save the chunks to a text file for each chunk

Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook.

In [9]:
for i, chunk in enumerate(all_chunks):
    chunk_path = CHUNKING_OUTPUT_DIR / f"{chunk['file']}-{i}.txt"
    with open(chunk_path, "w") as file:
        file.write(chunk["chunk"])

## Authoring

In [55]:
!pip install -qq docling-sdg

# TODO: replace with above after https://github.com/docling-project/docling-sdg/pull/31 merges
#!pip install -qq git+https://github.com/anastasds/docling-sdg@d15de2c5a81bfe166f66f412fc4b23728065f396

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [56]:
from docling_sdg.qa.utils import get_qa_chunks

filters = [
    lambda chunk: len(str(chunk.text)) > 500
]

dataset = {}
for doc in docs:
    print(f"Chunking and filtering document {doc.document.name}")

    chunks = list(chunker.chunk(dl_doc=doc.document))
    qa_chunks = list(get_qa_chunks(doc.document.name, chunk_objs, filters)) #TODO: decouple reference to chunk_objs from above)
    dataset[doc.document.name] = qa_chunks
    
    print(f"Created dataset {doc.document.name} with {len(qa_chunks)} QA chunks")

Chunking and filtering document 2502.01618v3
Created dataset 2502.01618v3 with 45 QA chunks


### Initialize QA generator, supplying details for which model to use

GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:

* LlmProviders.WATSONX for watsonx
* LlmProviders.OPENAI for OpenAI
* LlmProviders.OPENAI_LIKE for any model provider with OpenAI compatible APIs

In [57]:
from docling_sdg.qa.generate import Generator
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from pydantic import SecretStr

generate_options = GenerateOptions(api_key="fake", project_id="project_id")
generate_options.provider = LlmProvider.OPENAI_LIKE
generate_options.api_key = SecretStr("fake")
generate_options.model_id = "granite3.3"

generate_options.api_key = SecretStr("a8230601c7cfc3c891ab744108417f8e")
generate_options.url = "https://mixtral-8x7b-instruct-v0-1-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"
generate_options.model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

### Configure subset selection

In [27]:
NUM_CHUNKS_TO_SELECT_FOR_AUTHORING = 5

### Run QA generation on selected chunks

In [62]:
import random #TODO: replace random sampling with subset selection

for doc, chunks in dataset.items(): # TODO: multiple file support
    generate_options.generated_file = AUTHORING_OUTPUT_DIR / f"qagen-{doc}.json" 
    gen = Generator(generate_options=generate_options)
    
    print(f"processing chunks that looks like:\n{chunks[0].text}")
    selected_chunks = random.sample(chunks, NUM_CHUNKS_TO_SELECT_FOR_AUTHORING)
    print(f"Selected {len(selected_chunks)} contexts")

    Path.unlink(generate_options.generated_file, missing_ok=True)
    results = gen.generate_from_chunks(selected_chunks) # automatically saves to file
    
    print(f"{doc}: {results.status}")
    break

processing chunks that looks like:
Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic sear

100%|███████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:20<00:00,  4.03s/it]

2502.01618v3: Status.SUCCESS





### Read generated QAs and restructure

In [63]:
import json
import yaml
from textwrap import wrap

qnas = {}
chunk_id_to_text = {}
with open(generate_options.generated_file, "rt") as f:
    for line in f.readlines():
        entry = json.loads(line)
        chunk_id = entry['chunk_id']
        if chunk_id not in chunk_id_to_text:
            chunk_id_to_text[chunk_id] = entry['context']
        if chunk_id not in qnas:
            qnas[chunk_id] = []
        qnas[chunk_id].append({'question': entry['question'], 'answer': entry['answer']})

print(f"Generated QA pairs for {len(qnas)} contexts")
print(list(qnas.values())[0])

Generated QA pairs for 5 contexts
[{'question': 'What is the name of the paper that presents a novel inference-time scaling approach for large language models?', 'answer': 'The name of the paper is not mentioned in the provided context.'}, {'question': 'What are the main advantages of the proposed inference-time scaling approach compared to existing methods?', 'answer': 'The proposed approach has a 4-16x better scaling rate over deterministic search counterparts on various mathematical reasoning tasks, and it can surpass GPT4o accuracy in only 4 rollouts for Qwen2.5-Math-1.5B-Instruct and scale to o1 level accuracy in only 32 rollouts for Qwen2.5Math-7B-Instruct.'}, {'question': 'Given the empirical evaluation results, how would you infer the computational cost of the proposed approach compared to the existing inference-time scaling methods?', 'answer': 'Based on the provided context, the computational cost of the proposed approach is significantly less than that of the existing method

### Define metadata for qna.yaml

In [76]:
DOCUMENT_OUTLINE = "A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods"
DOMAIN = "artificial intelligence"

### Output qna.yaml

In [77]:
qna_output_path = AUTHORING_OUTPUT_DIR / "qna.yaml"

def str_presenter(dumper, data):
  if len(data.splitlines()) > 1:  # check for multiline string
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  elif len(data) > 80:
    data = "\n".join(wrap(data, 80))
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)

# to use with safe_dump:
yaml.representer.SafeRepresenter.add_representer(str, str_presenter)

class IndentedDumper(yaml.Dumper):
    def increase_indent(self, flow=False, indentless=False):
        return super(IndentedDumper, self).increase_indent(flow, False)

data = {'seed_examples': []}
for chunk_id, context in chunk_id_to_text.items():
    data['seed_examples'].append({
        'context': context,
        'questions_and_answers': [
            {
                'question': example['question'],
                'answer': example['answer'],
            } for example in qnas[chunk_id]
        ]
    })

data['document_outline'] = DOCUMENT_OUTLINE
data['domain'] = DOMAIN

Path.unlink(qna_output_path, missing_ok=True) # shouldn't be necessary but was. jupyter caching thing?
with open(qna_output_path, 'w') as yaml_file:
    yaml.dump(data, yaml_file, Dumper=IndentedDumper, default_flow_style=False, sort_keys=False, width=80)

print("Done")

Done


### View generated qna.yaml

In [78]:
with open(qna_output_path) as yaml_file:
    print(yaml_file.read())

seed_examples:
  - context: |-
      Large language models (LLMs) have achieved significant performance gains via
      scaling up model sizes and/or data. However, recent evidence suggests
      diminishing returns from such approaches, motivating scaling the computation
      spent at inference time. Existing inference-time scaling methods, usually with
      reward models, cast the task as a search problem, which tends to be vulnerable
      to reward hacking as a consequence of approximation errors in reward models. In
      this paper, we instead cast inference-time scaling as a probabilistic inference
      task and leverage sampling-based techniques to explore the typical set of the
      state distribution of a state-space model with an approximate likelihood, rather
      than optimize for its mode directly. We propose a novel inference-time scaling
      approach by adapting particle-based Monte Carlo methods to this task. Our
      empirical evaluation demonstrates that our 

## Create Seed Dataset for SDG

This notebook combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this notebook you need a directory that contains N chunks named `{original-file-name}-{N}.txt` and a `qna.yaml` in the same directory.

This notebook outputs a `seed.jsonl` file in the `output_dir` that you set.

In [79]:
!pip install -qq datasets transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [80]:
import os
import shutil

from utils.create_seed_dataset import get_seed_dataset

src_files = os.listdir(CHUNKING_OUTPUT_DIR)

for file_name in src_files:
    full_file_name = os.path.join(CHUNKING_OUTPUT_DIR, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, SEED_EXAMPLE_INPUT_DIR)

shutil.copy(qna_output_path, SEED_EXAMPLE_INPUT_DIR)

seed_data = get_seed_dataset(SEED_EXAMPLE_INPUT_DIR)
seed_data.to_json(f'{SEED_EXAMPLE_OUTPUT_DIR}/seed_data.jsonl', orient='records', lines=True)

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/260 [00:00<?, ? examples/s]

Filter:   0%|          | 0/260 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

924266