# 🐶 Data Pre-Processing: From source PDF to SDG-ready

This notebook goes through each of the stages of data pre-processing. Directory-based conventions are used to save intermediate results as a PDF is converted and chunked, QA generation is performed to create a `qna.yaml` file, and finally everything is combined into the inputs for SDG.

Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.

**NOTE**: Starting the notebook using Python 3.11 is recommended. Python 3.12 or later are not yet supported. 

1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

***

In [1]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

SOURCE_DOCUMENT = None # to process a specific document, set its path here; otherwise, the entire source documents repository will be used
SOURCE_DOCUMENT_DIR = WORKSPACE_DIR / "source_documents"
SOURCE_DOCUMENT_DIR.mkdir(parents=True, exist_ok=True)

CONVERSION_OUTPUT_DIR = WORKSPACE_DIR / "conversion"
CONVERSION_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CHUNKING_OUTPUT_DIR = WORKSPACE_DIR / "chunking"
CHUNKING_OUTPUT_DIR.mkdir(exist_ok=True)

AUTHORING_OUTPUT_DIR = WORKSPACE_DIR / "authoring"
AUTHORING_OUTPUT_DIR.mkdir(exist_ok=True)

SEED_EXAMPLE_INPUT_DIR = WORKSPACE_DIR / "sdg_inputs"
SEED_EXAMPLE_INPUT_DIR.mkdir(exist_ok=True)

SEED_EXAMPLE_OUTPUT_DIR = WORKSPACE_DIR / "seed_examples"
SEED_EXAMPLE_OUTPUT_DIR.mkdir(exist_ok=True)

SDG_OUTPUT_DIR = WORKSPACE_DIR / "sdg"
SDG_OUTPUT_DIR.mkdir(exist_ok=True)

## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [2]:
!pip install -qq docling

In [3]:
files = []

if SOURCE_DOCUMENT:
    files.append(Path(SOURCE_DOCUMENT))
else:
    print("***** WARNING! Only one file at a time is supported at this time.")
    files = list(SOURCE_DOCUMENT_DIR.rglob("*.pdf"))
    print(f"***** Using {files[0]})")

print(f"Files to convert: {files}")

***** Using workspaces/default/source_documents/Advantag Savings.pdf)
Files to convert: [PosixPath('workspaces/default/source_documents/Advantag Savings.pdf')]


### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.

For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS).

In [4]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

  from .autonotebook import tqdm as notebook_tqdm


Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [5]:
import json

for file in files:
    doc = doc_converter.convert(source=file).document
    doc_dict = doc.export_to_dict()

    json_output_path = CONVERSION_OUTPUT_DIR / f"{file.stem}.json"
    with open(json_output_path, "w") as f:
        json.dump(doc_dict, f)
        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Path of JSON output is: /Users/dawarner/Documents/PROJECTS/examples/notebooks/instructlab-knowledge/workspaces/default/conversion/Advantag Savings.json


  np.nanmean(
  np.nanmean(


## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these.

In [6]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

In [7]:
all_chunks = []
docs = []
for file in files:
    doc = DocumentConverter().convert(source=file)
    docs.append(doc)
    
    chunk_iter = chunker.chunk(dl_doc=doc.document)
    chunk_objs = list(chunk_iter)
    chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]

    print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
    
    for chunk in chunks:
        c = dict(chunk=chunk, file=file.stem)
        all_chunks.append(c)

# TODO: support multiple files save all chunks to single file for review

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Extracted 6 chunks from Advantag Savings


  np.nanmean(
  np.nanmean(


### View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [8]:
#print(all_chunks)
print(chunks[1])

Interest rate policy
Interest rate and interest calculation
Your account has a variable interest rate, which means the interest rate may change. Interest rates for your account are based on balance tiers, are set at our discretion and may change at any time without notice.
To calculate interest, we apply a daily periodic rate to the collected balance in your account each day. The daily rate that applies will depend on which balance tier your end-of-day balance falls in. We compound and pay any interest earned monthly. Interest is only paid in whole cents and we use standard rounding rules to calculate the amount. This means that an amount less than half of one cent is rounded down to zero, and an amount of half of one cent or more is rounded up to the next whole cent.
Extra interest
Extra interest is earned automatically when an account owner is enrolled in the Preferred Rewards program. The interest rate will increase by at least 5%, 10% or 20% over the standard rate, and is based on 

### Save the chunks to a text file for each chunk

Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook.

In [9]:
for i, chunk in enumerate(all_chunks):
    chunk_path = CHUNKING_OUTPUT_DIR / f"{chunk['file']}-{i}.txt"
    with open(chunk_path, "w") as file:
        file.write(chunk["chunk"])

## Authoring

In [10]:
!pip install -qq docling-sdg

# TODO: replace with above after https://github.com/docling-project/docling-sdg/pull/31 merges
#!pip install -qq git+https://github.com/anastasds/docling-sdg@d15de2c5a81bfe166f66f412fc4b23728065f396

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [11]:
from docling_sdg.qa.utils import get_qa_chunks

filters = [
    lambda chunk: len(str(chunk.text)) > 500
]

dataset = {}
for doc in docs:
    print(f"Chunking and filtering document {doc.document.name}")

    chunks = list(chunker.chunk(dl_doc=doc.document))
    qa_chunks = list(get_qa_chunks(doc.document.name, chunk_objs, filters)) #TODO: decouple reference to chunk_objs from above)
    dataset[doc.document.name] = qa_chunks
    
    print(f"Created dataset {doc.document.name} with {len(qa_chunks)} QA chunks")

Chunking and filtering document Advantag Savings
Created dataset Advantag Savings with 5 QA chunks


### Initialize QA generator, supplying details for which model to use

GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:

* LlmProviders.WATSONX for watsonx
* LlmProviders.OPENAI for OpenAI
* LlmProviders.OPENAI_LIKE for any model provider with OpenAI compatible APIs

In [None]:
API_KEY = "VITE_LLM_API_KEY"  # the API access key for your account ( cannot be empty )
API_URL = "https://mixtral-8x7b-instruct-v0-1-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"  # the URL of your model's API
MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" # the name of your model

In [13]:
from docling_sdg.qa.generate import Generator
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from pydantic import SecretStr

generate_options = GenerateOptions(project_id="project_id")
generate_options.provider = LlmProvider.OPENAI_LIKE
generate_options.api_key = SecretStr(API_KEY)
generate_options.url = API_URL
generate_options.model_id = MODEL_ID

### Configure subset selection

In [14]:
NUM_CHUNKS_TO_SELECT_FOR_AUTHORING = 5

### Run QA generation on selected chunks

In [15]:
import random #TODO: replace random sampling with subset selection

for doc, chunks in dataset.items(): # TODO: multiple file support
    generate_options.generated_file = AUTHORING_OUTPUT_DIR / f"qagen-{doc}.json"
    gen = Generator(generate_options=generate_options)
    print(f"processing chunks that looks like:\n{chunks[0].text}")
    selected_chunks = random.sample(chunks, NUM_CHUNKS_TO_SELECT_FOR_AUTHORING)
    print(f"Selected {len(selected_chunks)} contexts")

    Path.unlink(generate_options.generated_file, missing_ok=True)
    results = gen.generate_from_chunks(selected_chunks) # automatically saves to file

    print(f"{doc}: {results.status}")
    break

processing chunks that looks like:
FDIC
Coverage
This account is insured by the Federal Deposit Insurance Corporation (FDIC) and is backed by the U.S. government. The standard insurance amount is $250,000 per depositor, per insured bank, for each account ownership category.
Monthly Maintenance Fee
$8.00
each month
(We'll waive this fee for the first 6 months.)
You can avoid the Monthly Maintenance Fee when you meet ONE of the following requirements during each statement cycle:
· Maintain a minimum daily balance of $500 or more in your account, OR
· Ask us to link your account to your Bank of America Advantage Relationship Banking ® , Bank of America Advantage  with Tiered Interest Checking or Bank of America Advantage ® ® Regular Checking account (first 4 savings accounts), OR
· An owner of this account is under the age of 25 (fiduciary accounts don't qualify), OR
· Be a member of the Preferred Rewards program. Learn more at bankofamerica.com/preferred-rewards, or visit your local fina

100%|██████████| 5/5 [00:16<00:00,  3.37s/it]

Advantag Savings: Status.SUCCESS





### Read generated QAs and restructure

In [16]:
import json
import yaml
from textwrap import wrap

qnas = {}
chunk_id_to_text = {}
with open(generate_options.generated_file, "rt") as f:
    for line in f.readlines():
        entry = json.loads(line)
        chunk_id = entry['chunk_id']
        if chunk_id not in chunk_id_to_text:
            chunk_id_to_text[chunk_id] = entry['context']
        if chunk_id not in qnas:
            qnas[chunk_id] = []
        qnas[chunk_id].append({'question': entry['question'], 'answer': entry['answer']})

print(f"Generated QA pairs for {len(qnas)} contexts")
print(list(qnas.values())[0])

Generated QA pairs for 5 contexts
[{'question': 'What is the method used to calculate interest on the account?', 'answer': 'Interest is calculated by applying a daily periodic rate to the collected balance in the account each day, with the daily rate depending on the balance tier the end-of-day balance falls in. The interest is then compounded and paid monthly.'}, {'question': 'What are the different ways interest can be increased on the account?', 'answer': 'Interest can be increased through a variable interest rate based on balance tiers, or by enrollment in the Preferred Rewards program, which can increase the interest rate by at least 5%, 10%, or 20%.'}, {'question': 'If a customer is enrolled in the Preferred Rewards program, what would be the minimum increase in the interest rate?', 'answer': 'The interest rate would increase by at least 5% over the standard rate.'}]


### Define metadata for qna.yaml

In [17]:
DOCUMENT_OUTLINE = "Details about the Advantage Savings Account (ASA) and its features, benefits, and eligibility criteria."
DOMAIN = "retail banking"

### Output qna.yaml

In [18]:
qna_output_path = AUTHORING_OUTPUT_DIR / "qna.yaml"

# The following creates a data structure for outputting in the expected format for qna.yaml
# TODO: extract into utils library

def str_presenter(dumper, data):
  if len(data.splitlines()) > 1:  # check for multiline string
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  elif len(data) > 80:
    data = "\n".join(wrap(data, 80))
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)

# to use with safe_dump:
yaml.representer.SafeRepresenter.add_representer(str, str_presenter)

class IndentedDumper(yaml.Dumper):
    def increase_indent(self, flow=False, indentless=False):
        return super(IndentedDumper, self).increase_indent(flow, False)

data = {'seed_examples': []}
for chunk_id, context in chunk_id_to_text.items():
    data['seed_examples'].append({
        'context': context,
        'questions_and_answers': [
            {
                'question': example['question'],
                'answer': example['answer'],
            } for example in qnas[chunk_id]
        ]
    })

data['document_outline'] = DOCUMENT_OUTLINE
data['domain'] = DOMAIN

Path.unlink(qna_output_path, missing_ok=True) # shouldn't be necessary but was. jupyter caching thing?
with open(qna_output_path, 'w') as yaml_file:
    yaml.dump(data, yaml_file, Dumper=IndentedDumper, default_flow_style=False, sort_keys=False, width=80)

print(f"qna.yaml saved to: {qna_output_path}")

qna.yaml saved to: workspaces/default/authoring/qna.yaml


### View generated qna.yaml

In [19]:
with open(qna_output_path) as yaml_file:
    print(yaml_file.read())

seed_examples:
  - context: |-
      Interest rate and interest calculation
      Your account has a variable interest rate, which means the interest rate may change. Interest rates for your account are based on balance tiers, are set at our discretion and may change at any time without notice.
      To calculate interest, we apply a daily periodic rate to the collected balance in your account each day. The daily rate that applies will depend on which balance tier your end-of-day balance falls in. We compound and pay any interest earned monthly. Interest is only paid in whole cents and we use standard rounding rules to calculate the amount. This means that an amount less than half of one cent is rounded down to zero, and an amount of half of one cent or more is rounded up to the next whole cent.
      Extra interest
      Extra interest is earned automatically when an account owner is enrolled in the Preferred Rewards program. The interest rate will increase by at least 5%, 10% or 20% 

### Revise QAs

Open the generated `qna.yaml` in your preferred text editor to ensure the quality of generated questions and answers. If the generation step has failed to generated three questions and answers for each of five contexts, supplant until that required number of QA pairs is reached.

## Create Seed Dataset for SDG

This notebook combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this notebook you need a directory that contains N chunks named `{original-file-name}-{N}.txt` and a `qna.yaml` in the same directory.

This notebook outputs a `seed.jsonl` file in the `output_dir` that you set.

In [20]:
!pip install -qq datasets transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
import os
import shutil

from utils.create_seed_dataset import get_seed_dataset

src_files = os.listdir(CHUNKING_OUTPUT_DIR)

for file_name in src_files:
    full_file_name = os.path.join(CHUNKING_OUTPUT_DIR, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, SEED_EXAMPLE_INPUT_DIR)

shutil.copy(qna_output_path, SEED_EXAMPLE_INPUT_DIR)

seed_data = get_seed_dataset(SEED_EXAMPLE_INPUT_DIR)
output_path = f'{SEED_EXAMPLE_OUTPUT_DIR}/seed_data.jsonl'
seed_data.to_json(output_path, orient='records', lines=True)

print(f"Generated {seed_data.data.num_rows} rows")
print(f"Results saved to: {output_path}")

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Map: 100%|██████████| 6/6 [00:00<00:00, 2155.90 examples/s]
Map: 100%|██████████| 6/6 [00:00<00:00, 3200.13 examples/s]
Map: 100%|██████████| 6/6 [00:00<00:00, 3067.51 examples/s]
Map: 100%|██████████| 6/6 [00:00<00:00, 2706.29 examples/s]
Map: 100%|██████████| 6/6 [00:00<00:00, 2063.11 examples/s]
Map: 100%|██████████| 30/30 [00:00<00:00, 3182.65 examples/s]
Filter: 100%|██████████| 30/30 [00:00<00:00, 3546.98 examples/s]
Creating json

Generated 30 rows
Results saved to: workspaces/default/seed_examples/seed_data.jsonl





### Inspect the generated data

In [22]:
print(seed_data.data.table.slice(length=1))

pyarrow.Table
document_outline: string
document_title: string
domain: string
icl_document: string
icl_query_1: string
icl_response_1: string
icl_query_2: string
icl_response_2: string
icl_query_3: string
icl_response_3: string
document: string
----
document_outline: [["Details about the Advantage Savings Account (ASA) and its features, benefits,
and eligibility criteria."]]
document_title: [["Advantag Savings"]]
domain: [["retail banking"]]
icl_document: [["Interest rate and interest calculation
Your account has a variable interest rate, which means the interest rate may change. Interest rates for your account are based on balance tiers, are set at our discretion and may change at any time without notice.
To calculate interest, we apply a daily periodic rate to the collected balance in your account each day. The daily rate that applies will depend on which balance tier your end-of-day balance falls in. We compound and pay any interest earned monthly. Interest is only paid in whole cent

# Summary

To recap, given a source document in PDF format, this notebook:

1. Converted the document using document and saved it to JSON for inspection
2. Split the extracted text into chunks
3. Generated QA pairs for a subset of those chunks
4. Created a `qna.yaml` available for inspection and revision
5. Combined the chunks and `qna.yaml` to create a `seed.jsonl` for use with SDG

The next step is to use the resulting `seed.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb).