# 🐶 Data Pre-Processing: From source PDF to SDG-ready

This notebook goes through each of the stages of data pre-processing. Directory-based conventions are used to save intermediate results as a PDF is converted and chunked and QA generation is performed to create a `qna.yaml` file for each knowledge contribution. At the end everything is combined into the inputs for SDG.

Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.

**NOTE**: Starting the notebook using Python 3.11 is recommended. Python 3.12 or later are not yet supported. 

1. [Data Gathering](#Data-Gathering)
1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

***

TODO: Document how to organize directory structure, etc

Store your source documents under: `workspaces > workspace_name > contribution name > source documents`

In [1]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

contribution_dirs = []
# contribution_names = ["nfl"]  # ADD CONTRIBUTION NAMES HERE
contribution_names = ["nfl", "finance"]
contribution_metadata = [{"domain": "sports", "summary": "Official playing rules of the National Football League 2022"}, {"domain": "finance", "summary": "Account information for a specific bank"}]


for name in contribution_names:
    contribution_dir = WORKSPACE_DIR / name
    contribution_dirs.append(contribution_dir)

    for subdir in ["source_documents", "conversion", "chunking", "authoring"]:
        (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)

TODO

Now that the directory structure is created, ensure your source documents are placed under the appropriate `<contribution_name>/source_documents` directory.

## Data Gathering

TODO: Add documentation about domain and summary here, clear out second contribution example

In [13]:
print(f"Files to convert:")
for dir in contribution_dirs:
    files = list((dir / "source_documents").glob("*.pdf"))
    for file in files:
        print(file.resolve())

Files to convert:
/Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/nfl/source_documents/2022-nfl-rulebook-final.pdf
/Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/source_documents/BofA_InterestChecking_en_ADA.pdf
/Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/source_documents/Advantag Savings.pdf


## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [14]:
!pip install -qq docling

### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.

For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS).

In [15]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [16]:
import json

json_files=[]
for contribution_dir in contribution_dirs:
    files = list((contribution_dir / "source_documents").glob("*.pdf"))
    
    for file in files:
        doc = doc_converter.convert(source=file).document
        doc_dict = doc.export_to_dict()
   
        conversion_output_dir = contribution_dir / "conversion"
        conversion_output_dir.mkdir(parents=True, exist_ok=True)
        
        json_output_path = conversion_output_dir / f"{file.stem}.json"
        with open(json_output_path, "w") as f:
            json.dump(doc_dict, f)
            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
            json_files.append(json_output_path.resolve())

[PosixPath('workspaces/default/nfl/source_documents/2022-nfl-rulebook-final.pdf')]




Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/nfl/conversion/2022-nfl-rulebook-final.json
[PosixPath('workspaces/default/finance/source_documents/BofA_InterestChecking_en_ADA.pdf'), PosixPath('workspaces/default/finance/source_documents/Advantag Savings.pdf')]
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/conversion/BofA_InterestChecking_en_ADA.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/conversion/Advantag Savings.json


## Post-Conversion: Illuminator Analysis

The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues.

In [None]:
from utils.illuminator.analysis import analyze_docling_tables
from utils.illuminator.utils import generate_summary
from docling.datamodel.document import DoclingDocument

import json
import sys
from pathlib import Path

results = {}

for path in converted_json_paths:
    with open(path, "r") as f:
        doc_dict = json.load(f)

    doc = DoclingDocument(**doc_dict)
    results[path] = analyze_docling_tables(doc)

summary_path = Path("illuminator_readable_summary.txt")

with open(summary_path, "w") as f:
    generate_summary(results, file=f)

print(f"✅ Post-conversion summary saved to: {summary_path.resolve()}")


The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.


## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these.

In [17]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

All chunks are saved to a JSON file called chunks.jsonl in the `chunks` directory in your contribution. This file is one of the inputs father below when we create the seed dataset for SDG.

In [33]:
import json
from docling.document_converter import DocumentConverter

for contribution_dir in contribution_dirs:
    conversion_dir = contribution_dir / "conversion"
    json_files = list(conversion_dir.glob("*.json"))
    chunking_output_dir = contribution_dir / "chunks"
    chunking_output_dir.mkdir(parents=True, exist_ok=True)
    all_chunks = []
    
    for file in json_files:
        # reconvert the docling JSON for chunking
        doc = DocumentConverter().convert(source=file)
        
        chunk_iter = chunker.chunk(dl_doc=doc.document)
        chunk_objs = list(chunk_iter)
    
        print(f"Extracted {len(chunk_objs)} chunks from {doc.document.name}")
        
        for chunk in chunk_objs:
            c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())
            all_chunks.append(c)

        chunks_file_path = chunking_output_dir / "chunks.jsonl"
        with open(chunks_file_path, "w", encoding="utf-8") as file:
            for chunk in all_chunks:
                json.dump(chunk, file)
                file.write("\n")
            print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")

Extracted 1798 chunks from 2022-nfl-rulebook-final
Path of chunks JSON is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/nfl/chunks/chunks.jsonl
Extracted 5 chunks from Advantag Savings
Path of chunks JSON is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/chunks/chunks.jsonl
Extracted 10 chunks from BofA_InterestChecking_en_ADA
Path of chunks JSON is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/finance/chunks/chunks.jsonl


### View the Chunks

In [21]:
chunk_gen = iter(all_chunks)

To view the chunks one by one, rerun the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [23]:
print(next(chunk_gen)['chunk'])

Interest rate policy
Interest rate and interest calculation
Your account has a variable interest rate, which means the interest rate may change. Interest rates for your account are based on balance tiers, are set at our discretion and may change at any time without notice.
To calculate interest, we apply a daily periodic rate to the collected balance in your account each day. The daily rate that applies will depend on which balance tier your end-of-day balance falls in. We compound and pay any interest earned monthly. Interest is only paid in whole cents and we use standard rounding rules to calculate the amount. This means that an amount less than half of one cent is rounded down to zero, and an amount of half of one cent or more is rounded up to the next whole cent.
Extra interest
Extra interest is earned automatically when an account owner is enrolled in the Preferred Rewards program. The interest rate will increase by at least 5%, 10% or 20% over the standard rate, and is based on 

## Authoring

### Install docling-sdg

In [3]:
!pip install -qq docling-sdg

### Initialize QA generator model & Number of Seed examples

To generate the Open AI compatible endpoint, API key, and model name for the as well as the number of seed example you wish to generate for your contribution.

In [4]:
API_KEY = "bc4f8ec17edae8eb015cb76a8cca6363"  # the API access key for your account ( cannot be empty )
API_URL = "https://mixtral-8x7b-instruct-v0-1-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com/v1"  # the URL of your model's API
MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" # the name of your model
NUM_SEED_EXAMPLES = 5

In [5]:
from utils.qna_gen import generate_seed_examples

for contribution_name, contribution_dir, contribution_metadata in zip(contribution_names, contribution_dirs, contribution_metadata):
    qna_output_path = generate_seed_examples(contribution_name,
                           contribution_dir,
                           contribution_metadata,
                           NUM_SEED_EXAMPLES,
                           API_KEY,
                           API_URL,
                           MODEL_ID)
    print(f"qna.yaml saved to: {qna_output_path}")


Filtering smaller chunks out of document 2022-nfl-rulebook-final


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:16<00:00,  3.38s/it]


Status for Q&A generation for nfl is: Status.SUCCESS
qna.yaml saved to: workspaces/default/nfl/authoring/qna.yaml
Filtering smaller chunks out of document Advantag Savings
Filtering smaller chunks out of document BofA_InterestChecking_en_ADA


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:17<00:00,  3.49s/it]

Status for Q&A generation for finance is: Status.SUCCESS
qna.yaml saved to: workspaces/default/finance/authoring/qna.yaml





### Revise QAs

Open the generated `qna.yaml` in your preferred text editor to ensure the quality of generated questions and answers. If the generation step has failed to generated three questions and answers for each of five contexts, supplant until that required number of QA pairs is reached.

## Create Seed Dataset for SDG

This section combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this step you need a directory that contains `chunks.jsonl` and a `qna.yaml` in the same directory.

This step outputs a seed.jsonl file in the SDG_OUTPUT_DIR that you set.

In [6]:
!pip install -qq datasets transformers

In [9]:
from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets

contribution_datasets = []
for contribution_dir, contribution_name in zip(contribution_dirs, contribution_names):
    chunks_dir = contribution_dir / "chunks"
    qna_dir = contribution_dir / "authoring"
    seed_data = get_seed_dataset(chunks_dir, qna_dir)
    output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'
    seed_data.to_json(output_path, orient='records', lines=True)
    contribution_datasets.append(seed_data)
    print(f"Intermediate results saved to: {output_path}")

final_seed_data = safe_concatenate_datasets(contribution_datasets)
output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'
final_seed_data.to_json(output_path, orient='records', lines=True)

print(f"Final seed data contains {final_seed_data.data.num_rows} rows")
print(f"Final seed data for SDG saved to: {output_path}")

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/8990 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8990 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Intermediate results saved to: workspaces/default/nfl/seed_data-nfl.jsonl


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Intermediate results saved to: workspaces/default/finance/seed_data-finance.jsonl


Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Seed data contains 9065 rows
Final results saved to: workspaces/default/seed_data.jsonl


### Inspect the seed data

In [None]:
print(seed_data.data.table.slice(length=1))

# Summary

To recap, given a source document in PDF format, this notebook:

1. Converted the document using document and saved it to JSON for inspection
2. Split the extracted text into chunks
3. Generated QA pairs for a subset of those chunks
4. Created a `qna.yaml` available for inspection and revision
5. Combined the chunks and `qna.yaml` to create a `seed_data.jsonl` for use with SDG

The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb).