# Extraction

In [4]:
%pip install --upgrade lancedb

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from docling.document_converter import DocumentConverter
from utils.sitemap import get_sitemap_urls

converter = DocumentConverter()

In [6]:
# --------------------------------------------------------------
# Basic PDF extraction
# --------------------------------------------------------------

result = converter.convert("https://arxiv.org/pdf/2408.09869")




In [3]:
document = result.document
markdown_output = document.export_to_markdown()
json_output = document.export_to_dict()

print(markdown_output)

<!-- image -->

## Docling Technical Report

Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research R¨ uschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

## 1 Introduction

Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variabi

In [12]:
# --------------------------------------------------------------
# Basic HTML extraction
# --------------------------------------------------------------

result = converter.convert("https://baldursgate.fandom.com/wiki/Bone_Blade")

document = result.document
markdown_output = document.export_to_markdown()
print(markdown_output)


# Bone Blade

- Edit source
- History
- Purge
- Talk (0)

Bone Blade
Baldur's Gate II (2001)
Content from the original Throne of Bhaal in the Baldur's Gate II campaign Shadows of AmnThrone of Bhaal (2001)
Content from the original Baldur's Gate II campaign Throne of BhaalBaldur's Gate II: Enhanced Edition
Throne of Bhaal (2013)
Content from the Baldur's Gate II: Enhanced Edition campaign Throne of Bhaal

Record

Type
Weapon


Gender
Extra


Race
Sword


Class
Long sword


Alignment
LGNGCGLNTNCNLENECE



Involvement

Allegiance
Enemy


Area
Pocket Plane (area)


Place
Final Challenge Room


Area code
AR4500



Ability scores




Level
Hit points
XP value




10
50
2500







Str
Dex
Con


12
9
9




Int
Wis
Cha


25
9
9




Total scores
73





Morale
Break
Recovery




15
0
0






Combat basics




AC
Base THAC0
Base APR




-10
-2
2







Saving throws


Dth
Wand
Poly
Brth
Spell




13
15
14
16
16





Damage resistances [%]



Cold
Elec.
Fire


100
100
80







Crush
Miss.
Pierce

# Chunking

In [2]:
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from dotenv import load_dotenv
# from openai import OpenAI
# from utils.tokenizer import OpenAITokenizerWrapper

load_dotenv()


False

In [3]:
# # Initialize OpenAI client (make sure you have OPENAI_API_KEY in your environment variables)
# client = OpenAI()


# tokenizer = OpenAITokenizerWrapper()  # Load our custom tokenizer for OpenAI
# MAX_TOKENS = 8191  # text-embedding-3-large's maximum context length


# --------------------------------------------------------------
# Extract the data
# --------------------------------------------------------------

converter = DocumentConverter()
result = converter.convert("https://arxiv.org/pdf/2408.09869")


In [4]:

# --------------------------------------------------------------
# Apply hybrid chunking
# --------------------------------------------------------------

chunker = HybridChunker(
    # tokenizer=tokenizer,
    max_tokens=512
    # merge_peers=True,
)

chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)

len(chunks)

Token indices sequence length is longer than the specified maximum sequence length for this model (648 > 512). Running this sequence through the model will result in indexing errors


49

In [5]:
chunks

[DocChunk(text='Version 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨ uschlikon, Switzerland', meta=DocMeta(schema_name='docling_core.transforms.chunker.DocMeta', version='1.0.0', doc_items=[DocItem(self_ref='#/texts/2', parent=RefItem(cref='#/groups/0'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=283.31, t=511.978, r=328.69, b=503.426, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 11))]), DocItem(self_ref='#/texts/3', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=

# Embedding

In [14]:
from typing import List
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
import re

path = "https://baldursgate.fandom.com/wiki/Ordulinian"
converter = DocumentConverter()
result = converter.convert(path)

# --------------------------------------------------------------
# Apply hybrid chunking
# --------------------------------------------------------------

chunker = HybridChunker(
    # tokenizer=tokenizer,
    max_tokens=512
    # merge_peers=True,
)

chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)

len(chunks)
# --------------------------------------------------------------
# Create a LanceDB database and table
# --------------------------------------------------------------

# Create a LanceDB database
db = lancedb.connect("data/lancedb")

# Get the OpenAI embedding function
func = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")


# Define a simplified metadata schema
class ChunkMetadata(LanceModel):
    """
    You must order the fields in alphabetical order.
    This is a requirement of the Pydantic implementation.
    """

    filename: str | None
    page_numbers: List[int] | None
    title: str | None
    url: str | None


# Define the main Schema
class Chunks(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()  # type: ignore
    metadata: ChunkMetadata


table = db.create_table("baldurs_gate", schema=Chunks, mode="overwrite")

# --------------------------------------------------------------
# Prepare the chunks for the table
# --------------------------------------------------------------

# Create table with processed chunks
processed_chunks = [
    {
        "text": re.sub(r"\s+", " ", chunk.text).strip(),
        "metadata": {
            "filename": chunk.meta.origin.filename,
            "page_numbers": [
                page_no
                for page_no in sorted(
                    set(
                        prov.page_no
                        for item in chunk.meta.doc_items
                        for prov in item.prov
                    )
                )
            ]
            or [-1],
            "title": chunk.meta.headings[0] if chunk.meta.headings else '',
            "url": path
        },
    }
    for chunk in chunks
]



# --------------------------------------------------------------
# Add the chunks to the table (automatically embeds the text)
# --------------------------------------------------------------

table.add(data=processed_chunks)

# # --------------------------------------------------------------
# # Load the table
# # --------------------------------------------------------------

# table.to_pandas()
# table.count_rows()



In [7]:
result.document

DoclingDocument(schema_name='DoclingDocument', version='1.3.0', name='Ordulinian', origin=DocumentOrigin(mimetype='text/html', binary_hash=10729628876958181298, filename='Ordulinian', uri=None), furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0'), RefItem(cref='#/pictures/0'), RefItem(cref='#/groups/0'), RefItem(cref='#/texts/81'), RefItem(cref='#/pictures/1'), RefItem(cref='#/texts/82'), RefItem(cref='#/pictures/2'), RefItem(cref='#/texts/83'), RefItem(cref='#/texts/84'), RefItem(cref='#/groups/15'), RefItem(cref='#/groups/30'), RefItem(cref='#/texts/171'), RefItem(cref='#/texts/172')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), groups=[UnorderedList(self_ref='#/groups/0', parent=RefItem(cref='#/body'), childr

In [15]:
query = "What is the performance of docling?"
actual = table.search(query).limit(3).to_list()

for dict in actual:
    print(dict['text'],dict['metadata'],'\n\n\n')
    

Media Kit Contact {'filename': 'Ordulinian', 'page_numbers': [-1], 'title': 'Ordulinian', 'url': 'https://baldursgate.fandom.com/wiki/Ordulinian'} 



What is Fandom? About Careers Press Contact Terms of Use Privacy Policy Digital Services Act Global Sitemap Local Sitemap {'filename': 'Ordulinian', 'page_numbers': [-1], 'title': 'Ordulinian', 'url': 'https://baldursgate.fandom.com/wiki/Ordulinian'} 



Original BG image {'filename': 'Ordulinian', 'page_numbers': [-1], 'title': 'Ordulinian', 'url': 'https://baldursgate.fandom.com/wiki/Ordulinian'} 





# Pulling Data

In [1]:
import re
from typing import List
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from dotenv import load_dotenv
# from openai import OpenAI
# from utils.tokenizer import OpenAITokenizerWrapper

load_dotenv()


converter = DocumentConverter()

# --------------------------------------------------------------
# Apply hybrid chunking
# --------------------------------------------------------------

chunker = HybridChunker(
    # tokenizer=tokenizer,
    max_tokens=512
    # merge_peers=True,
)
# Create a LanceDB database
db = lancedb.connect("data/lancedb")

table = db.open_table('baldurs_gate')
def is_filename_in_table(filename: str) -> bool:
    results = table.search().where(f"metadata.filename == '{filename}'").limit(1).to_list()
    return len(results) > 0
def add_document(path):
    try:
        result = converter.convert(path)
        filename = result.document.origin.filename

        if is_filename_in_table(filename):
            print(f"Document '{filename}' already exists in the table. Skipping.")
            return

        chunk_iter = chunker.chunk(dl_doc=result.document)
        chunks = list(chunk_iter)

        processed_chunks = [
            {
                "text": re.sub(r"\s+", " ", chunk.text).strip(),
                "metadata": {
                    "filename": chunk.meta.origin.filename,
                    "page_numbers": [
                        page_no
                        for page_no in sorted(
                            set(
                                prov.page_no
                                for item in chunk.meta.doc_items
                                for prov in item.prov
                            )
                        )
                    ]
                    or [-1],
                    "title": chunk.meta.headings[0] if chunk.meta.headings else '',
                    "url": path
                },
            }
            for chunk in chunks
        ]

        table.add(data=processed_chunks)
        print(f"Document '{filename}' added to the table.")
    except IndexError:
        print(path, ' had an IndexError.')



  from .autonotebook import tqdm as notebook_tqdm


In [17]:
add_document(path)

Document 'Ordulinian' already exists in the table. Skipping.


In [4]:
query = "What attributes does firebead have?"
actual = table.search(query).limit(1).to_list()

for dict in actual:
    print(dict,'\n\n\n')

{'text': '- A Book for Firebead', 'vector': [-0.059914715588092804, 0.06331981718540192, 0.043114952743053436, -0.051640018820762634, 0.018543966114521027, 0.026721898466348648, -0.04259528964757919, 0.02585526555776596, -0.07129446417093277, -0.005461807828396559, 0.0010402057087048888, -0.052312664687633514, -0.0058830673806369305, 0.028274772688746452, 0.023514069616794586, -0.002112439600750804, 0.055105432868003845, 0.025114726275205612, -0.07506255805492401, 0.009385259822010994, 0.04493577405810356, 0.015313226729631424, 0.015071861445903778, -0.06452623009681702, 0.04802769422531128, 0.030334236100316048, -0.012819980271160603, -0.0005620818701572716, -0.04454883188009262, -0.10673323273658752, 0.023144368082284927, 0.020086567848920822, 0.05138682574033737, -0.06189262866973877, -0.008381643332540989, 0.010795135982334614, 0.015123702585697174, 0.01029358059167862, 0.02643856778740883, 0.038394637405872345, 0.043086279183626175, 0.0856972262263298, -0.008958681486546993, -0.00

In [5]:
import requests

response = requests.get('https://baldursgate.fandom.com/sitemap-newsitemapxml-NS_0-id-46240-58007.xml').text

response



In [6]:
import xml.etree.ElementTree as ET

def extract_urls(xml_string):
    ns = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    root = ET.fromstring(xml_string)
    return [elem.text for elem in root.findall('.//ns:loc', ns)]

In [8]:
url_lst = extract_urls(response)
start = False
for url in url_lst:
    if 'DEMOGORG.ITM' in url:
        start = True
    if start:
        add_document(url)

Document 'DEMOGORG.ITM' already exists in the table. Skipping.
Document 'Herschel' added to the table.
Document 'Bartender' added to the table.
Document 'Slow_effect' added to the table.
Document 'Aerial_servant_(creature)' added to the table.
Document 'Mountain_bear' added to the table.
Document 'Blink_Dog' added to the table.
Document 'Mongo' added to the table.
Document 'Haste_effect' added to the table.
Document 'Brooch_of_the_Vagrant_Blades' added to the table.
Document 'Niklos%27s_Master' added to the table.
Document 'Ninjat%C5%8D_%2B3' added to the table.
Document 'Wakizashi_%2B2' added to the table.
Document 'Dragon_(Watchers_Keep_Final_Seal)' added to the table.
Document 'Azamantes' added to the table.
Document 'Kuo-toan_Wizard' added to the table.
Document 'Hobgoblin_Warrior' added to the table.
Document 'Hobgoblin_Shaman' added to the table.
Document 'Hobgoblin_Wizard' added to the table.
Document 'Hobgoblin_Archer' added to the table.
Document 'Mage_(Watchers_Keep_Final_Sea