# Parsing the most horrible notebooks!

This notebook will show you how you can parse some of the most horrible pdfs in existence. 
We will use the docling granite model. It has been published as part of the docling project and can be described as the "biggest gun" that you can use to parse the trickiest documents. 


![Sample PDF](images/SCR-20251013-onxj.png)


The docling projects offers a wide range of tools and aproaches to parse documents in a more eficient and scalable way

**This is all running locally**



In [1]:
# all the imports
from pathlib import Path
from pdf2image import convert_from_path
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
from transformers.image_utils import load_image
from pathlib import Path
from PIL.Image import Image
from typing import Any

In [2]:
# Configuration
MODEL_PATH = "ibm-granite/granite-docling-258M-mlx"
SHOW_IN_BROWSER = True
#SAMPLE_PDF = "gnarly_pdfs/Apollo_guidance_and_navigation.pdf"
SAMPLE_PDF = "docling/gnarly_pdfs/apollo-11-flight-plan.pdf"
OUTPUT_PATH = "./output"

In [3]:
# Load model and processor
model, processor = load(path_or_hf_repo=MODEL_PATH)
config: dict[Any, Any] = load_config(model_path=MODEL_PATH)

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

In [4]:
# Convert PDF page to PNG and load as image
# This is the ultima ratio! If everything helps -> convert to image and feed straight to granite-docling
print("Converting PDF to PNG...")
images = convert_from_path(SAMPLE_PDF, dpi=200)
images = convert_from_path(pdf_path=SAMPLE_PDF, first_page=1, last_page=40, dpi=200)
# display(images[0])

Converting PDF to PNG...


In [5]:
# this is the default prompt for granite-docling. Feel free to change it!
PROMPT = "Convert this page to docling"
formatted_prompt = apply_chat_template(processor, config, PROMPT, num_images=1)

In [6]:
# THIS IS THE MAIN LOOP --> Generate DocTags output for all pages
# this is kind of ugly, because we manually iterate over the pages, but there was an with appending to the doctags object
print("Generating DocTags...\n")

for i, pil_image in enumerate(images):
    print(f"\nProcessing page {i+1}/{len(images)}...")
    
    # Generate DocTags for this page
    page_output = ""
    for token in stream_generate(
        model, processor, formatted_prompt, [pil_image], max_tokens=4096, verbose=False
    ):
        page_output += token.text
        if "</doctag>" in token.text:
            break
    
    # Create and save document for THIS page only
    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([page_output], [pil_image])
    doc = DoclingDocument.load_from_doctags(doctags_doc, document_name=f"Page {i+1}")
    
    # Save each page separately in a tmp folder within OUTPUT_PATH
    tmp_dir = Path(OUTPUT_PATH) / "tmp"
    tmp_dir.mkdir(parents=True, exist_ok=True)
    # HTML output
    tmp_output_path = tmp_dir / f"output_page_{i+1}.html"
    doc.save_as_html(tmp_output_path, image_mode=ImageRefMode.EMBEDDED)
    print(f"Page {i+1} saved to: {tmp_output_path}")
    # markdown output
    tmp_output_path = tmp_dir / f"output_page_{i+1}.md"
    doc.save_as_markdown(tmp_output_path)
    print(f"Page {i+1} saved to: {tmp_output_path}")
    # doctags output
    tmp_output_path = tmp_dir / f"output_page_{i+1}.json"
    doc.save_as_doctags(tmp_output_path)
    print(f"Page {i+1} saved to: {tmp_output_path}")

    # Print this page's markdown for debugging
    print(f"\nPage {i+1} Markdown:\n")
    print(doc.export_to_markdown())
    print("\n" + "="*50 + "\n")



Generating DocTags...


Processing page 1/40...
Page 1 saved to: output/tmp/output_page_1.html
Page 1 saved to: output/tmp/output_page_1.md
Page 1 saved to: output/tmp/output_page_1.json

Page 1 Markdown:

## Apollo 11

Flight Plan

Final - July 1, 1969

NATIONAL AERONAUTICS AND SPACE ADMINISTRATION

MANNED SPACECRAFT CENTER HOUSTON, TEXAS



Processing page 2/40...
Page 2 saved to: output/tmp/output_page_2.html
Page 2 saved to: output/tmp/output_page_2.md
Page 2 saved to: output/tmp/output_page_2.json

Page 2 Markdown:

Powered by TCPDF (www.tcpdf.org)



Processing page 3/40...
Page 3 saved to: output/tmp/output_page_3.html
Page 3 saved to: output/tmp/output_page_3.md
Page 3 saved to: output/tmp/output_page_3.json

Page 3 Markdown:

APOLLO 11 - FLIGHT PLAN



Processing page 4/40...
Page 4 saved to: output/tmp/output_page_4.html
Page 4 saved to: output/tmp/output_page_4.md
Page 4 saved to: output/tmp/output_page_4.json

Page 4 Markdown:

<!-- image -->



Processing page 5/40...
Page

AssertionError: bbox[0]<=bbox[2] => 555.744<=443.27200000000005