# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## Granite docling 258M

### Using the transformers library

Start by initializing the model and processor.

In [1]:
import torch

from transformers import AutoProcessor, AutoModelForImageTextToText

model_name = "ibm-granite/granite-docling-258M"
device = "cuda" if torch.cuda.is_available() else "cpu"
granite_docling_processor = AutoProcessor.from_pretrained(model_name)
granite_docling_model = AutoModelForImageTextToText.from_pretrained(
    pretrained_model_name_or_path=model_name,
    dtype=torch.bfloat16,
).to(device)  # type: ignore

Prepare the input to be passed to the model.

In [2]:
from PIL import Image
from transformers.image_utils import load_image


# Provide a URL or a PIL image
image_path = "../images/test_ocr.png"
image = Image.open(image_path)
image = load_image(image=image_path)
# image = load_image(
#     "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png"
# )
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."},
        ],
    },
]

prompt = granite_docling_processor.apply_chat_template(
    conversation=messages, add_generation_prompt=True
)
inputs = granite_docling_processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(device)

Generate the output, which will contain the extracted text. Even with a GPU, it took around 5 minutes to process just one image. The `doc_tags` variable contains the raw output from the model.

In [3]:
generated_ids = granite_docling_model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doc_tags = granite_docling_processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

In [5]:
print(doc_tags)

<doctag><page_header><loc_115><loc_27><loc_385><loc_34>Energy Budget of WASP-121 b from JWST/NIRISS Phase Curve</page_header>
<page_header><loc_454><loc_28><loc_459><loc_34>9</page_header>
<text><loc_41><loc_42><loc_239><loc_88>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.</text>
<text><loc_41><loc_89><loc_239><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨ıvely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflecte

Render the output in markdown format.

In [6]:
from IPython.display import Markdown, display
from docling_core.types.doc.document import DoclingDocument, DocTagsDocument


doctag_document = DocTagsDocument.from_doctags_and_image_pairs(
    doctags=[doc_tags], images=[image]
)
docling_document = DoclingDocument.load_from_doctags(
    doctag_document=doctag_document, document_name="Document"
)
extracted_text_markdown = docling_document.export_to_markdown()
display(Markdown(extracted_text_markdown))

while the kernel weights are structured as ( N$\_{slice}$ , N$\_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨ıvely, the fit would include 2 N$\_{slice}$ + 1 parameters: N$\_{slice}$ for the albedo values, N$\_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameter. Following Coulombe et al. (2025) we set an upper prior limit of 3 / 2 on all albedo slices as a fully Lambertian sphere ( A$\_{i}$ = 1 ) corresponds to a geometric albedo of A$\_{g}$ = 2 / 3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice.

We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N$\_{slice}$ = 4 , 6 , 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in.

## 2.5. Planetary Effective Temperature

Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partier &amp; Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b ( ∼ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget.

We convert the fitted F$\_{p}$ / F$\_{∗}$ emission spectra to brightness temperature by wavelength,

$$T _ { b r i g h t } = \frac { h c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , p l a n e t } } + 1 \right ) \right ] ^ { - 1 } ,$$

where the planet's thermal emission is

$$B _ { \lambda , \text {planet} } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , \text {star} } \, .$$

There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz &amp; Cowan 2015;

Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ$\_{max}$ λ$\_{min}$ B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ . The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 µ m] (red line); Hubble WFC3 [1.12-1.64 µ m] (dashed green line); NIRSpec G395H [2.7-5.15 µ m] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on our T$\_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.

line chart

<!-- image -->

Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F$\_{p}$ / F$\_{s}$ emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T$\_{eff}$ for a given orbital phase as

$$T _ { \text {eff} } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { \text {bright,} } , } { \sum _ { i = 1 } ^ { N } w _ { i } } ,$$

where w$\_{i}$ is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement,

$$w _ { i } = \frac { \int _ { \lambda _ { i } } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } ,$$

with T$\_{est}$ representing an estimated effective temperature at the orbital phase of interest. When computing

Save the extracted text to a file so that we can review it later.

In [7]:
file_path = "../data/extracted_text/granite_docling_output.txt"
with open(file_path, "w") as f:
    f.write(extracted_text_markdown)

### Using llama.cpp

For some reason, using transformers is very slow, even with a GPU. So, let's try using `llama.cpp` instead. First, download the model in `GGUF` format from [ggml-org repository](https://huggingface.co/ggml-org/granite-docling-258M-GGUF/tree/main). Choose which quantization you want to use. This model is small, so I went with half precision (f16).

Download two files, one for the language model, and the other for the vision encoder (mmproj). For the 16-bit model, the files are:

- `granite-docling-258M-f16.gguf`
- `mmproj-granite-docling-258M-f16.gguf`

After that serve the model using [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):

```bash
llama-server \
  --model ~/.cache/llama.cpp/granite-docling-258M-f16.gguf \
  --mmproj ~/.cache/llama.cpp/mmproj-granite-docling-258M-f16.gguf \
  --n-gpu-layers 999 \
  --ctx-size 4096 \
  --port 36912
```

Here we define a function that takes an image path as input, encodes the image in base64 format, and returns a data URI.

In [8]:
import base64
import mimetypes


def encode_image_to_data_uri(image_path: str) -> str:
    mime_type, _ = mimetypes.guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"

    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

    return f"data:{mime_type};base64,{encoded_string}"


path_to_image = "../images/test_ocr.png"
image_data_uri = encode_image_to_data_uri(path_to_image)

In [None]:
import json
import requests

# Don't forget to start the llama.cpp server!
LLAMA_SERVER_URL = "http://localhost:36912"


def extract_text_from_images(model_name: str, prompt: str, image_data_uri: str) -> str:
    url = f"{LLAMA_SERVER_URL}/v1/chat/completions"
    headers = {"Content-Type": "application/json"}

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": image_data_uri}},
            ],
        }
    ]

    data = {
        "messages": messages,
        "model": model_name,
        "max_tokens": 4096,
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(data))
        response.raise_for_status()
        response_json = response.json()

        assistant_message = response_json["choices"][0]["message"]["content"]
        return assistant_message.strip()

    except Exception as e:
        print(f"An error occurred: {e}")
        return ""


model_to_use = "granite-docling-258M"
text_prompt = "Convert this page to docling."

doc_tags_llama_cpp = extract_text_from_images(model_to_use, text_prompt, image_data_uri)

Ah, look at the speed difference! It took only 5 seconds to process the same image instead of 5 minutes. Let's see the output.

In [13]:
import re


def extract_inner_text(text_with_tags: str) -> str:
    return re.sub(r"<.*?>", "", text_with_tags, flags=re.DOTALL).strip()


extracted_text_llama_cpp = ""
for line in doc_tags_llama_cpp.splitlines():
    extracted_text_llama_cpp += extract_inner_text(text_with_tags=line) + "\n"

print(extracted_text_llama_cpp)

E NERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE
9
while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.
To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨¨vely, the fit would include 2 N$_{slice }$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameer. Following Coulombe et al. (2025) we set an upper prio

Save the extracted text to a file so that we can review it later.

In [14]:
file_path = "../data/extracted_text/granite_docling_output.txt"
with open(file_path, "w") as f:
    f.write(extracted_text_llama_cpp)

Now, let's use the same model on the scanned PDF.

In [15]:
import pymupdf

from tqdm import tqdm

extracted_text_from_all_pages = ""
model_to_use = "granite-docling-258M"
text_prompt = "Convert this page to docling."

pdf_path = "../data/documents/rog_strix_gaming_notebook_pc_scanned_file.pdf"
pdf_document = pymupdf.open(pdf_path)

for page in tqdm(pdf_document, total=pdf_document.page_count):  # type: ignore
    pix = page.get_pixmap(dpi=300)
    image_bytes = pix.tobytes("png")
    encoded_string = base64.b64encode(image_bytes).decode("utf-8")
    data_uri = f"data:image/png;base64,{encoded_string}"

    extracted_text_from_page_with_tags = extract_text_from_images(
        model_name=model_to_use, prompt=text_prompt, image_data_uri=data_uri
    )
    extracted_text_from_page = extract_inner_text(
        text_with_tags=extracted_text_from_page_with_tags
    )
    extracted_text_from_all_pages += extracted_text_from_page + "\n"

print(extracted_text_from_all_pages.strip())

100%|██████████| 19/19 [00:26<00:00,  1.42s/it]

BC REVISED EDITIONVS DECEMBER 2024
ROG STRIX
GAMING NOTEBOOK PC
MORE INFO

Gaming
RCER CERRE RIE ERE RE
ASUS PROVIDES THIS MANUAL AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, IN NO EVENT SHALL ASUS, ITS DIRECTORS, OFFICERS, EMPLOYEES OR AGENTS BE LIABLE FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, LOSS OF BUSINESS, LOSS OF USE OR DATA, INTERRUPTION OF BUSINESS AND THE LIKE), EVEN IF ASUS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES ARISING FROM ANY DEFECT OR ERROR IN THIS MANUAL OR PRODUCT.
Products and corporate names appearing in this manual may or may not be registered trademarks or copyrights of their respective companies, and are used only for identification or explanation and to the owners' benefit, without intent to infringe.
SPECIFICATIONS AND INFORMATION CONTAINED IN THIS MA




Save the extracted text to a file so that we can review it later.

In [16]:
file_path = "../data/extracted_text/granite_docling_output.txt"
with open(file_path, "w") as f:
    f.write(extracted_text_from_all_pages)