# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## Gemma3 12B

### Using llama.cpp

Download the model in `GGUF` format from [Hugging Face](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf). This requires more resources, so make sure you have enough RAM and VRAM.

Download two files, one for the language model, and the other for the vision encoder (mmproj). The files are:

- `gemma-3-12b-it-q4_0.gguf`
- `mmproj-model-f16-12B.gguf`

After that serve the model using [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):

```bash
llama-server \
  --model ~/.cache/llama.cpp/gemma-3-12b-it-q4_0.gguf \
  --mmproj ~/.cache/llama.cpp/mmproj-model-f16-12B.gguf \
  --n-gpu-layers 20 \
  --ctx-size 4096 \
  --port 36912
```

In [3]:
import base64
import mimetypes


def encode_image_to_data_uri(image_path: str) -> str:
    mime_type, _ = mimetypes.guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"

    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

    return f"data:{mime_type};base64,{encoded_string}"


path_to_image = "../images/test_ocr.png"
image_data_uri = encode_image_to_data_uri(path_to_image)

On my RTX 4070, it took 6 minutes to process the image with Gemma3 12B. This is slow but let's see the output.

In [None]:
import json
import requests

# Don't forget to start the llama.cpp server!
LLAMA_SERVER_URL = "http://localhost:36912"


def extract_text_from_images(model_name: str, prompt: str, image_data_uri: str) -> str:
    url = f"{LLAMA_SERVER_URL}/v1/chat/completions"
    headers = {"Content-Type": "application/json"}

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": image_data_uri}},
            ],
        }
    ]

    data = {
        "messages": messages,
        "model": model_name,
        "max_tokens": 4096,
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(data))
        response.raise_for_status()
        response_json = response.json()

        assistant_message = response_json["choices"][0]["message"]["content"]
        return assistant_message.strip()

    except Exception as e:
        print(f"An error occurred: {e}")
        return ""


model_to_use = "gemma-3-12b-it"
system_prompt = """You are an expert AI assistant, you are tasked with extracting the entire text from any PDF document. The document can be simple, complex, or even scanned, this shouldn't matter to you.

You will be given the entire PDF as input. Start examining the document page by page, when you come across text, extract it as is don't convert it into another format like HTML or Markdown. If you come across images, replace them with a very detailed description of the image while taking into consideration the context around it.

When you come across tables, describe them too like the image. The description should be very detailed and in a way that someone will understand the table without seeing it.

Make sure to keep the structure of the document, if there are sections, subsections, bullet points, or numbered lists, make sure to keep them as is. If there are any headers, footers, page numbers, remove them.

The final output should be a clean, well-structured text that represents the content of the entire PDF document as closely as possible to how a human would see it with their eyes when reading the document. Don't say anything else, just output the text you extracted from the PDF.

Here is the PDF:
"""



gemma_response = extract_text_from_images(
    model_name=model_to_use, prompt=system_prompt, image_data_uri=image_data_uri
)

In [10]:
print(gemma_response)

ENERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE 9

This preconputation significantly affects our calculations, which is essential since the longitudinal sides are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

To address this, we follow a similar approach to our sinusoidal fits using emcee, but we increase the total number of walkers to 1000 and use 100 walkers. Nively, the fit would include 2N<sub>λ</sub> + 1 parameters. N<sub>λ</sub> refers to the number of walkers for the emission parameters and one additional scatter parameter. However, since night-side slices do not contribute to the reflected light component, we exclude these absorbed values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameter. Following Coulombe et al. (2023), we set an upper prior limit of 3/2 on all ellipsoidal shapes as a spherical shape (A<sub>1</sub> = 1), 

Let's compare the first paragraph between the following models:

- `Gemma3 12B`
- `Granite docling 258M`
- `Gemini 2.5 Pro`

**Gemini 2.5 Pro:** Perfectly extracted text.

ENERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE

while the kernel weights are structured as ($N_{albedo}$, $N_{time}$). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

---

**Gemma3 12B:** Missing the first sentence, and some words are incorrect.

ENERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE 9

This ~~preconputation~~ significantly ~~affects~~ our calculations, which is essential since the longitudinal ~~sides~~ are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

---

**Granite docling 258M:** Perfectly extracted text like the big Gemini 2.5 Pro model.

ENERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE 9

while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.


`Granite docling 258M` performed surprisingly well, matching the output of `Gemini 2.5 Pro`, while `Gemma3 12B` missed some words and made mistakes. `IBM` trained `Granite docling 258M` to be extremely good at document understanding, that's why it performed so well. `Gemma3 12B` is a general-purpose model, so it is not as good at this specific task.