# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## LightOnOCR 1B

### Using llama.cpp

Download the model in `GGUF` format from [Hugging Face](https://huggingface.co/ggml-org/LightOnOCR-1B-1025-GGUF). This requires more resources, so make sure you have enough RAM and VRAM.

Download two files, one for the language model, and the other for the vision encoder (mmproj). The files are:

- `LightOnOCR-1B-1025-Q8_0.gguf`
- `mmproj-LightOnOCR-1B-1025-Q8_0.gguf`

After that serve the model using [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):

```bash
llama-server \
  --model ~/.cache/llama.cpp/LightOnOCR-1B-1025-Q8_0.gguf \
  --mmproj ~/.cache/llama.cpp/mmproj-LightOnOCR-1B-1025-Q8_0.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --port 36912
```

In [1]:
import base64
import mimetypes


def encode_image_to_data_uri(image_path: str) -> str:
    mime_type, _ = mimetypes.guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"

    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")

    return f"data:{mime_type};base64,{encoded_string}"


path_to_image = "../images/test_ocr.png"
image_data_uri = encode_image_to_data_uri(path_to_image)

On my RTX 4070, it took 6 seconds to process the image with LightOnOCR 1B.

In [5]:
import json
import requests

# Don't forget to start the llama.cpp server!
LLAMA_SERVER_URL = "http://localhost:36912"


def extract_text_from_images(image_data_uri: str) -> str:
    url = f"{LLAMA_SERVER_URL}/v1/chat/completions"
    headers = {"Content-Type": "application/json"}

    messages = [
        {"role": "system", "content": ""},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_data_uri}},
            ],
        },
    ]

    data = {
        "messages": messages,
        "max_tokens": 4096,
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(data))
        response.raise_for_status()
        response_json = response.json()

        assistant_message = response_json["choices"][0]["message"]["content"]
        return assistant_message.strip()

    except Exception as e:
        print(f"An error occurred: {e}")
        return ""


model_response = extract_text_from_images(image_data_uri=image_data_uri)

In [4]:
print(model_response)

while the kernel weights are structured as ($N_{\rm slice}$, $N_{\rm time}$). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

To address this, we follow a similar approach to our sinusoidal fits using \texttt{emcee}, but we increase the total number of steps to 100,000 and use 100 walkers. Nively, the fit would include $2 N_{\rm slice} + 1$ parameters: $N_{\rm slice}$ for the albedo values, $N_{\rm slice}$ for the emission parameters, and one additional scatter parameter, $\sigma$. However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameter. Following \citet{Coulombe2025} we set an upper prior limit of 3/2 on all albedo sli