# Parse PDFs to markdown
---

Here we'll test parsing some non-English electoral manifestos' PDFs using different approaches. The goal is to extract the text from the PDFs and convert it to markdown, so that it can be easily processed and analyzed.

## Setup

### Import libraries

In [None]:
import os
from pathlib import Path
from IPython.display import Markdown, display
import requests
from pdf2image import convert_from_path
from docling.document_converter import DocumentConverter
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from marker.config.parser import ConfigParser
import ollama
from google import genai
from google.genai import types
import httpx
from dotenv import load_dotenv
import base64
from openai import OpenAI
from llama_cloud_services import LlamaParse
from llama_index.core import SimpleDirectoryReader
import nest_asyncio
import pymupdf4llm
from io import BytesIO
from pydantic import BaseModel

In [None]:
from polids.pdf_processing.openai import OpenAIPDFProcessor

### Set parameters

In [None]:
load_dotenv()

In [None]:
os.listdir()

In [None]:
os.chdir("..")

In [None]:
os.listdir()

In [None]:
human_annotated_data_path = Path("data/elections_portugal/2022/programs_md/")
# small PDF
pdf_url = (
    "https://partidolivre.pt/wp-content/uploads/2021/12/Programa_Eleitoral_2022.pdf"
)
human_annotated_md = human_annotated_data_path / "livre.md"
# very large PDF
# pdf_urannotated_md = hl = "https://iniciativaliberal.pt/wp-content/uploads/2022/01/Iniciativa-Liberal-Programa-Eleitoral-2022.pdf"
# human_uman_annotated_data_path / "liberal.md"
llm_image_parsing_prompt = "Parse all of the text from this image. Convert it into a markdown format. Only write the content of the image, nothing else"
llm_pdf_parsing_promt = "Parse all of the text from this PDF. Convert it into a markdown format. Only write the content of the PDF, nothing else"

In [None]:
# LlamaParse needs nested async loop to run in a notebook
nest_asyncio.apply()

## Load data

In [None]:
markdown_content = human_annotated_md.read_text()
markdown_content

In [None]:
display(Markdown(markdown_content))

In [None]:
# Download the PDF
pdf_path = "downloaded_program.pdf"
response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)

In [None]:
test_pdf = Path("tests/data/test_electoral_program.pdf")

## Parse PDFs

### Implemented solution

In [None]:
pdf_parser = OpenAIPDFProcessor()
result = pdf_parser.process(pdf_path)
display(Markdown("\n\n---\n\n".join(result)))

While this LLM-powered PDF parser is slow, it gives us great output quality at a more reasonable price than other proprietary software.

### docling

In [None]:
converter = DocumentConverter()
result = converter.convert(pdf_url)
display(Markdown(result.document.export_to_markdown()))

Big issues found:
- If the text has columns, it can struggle and merge the text together from separate columns.
- Some text that is inside lists is cut from `docling`'s Markdown output. This seems to be an [open issue with the package](https://github.com/docling-project/docling/issues/913).

### marker

In [None]:
config = {
    "use_llm": True,  # use Gemini 2.0 Flash LLM to improve the output quality
    "gemini_api_key": os.environ.get("GOOGLE_API_KEY"),
    "paginate_output": True,  # separate pages in the markdown, using 48x "-" separator
    "output_format": "markdown",
}
config_parser = ConfigParser(config)
converter = PdfConverter(
    config=config_parser.generate_config_dict(),
    artifact_dict=create_model_dict(),
    processor_list=config_parser.get_processors(),
    renderer=config_parser.get_renderer(),
    llm_service=config_parser.get_llm_service(),
)
rendered = converter(pdf_path)
text, _, images = text_from_rendered(rendered)
display(Markdown(text))

This looks better than the `docling` output, with the main cons being:
- Slightly longer overhead (some minutes downloading language models);
- Adds a newline (`\n\n`) when the text in one column switches to the next column on the right.

Still, it handles even very large (>600 pages) PDFs well and in under 10 minutes.

Adding the LLM usage and paginating the outputs really makes this a great tool for parsing PDFs to markdown.

In [None]:
text.split(48 * "-")  # split the text into pages

### Gemma 3

In [None]:
# Convert PDF pages to images
pdf_images = convert_from_path(pdf_url, dpi=300)

# Save images to files
for i, image in enumerate(pdf_images):
    image_path = f"page_{i + 1}.png"
    image.save(image_path, "PNG")
    print(f"Saved: {image_path}")
    if i > 2:
        break

In [None]:
res = ollama.chat(
    model="gemma3:4b",
    messages=[
        {
            "role": "user",
            "content": llm_image_parsing_prompt,
            "images": ["page_2.png"],
        }
    ],
)
display(Markdown(res["message"]["content"]))

In [None]:
res = ollama.chat(
    model="gemma3:12b",
    messages=[
        {
            "role": "user",
            "content": llm_image_parsing_prompt,
            "images": ["page_2.png"],
        }
    ],
    options={"temperature": 0, "num_ctx": 8000},
)
display(Markdown(res["message"]["content"]))

### Granite

In [None]:
res = ollama.chat(
    model="granite3.2-vision",
    messages=[
        {
            "role": "user",
            "content": llm_image_parsing_prompt,
            "images": ["page_2.png"],
        }
    ],
    options={"temperature": 0, "num_ctx": 8000},
)
display(Markdown(res["message"]["content"]))

### Llama 3.2

In [None]:
res = ollama.chat(
    model="llama3.2-vision",
    messages=[
        {
            "role": "user",
            "content": llm_image_parsing_prompt,
            "images": ["page_2.png"],
        }
    ],
    options={"temperature": 0, "num_ctx": 8000},
)
display(Markdown(res["message"]["content"]))

Llama 3.2 has EOF issues in Ollama.

### Gemini

I might skip this one for now as it's likely costlier and more expensive than the two options above. If one of the above has good enough outputs then an LLM would be an overkill.

In [None]:
client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))

In [None]:
# Retrieve and encode the PDF byte
doc_data = httpx.get(pdf_url).content
response = client.models.generate_content(
    model="gemini-2.0-flash-lite-001",
    contents=[
        types.Part.from_bytes(
            data=doc_data,
            mime_type="application/pdf",
        ),
        llm_pdf_parsing_promt,
    ],
)
display(Markdown(response.text))

Gemini 2.0 flash lite can't parse multiple columns correctly. It also seems to cut off instead of going through the entire PDF, albeit this might be solved with changing the max output length or uploading subsets of the pages of the PDF.

In [None]:
# Retrieve and encode the PDF byte
doc_data = httpx.get(pdf_url).content
response = client.models.generate_content(
    model="gemini-2.0-flash-001",
    contents=[
        types.Part.from_bytes(
            data=doc_data,
            mime_type="application/pdf",
        ),
        llm_pdf_parsing_promt,
    ],
)
display(Markdown(response.text))

Gemini 2.0 flash also struggles with multiple columns.

In [None]:
# Retrieve and encode the PDF byte
doc_data = httpx.get(pdf_url).content
response = client.models.generate_content(
    model="gemini-2.5-pro-exp-03-25",
    contents=[
        types.Part.from_bytes(
            data=doc_data,
            mime_type="application/pdf",
        ),
        llm_pdf_parsing_promt,
    ],
)
display(Markdown(response.text))

Gemini 2.5 pro is facing overloading issues.

### GPT 4o

In [None]:
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [None]:
with open(pdf_path, "rb") as f:
    data = f.read()

base64_string = base64.b64encode(data).decode("utf-8")

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "file",
                    "file": {
                        "filename": pdf_path,
                        "file_data": f"data:application/pdf;base64,{base64_string}",
                    },
                },
                {
                    "type": "text",
                    "text": "Parse all of the text from this image into a Markdown format.",
                },
            ],
        },
    ],
)

display(Markdown(completion.choices[0].message.content))

GPT 4o is refusing the prompt with "I'm sorry, I can't assist with that" or "I'm unable to directly parse the entire PDF content here".

In [None]:
def pdf_to_base64_images(
    pdf_path: Path, image_format: str = "PNG", dpi: int = 350
) -> list[str]:
    """
    Convert each page of a PDF to a base64 encoded image.

    Args:
        pdf_path (str): Path to the PDF file
        image_format (str): Format to save the images as (PNG, JPEG, etc.)
        dpi (int): DPI resolution for the images

    Returns:
        list[str]: List of base64 encoded strings, one for each page
    """
    # Convert PDF to list of PIL Image objects
    images = convert_from_path(pdf_path, dpi=dpi)

    # Encode each image to base64
    base64_images = []
    for i, image in enumerate(images):
        buffered = BytesIO()
        image.save(buffered, format=image_format)
        img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
        base64_images.append(img_str)

    return base64_images


pdf_images = pdf_to_base64_images(test_pdf)

In [None]:
class ParsedPDFText(BaseModel):
    text: str


completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini-2024-07-18",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Parse all of the text from this image into a Markdown format.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{pdf_images[1]}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    response_format=ParsedPDFText,
    temperature=0,
    seed=42,
)

display(Markdown(completion.choices[0].message.parsed.text))

GPT 4o mini works very well if we use images instead of PDFs and apply structured outputs!

### GPT 4.1 nano

In [None]:
pdf_images = pdf_to_base64_images(test_pdf)
completion = client.beta.chat.completions.parse(
    model="gpt-4.1-nano-2025-04-14",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Parse all of the text from this image into a Markdown format.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{pdf_images[1]}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    response_format=ParsedPDFText,
    temperature=0,
    seed=42,
)

display(Markdown(completion.choices[0].message.parsed.text))

GPT 4.1 nano doesn't apply Markdown format properly.

### GPT 4.1 mini

In [None]:
pdf_images = pdf_to_base64_images(test_pdf)
completion = client.beta.chat.completions.parse(
    model="gpt-4.1-mini-2025-04-14",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Parse all of the text from this image into a Markdown format.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{pdf_images[1]}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    response_format=ParsedPDFText,
    temperature=0,
    seed=42,
)

display(Markdown(completion.choices[0].message.parsed.text))

GPT 4.1 mini has a better output quality than GPT 4o mini (less changes to the original text), while having a similar speed and price!

### GPT 4.1

In [None]:
pdf_images = pdf_to_base64_images(test_pdf)
completion = client.beta.chat.completions.parse(
    model="gpt-4.1-2025-04-14",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Parse all of the text from this image into a Markdown format.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{pdf_images[1]}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    response_format=ParsedPDFText,
    temperature=0,
    seed=42,
)

display(Markdown(completion.choices[0].message.parsed.text))

GPT 4.1 might be an overkill, given that the mini version seems to work well enough.

### LlamaParse

Seems to work very well in the premium version, which costs $0.045 USD per page. Can also use the auto mode, which uses the premium mode only when a page is thought to need it.

In [None]:
# set up parser
parser = LlamaParse(
    result_type="markdown",  # "markdown" and "text" are available
    # auto_mode=True,  # automatically choose the best model for the input; doesn't always work well though
    premium_mode=True,  # use premium models
)

# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=[pdf_path], file_extractor=file_extractor
).load_data()
display(Markdown("\n\n".join([doc.text for doc in documents])))

Sometimes messes up the order of the text, when it has multiple columns, and it can be expensive (e.g. $27 for 600 pages). But I also get $10 of free credit per month. A benefit here is that the output is separated by page, which can be useful for further processing.

### PyMuPDF4LLM

In [None]:
md_text = pymupdf4llm.to_markdown(pdf_path)
display(Markdown(md_text))

The output quality here is disappointing. Gaps between words, cut out text, disorganized layout, etc.

## Final notes

LlamaParse has the best output quality, but it's also the most expensive. Marker is a good alternative, with a slightly worse output quality but a much lower price. If I see that Marker's output quality starts to significantly harm the analysis, I might consider using LlamaParse instead.