# Parse PDFs to markdown
---

Here we'll test parsing some non-English electoral manifestos' PDFs using different approaches. The goal is to extract the text from the PDFs and convert it to markdown, so that it can be easily processed and analyzed.

## Setup

### Import libraries

In [None]:
import os
from pathlib import Path
from IPython.display import Markdown, display
import requests
from docling.document_converter import DocumentConverter
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

### Set parameters

In [None]:
ls

In [None]:
os.chdir("..")

In [None]:
ls

In [None]:
pdf_url = (
    "https://partidolivre.pt/wp-content/uploads/2021/12/Programa_Eleitoral_2022.pdf"
)
human_annotated_data_path = Path("data/portugal_2022/programs/")
human_annotated_md = human_annotated_data_path / "livre.md"

## Load data

In [None]:
markdown_content = human_annotated_md.read_text()
markdown_content

In [None]:
display(Markdown(markdown_content))

## Parse PDFs

### docling

In [None]:
converter = DocumentConverter()
result = converter.convert(pdf_url)
display(Markdown(result.document.export_to_markdown()))

Big issues found:
- If the text has columns, it can struggle and merge the text together from separate columns.
- Some text that is inside lists is cut from `docling`'s Markdown output. This seems to be an [open issue with the package](https://github.com/docling-project/docling/issues/913).

### marker

In [None]:
converter = PdfConverter(
    artifact_dict=create_model_dict(),
)
# Download the PDF
pdf_path = "downloaded_program.pdf"
response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)

# Pass the downloaded PDF to the converter
rendered = converter(pdf_path)
text, _, images = text_from_rendered(rendered)
display(Markdown(text))

This looks better than the `docling` output, with the main cons being:
- Slightly longer overhead (some minutes downloading language models);
- Adds a newline (`\n\n`) when the text in one column switches to the next column on the right.

### Gemini 2.0 flash

I might skip this one for now as it's likely costlier and more expensive than the two options above. If one of the above has good enough outputs then an LLM would be an overkill.

## Final notes

