<a href="https://colab.research.google.com/github/GiovanniPasq/agentic-rag-for-dummies/blob/main/pdf_to_md.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PDF to Markdown Conversion

Before building your RAG system, you’ll need to convert your PDF documents into a machine-readable format. While **Markdown** is a common choice, you can also export to other formats such as **JSON**, **TXT**, or others depending on your pipeline and requirements.

There are two main approaches you can use:

### Option 1: VLM-Based Conversion (Recommended for Complex Documents)

This approach uses a **Vision-Language Model (VLM)** to visually process each PDF page. It extracts text, preserves the original layout, and generates descriptions for visual elements such as images, tables, and charts.

#### Why choose this method?

- **Best for:** Complex layouts, multi-column documents, tables, charts, and image-heavy PDFs.  
- **Pros:** High accuracy and excellent handling of both text and non-text elements.  
- **Cons:** Higher computational cost and slower performance on large document batches.  

The examples below demonstrate how to use the **Google Gemini API** for VLM-based conversion. However, you can easily adapt this approach to other models or frameworks—such as **OpenAI**, **Ollama**, **Claude**, or **Hugging Face**—based on your tools and infrastructure.

For more information, see the official documentation:  
**[Gemini API – Image Understanding](https://ai.google.dev/gemini-api/docs/image-understanding)**

In [None]:
# Customize this system prompt based on your PDF type (e.g., academic, technical, legal).
# This template works for 90% of documents—tweak rules as needed for your use case.
SYSTEM_PROMPT = """You are an expert document parser specializing in converting PDF pages to markdown format.

**Your task:**
Extract ALL content from the provided page image and return it as clean, well-structured markdown.

**Text Extraction Rules:**
1. Preserve the EXACT text as written (including typos, formatting, special characters)
2. Maintain the logical reading order (top-to-bottom, left-to-right)
3. Preserve hierarchical structure using appropriate markdown headers (#, ##, ###)
4. Keep paragraph breaks and line spacing as they appear
5. Use markdown lists (-, *, 1.) for bullet points and numbered lists
6. Preserve text emphasis: **bold**, *italic*, `code`
7. For multi-column layouts, extract left column first, then right column

**Tables:**
- Convert all tables to markdown table format
- Preserve column alignment and structure
- Use | for columns and - for headers

**Mathematical Formulas:**
- Convert to LaTeX format: inline `$formula$`, display `$$formula$$`
- If LaTeX conversion is uncertain, describe the formula clearly

**Images, Diagrams, Charts:**
- Insert markdown image placeholder: `![Description](image)`
- Provide a detailed, informative description including:
  * Type of visual (photo, diagram, chart, graph, illustration)
  * Main subject or purpose
  * Key elements, labels, or data points
  * Colors, patterns, or notable visual features
  * Context or relationship to surrounding text
- For charts/graphs: mention axes, data trends, and key values
- For diagrams: describe components and their relationships

**Special Elements:**
- Footnotes: Use markdown footnote syntax `[^1]`
- Citations: Preserve as written
- Code blocks: Use triple backticks with language specification
- Quotes: Use `>` for blockquotes
- Links: Preserve as `[text](url)` if visible

**Quality Guidelines:**
- DO NOT add explanations, comments, or meta-information
- DO NOT skip or summarize content
- DO NOT invent or hallucinate text not present in the image
- DO NOT include "Here is the markdown..." or similar preambles
- Output ONLY the markdown content, nothing else

**Output Format:**
Return raw markdown with no wrapper, no code blocks, no explanations.
Start immediately with the page content.
""".strip()

In [None]:
!pip install PyMuPDF

In [None]:
import os
import fitz  # PyMuPDF
from google import genai
from google.genai import types

def split_pdf_and_describe(pdf_path, api_key):
    client = genai.Client(api_key=api_key)
    pdf_document = fitz.open(pdf_path)
    markdown_pages = {}

    for page_num in range(pdf_document.page_count):
        try:
            page = pdf_document[page_num]

            pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72)) # Convert page to high-res image (300 DPI)
            img_data = pix.tobytes("png")

            image = types.Part.from_bytes(data=img_data, mime_type="image/png")

            response = client.models.generate_content(
                config=types.GenerateContentConfig(
                    system_instruction=SYSTEM_PROMPT,
                    temperature=0.1
                ),
                model="gemini-2.0-flash",
                contents=["Convert this PDF page to clean, structured markdown.", image],
            )

            markdown_pages[page_num + 1] = response.text
            print(f"✓ Processed page {page_num + 1}/{pdf_document.page_count}")

        except Exception as e:
            print(f"✗ Error on page {page_num + 1}: {e}")
            markdown_pages[page_num + 1] = f"Error processing page"

    pdf_document.close()
    return markdown_pages

def process_pdf_folder(folder_path, api_key):
    os.makedirs("md_output", exist_ok=True)

    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(folder_path, filename)
            pdf_name = os.path.splitext(filename)[0]
            markdown_pages = split_pdf_and_describe(pdf_path, api_key)

            with open(os.path.join("md_output", f"{pdf_name}.md"), 'w', encoding='utf-8') as f:
                f.write("\n\n".join([f"# Page {page_num}\n\n{content}" for page_num, content in markdown_pages.items()]))

### Option 2: OCR-Based Conversion

This approach uses **Optical Character Recognition (OCR)** to extract text and tables from PDF documents. It is particularly effective for **scanned PDFs** or files that do not contain selectable digital text.

A typical implementation uses **Docling** for OCR and table extraction, optionally combined with a **Vision-Language Model (VLM)** for image captioning. This setup ensures accurate text recognition, preserves table structures, and generates meaningful descriptions for images and other visual elements.

#### Why choose this method?

- **Best for:** Scanned documents, image-based PDFs, and files without embedded text.  
- **Pros:** Reliable text extraction, excellent table structure preservation, and flexible integration with VLMs for image captions.  
- **Cons:** Slightly lower accuracy on complex layouts compared to full VLM-based parsing.  

**References and tools:**

- **Docling – Picture Description:** [https://docling-project.github.io/docling/examples/pictures_description](https://docling-project.github.io/docling/examples/pictures_description)  
- **PaddleOCR Repository:** [https://github.com/PaddlePaddle/PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)

Both tools can export results in **Markdown** format and optionally use a VLM to replace images with captions generated by the model.

For image captioning, if you use **Docling**, you can integrate your own VLM model.  
You can also adopt a **hybrid approach**: convert images to **Base64** with Docling and embed them directly in the Markdown file.  
Afterward, send the complete Markdown (including the Base64-encoded images) to **Gemini** or another VLM to generate higher-quality, context-aware captions.

In [None]:
!pip install docling

In [None]:
def convert_pdfs_to_markdown(pdf_folder: str, output_folder: str):
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_table_structure = True
    pipeline_options.do_picture_description = True
    pipeline_options.picture_description_options = smolvlm_picture_description
    pipeline_options.picture_description_options.prompt = (
        "Describe the image in detail, focusing on key elements and context, "
        "while being concise and accurate."
    )
    pipeline_options.images_scale = 2.0
    pipeline_options.generate_picture_images = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    output_path = Path(output_folder)
    output_path.mkdir(parents=True, exist_ok=True)

    pdf_files = list(Path(pdf_folder).glob("*.pdf"))

    for pdf_file in pdf_files:
        try:

            result = converter.convert(str(pdf_file))
            doc = result.document

            markdown_content = doc.export_to_markdown()

            output_file = output_path / f"{pdf_file.stem}.md"
            output_file.write_text(markdown_content, encoding='utf-8')

        except Exception as e:
            print(f"Error processing {pdf_file.name}: {e}")

    print(f"\nConversion complete! Output in '{output_folder}'")
