# Example notebook demonstrating the extract_markdown functionality
This notebook shows how to extract markdown from PDFs using different engines and configurations

We support 2 extraction engines (Mistral OCR and Docling) with 3 image configurations each: "embedded", "referenced", and "placeholder"

## Imports

In [None]:
from IPython.display import Markdown, display
from llm_synthesis.utils import extract_markdown
from dotenv import load_dotenv
load_dotenv(override=True)

## Set Example File Path

In [8]:
PDF_PATH = "/Users/siddharthbetala/Desktop/llm-synthesis/data/pdf_papers/s41586-021-03819-2.pdf"

## Configuration 1 - Mistral with Embedded Images

The 'embedded' configuration embeds images directly in the markdown as base64 data URIs

In [14]:
mistral_embedded = extract_markdown(
    pdf_path=PDF_PATH,
    engine="mistral",  # Use Mistral OCR engine
    image_mode="embedded",  # Embed images as data URIs
    save_markdown=False,  # Don't save to disk, just return content
)

In [None]:
display(Markdown(mistral_embedded))

## Configuration 2 - Docling with Referenced Images
The 'referenced' configuration saves images as separate files and references them in markdown. The output markdown and images can be seen in the results folder

In [None]:
docling_referenced = extract_markdown(
    pdf_path=PDF_PATH,
    engine="docling",
    image_mode="embedded",
    root_dir=".",
    save_markdown=False,
)

In [None]:
docling_referenced[:2000]

In [None]:
display(Markdown(docling_referenced))

## Configuration 3 - Mistral with Placeholder Images
The "placeholder" configuration replaces images with placeholder text

In [9]:
mistral_placeholder = extract_markdown(
    pdf_path=PDF_PATH,
    engine="mistral",  # Use Mistral OCR engine
    image_mode="placeholder",  # Embed images as data URIs
    save_markdown=False,  # Don't save to disk, just return content
)

In [None]:
display(Markdown(mistral_placeholder))