# Customizing PdfItDown

Once your app reaches a certain complexity, it might be good to customize it so that it can perform more complex and document-specific task that better adapt to your use case.

PdfItDown is designed to offer this flexibility, by allowing you to provide a custom `conversion_callback` to the `Converter` class constructor.

The `conversion_callback`, tho, has to follow this function signature:

```python
def conversion_callback(
    input_path: str,
    output_path: str,
    title: str | None,
    overwrite: bool
)
```

While the name of the parameters can vary, their order should not be changed.

## Example 1: Turn Word Documents into FAQs sheets

The use case for this example is simple: a lot of times, teams drown in Google Docs and don't know where to start. With PdfItDown, you can turn `.docx` documents into PDFs, but in this case we will use some custom logic to extract an abstract and some FAQs and save everything into our PDF. For this, we will be using [LlamaExtract]() by LlamaIndex.


In [4]:
# Install needed dependencies
! pip install -q "pdfitdown>=2.0.0b2" llama-cloud-services

In [6]:
# Let's set up the API key for LlamaCloud
# Get yours at https://cloud.llamaindex.ai !
import os
from getpass import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass("LlamaCloud API Key: ")

LlamaCloud API Key: ··········


In [None]:
# let's define the utilities for the FAQs and abstract extraction

from llama_cloud_services.extract import LlamaExtract, ExtractConfig, SourceText
from llama_cloud.types.extract_run import ExtractRun
from llama_cloud.types.extract_mode import ExtractMode
from llama_cloud_services.parse.utils import ResultType
from llama_cloud_services import LlamaParse
from typing import cast
from pydantic import BaseModel, Field


class QuestionAndAnswer(BaseModel):
    question: str = Field(description="Frequently asked question")
    answer: str = Field(description="Answer to the question")


class DocumentBrief(BaseModel):
    abstract: str = Field(description="Abstract of the document")
    faqs: list[QuestionAndAnswer] = Field(description="FAQs about the document")


extractor = LlamaExtract(api_key=os.getenv("LLAMA_CLOUD_API_KEY"))
extract_config = ExtractConfig(
    extraction_mode=ExtractMode.MULTIMODAL,
)
parser = LlamaParse(api_key=os.getenv("LLAMA_CLOUD_API_KEY"), result_type=ResultType.MD)


def extract_faqs_and_abstract(input_path: str):
    result = parser.parse(file_path=input_path)
    docs = result.get_markdown_documents()
    text = "\n".join([doc.text for doc in docs])
    extract_result = extractor.extract(
        data_schema=DocumentBrief,
        config=extract_config,
        files=SourceText(text_content=text),
    )
    data = cast(ExtractRun, extract_result).data
    return DocumentBrief.model_validate(data)

In [None]:
# Now let's create the custom PDF converter
from markdown_pdf import MarkdownPdf, Section
from pathlib import Path


def conversion_callback(
    input_path: str, output_path: str, title: str | None = None, overwrite: bool = True
) -> str:
    if Path(output_path).exists() and not overwrite:
        raise FileExistsError(f"File {output_path} already exists")
    document_brief_data = extract_faqs_and_abstract(input_path=input_path)
    abstract = document_brief_data.abstract
    faqs = document_brief_data.faqs
    content = f"# Brief for {input_path}\n\n## Abstract\n\n{abstract}\n\n## FAQs\n\n"
    for faq in faqs:
        content += f"### {faq.question}\n\n{faq.answer}\n\n"
    pdf = MarkdownPdf(toc_level=0)
    pdf.add_section(Section(content))
    pdf.meta["title"] = title or f"Brief for {input_path}"
    pdf.save(output_path)
    return output_path

In [24]:
# Now let's convert!
from pdfitdown.pdfconversion import Converter

converter = Converter(conversion_callback=conversion_callback)

In [25]:
converter.convert(
    file_path="data/strategic_plan.docx",
    output_path="strategic_plan_brief.pdf",
    overwrite=True,
    title="Strategic Plan Brief",
)

Started parsing the file under job_id fa2cfe74-0a56-4b04-ac22-2336b1334be8


'strategic_plan_brief.pdf'

## Example 2: Generate PDF documents from LLM responses

LLMs produce their responses as markdown text. We can exploit this to create PDF documents for their responses, to have, for instance, a chatbot that produces accurate news reports in PDF format.

We will be using [LinkUp]() for fetching news and [OpenAI Structured Output]() to produce the report.

In [None]:
# install needed dependencies

! pip install linkup-sdk openai

In [27]:
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")
os.environ["LINKUP_API_KEY"] = getpass("Linkup API key: ")

OpenAI API key: ··········
Linkup API key: ··········


In [28]:
from linkup import LinkupClient, LinkupSourcedAnswer
from openai import OpenAI

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
linkup_client = LinkupClient(api_key=os.getenv("LINKUP_API_KEY"))

In [30]:
# let's define our news search utilities


def search_news(query: str) -> LinkupSourcedAnswer:
    response = linkup_client.search(
        query=query,
        depth="standard",
        output_type="sourcedAnswer",
        include_images=False,
        include_inline_citations=False,
    )
    return response

In [29]:
# let's create the report generation functions


class NewsReport(BaseModel):
    title: str = Field(description="Title of the report")
    report: str = Field(description="News report on a given topic")
    sources: list[str] = Field(description="Sources for the report")


def llm_generate(prompt: str, information: str) -> NewsReport | None:
    result = openai_client.responses.parse(
        model="gpt-4.1",
        input=[
            {
                "role": "system",
                "content": "You are a reporter who is tasked, given the provided information and user prompt, to produce a comprehensive report, assigning it a title and listing the sources.",
            },
            {
                "role": "user",
                "content": f"Initial user prompt:\n\n'''\n{prompt}\n'''\n\nCollected information from the web:\n\n'''\n{information}\n'''\n\n",
            },
        ],
        text_format=NewsReport,
    )
    return result.output_parsed

In [31]:
# let's define the conversion callback, using a little trick: we take the first
# input parameter as the news query (not a file path)!


def conversion_callback(
    query: str, output_path: str, title: str | None = None, overwrite: bool = True
) -> str:
    if Path(output_path).exists() and not overwrite:
        raise FileExistsError(f"File {output_path} already exists")
    response = search_news(query=query)
    information = response.answer
    complete_info = (
        "# Answer\n\n"
        + information
        + "# Sources\n\n"
        + "\n".join(
            [f"**{source.url}**\n\n{source.snippet}\n\n" for source in response.sources]
        )
    )
    report = llm_generate(prompt=query, information=complete_info)
    if report is None:
        raise ValueError("LLM returned None")
    content = f"# {report.title}\n\n{report.report}\n\n## Sources\n\n{'\n'.join(report.sources)}"
    pdf = MarkdownPdf(toc_level=0)
    pdf.add_section(Section(content))
    pdf.meta["title"] = title or report.title
    pdf.save(output_path)
    return output_path

In [32]:
# Now let's convert!
converter = Converter(conversion_callback=conversion_callback)

In [33]:
converter.convert(
    file_path="Most recent breaking news about the EU",
    output_path="eu_news_report.pdf",
    overwrite=True,
    title="EU Breaking News",
)

'eu_news_report.pdf'