# Agentic AI with Docling as MCP tool

### Overview

In this lab, we will introduce **Docling MCP**, a service that provides tools for document conversion, processing, and generation. It also has extensions for leveraging popular agentic frameworks.

It uses the Docling library to convert PDF documents into structured formats and provides a caching mechanism to improve performance. The service exposes functionality through a set of tools that can be called by client applications.

---


### Technologies We'll Use

Building on our previous labs, we will leverage:

1. **[Docling](https://docling-project.github.io/docling/):** An open-source toolkit used to parse and convert documents.
2. **[MCP](https://modelcontextprotocol.io)**: The model context protocol for creating a tool.
3. **[Llama Stack](https://llama-stack.readthedocs.io/)**: Backend for building generative AI applications exposing standard APIs.
4. **[OpenAI Agents SDK](https://openai.github.io/openai-agents-python/)**: A lightweight framework for building agentic AI apps.

---

#### Notebook dependencies

Create a virtual environment to run this notebook, for instance, with [uv](https://docs.astral.sh/uv/), and install the necessary dependencies:

In [1]:
!uv pip install llama-stack-client openai-agents

[2mUsing Python 3.12.9 environment at: /Users/dol/projects/tx25/techxchange2025-lab3640-docling/.venv[0m
[2mAudited [1m2 packages[0m [2min 7ms[0m[0m


Import the necessary classes and methods:

In [2]:
import uuid
import logging

from rich.console import Console
from rich.markdown import Markdown

from agents import Agent, ModelSettings, Runner, SQLiteSession, set_trace_processors, set_tracing_disabled, ItemHelpers, FileSearchTool
from agents.mcp import MCPServerStreamableHttp, ToolFilterStatic
from agents.models.openai_provider import OpenAIProvider
from agents.run import RunConfig
from agents.items import ResponseFunctionToolCall, ResponseFileSearchToolCall
from agents.tracing.processors import BatchTraceProcessor, ConsoleSpanExporter
from openai import AsyncOpenAI
from llama_stack_client import LlamaStackClient

console = Console(width=100, soft_wrap=True)

Create the components that we will leverage for creating and running agents, from **Llama Stack** and **OpenAI Agents**.

Note that in this example we will use the Meta Llama 3.3 70B language model, but you can experiment with other models.

In [3]:
LLS_URL = "http://localhost:8321"
BASE_URL = f"{LLS_URL}/v1/openai/v1"
API_KEY = "none"
# MODEL_ID = "vllm/gpt-oss-120b"
MODEL_ID = "vllm/llama-3-3-70b"
# MODEL_ID = "vllm/Llama-4-Maverick"

In [4]:
# Llama Stack Client
lls_client = LlamaStackClient(base_url=LLS_URL)

# Model client
client = AsyncOpenAI(base_url=BASE_URL, api_key=API_KEY)

# Configure the OpenAI provider that uses our AsyncOpenAI client for Llama Stack
provider = OpenAIProvider(openai_client=client)

# Tell OpenAI to dump traces to the console
set_tracing_disabled(True)
set_trace_processors([BatchTraceProcessor(exporter=ConsoleSpanExporter())])

# Setup quite logging
logging.getLogger("openai.agents").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("httpcore").setLevel(logging.WARNING)


In [5]:
instructions="""You are an assistant that uses external tools.  
Follow these rules strictly:

1. At each reasoning step, you may call **at most one tool**.  
2. If you need to use multiple tools, do it sequentially:  
   - Call exactly one tool.  
   - Wait for its result.  
   - Then think again and, if needed, call another tool.  
3. Never call two or more tools in the same response.  
4. If no tool is needed, just provide your reasoning or final answer.  
5. Treat a missing tool result as blocking — do not continue reasoning until you receive the output.

If you ever attempt to call more than one tool at once, your answer will be rejected.  
Always restrict yourself to **zero or one tool call per response**.

When the tools return a base64 image, simply create some markdown wrapper which allows to display it and take special care in not hallucinating any byte.
"""

### Agent definition

In the following block we define the Document Agent with the following settings:

1. Connection to the Docling MCP server.
2. Selection of the tools to use in Docling.
3. (Optional) Any additional tools, such as the built-in File Search.

The `run_agent()` method will be called in the use cases below with different user prompts and tools.

In [None]:
async def run_agent(queries: list[str]=[], extra_tools: list[str]=[], allowed_tools: list[str]=[]):
    async with MCPServerStreamableHttp(
        name="Docling MCP",
        params={
            "url": "http://localhost:8000/mcp",
            "timeout": 180.0,
        },
        client_session_timeout_seconds=180,
        tool_filter=ToolFilterStatic(allowed_tool_names=allowed_tools) if allowed_tools else None,
    ) as server:
        agent = Agent(
            name="Document Agent",
            model=MODEL_ID,
            instructions=instructions,
            model_settings=ModelSettings(
                temperature=0, top_p=0.9,
                parallel_tool_calls=False,
                tool_choice="required",
            ),
            mcp_servers=[server],
            tools=extra_tools,
        )
        session = SQLiteSession(str(uuid.uuid4()))
        print(f"Created session_id={session.session_id} for Agent({agent.name})")

        # user_queries = [instructions, *queries]
        user_queries = queries

        for prompt in user_queries:
            console.print(f"[cyan]👤 User> {prompt}[/cyan]")
            # Launch the agent runner
            result = Runner.run_streamed(
                agent,
                prompt,
                session=session,
                run_config=RunConfig(model_provider=provider),
            )

            # Print the events as they appear from the agent stream
            async for event in result.stream_events():
                # We'll ignore the raw responses event deltas
                if event.type == "raw_response_event":
                    continue
                # When the agent updates, print that
                elif event.type == "agent_updated_stream_event":
                    console.print(f"Agent updated: {event.new_agent.name}")
                    continue
                # When items are generated, print them
                elif event.type == "run_item_stream_event":
                    if event.item.type == "tool_call_item":
                        raw_item = event.item.raw_item
                        if isinstance(raw_item, ResponseFunctionToolCall):
                            console.print(f"[yellow]-- Tool was called: {raw_item.name}({raw_item.arguments.strip()})[/yellow]")
                    elif event.item.type == "tool_call_output_item":
                        console.print(f"[yellow]-- Tool output: {event.item.output}[/yellow]")
                    elif event.item.type == "message_output_item":
                        md = Markdown(ItemHelpers.text_message_output(event.item))
                        console.print("[green]🤖 Assistant>[/green]")
                        console.print(md)
                    else:
                        print(f"other event: {event.item.type}")
                        pass  # Ignore other event types


---

## 1. Interact with documents

In this part of the lab, you will use the agent you just built to work with existing documents. The agent leverages both the reasoning capabilities of the LLM and the Docling MCP tools to interpret, process, and retrieve document content.

### Document conversion and exports

The Docling MCP tool allows to convert documents and return their content to the agent model for reasoning about it.
A classic use case is **document summarization**.

In [7]:
await run_agent(
    queries=[
        "Convert the document on https://arxiv.org/pdf/2408.09869 and give me a summary of the document from its markdown content.",
    ],
    extra_tools=[],
    allowed_tools=[
        "convert_document_into_docling_document",
        "export_docling_document_to_markdown",
    ]
)


INFO:mcp.client.streamable_http:Received session ID: 1d6cc25d04a44dc886b73617dddf1ce4
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18


Created session_id=45d60d37-e8e6-4ace-b05f-5c0629e851ff for Agent(Document Agent)


### Agentic RAG

For large documents or, in general, for a corpus of documents, we can leverage the power of the Docling document conversion with Agentic RAG.

During the ingestion with Docling, we will process the textual document components as well as the more complex structures like tables and figures.

The Agentic RAG improves standard RAG pipelines in two ways:
1. It lets the model rephrase the retrieval query.
2. It enables the reasoning loop to decide if the retrieved context are good and sufficient, or if a second or a third retrieval iteration is needed.

In [8]:
# Create a vectordb index in Llama Stack
vdb_id_resp = lls_client.vector_dbs.register(
    vector_db_id=(vdb_name := f"testdb_{uuid.uuid4()}"),
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_dimension=384,
    provider_id="milvus",
)
vdb_id = vdb_id_resp.identifier

# Instruct the agent to ingest the data in the index and use it for querying.
await run_agent(
    queries=[
        f"Ingest the document https://arxiv.org/pdf/2206.01062 into the vectordb {vdb_id} and answer the following questions: 1) how many pages were manually annotated?",
    ],
    extra_tools=[
        FileSearchTool(vector_store_ids=[vdb_id])
    ],
    allowed_tools=[
        "convert_document_into_docling_document",
        "insert_document_to_vectordb",
    ],
)


INFO:mcp.client.streamable_http:Received session ID: 21d459c020bd47abbcbe2d05161ca239
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18


Created session_id=35aa75ef-30d2-40f9-929d-17eb7edd0d60 for Agent(Document Agent)


---

## 2. Create a document

Docling MCP can also be used to have a model write a complex document directly in the **DoclingDocument** format, which can then be exported to many outputs (Markdown, HTML, LaTeX, etc.).

In the following example, we give the agent only the topic and the high‑level structure of the document. The agent is responsible for filling out the rest.

In [9]:
await run_agent(
    queries=[
        "Create a new Docling document with the title \"Open-Source Agentic AI\", a paragraph, and a list of the main applications. Show the result in valid markdown.",
    ],
    extra_tools=[],
    allowed_tools=[
        "create_new_docling_document",
        "add_title_to_docling_document",
        "add_section_heading_to_docling_document",
        "add_paragraph_to_docling_document",
        "open_list_in_docling_document",
        "close_list_in_docling_document",
        "add_list_items_to_list_in_docling_document",
        "add_table_in_html_format_to_docling_document",
        "export_docling_document_to_markdown",
    ]
)

INFO:mcp.client.streamable_http:Received session ID: 3dac9bcf18f54919a42cd9d7bff942a4
INFO:mcp.client.streamable_http:Negotiated protocol version: 2025-06-18


Created session_id=89216e19-8799-4efe-83fc-b956f1a257f1 for Agent(Document Agent)


---

## Next Steps: Where to Go from Here

### Immediate actions

1. **Experiment with your documents**
   - Try documents with complex layouts
   - Test with technical diagrams and charts
   - Process multi-page reports with mixed content

2. **Connect more agents**
   - Try connecting more tools
   - Search the documents to ingest via metadata
   - Search the web for relevant documents
   - Extract information from the documents

3. **More ways to interact with tools**
   - Use the Llama Stack playground UI for chatting with the agents
   - Use other frameworks and ecosystems like Claude Desktop, BeeAI, etc

---

## Resources for Continued Learning

### Official Documentation
- **[Docling Documentation](https://github.com/docling-project/docling)**: Latest features and updates

### Community Resources
- Join the Docling community on GitHub
- Share your implementations
- Contribute improvements back to the project

### Related Topics to Explore
- Document Layout Analysis
- Multimodal Embeddings
- Visual Question Answering
- Explainable AI Systems

---