# Information Extraction with Docling

This notebook demonstrates how to extract structured information from unstructured documents using [Docling](https://docling-project.github.io/docling/). You'll learn how to use different template formats to extract specific data fields from documents like invoices.

*Note: The extraction API is currently in beta and may change without prior notice.*

In [None]:
%pip install -q docling[vlm] # Install the Docling package with VLM support

## Installation

First, we need to install the Docling package with VLM (Vision Language Model) support for information extraction:

In [None]:
from IPython import display
from pydantic import BaseModel, Field
from rich import print

## Setup and Configuration

Let's import the necessary libraries and set up our environment:

### Sample Document

For this demonstration, we'll work with a sample invoice document. This will help us understand how information extraction works with real-world documents:

In [None]:
invoice_url = "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf"

display.IFrame(invoice_url, width="100%", height=600)

### Document Extractor Setup

Now let's configure the document extractor to handle PDF and image formats. The extractor is the main component that will process our documents and extract information using the templates we define:

In [None]:
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor

extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])

## Information Extraction with Templates

Docling supports different template formats for [information extraction](https://docling-project.github.io/docling/examples/extraction/). Templates define the structure and data types of the information you want to extract from documents.

### Configure extraction templates

The next cell configures different template formats available for information extraction. Each template format has its own advantages and use cases:

### Template Selection Guidelines

- **String templates**: Simple JSON format, fastest to write and good for basic extractions
  - Best for: Quick prototyping, simple data structures, minimal setup
  - Example use case: Extracting just a few basic fields like invoice number and total


- **Dictionary templates**: Python dictionaries, provides better integration with Python code  
  - Best for: Structured data with nested objects, better Python integration
  - Example use case: When you need nested data structures or complex field relationships


- **Pydantic model templates**: Recommended for production use with type validation
  - Best for: Production applications, type safety, complex data structures, documentation
  - Example use case: Enterprise applications where data validation and type safety are critical


- **Pydantic instance templates**: Useful when you need specific default values that override the model defaults
  - Best for: When you have contextual information that should be used as fallbacks
  - Example use case: Processing invoices from a specific vendor where you know the vendor name but want to extract it if present, or using a known invoice series number as a fallback
  - **Why this matters**: In our sample invoice, if the vendor name isn't clearly extractable but you're processing a batch from "WordPress Invoice Plugin", you can set that as the default while still allowing extraction to override it

In a later cell you'll choose which template format to use for the actual extraction.

In [None]:
from typing import Optional

# String template format - simple JSON string
string_template = '{"invoice_number": "string", "total": "float"}'

# Dictionary template format - Python dictionary
dict_template = {"invoice_number": "string", "total": "float"}

# Pydantic model template format - class definition with validation
# Notice how we use Field() with examples and defaults - this improves extraction accuracy!
class Invoice(BaseModel):
    invoice_number: str = Field(examples=["INV-001", "12345"])
    total: float = Field(default=10, examples=[100.0, 250.50])
    vendor_name: Optional[str] = Field(default=None, examples=["ACME Corp", "Tech Solutions Inc"])

pydantic_model_template = Invoice

# Pydantic instance template format - model instance with specific defaults
# This is useful when you have contextual information about the documents you're processing.
# In this example, imagine you're processing a batch of invoices from a specific context:
pydantic_instance_template = Invoice(
    invoice_number="WP-UNKNOWN",  # Fallback for WordPress invoice plugin documents
    total=0.0,                    # Safe default when total can't be extracted
    vendor_name="WordPress Invoice Plugin"  # Known vendor for this document batch
)

# Why use Pydantic instances? Consider this scenario with our sample invoice:
# - invoice_number and total will be extracted from the document if found
# - If vendor_name isn't clearly extractable, "WordPress Invoice Plugin" will be used
# - If invoice_number is missing, "WP-UNKNOWN" provides a meaningful fallback
# - This is very useful for batch processing where you have context about the document source

def get_extraction_template(template_name: str = "string"):
    """Get the configured extraction template based on name.

    Args:
        template_name: One of "string", "dict", "pydantic_model", or "pydantic_instance"
    
    Returns:
        Template for extraction
        
    Raises:
        ValueError: If template_name is not recognized
    """
    templates = {
        "string": string_template,
        "dict": dict_template,
        "pydantic_model": pydantic_model_template,
        "pydantic_instance": pydantic_instance_template
    }

    if template_name not in templates:
        raise ValueError(
            f"Unknown template name: '{template_name}'. "
            f"Choose from {list(templates.keys())}"
        )
    
    return templates[template_name]

# Tips for Better Extraction:
print("💡 Tips for Better Extraction:")
print("1. Use descriptive field names that clearly indicate what information you're looking for")
print("2. Provide examples in Pydantic Field definitions to guide the extraction")  
print("3. Specify appropriate data types (string, float, int, etc.) for better accuracy")
print("4. Use optional fields for data that might not always be present")
print("5. Test with different template formats to find what works best for your use case")
print("")
print("🔧 Pydantic Instance Use Case:")
print("In our sample invoice, the Pydantic instance template provides:")
print("- Known vendor fallback: 'WordPress Invoice Plugin' (useful if vendor name is unclear)")
print("- Meaningful invoice number fallback: 'WP-UNKNOWN' (better than generic defaults)")
print("- Safe total fallback: 0.0 (prevents errors if extraction fails)")
print("- Extracted data will still override these defaults when found in the document")

### Choose an extraction template

Next we choose the template format to be used for information extraction.

Each template format has different characteristics:

- **string**: Simple JSON format, fastest to write and good for basic extractions
- **dict**: Python dictionary, provides better integration with Python code  
- **pydantic_model**: Pydantic model class, recommended for production use with type validation
- **pydantic_instance**: Pydantic model instance, useful when you need specific default values

Just set `template_to_use` to one of the available template formats.

In [None]:
# Set the template to use (choose from: "string", "dict", "pydantic_model", "pydantic_instance")
template_to_use = "pydantic_model"

extraction_template = get_extraction_template(template_to_use)

print(f"✓ Using '{template_to_use}' template format")
print(f"Template: {extraction_template}")

## ✨ Information Extraction

Now we'll perform the information extraction using the selected template format. The extractor will analyze the invoice document and extract the structured information according to the template we configured.

In [None]:
# Perform information extraction using the selected template
print(f"Extracting information using '{template_to_use}' template...")

result = extractor.extract(invoice_url, template=extraction_template)

print(f"✓ Extraction completed successfully!")
print(f"Extracted data:")
print(result)

## Understanding the Results

The extraction results contain the structured data extracted from the document according to your selected template. The extractor uses vision-language models to understand the document content and map it to the requested fields.

### Interpreting Results

- **Successful extraction**: When the extractor finds the requested information, it will return the structured data in the format specified by your template
- **Missing fields**: Optional fields may be `None` or use default values if the information isn't found in the document
- **Data types**: Results will be converted to the specified types (string, float, int, etc.) when possible
- **Confidence**: The accuracy depends on document quality, field descriptiveness, and template complexity

### Pydantic Instance Template Results Explained

If you chose the `pydantic_instance` template, you'll see how contextual defaults work in practice:

- **`invoice_number` and `total`**: These will be extracted from the actual invoice document if found
- **`vendor_name`**: If the vendor name isn't clearly visible or extractable from our sample invoice, the fallback "WordPress Invoice Plugin" will be used instead of `None`

This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.

For example, in the results below:
- `invoice_number` and `total` are actually set from the values extracted from the document data
- If there was no clear `vendor_name` to be extracted, the updated default "WordPress Invoice Plugin" we provided would be applied instead of the model's default `None`

### Experimenting with Different Templates

You can easily experiment with different template formats by going back to the template selection cell and changing the `template_to_use` variable, then re-running the extraction. Try comparing the `pydantic_model` vs `pydantic_instance` results to see how the contextual defaults affect the output.

## Additional Resources

### Documentation
- [Docling Project Documentation](https://docling-project.github.io/docling/)
- [Pydantic Documentation](https://docs.pydantic.dev/latest/)
- [Open Data Hub Data Processing Repository](https://github.com/opendatahub-io/odh-data-processing)

### Next Steps
- Try extracting information from your own documents
- Experiment with more complex Pydantic models
- Explore batch processing of multiple documents
- Integrate extraction into your data processing pipelines

### Feedback and Contributions
We welcome feedback and contributions! Please visit the [ODH Data Processing repository](https://github.com/opendatahub-io/odh-data-processing) to:
- Report issues or bugs
- Suggest improvements
- Contribute examples and documentation
- Share your use cases