<img src="https://imagedelivery.net/Dr98IMl5gQ9tPkFM5JRcng/3e5f6fbd-9bc6-4aa1-368e-e8bb1d6ca100/Ultra" alt="Image description" width="160" />

<br/>

# Metadata Management in Contextual AI

This notebook provides a comprehensive guide to adding and utilizing metadata within the Contextual AI Platform for enhanced document retrieval and filtering.  

The notebook walks through manually adding metadata into the datastore. With this additional metadata, it is possible to have more control over document organization, retrieval filtering, and response generation.   

For more background on metadata management in the platform, see also our [documentation](https://docs.contextual.ai/api-reference/datastores-documents/get-document-metadata).

  
  
## Table of Contents

1. **Environment Setup** - API configuration and dependencies
2. **Datastore and Agent Initialization** - Creating datastores and configuring agents
3. **Metadata Configuration** - Setting metadata at ingest time and post-processing
4. **Retrieval with Metadata Filters** - Using metadata for precise document filtering
5. **Advanced Filtering Operations** - Complex queries with multiple operators
6. **Generation using Metadata** - Leveraging metadata in reranking and generation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/15-metadata-intro/metadata_intro.ipynb)

## 1. Environment Setup

### Prerequisites

- **API Key**: Obtain from your Contextual AI workplace dashboard
- **Python Client**: Required version 0.8.0 or higher
- **Sample Data**: Policy documents and knowledge base templates

**Security Note**: Store API keys in environment variables or secure key management systems.

In [None]:
!pip install "contextual-client>=0.8.0"

In [None]:
# Import required libraries
import os
import requests
import json
from pathlib import Path
from typing import List, Optional, Dict
from IPython.display import display, JSON
import pandas as pd
from contextual import ContextualAI

In [None]:
# Initialize Contextual AI client
# You can store the API key as an environment variable:
#os.environ["CONTEXTUAL_API_KEY"] = "key-YSVU"

client = ContextualAI(
    api_key= os.environ["CONTEXTUAL_API_KEY"]
)

### Data Preparation

Download sample policy documents and knowledge base templates for demonstration purposes.

In [None]:
def fetch_file(filepath):
    os.makedirs(os.path.dirname(filepath), exist_ok=True) if '/' in filepath else None
    if not os.path.exists(filepath):
        print(f"Fetching {filepath}")
        response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/15-metadata-intro/{filepath}")
        if response.ok:
            with open(filepath, 'wb') as f:
                f.write(response.content)
            print(f"Saved {filepath}")
        else:
            print(f"Failed to fetch {filepath}")

fetch_file('data/POL_EU-v3.md')
fetch_file('data/POL_US-v2.md')
fetch_file('data/KB_Template_DE.md')
fetch_file('data/KB_Template_EU.md')
fetch_file('data/KB_Template_US.md')

## 2. Datastore and Agent Initialization

### 2.1 Datastore Creation

Initialize a new datastore to serve as the repository for documents with metadata capabilities.

In [None]:
result = client.datastores.create(name="Metadata_Examples")
datastore_id = result.id
print(f"Datastore ID: {datastore_id}")

### 2.2 Agent Configuration

Create an agent to enable querying capabilities on the datastore.

In [None]:
app_response = client.agents.create(
    name="Demo Metadata",
    datastore_ids=[datastore_id]
)
agent_id= app_response.id
print(f"Agent ID created: {agent_id}")

## 3. Metadata Configuration

### Overview

The Contextual AI platform provides flexible metadata management with two primary configuration methods:

**Timing Options:**
- **At ingest time**: Set metadata during document upload
- **Post-processing**: Update metadata after document processing

**Configuration Parameters:**
- `in_chunks` (default: False): Include metadata in chunk content for reranking
- `returned_in_response` (default: False): Include metadata in API responses
- `filterable` (default: True): Enable metadata-based filtering

This section demonstrates various metadata configuration strategies.


### Metadata Structure

Each metadata entry consists of:
- **Field**: The metadata key (e.g., 'language', 'region')
- **Value**: The corresponding value (e.g., 'EN', 'US')

**Limitations:**
- Maximum of 15 metadata fields per workspace (contact support for extensions)
- Maximum 2KB per metadata field value
- Case sensitivity varies by operator (see Advanced Filtering Operations section)

**Best Practices:**
- Metadata fields can be case sensitive, so I only use lowercase so there are less issues
- Here is a list of common metadata fields that can inspire you and you may reuse: author, category, company, custom_field_1, custom_field_2, date, description, document_id, document_title, document_type, filename, folder_path, industry, language, name, project_id, publication_date, region, source, status, subject, title, type, updated_at, upload_date, upload_timestamp, uploaded_by, url, user_id, user_name, version, year

### Available Metadata Fields

Sample metadata fields used in our demonstration documents:
- `date`: Document date
- `region`: Geographic region (US, EU, etc.)
- `language`: Language code (EN, DE, etc.)
- `entities`: Named entities in the document
- `version`: Document version

Use can use create your own metadata fields for your documents.

### 3.1 Setting Metadata at Ingest Time

Demonstrate adding metadata during document upload with default configuration parameters.

In [None]:
# Convert it to a JSON string first:
metadata_string = json.dumps({
    "custom_metadata": {
        "region": "EU",
        "language": "EN"
    }
})

# Then use it in your function:
with open('data/POL_EU-v3.md', 'rb') as f:
    ingestion_result = client.datastores.documents.ingest(
        datastore_id,
        file=f,
        metadata=metadata_string  # Pass the string instead
    )
    document_id = ingestion_result.id

Monitor document processing status. The document progresses through states: pending → processing → completed. Custom metadata becomes visible upon completion.

In [None]:
metadata = client.datastores.documents.metadata(datastore_id = datastore_id, document_id = document_id)
print("Document metadata:", metadata)

Note the metadata embedded in the document response: `custom_metadata={'region': 'EU', 'language': 'EN'}`

### Examining Metadata in Chunks

Query the datastore to observe how metadata is incorporated into chunk content. For this query to run successfully, you should wait until the document has been fully processed and the status is complete.

In [None]:
query = "During"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 1,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
    },
        include_retrieval_content_text=True,
        retrievals_only=False
        )

query_result

In [None]:
query_result.retrieval_contents[0].content_text

The retrieved chunk includes metadata appended to the content:

```
'File Name: POL_EU-v3.html\nDocument Title: Refund Policy EU v3\nSection: Refund Policy EU v3\n\n\nEU customers: refunds within 30 days. Digital goods only if unused.\nMetadata: {\n\t"region": "EU",\n\t"language": "EN"\n}'
```

Metadata is automatically appended when `in_chunks` is set to True (default behavior).

### 3.2 Exploring Advanced Metadata Configuration

Let's add a new document, where we configure all the different metadata parameters to show how they work.

In [None]:
metadata_string = json.dumps({
    "custom_metadata": {
        "region": "US",
        "language": "EN",
        "version": "v2"
    },
    "custom_metadata_config": {
        "region": {
            "filterable": True,
            "in_chunks": True,
            "returned_in_response": True
        },
        "language": {
            "filterable": True,
            "in_chunks": True,
            "returned_in_response": True
        },

        "version": {
            "filterable": True,
            "in_chunks": False,
            "returned_in_response": False
        }
    },
})

with open('data/POL_US-v2.md', 'rb') as f:
    ingestion_result = client.datastores.documents.ingest(
        datastore_id,
        file=f,
        metadata=metadata_string  # Pass the string instead
    )
    document_id = ingestion_result.id

Verify the complete metadata configuration:

In [None]:
metadata = client.datastores.documents.metadata(datastore_id = datastore_id, document_id = document_id)
print("Document metadata:", metadata)

In [None]:
metadata.custom_metadata

Document-level metadata: `{'region': 'US', 'version': 'v2', 'language': 'EN'}`

### Examining Metadata in Chunks

Let's look into how the metadata is applied to chunks. We will apply metadata filters to retrieve specific chunks. The `override_configuration` parameter is used here for demonstration purposes to bypass reranking and filtering, showing all matching chunks.

In [None]:
query = "During"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
    documents_filters= {
        "operator": "AND",
        "filters": [
            {"field": "region", "operator": "equals", "value": "US"}
        ]
      },
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 3,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
    },
        retrievals_only=True,
        include_retrieval_content_text=True,
        )

query_result

Extract chunk-level metadata from the query response:

In [None]:
query_result.retrieval_contents[0].custom_metadata

Chunk-level metadata: `{'language': 'EN', 'region': 'US'}`

Note that `version` is excluded from the response metadata due to `returned_in_response: False` configuration.

In [None]:
query_result.retrieval_contents[0].content_text

Chunk content with embedded metadata:

```
File Name: POL_US-v2.html
Document Title: Refund Policy US v2
Section: Refund Policy US v2

US customers: refunds within 14 days. Subscriptions non-refundable.
Metadata: {
    "region": "US",
    "language": "EN"
}
```

The `version` field is absent from chunk content because `in_chunks: False` was specified.

### 3.3 Updating Metadata

Metadata can also be updated after ingestion time. Let's change the version number for a document.

In [None]:
result = client.datastores.documents.set_metadata(
        datastore_id=datastore_id,
        document_id=document_id,
        custom_metadata={"version": 'v4'}
)

In [None]:
metadata = client.datastores.documents.metadata(datastore_id=datastore_id,
                        document_id=document_id)
print("Document metadata:", metadata.custom_metadata)

The output now shows the metadata has been updated to: `{'version': 'v4'}`

### 3.4 Summary: Metadata Configuration Options

**Configuration Timing:**
- At ingest time during document upload
- Post-processing via metadata updates

**Configuration Parameters:**
- `in_chunks`: Controls metadata inclusion in chunk content
- `returned_in_response`: Controls metadata visibility in API responses
- `filterable`: Enables metadata-based query filtering

## 4. Retrieval with Metadata Filters

Metadata filtering enables precise control over document retrieval, improving query accuracy by narrowing the search scope to relevant documents based on metadata criteria.

### 4.1 Ingesting Documents with Rich Metadata

Prepare additional documents with comprehensive metadata for demonstration.

In [None]:
metadata_string = json.dumps({
    "custom_metadata": {
        "region": "US",
        "language": "EN",
        "date": "2025-02-10",
        "entities": '["REG:US", "TEMP:CUSTOMER", "PROC:REFUND", "PROD:SUBSCRIPTION"]'
    }
})

with open('data/KB_Template_US.md', 'rb') as f:
    ingestion_result = client.datastores.documents.ingest(
        datastore_id,
        file=f,
        metadata=metadata_string
    )

In [None]:
metadata_string = json.dumps({
    "custom_metadata": {
        "region": "EU",
        "language": "DE",
        "date": "2022-02-10",
        "entities": '["REG:DE", "LANG:DE", "TEMP:CUSTOMER", "PROC:WARRANTY"]'
    }
})

with open('data/KB_Template_DE.md', 'rb') as f:
    ingestion_result = client.datastores.documents.ingest(
        datastore_id,
        file=f,
        metadata=metadata_string
    )

In [None]:
metadata_string = json.dumps({
    "custom_metadata": {
        "region": "EU",
        "language": "EN",
        "date": "2024-02-10",
        "entities": '["REG:EU", "TEMP:CUSTOMER", "PROC:REFUND", "PROD:DIGITAL"]'
    }
})

with open('data/KB_Template_EU.md', 'rb') as f:
    ingestion_result = client.datastores.documents.ingest(
        datastore_id,
        file=f,
        metadata=metadata_string
    )

Monitor document processing status. Ensure all documents reach 'completed' status before proceeding.

In [None]:
# Retrieve all documents from the datastore
docs = client.datastores.documents.list(datastore_id=datastore_id)
doc_pairs = [(doc.id, doc.name, doc.status) for doc in docs.documents]
print("Document ID and Name pairs:")
for doc_id, name, status in doc_pairs:
    print(f"ID: {doc_id}, Name: {name}, Status: {status}")

### 4.2 Baseline Query Without Filters

Execute a general query to demonstrate the challenge of ambiguous responses without metadata filtering.

In [None]:
query = "What is the current refund window?"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        )

query_result.message.content

The response contains multiple region-specific policies, demonstrating the ambiguity when metadata filtering is not applied. The system returns policies for both US and EU regions, requiring manual disambiguation.`

### 4.3 Targeted Query with Metadata Filters

Apply region and date filters to retrieve contextually relevant information for a US-based user requiring current policies.

In [None]:
query = "What is the current refund window?"
value1 = "US"
value2 = "2023-01-01"
metadata_field1 = "region"
metadata_field2 = "date"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "AND",
        "filters": [
            { "operator": "AND",
            "filters": [
                {"field": metadata_field1, "operator": "equals", "value": value1},
                {"field": metadata_field2, "operator": "gt", "value": value2}
            ]
            }
       ]
        }
        )

query_result.message.content

The filtered response provides US-specific information only, demonstrating improved precision through metadata filtering. The response correctly focuses on current US policies dated after January 2023.

## 5. Advanced Filtering Operations

### 5.1 Available Filter Operators

| Operator | Description | Case Sensitivity | Example |
|----------|-------------|------------------|---------|
| `exists` | Checks if field has any value | N/A | `{"field": "tags", "operator": "exists"}` |
| `equals` | Exact match comparison | Case sensitive | `{"field": "status", "operator": "equals", "value": "active"}` |
| `notequals` | Negated exact match | Case sensitive | `{"field": "status", "operator": "notequals", "value": "inactive"}` |
| `wildcard` | Pattern matching with * | Lowercase only | `{"field": "code", "operator": "wildcard", "value": "*123"}` |
| `startswith` | Prefix matching | Case sensitive | `{"field": "title", "operator": "startswith", "value": "HR-"}` |
| `between` | Range comparison | N/A | `{"field": "date", "operator": "between", "value": ["2023-01-01", "2023-12-31"]}` |
| `containsany` | Array membership | Case sensitive | `{"field": "tags", "operator": "containsany", "value": ["policy", "HR"]}` |
| `gt`, `gte` | Greater than (or equal) | N/A | `{"field": "timestamp", "operator": "gt", "value": "2023-12-31"}` |
| `lt`, `lte` | Less than (or equal) | N/A | `{"field": "timestamp", "operator": "lte", "value": "2024-01-01"}` |

### 5.2 Complex Query with OR Logic and Wildcard Matching

Demonstrate combining multiple filter conditions using OR logic and wildcard pattern matching. Note that wildcard operators require lowercase values.


In [None]:
query = "What is the current refund window?"

#use lower cases !!!!
metadata_field1 = "region"
operator1 = "equals"
value1 = "US"
metadata_field2 = "entities"
operator2 = "wildcard"
value2 = "*prod:digital*"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "OR",
        "filters": [
                {"field": metadata_field1, "operator": operator1, "value": value1},
                {"field": metadata_field2, "operator": operator2, "value": value2},
            ]
        },
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 3,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
        },
        retrievals_only=True
        )
query_result

The query returns multiple retrievals, confirming that OR logic successfully matches documents meeting either condition.

### 5.3 Case Sensitivity in Filter Operations

Demonstrate the impact of case sensitivity on filter results. The `equals` operator is case-sensitive.

In [None]:
query = "What is the current refund window?"

metadata_field1 = "region"
operator1 = "equals"
value1 = "us"
metadata_field2 = "entities"
operator2 = "wildcard"
value2 = "*prod:digital*"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "OR",
        "filters": [
                {"field": metadata_field1, "operator": operator1, "value": value1},
            ],
            "operator": "OR",
                "filters": [
                    {"field": metadata_field2, "operator": operator2, "value": value2},
                ]
        },
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 2,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
        },
        retrievals_only=True
        )

query_result

Result: Single retrieval returned. The lowercase "us" fails to match the stored value "US" due to case sensitivity of the `equals` operator.


### 5.4 Wildcard Operator Case Requirements

Test wildcard operator with uppercase values to demonstrate its lowercase requirement.

In [None]:
query = "What is the current refund window?"

metadata_field1 = "region"
operator1 = "equals"
value1 = "us"
metadata_field2 = "entities"
operator2 = "wildcard"
value2 = "*PROD:DIGITAL*"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "OR",
        "filters": [
                {"field": metadata_field1, "operator": operator1, "value": value1},
            ],
            "operator": "OR",
                "filters": [
                    {"field": metadata_field2, "operator": operator2, "value": value2},
                ]
        },
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 2,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
        },
        retrievals_only=True
        )

query_result

Result: No retrievals returned.

**Key Learning:**
- `wildcard` operator requires lowercase values
- `equals` operator is case sensitive
- Carefully validate case requirements for each operator to ensure successful filtering

## 6. Generation using Metadata

Metadata integration extends throughout the generation pipeline, enhancing:
- Document reranking
- Filter model behavior
- Generation model context

### 6.1 Instruction-Following Reranker

Contextual AI's instruction-following reranker accepts custom prompts that can leverage metadata fields for enhanced relevance scoring.

#### Baseline Reranking

In [None]:
query = "What is the current refund window?"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": True,
        },
        retrievals_only=True,
        include_retrieval_content_text=True
        )

top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]
print(top_3_doc_names)

Default ranking order: `['KB_Template_US', 'KB_Template_EU', 'KB_Template_DE']`

#### Custom Reranking Instructions

Apply metadata-based prioritization through reranker instructions.

In [None]:
query = "What is the current refund window?"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": True,
        "rerank_instructions": "Prioritize PROC:WARRANTY chunks"
        },
        retrievals_only=True,
        include_retrieval_content_text=True
        )

top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]
print(top_3_doc_names)

Modified ranking order: `['KB_Template_EU', 'KB_Template_US', 'KB_Template_DE']`

The reranker successfully prioritizes documents containing PROC:WARRANTY metadata.

#### Temporal Prioritization

Demonstrate using date metadata for recency-based reranking.

In [None]:
query = "Customer email template"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": True,
      },
        retrievals_only=True,
        include_retrieval_content_text=True
        )

top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]
print(top_3_doc_names)

Baseline order: `['KB_Template_EU', 'KB_Template_US', 'KB_Template_DE']`

In [None]:
query = "Customer email template"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": True,
        "rerank_instructions": "Prioritize recent documents"
      },
        retrievals_only=True,
        include_retrieval_content_text=True
        )

top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]
print(top_3_doc_names)

Recency-prioritized order: `['KB_Template_US', 'KB_Template_EU', 'KB_Template_DE']`

The reranker successfully prioritizes documents based on date metadata, placing the most recent document (US, dated 2025-02-10) first.

### 6.2 Metadata in Generation

The generation model incorporates metadata from retrieved chunks into its responses. This example demonstrates how metadata enhances response generation.

In [None]:
query = "What is the current refund window?"

metadata_field1 = "region"
value1 = "US"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "AND",
        "filters": [
            { "operator": "AND",
            "filters": [
                {"field": metadata_field1, "operator": "equals", "value": value1},
            ]
            }
       ]
        },
        override_configuration= {
        "enable_filter": False,
        "enable_rerank": False,
        "top_k_retrieved_chunks": 3,
        "lexical_alpha": 0.1,
        "semantic_alpha": 0.9,
        },
        include_retrieval_content_text=True
        )

query_result.message.content

The generated response incorporates the date metadata ("February 10, 2025") from the retrieved chunk, demonstrating how metadata enriches the generation output with contextual information.

#### Examining Metadata Source

Verify how metadata is accessible to the generation model through chunk content.

In [None]:
query_result.retrieval_contents[0].content_text

Retrieved chunk with embedded metadata:

```
File Name: KB_Template_US.html
Document Title: KB — Customer email template (US)
Section: KB — Customer email template (US)

Localized template for customer emails for United States Dear Customer, Thank you for shopping with us. For products purchased in the United States: Refunds are available for 30 days on physical goods. Refunds for digital goods are limited to 14 days and only if the product has not been activated. Subscriptions are generally non-refundable once started.
Metadata: {
    "region": "US",
    "language": "EN",
    "date": "2025-02-10",
    "entities": "["REG:US", "TEMP:CUSTOMER", "PROC:REFUND", "PROD:SUBSCRIPTION"]"
}
```

The metadata is embedded within the chunk content, making it available to the generation model for producing contextually aware responses.


## Summary

This notebook demonstrated comprehensive metadata utilization across the Contextual AI platform:

- **Configuration**: Set metadata at ingest time or post-processing with granular control
- **Retrieval**: Apply precise document filters using various operators with attention to case sensitivity
- **Generation**: Leverage metadata in reranking, filtering, and generation stages

Metadata provides powerful capabilities for improving retrieval accuracy and response relevance in production applications. While here we manually added metadata, depending on your use case, you can automate the creation of metadata and ingestion into Contextual AI.

## Best Practices and Recommendations

1. **Test Case Sensitivity**: Validate filter behavior with different case combinations
2. **Plan Metadata Schema**: Design metadata fields before large-scale ingestion
3. **Monitor Performance**: Track retrieval quality improvements with metadata filtering along with latency. Some operators like wildcard can increase latency.
4. **Document Metadata Strategy**: Maintain clear documentation of metadata field meanings and usage

For additional resources and examples, refer to the [Contextual AI documentation](https://docs.contextual.ai/).
