## How to add metadata to your documents and filter searches
This notebook will walk you through how to upload metadata that provides extra information about the corpus you are ingesting with nv-ingest. It will show the requirements for the metadata file and what file types are supported. Then we will go throught he process of filtering searches, in this case, on the metadata we provided.

First step is to provide imports for all the tools we will be using.

In [1]:
from nv_ingest_client.client import Ingestor
from nv_ingest_client.util.milvus import nvingest_retrieval
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Next we will annotate all the necessary variables to ensure our client connects to our pipeline.

In [2]:
model_name="nvidia/llama-3.2-nv-embedqa-1b-v2"
hostname="localhost"
collection_name = "nv_ingest_collection"
sparse = True

Now, we will begin by creating a dataframe with dummy metadata in it. The metadata can be ingested as either a dataframe or a file. Supported file types (json, csv, parquet). If you supply a file it will be converted into a pandas dataframe for you. In this example, after we create the dataframe, we write it to a file and we will use that file as part of the ingestion.

In [3]:
meta_df = pd.DataFrame(
    {
        "source": ["/raid/nv-ingest/data/woods_frost.pdf", "/raid/nv-ingest/data/multimodal_test.pdf"],
        "meta_a": ["alpha", "bravo"],
        "meta_b": [5, 10],
        "meta_c": [True, False],
        "meta_d": [10.0, 20.0]
    }
)
file_path = "./meta_df.csv"
meta_df.to_csv(file_path)

If you are supplying metadata during ingestion you are required to supply three keyword arguments.

- meta_dataframe - This is either a string representing the file (to be loaded via pandas) or the already loaded dataframe.
- meta_source_field - This is a string, that represents the field that will be used to connect to the document during ingestion.
- meta_fields - This is a list of strings, representing the columns of data from the dataframe that will be used as metadata for the corresponding documents.

All three of the parameters are required to enable metadata updates to the documents during ingestion.


In [4]:
ingestor = ( 
    Ingestor(message_client_hostname=hostname)
    .files(["/raid/nv-ingest/data/woods_frost.pdf", "/raid/nv-ingest/data/multimodal_test.pdf"])
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
        text_depth="page"
    ).embed(text=True, tables=True
    ).vdb_upload(collection_name=collection_name, milvus_uri=f"http://{hostname}:19530", sparse=sparse, minio_endpoint=f"{hostname}:9000", dense_dim=2048
                 ,meta_dataframe=file_path, meta_source_field="source", meta_fields=["meta_a", "meta_b", "meta_c", "meta_d"]
                )
)
results = ingestor.ingest_async().result()

'text' parameter is deprecated and will be ignored. Future versions will remove this argument.
'tables' parameter is deprecated and will be ignored. Future versions will remove this argument.


Once the ingestion is complete, the documents will have uploaded to the vector database with the corresponding metadata as part of the `content_metadata` field. This is a json field that can be used as part of a filtered search. To use this, you can select a column from the meta_fields previously described and filter based on a value for that sub-field. That is what is done in this example below. There are more extensive filters that can be applied, please refer to https://milvus.io/docs/use-json-fields.md#Query-with-filter-expressions for more information.

In [5]:
queries = ["this is expensive"]
top_k = 5
q_results = []
for que in queries:
    q_results.append(nvingest_retrieval([que], collection_name=collection_name, host=f"http://{hostname}:19530", embedding_endpoint=f"http://{hostname}:8012/v1",  hybrid=sparse, top_k=top_k, model_name=model_name, gpu_search=False
                                            , _filter='content_metadata["meta_a"] == "alpha"'
                                           ))

print(f"{q_results}")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /opt/conda/envs/nv_ingest_runtime/lib/python3.12/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


[data: [[{'id': 459164003456523110, 'distance': 0.016393441706895828, 'entity': {'text': 'Stopping by Woods on a Snowy Evening, By Robert Frost\r\nFigure 1: Snowy Woods\r\nWhose woods these are I think I know. His house is in the village though; He will not see me \r\nstopping here; To watch his woods fill up with snow. \r\nMy little horse must think it queer; To stop without a farmhouse near; Between the woods and \r\nfrozen lake; The darkest evening of the year. \r\nHe gives his harness bells a shake; To ask if there is some mistake. The only other sound’s the \r\nsweep; Of easy wind and downy flake. \r\nThe woods are lovely, dark and deep, But I have promises to keep, And miles to go before I \r\nsleep, And miles to go before I sleep.\r\nFrost’s Collections\r\nFigure 2: Robert Frost', 'source': {'source_name': '/raid/nv-ingest/data/woods_frost.pdf', 'source_id': '/raid/nv-ingest/data/woods_frost.pdf', 'source_location': '', 'source_type': 'PDF', 'collection_id': '', 'date_created': 

The second filter expression leverages the `meta_b` field and grabs all available chunks because the filter includes any values greater than or equal to 5. This will retrieve all chunks from both the `woods_frost.pdf` and `multimodal_test.pdf`.

In [6]:
queries = ["this is expensive"]
top_k = 5
q_results = []
for que in queries:
    q_results.append(nvingest_retrieval([que], collection_name=collection_name, host=f"http://{hostname}:19530", embedding_endpoint=f"http://{hostname}:8012/v1",  hybrid=sparse, top_k=top_k, model_name=model_name, gpu_search=False
                                            , _filter='content_metadata["meta_b"] >= 5'
                                           ))

print(f"{q_results}")

[data: [[{'id': 459164003456523124, 'distance': 0.016393441706895828, 'entity': {'text': 'This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost   Hammer - Powerdrill - Bluetooth speaker - Minifridge - Premium desk fan Dollars $- - $20.00 - $40.00 - $60.00 - $80.00 - $100.00 - $120.00 - $140.00 - $160.00 Cost    Chart 1', 'source': {'source_name': '/raid/nv-ingest/data/multimodal_test.pdf', 'source_id': '/raid/nv-ingest/data/multimodal_test.pdf', 'source_location': '', 'source_type': 'PDF', 'collection_id': '', 'date_created': '2025-07-08T19:00:47.222326', 'last_modified': '2025-07-08T19:00:47.222219', 'summary': '', 'partition_id': -1, 'access_level': -1}, 'content_metadata': {'content_url': '', 'content_metadata': {'type': 'structured', 'description': 'Structured chart extracted from PDF document.', 'page_number': 0, 'hierarchy': {'page_count': 3, 'page': 0, 'block': -1, 'line': -1, 'span': -1, 'nearby_objects': {'text': {'content': [], 'bbox': [], 'typ

In the next retrieval run, we will create a filter expressions for the `meta_c` filter. We will grab all available chunks that are `True` for the `meta_c` field. The results retrieved will be from the `woods_frost.pdf`.

In [7]:
queries = ["this is expensive"]
top_k = 5
q_results = []
for que in queries:
    q_results.append(nvingest_retrieval([que], collection_name=collection_name, host=f"http://{hostname}:19530", embedding_endpoint=f"http://{hostname}:8012/v1",  hybrid=sparse, top_k=top_k, model_name=model_name, gpu_search=False
                                            , _filter='content_metadata["meta_c"] == True'
                                           ))

print(f"{q_results}")

[data: [[{'id': 459164003456523110, 'distance': 0.016393441706895828, 'entity': {'source': {'source_name': '/raid/nv-ingest/data/woods_frost.pdf', 'source_id': '/raid/nv-ingest/data/woods_frost.pdf', 'source_location': '', 'source_type': 'PDF', 'collection_id': '', 'date_created': '2024-04-30T18:02:30', 'last_modified': '2024-04-30T18:02:32', 'summary': '', 'partition_id': -1, 'access_level': -1}, 'content_metadata': {'content_url': '', 'content_metadata': {'type': 'text', 'description': 'Unstructured text from PDF document.', 'page_number': 0, 'hierarchy': {'page_count': 2, 'page': 0, 'block': -1, 'line': -1, 'span': -1, 'nearby_objects': {'text': {'content': [], 'bbox': [], 'type': []}, 'images': {'content': [], 'bbox': [], 'type': []}, 'structured': {'content': [], 'bbox': [], 'type': []}}}, 'subtype': '', 'start_time': -1, 'end_time': -1, 'location': None, 'max_dimensions': None}, 'audio_metadata': None, 'text_metadata': {'text_type': 'page', 'summary': '', 'keywords': '', 'languag

In the following retrieval run, we will construct a filter expression using the `meta_d` field and we will retrieve all available chunks that have a `meta_d` value of less than 20. This should correspond to the five chunks in the `woods_frost.pdf`.

In [9]:
queries = ["this is expensive"]
top_k = 5
q_results = []
for que in queries:
    q_results.append(nvingest_retrieval([que], collection_name=collection_name, host=f"http://{hostname}:19530", embedding_endpoint=f"http://{hostname}:8012/v1",  hybrid=sparse, top_k=top_k, model_name=model_name, gpu_search=False
                                            , _filter='content_metadata["meta_d"] < 20 '
                                           ))

print(f"{q_results}")

[data: [[{'id': 459164003456523110, 'distance': 0.016393441706895828, 'entity': {'content_metadata': {'content_url': '', 'content_metadata': {'type': 'text', 'description': 'Unstructured text from PDF document.', 'page_number': 0, 'hierarchy': {'page_count': 2, 'page': 0, 'block': -1, 'line': -1, 'span': -1, 'nearby_objects': {'text': {'content': [], 'bbox': [], 'type': []}, 'images': {'content': [], 'bbox': [], 'type': []}, 'structured': {'content': [], 'bbox': [], 'type': []}}}, 'subtype': '', 'start_time': -1, 'end_time': -1, 'location': None, 'max_dimensions': None}, 'audio_metadata': None, 'text_metadata': {'text_type': 'page', 'summary': '', 'keywords': '', 'language': 'en', 'text_location': [-1, -1, -1, -1], 'text_location_max_dimensions': [-1, -1]}, 'image_metadata': None, 'table_metadata': None, 'chart_metadata': None, 'error_metadata': None, 'info_message_metadata': None, 'debug_metadata': None, 'raise_on_failure': False, 'meta_a': 'alpha', 'meta_b': 5, 'meta_c': True, 'meta_