# NV-Ingest: Python Client Quick Start Guide

This notebook provides a quick start guide to using the NV-Ingest Python API to create a client that interacts with a running NV-Ingest cluster. It will walk through the following:

- Define the task configuration for an NV-Ingest job
- Submit a job the the NV-Ingest cluster and retrieve completed results
- Investigate the multimodal extractions

Specify a few parameters to connect to our nv-ingest cluster and a notional document to guide the examples.

In [None]:
import os

# client config
HTTP_HOST = os.environ.get('HTTP_HOST', "localhost")
HTTP_PORT = os.environ.get('HTTP_PORT', "7670")
TASK_QUEUE = os.environ.get('TASK_QUEUE', "morpheus_task_queue")

# minio config
MINIO_ACCESS_KEY = os.environ.get('MINIO_ACCESS_KEY', "minioadmin")
MINIO_SECRET_KEY = os.environ.get('MINIO_SECRET_KEY', "minioadmin")

# time to wait for job to complete
DEFAULT_JOB_TIMEOUT = 90

# sample input file and output directory
SAMPLE_PDF = "/workspace/nv-ingest/data/multimodal_test.pdf"

## The NV-Ingest Python Client

In [None]:
from base64 import b64decode
import time

from nv_ingest_client.client import Ingestor

from IPython import display

Each ingest job includes a set of tasks. These tasks define the operations that will be performed during ingestion. This allows each job to potentially have different ingestion instructions. Here we define a simple extract oriented job, but the full list of supported options are contained below:

- `extract` : Performs multimodal extractions from a document, including text, images, and tables.
- `split` : Chunk the text into smaller chunks, useful for storing in a vector database for retrieval applications.
- `dedup` : Identifies duplicate images in document that can be filtered to remove data redundancy.
- `filter` : Filters out images that are likely not useful using some heuristics, including size and aspect ratio.
- `embed` : Computes an embedding for the extracted content using a [`nvidia/nv-embedqa-e5-v5`](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nv-embedqa-e5-v5) NVIDIA Inference Microservice (NIM) by default.
- `store` : Save the extracted tables or images to an S3 compliant object store like MinIO.
- `vbd_upload` : Save embeddings, chunks, and metadata to a Milvus vector database.

We'll use the Ingestor interface to chain together an extraction tast and a deduplication task to ingest our sample PDF. 

In [None]:
SAMPLE_PDF = "../../../data/multimodal_test.pdf"

ingestor = (
    Ingestor(message_client_hostname=HTTP_HOST)
    .files(SAMPLE_PDF)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
        text_depth="document",
    ).dedup(
        content_type="image",
        filter=True,
    )
)

Submit the job to our NV-Ingest cluster

In [None]:
generated_metadata = ingestor.ingest()[0]

## Explore the Outputs

Let's explore elements of the NV-Ingest output. When data flows through an NV-Ingest pipeline, a number of extractions and transformations are performed. As the data is enriched, it is stored in rich metadata hierarchy. In the end, there will be a list of dictionaries, each of which represents a extracted type of information. The most common elements to extract from a dictionary in this hierarchy are the extracted content and the text representation of this content. The next few cells will demonstrate interacting with the metadata, pulling out these elements, and visualizing them. Note, when there is a `-1` value present, this represents non-applicable positional resolution. Positive numbers represent valid positional data.

For a more complete description of metadata elements, view the data dictionary.

[https://github.com/NVIDIA/nv-ingest/blob/main/docs/content-metadata.md](https://github.com/NVIDIA/nv-ingest/blob/main/docs/content-metadata.md)

In [None]:
def redact_metadata_helper(metadata: dict) -> dict:
    """A simple helper function to redact `metadata["content"]` so improve readability."""
    
    text_metadata_redact = text_metadata.copy()
    text_metadata_redact["content"] = "<---Redacted for readability--->"
    
    return text_metadata_redact

### Explore Output - Text

This cell depicts the full metadata hierarchy for a text extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, text in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source document that is the basis of the ingest job.
- `text_metadata` - Contain information about the text extraction, including detected language, among others - this section will only exist when `metadata['content_metadata']['type'] == 'text'`

In [None]:
text_metadata = generated_metadata[3]["metadata"]
redact_metadata_helper(text_metadata)

View the text extracted from the sample document.

In [None]:
text_metadata["content"]

### Explore Output - Tables

This cell depicts the full metadata hierarchy for a table extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, a base64 encoded image of the extracted table in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source and storage path of an extracted table in an S3 compliant object store.
- `table_metadata` - Contains the text representation of the table, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'structured'`.

Note, `table_metadata` will store chart and table extractions. The are distringuished by `metadata['content_metadata']['subtype']`

In [None]:
table_metadata = generated_metadata[4]["metadata"]
redact_metadata_helper(table_metadata)

Visualize the table contained within the extracted metadata.

In [None]:
display.Image(b64decode(table_metadata["content"]))

View the corresponding text that maps to this table. This text could be embedded to support multimodal retrieval workflows.

In [None]:
table_metadata["table_metadata"]["table_content"]

### Explore Output - Charts

This cell depicts the full metadata hierarchy for a chart extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, a base64 encoded image of the extracted chart in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source and storage path of an extracted chart in an S3 compliant object store.
- `table_metadata` - Contains the text representation of the chart, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'structured'`.

Note, `table_metadata` will store chart and table extractions. The are distringuished by `metadata['content_metadata']['subtype']`

In [None]:
chart_metadata = generated_metadata[7]["metadata"]
chart_metadata_redact = chart_metadata.copy()
chart_metadata_redact["content"] = "<---Redacted for readability--->"
chart_metadata_redact

Visualize the chart contained within the extracted metadata.

In [None]:
display.Image(b64decode(chart_metadata["content"]))

View the corresponding text that maps to this chart. This text could be embedded to support multimodal retrieval workflows.

In [None]:
chart_metadata["table_metadata"]["table_content"]

### Explore Output - Images

This cell depicts the full metadata hierarchy for a image extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, a base64 encoded image extracted from the document in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source and storage path of an extracted image in an S3 compliant object store.
- `image_metadata` - Contains the image type, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['type'] == 'image'`.

In [None]:
img_metadata = generated_metadata[1]["metadata"]
redact_metadata_helper(img_metadata)

Visualize the image contained within the extracted metadata.

In [None]:
display.Image(b64decode(img_metadata["content"]))

### Optional:  Expanded Task Configuration

This section illustrates usage of the remaining task types used when supporting retrieval workflows.

- `filter` : Filters out images that are likely not useful using some heuristics, including size and aspect ratio.
- `split` : Chunk the text into smaller chunks, useful for storing in a vector database for retrieval applications.
- `store` - Stores extracted content to an S3 compliant object store (MinIO by default) and updates the `source_metadata` with the corresponding stored location.
- `embed` - Computes an embedding for the extracted content using a [`nvidia/nv-embedqa-e5-v5`](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nv-embedqa-e5-v5) NVIDIA Inference Microservice (NIM) by default.
- `vdb_upload` - Inserts ingested content into a Milvus vector database to support retrieval use cases.

Define the ingest job specification. Here the task configuration is expanded, but requires the ancillary services (Embedding NIM, MinIO object stor, and Milvus Vector Database) to be up and running to return metadata back to the client.

In [None]:
ingestor = (
    Ingestor(message_client_hostname=HTTP_HOST)
    .files(SAMPLE_PDF)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
        text_depth="document",
    ).dedup(
        content_type="image",
        filter=True,
    ).filter(
        content_type="image",
        min_size=128,
        max_aspect_ratio=5.0,
        min_aspect_ratio=0.2,
        filter=True,
    ).split(
        split_by="word",
        split_length=300,
        split_overlap=10,
        max_character_length=5000,
        sentence_window_size=0,
    ).store(
        structured=True,
        images=True,
        store_method="minio",
        params={
            "access_key": MINIO_ACCESS_KEY, 
            "secret_key": MINIO_SECRET_KEY,
        }
    )
    .embed()
    .vdb_upload()
)

Submit the job and retrieve the results

In [None]:
generated_metadata = ingestor.ingest()[0]

Query the Milvus VDB

In [None]:
from nv_ingest_client.util.milvus import nvingest_retrieval

query = "What is the dog doing and where?"

nvingest_retrieval(
        [query],
        "nv_ingest_collection",
        hybrid=False,
        embedding_endpoint="http://localhost:8012/v1",
        model_name="nvidia/nv-embedqa-e5-v5",
        top_k=1,
        gpu_search=True,
)