# Multimodal RAG with LangChain

This cookbook shows how to use LangChain to query the table and text extraction results of nv-ingest's pdf extraction tools

To start we'll need to make sure we have some dependencies installed

Then, we'll use nv-ingest to parse an example pdf that contains text, tables, charts, and images. We'll need to make sure to have the nv-ingest microservice up and running at localhost:7670 along with the supporting NIMs. To do this, follow the nv-ingest [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart). Once the microservice is ready we can create a job with the nv-ingest python client

In [1]:
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.message_clients.rest.rest_client import RestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask


from nv_ingest_client.util.file_processing.extract import extract_file_content
import logging, time

logger = logging.getLogger("nv_ingest_client")

file_name = "../data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)

job_spec = JobSpec(
    document_type=file_type,
    payload=file_content,
    source_id=file_name,
    source_name=file_name,
    extended_options={
        "tracing_options": {
            "trace": True,
            "ts_send": time.time_ns()
        }
    },
)

And then we can and submit a task to extract the text and tables from the example pdf

In [2]:
extract_task = ExtractTask(
    document_type=file_type,
    extract_text=True,
    extract_images=False,
    extract_tables=True,
)


job_spec.add_task(extract_task)

client = NvIngestClient(
  message_client_hostname="localhost",
  message_client_port=7670
)

job_id = client.add_job(job_spec)

client.submit_job(job_id, "morpheus_task_queue")

result = client.fetch_job_result(job_id, timeout=60)

In [3]:
result[0][0]

{'document_type': 'text',
 'metadata': {'content': 'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs. Section One\r\nThis is the first section of the document. It has some more placeholder text to show how \r\nthe document looks like.

Now, we have the extraction results in the nv-ingest metadata format which we'll grab the extracted content from and load into Langchain documents

In [4]:
from langchain_core.documents import Document

texts = []
tables = []
for element in result[0]:
    if element['document_type'] == 'text':
        texts.append(Document(element['metadata']['content']))
    elif element['document_type'] == 'structured':
        tables.append(Document(element['metadata']['table_metadata']['table_content']))

In [5]:
texts

[Document(metadata={}, page_content='TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs. Section One\r\nThis is the first section of the document. It has some more placeholder text to show how \r\nthe document looks like. The text is no

In [6]:
tables

[Document(metadata={}, page_content='locations. Animal Activity Place Giraffe Driving a car. At the beach Lion Putting on sunscreen At the park. Cat Jumping onto a laptop In a home office Dog Chasing a squirrel In the front yard'),
 Document(metadata={}, page_content='This chart shows some gadgets, and some very fictitious costs. >\\n7938.758 ext. Print & Maroon Bookshelf Fine Art Poems Collection dla Cemicon Diamtháhn | Gadgets and their cost\nSollywood for Coasters | 19875.075     t158.281 \n Hammer | 19871.55 \n Powerdrill | 12044.625 \n Bluetooth speaker | 7598.07 \n Minifridge | 9916.305 \n Premium desk    Hammer - Powerdrill - Bluetooth speaker - Minifridge - Premium desk fan Dollars $- - $20.00 - $40.00 - $60.00 - $80.00 - $100.00 - $120.00 - $140.00 - $160.00 Cost    Chart 1 - Gadgets and their cost'),
 Document(metadata={}, page_content='This table shows some popular colors that cars might come in. Car Color1 Color2 Color3 Coupe White Silver Flat Gray Sedan White Metallic Gray

Next, we'll set our OpenAI API key and create a vector store to embed and store our text and table documents using OpenAI's embedding model

In [7]:
import os
from langchain_chroma import Chroma
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# TODO: Add your NVIDIA API key here
os.environ["NVIDIA_API_KEY"] = "<YOUR_NVIDIA_API_KEY>"

embedding = NVIDIAEmbeddings()
vectorstore = Chroma.from_documents(documents=(texts+tables), embedding=embedding)

Then, we'll create a retriever from our vector score that will allow us to retrieve our documents by semantic similarity and an llm to synthesize the final answer from the retrieved documents

In [8]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

retriever = vectorstore.as_retriever()

llm = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")

Finally, we'll create an RAG chain that we can use to query our pdf in natural language

In [9]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Keep the answer concise."
    "\n\n"
    "{context}"
    "Question: {question}"
)

prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [10]:
rag_chain.invoke("What is the dog doing and where?")

'The dog is chasing a squirrel in the front yard.'

Alternatively, we can use the milvus vector database packaged with NV-Ingest. This requires pymilvus and the milvus, etcd, attu, and minio microservices to be up and [running](https://github.com/NVIDIA/nv-ingest/blob/main/docs/deployment.md#launch-nv-ingest-micro-services)

In [None]:
pip install -qU pymilvus langchain_milvus

In [11]:
from langchain_milvus import Milvus

vector_store = Milvus(
    embedding_function=embedding,
    connection_args={"uri": "http://localhost:19530"},
)

And then we'll load our documents into our new store

In [12]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(texts+tables))]
vector_store.add_documents(documents=texts+tables, ids=uuids)

['2cfd802a-c655-495b-acc8-a51ec098bc95',
 '4c4fc62d-ae33-4f89-b0e0-667f702d5dbe',
 '8b9e6af9-7c48-4165-b9ab-116ed8a834e6',
 '45e6cd4e-0bee-47e9-bec1-3bacc7cb37f2',
 '71fe6278-99bf-48bb-8aae-85bad4370466']

In [13]:
retriever = vectorstore.as_retriever()

In [14]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Keep the answer concise."
    "\n\n"
    "{context}"
    "Question: {question}"
)

prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [15]:
rag_chain.invoke("What is the dog doing and where?")

'The dog is chasing a squirrel in the front yard.'