# Multimodal RAG using a simple AI Agent with complex PDF files

Let's start by loading the environment variables we need to use.

In [None]:
import sys
print(sys.executable)

import pydantic
print(pydantic.__version__)

In [177]:
import os
from dotenv import load_dotenv

load_dotenv()
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [178]:
MODEL = "llama3.2-vision:11b-instruct-q4_K_M"

In [179]:
print(MODEL)

llama3.2-vision:11b-instruct-q4_K_M


### Parsing raw pdf using LlamaParse for getting Json Structured Output

In [6]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_parse import LlamaParse

# Conc ophilips AIM Report
pdf_file = "./2023-conocophillips-aim-presentation.pdf"

not_from_cache = False
parser_txt = LlamaParse(verbose=True, invalidate_cache=not_from_cache, result_type="text")
parser_md = LlamaParse(verbose=True, invalidate_cache=not_from_cache, result_type="markdown")

In [8]:
print(f"Parsing text...")
docs_text = parser_txt.load_data(pdf_file)
print(f"Parsing PDF file...")
md_json_objs = parser_md.get_json_result(pdf_file)
md_json_list = md_json_objs[0]["pages"]

Parsing text...
Started parsing the file under job_id 6703c119-5620-4e4b-9d64-4fdd25e2fb2c
Parsing PDF file...
Started parsing the file under job_id f51bdebf-3653-4264-ae88-c990c01cb515


### Output one page Json output for example 

In [9]:
print(md_json_list[5]["md"])

# We Are Committed to Our Returns-Focused Value Proposition

# Triple Mandate

# Foundational Principles

# Clear and Consistent Priorities

Aligned to Business Realities

|S=|Sustain production|and pay dividend|
|---|---|---|
|Balance Sheet|Strength|Strength|

# MEET

# DELIVER

# PATHWAY DEMAND

# TRANSITION COMPETITIVE RETURNS

Annual dividend growth

Disciplined Investments
Peer-Leading Distributions
'A-rated balance sheet
# ACHIEVE NET-ZERO EMISSIONS AMBITION

# ESG

Excellence

&gt;30% of CFO shareholder payout

# Deliver Superior Returns Through Cycles

Disciplined investment to enhance returns

Scope Cash from operations (CFO) is a non-GAAP measure defined in the Appendix and 2 emissions on a gross operated and net equity basis.

ConocoPhillips


### Extract images as dicts from parser

In [70]:
image_dicts = parser_md.get_images(md_json_objs, download_path="llm_images")
# print one image dict as example
print(image_dicts[0])

> Images for page 1: [{'name': 'img_p0_1.png', 'height': 2250, 'width': 4000, 'x': 0, 'y': -1.2069939998582413e-05, 'original_width': 4000, 'original_height': 2250, 'ocr': [{'x': 210, 'y': 237, 'w': 673, 'h': 125, 'confidence': '0.9448544486543237', 'text': 'ConocoPhillips'}, {'x': 1376, 'y': 1567, 'w': 2423, 'h': 209, 'confidence': '0.9964726301497157', 'text': '2023 Analyst & Investor Meeting'}], 'path': 'llm_images\\f51bdebf-3653-4264-ae88-c990c01cb515-img_p0_1.png', 'job_id': 'f51bdebf-3653-4264-ae88-c990c01cb515', 'original_file_path': './2023-conocophillips-aim-presentation.pdf', 'page_number': 1}]
> Images for page 2: [{'name': 'img_p1_1.png', 'height': 2250, 'width': 4000, 'x': 0, 'y': -1.2069939998582413e-05, 'original_width': 4000, 'original_height': 2250, 'ocr': [{'x': 116, 'y': 95, 'w': 795, 'h': 157, 'confidence': '0.9882128832729178', 'text': 'Todays Agenda'}, {'x': 332, 'y': 370, 'w': 344, 'h': 124, 'confidence': '0.9999778319375243', 'text': 'Opening'}, {'x': 1535, 'y':

### Build Multimodal Index
In this section we build the multimodal index over the parsed deck.

We do this by creating text nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

#### Get Text Nodes

In [71]:
from pathlib import Path

'''
Create a dictionary which maps page numbers to image paths with the following format:

{
    1: [Path("path/to/image1"), Path("path/to/image2")],    
    2: [Path("path/to/image3"), Path("path/to/image4")],
}
'''
def create_image_index(image_dicts):
    image_index = {}

    for image_dict in image_dicts:
        page_number = image_dict["page_number"]
        image_path = Path(image_dict["path"])
        if page_number in image_index:
            image_index[page_number].append(image_path)
        else:
            image_index[page_number] = [image_path]

    return image_index

In [95]:
from copy import deepcopy

from llama_index.core.schema import TextNode

# attach image metadata to the text nodes
def get_text_nodes(docs, json_dicts=None, image_dicts=None):
    """Split docs into nodes, by separator."""
    nodes = []

    # Note: we assume that each PDF page contains exactly one image. 
    # If not, the code will need to be modified.
    image_index = create_image_index(image_dicts) if image_dicts is not None else None

    md_texts = [d["md"] for d in json_dicts] if json_dicts is not None else None

    doc_chunks = [c for d in docs for c in d.text.split("---")]
    for idx, doc_chunk in enumerate(doc_chunks):
        page_num = idx + 1
        chunk_metadata = {"page_num": page_num}
        if image_index:
            chunk_metadata["image_paths"] = [str(path) for path in image_index[idx + 1]]
        if md_texts is not None:
            chunk_metadata["parsed_text_markdown"] = md_texts[idx]
        chunk_metadata["parsed_text"] = doc_chunk
        node = TextNode(
            text=doc_chunk,
            metadata=chunk_metadata,
        )
        nodes.append(node)

    return nodes

In [96]:
# this will split into pages
text_nodes = get_text_nodes(docs_text, json_dicts=md_json_list, image_dicts=image_dicts)

In [None]:
print(text_nodes[6].get_content(metadata_mode="all"))

page_num: 7
image_paths: ['llm_images\\f51bdebf-3653-4264-ae88-c990c01cb515-img_p6_1.png']
parsed_text_markdown: # We Are Continuously Improving

|Return on Capital Employed|2016|2019|2022| |
|---|---|---|---|---|
| | |-4%|10%|27%|

Return of Capital
$1.11/share
$6.45/share
$11.73/share
Net Debt
$24B
$7B
$7B
Cash From Operations
$5B
$8B
$12B
$5B
$29B
$18B
Free Cash Flow
$40/BBL WTI
Resource
~10 BBOE
~15 BBOE
~20 BBOE
Production
1.6 MMBOED
1.3 MMBOED
1.7 MMBOED
Emissions Intensity?
~39
~36
~22
1 Defined in the Appendix and presented on a per-share basis using average outstanding diluted shares. 2 Gross operated GHG emissions (Scope 1 and 2), 2022 is a preliminary estimate.

Cash from operations (CFO), free cash flow (FCF), net debt and return on capital employed (ROCE) are non-GAAP measures. Definitions and reconciliations are included in the Appendix.

ConocoPhillips
parsed_text: We Are Continuously Improving
                                                                             

#### Build Index
Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store

In [156]:
# set BAAI/bge-small-en-v1.5 as vector store embedding model 
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

vector_store_embedding = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

In [157]:
import os
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)

index = None
if not os.path.exists("storage_nodes"):
    index = VectorStoreIndex(text_nodes, embed_model=vector_store_embedding)
    # save index to disk
    index.set_index_id("vector_index")
    index.storage_context.persist("./storage_nodes")
else:
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="storage_nodes")
    # load index
    index = load_index_from_storage(storage_context, index_id="vector_index", embed_model=vector_store_embedding)

### Build Multimodal Query Engine
We now use LlamaIndex abstractions to build a custom query engine. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [155]:
# set LLama3.2-11b-visions as Ollama model and perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model=Ollama(model=MODEL, request_timeout=500)
response = llm_model.complete("What is the capital of France?")
print(response)

The capital of France is Paris.


In [149]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from llama_index.core.prompts import PromptTemplate
from llama_index.core.base.response.schema import Response
from typing import Optional


QA_PROMPT_TMPL = """\
Use the image(s) information first and foremost. ONLY use the text/markdown information provided in the context
below if you can't understand the image(s).

---------------------
Context: {context_str}
---------------------
Given the context information and no prior knowledge, answer the query. Explain where you got the answer
from, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes.
    Also takes in a prompt template and multimodal model.

    """

    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: Ollama

    def __init__(self, qa_prompt: Optional[PromptTemplate] = None, **kwargs) -> None:
        """Initialize."""
        super().__init__(qa_prompt=qa_prompt or QA_PROMPT, **kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=image_path))
            for n in nodes for image_path in n.metadata.get("image_paths", [])
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )
        fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)

        image_docs = [image_node.node for image_node in image_nodes]
        # synthesize an answer from formatted text and images
        llm_response = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=image_docs
        )
        return Response(
            response=str(llm_response),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )


In [150]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=5), multi_modal_llm=llm_model
)

In [172]:
# run a query
response = query_engine.custom_query("What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate.")
print(response)

Based on the text, the average cost of supply in 2016 was $40/BBL WTI. To convert this amount to Indian Rupees (INR) using the current exchange rate, I'll use an approximate conversion rate of 1 USD = 75 INR.

$40/BBL WTI ≈ ₹3000 per barrel

Note: Please keep in mind that currency exchange rates may fluctuate frequently and might not be up-to-date at the time of my response. For accurate conversions, please check current exchange rates.

This answer is derived from the text on page 15, which states:

"... Resource: ~ $40/BBL WTI ... Average Cost of Supply: $40/BBL WTI ..."


### Building a Multimodal Agent

In [133]:
# set LLama3.1 as Ollama model for tool-calling since LLama3.2-vision currentlty does not support it
# perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model_tool_calling=Ollama(model="llama3.1")
response = llm_model_tool_calling.complete("Are you able to process image inputs?")
print(response)

I can understand text-based descriptions of images, but I don't have the ability to directly process or interpret visual image data. However, I can:

1. **Process text-to-image prompts**: You can describe an image using natural language, and I'll try to provide a textual description of what you're asking for.
2. **Understand image metadata**: If you provide me with the metadata associated with an image (e.g., EXIF data), I can help answer questions about it.
3. **Use text-based image descriptions**: If someone has described an image using natural language, I can process and respond to that description.

If you'd like to use a specific format or library for image processing, such as:

* OpenCV (Python)
* Pillow (Python)
* TensorFlow (TensorFlow.js)

Please let me know which one you're interested in working with. However, keep in mind that I won't be able to directly process images from the internet or your local file system.

Would you like to explore image processing using a specific l

In [None]:
import requests

from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionCallingAgentWorker

from llama_index.core.tools import FunctionTool
from pydantic import Field


def currency_converter(from_currency_code: str = Field(
        description="Country code of the currency to convert from (e.g., USD, INR, EUR)"
    ), to_currency_code: str = Field(
        description="Country code of the currency to convert to (e.g., USD, INR, EUR)"
    ), amount: float = Field(
        description="Currency amount to convert"
    )) -> float:

    # free API for currency exchange rates
    api_url = f"https://api.vatcomply.com/rates?base={to_currency_code}"
    
    response = requests.get(api_url)
    data = response.json()
    
    if "error" in data:
        raise ValueError(data["error"])
    
    rates = data["rates"]
    conversion_factor = rates[from_currency_code]
    converted_amount = float(amount) / conversion_factor
    return converted_amount

# Tool for currency conversion
currency_converter_tool = FunctionTool.from_defaults(
    currency_converter,
    name="currency_converter_tool",
    description="Converts currency from one country code to country code based on current exchange rate. "
    "Takes the currency amount value, the country code of the currency to convert from, and the country code "
    "of the currency to convert to as input.",
)

# Tool for querying the engine to retrieve contextual information around user query
query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="query_engine_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)

In [173]:
# Set-up the agent for calling the currency conversion and query engine tools
agent = FunctionCallingAgentWorker.from_tools(
    [currency_converter_tool, query_engine_tool], llm=llm_model_tool_calling, verbose=True
).as_agent()

In [176]:
query = (
    "What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate."
)
response = agent.query(query)
print(response)

Added user message to memory: What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate.
=== Calling Function ===
Calling function: currency_converter_tool with args: {"amount": "1000", "from_currency_code": "USD", "to_currency_code": "INR"}
=== Function Output ===
87514.21412739714
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Average cost of supply in 2016"}
=== Function Output ===
I found the relevant information on page 2 of the text, which states:

"...Lower 48 unconventional production of approximately 1.5 million barrels per day (MBOED) with an average cost of supply around $30-40 per barrel."

However, this statement does not explicitly mention the year 2016.

I then looked at page 3, which mentions:

"Deep; Durable and Diverse Portfolio with Significant Growth Runway"

And on page 5, I found a graph showing Lower 48 Unconventional Production (MBOED) for the years 2008-2022. Unfortunately, there is