# Building a RAG Pipeline using OCR mode, this is an advanced implementation that mix multimodals query using gpt4-o | gemini-1.5-flash (both text and images query)

This cookbook shows how to use LlamaParse and OpenAI's multimodal models to query instruction manual PDFs, which mainly contain images and diagrams to show how one can assemble the product.

LlamaParse and multimodal LLMs can interpret these diagrams and translate them into textual instructions. With textual assistance, confusing visual instructions within the product manuals can be made easier to understand and interpret. Additionally, textual instructions can be helpful for those who are visually impaired.

## Install and Setup

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [None]:
%pip install llama-index llama-parse llama-index-multi-modal-llms-openai llama-index-multi-modal-llms-gemini llama-index-embeddings-gemini llama-index-llms-gemini git+https://github.com/openai/CLIP.git llama_index.postprocessor.cohere_rerank

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-v1gsli7d
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-v1gsli7d
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting llama-index
  Downloading llama_index-0.11.3-py3-none-any.whl.metadata (11 kB)
Collecting llama-parse
  Downloading llama_parse-0.5.1-py3-none-any.whl.metadata (4.5 kB)
Collecting llama-index-multi-modal-llms-openai
  Downloading llama_index_multi_modal_llms_openai-0.2.0-py3-none-any.whl.metadata (728 bytes)
Collecting llama_index.postprocessor.cohere_rerank
  Downloading llama_index_postprocessor_cohere_rerank-0.2.0-py3-none-any.whl.metadata (723 bytes)
Collecting llama-index-agent-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_agent_openai-0.3.0-py3-none-any.whl.metad

In [None]:
!wget https://github.com/user-attachments/files/16461058/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

!mkdir files

!cp /content/data/fredde_instruction_manual.pdf /content/files


--2024-09-01 12:46:18--  https://github.com/user-attachments/files/16461058/data.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/835367238/16461058?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240901%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240901T124618Z&X-Amz-Expires=300&X-Amz-Signature=2f25d8a686b827481609a899d18adcf603d942eb9ce6ef4f3bcbfc43e2c0613c&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=835367238&response-content-disposition=attachment%3Bfilename%3Ddata.zip&response-content-type=application%2Fzip [following]
--2024-09-01 12:46:18--  https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/835367238/16461058?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240901%2Fus-east-1%

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [None]:
from google.colab import userdata

GEMINI_KEY = userdata.get('GEMINI_KEY')
COHERE_KEY = userdata.get('COHERE_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
LLAMA_CLOUD_API = userdata.get('LLAMA_CLOUD_API')

#or
# import os

# os.environ["OPENAI_API_KEY"] = "xxx"
# os.environ["LLAMA_CLOUD_API_KEY"] = "x0"
# os.environ["COHERE_API_KEY"] = "xJ"
# os.environ["GEMINI_API_KEY"] = "xG"

## Code Implementation

Set up LlamaParse. We will parse the PDF files into markdown and use the GPT-4o multimodal model to parse the PDFs.

Load data from the parser.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="You are given IKEA assembly instruction manuals",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o", #or gemini-1.5-flash
    show_progress=True,
    verbose=True,
    invalidate_cache=True,
    do_not_cache=True,
    num_workers=8,
    # Setting language
    language="en",
    # api_key=LLAMA_CLOUD_API,
)

In [None]:
DATA_DIR = "files"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

print(files[0])

files/fredde.pdf


Load data into docs, and save images from PDFs into `data_images` directory.

In [None]:
md_json_objs = parser.get_json_result(files)
image_dicts = parser.get_images(md_json_objs, download_path="data_images")

Parsing files:   0%|          | 0/1 [00:00<?, ?it/s]

Started parsing the file under job_id 7fff3598-5107-48dd-b4e1-da9ca3420f13


Parsing files: 100%|██████████| 1/1 [00:18<00:00, 18.44s/it]


> Image for page 1: [{'name': 'page-0.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 2: [{'name': 'page-1.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 3: [{'name': 'page-2.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 4: [{'name': 'page-3.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 5: [{'name': 'page-4.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 6: [{'name': 'page-5.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 7: [{'name': 'page-6.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 8: [{'name': 'page-7.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot'}]
> Image for page 9: [{'name': 'page-8.jpg', 'height': 0,

Create helper functions to create a list of `TextNode`s from the markdown tables to feed into the `VectorStoreIndex`.

In [None]:
import re
from pathlib import Path
import typing as t
from llama_index.core.schema import TextNode


def get_page_number(file_name):
    """Gets page number of images using regex on file names"""
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files


def get_text_nodes(md_json_objs, image_dir) -> t.List[TextNode]:
    """Creates nodes from json + images"""

    nodes = []

    for result in md_json_objs:
      json_dicts = result["pages"]
      document_name = result["file_path"].split('/')[-1]

      print(json_dicts)

      docs = [doc["md"] for doc in json_dicts]  # extract text
      image_files = _get_sorted_image_files(image_dir)  # extract images

      for idx, doc in enumerate(docs):
          # adds both a text node and the corresponding image node (jpg of the page) for each page
          node = TextNode(
              text=doc,
              metadata={"image_path": str(image_files[idx]), "page_num": idx + 1, "document_name": document_name},
          )
          nodes.append(node)

    return nodes


text_nodes = get_text_nodes(md_json_objs, "data_images")

[{'page': 1, 'md': '# FREDDE\n\n![FREDDE Desk](https://www.ikea.com/us/en/images/products/fredde-desk-black__0736012_pe740925_s5.jpg)\n\n---\n\n![IKEA Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/IKEA_logo.svg/1200px-IKEA_logo.svg.png)\n\nDesign and Quality  \nIKEA of Sweden', 'images': [{'name': 'page-0.jpg', 'height': 0, 'width': 0, 'x': 0, 'y': 0, 'type': 'full_page_screenshot', 'path': 'data_images/7fff3598-5107-48dd-b4e1-da9ca3420f13-page-0.jpg', 'job_id': '7fff3598-5107-48dd-b4e1-da9ca3420f13', 'original_pdf_path': 'files/fredde.pdf', 'page_number': 1}], 'items': [{'type': 'heading', 'lvl': 1, 'value': 'FREDDE', 'md': '# FREDDE'}, {'type': 'text', 'value': '![FREDDE Desk](https://www.ikea.com/us/en/images/products/fredde-desk-black__0736012_pe740925_s5.jpg)\n\n---\n\n![IKEA Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/IKEA_logo.svg/1200px-IKEA_logo.svg.png)\n\nDesign and Quality  \nIKEA of Sweden', 'md': '![FREDDE Desk](https://www.ikea.com/u

Index the documents.

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
#for OpenAI-gpt-4o
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

#for using Google Gemini-1.5-flash
# from llama_index.embeddings.gemini import GeminiEmbedding
# from llama_index.llms.gemini import Gemini

# embed_model = GeminiEmbedding(
#     model_name="models/embedding-001", api_key=GEMINI_KEY
# )

# llm = Gemini(api_key=GEMINI_KEY, model_name="models/gemini-1.5-flash")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_manuals"):
    index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_manuals")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_manuals")
    index = load_index_from_storage(ctx)

retriever = index.as_retriever()

Create a custom query engine that uses GPT-4o's multimodal model.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.schema import NodeWithScore, MetadataMode, QueryBundle
from llama_index.core.base.response.schema import Response
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import ImageNode

from typing import Any, List, Optional, Tuple
from llama_index.core.postprocessor.types import BaseNodePostprocessor

QA_PROMPT_TMPL = """\
You are a chatbot that will help users to get technical responses about and ikea product manual.

Below we give parsed text from slides in two different formats, as well as the image.

We parse the text in both 'markdown' mode as well as 'raw text' mode. Markdown mode attempts \
to convert relevant diagrams into tables, whereas raw text tries to maintain the rough spatial \
layout of the text.

Use the image information first and foremost. ONLY use the text/markdown information
if you can't understand the image.

When you reply dosen't send images links, but only text explaination of that.

Context:
---------------------
{context_str}
---------------------

Given the context information and not prior knowledge, answer the query using ONLY Context informations, if you dosen't find the answer in the Context NOT try to answer, reply that you dosen't know and give a page and document name where the user can find similar response.
Give the page's number and the document name where you find the response based on the Context.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

gpt_4o_mm = OpenAIMultiModal(model="gpt-4o")

# gemini_llm = GeminiMultiModal(
#         api_key=GEMINI_KEY, model_name="models/gemini-1.5-flash"
#     )

class MultimodalQueryEngine(CustomQueryEngine):
    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: OpenAIMultiModal
    # multi_modal_llm: GeminiMultiModal
    node_postprocessors: Optional[List[BaseNodePostprocessor]]

    def __init__(
        self,
        qa_prompt: PromptTemplate,
        retriever: BaseRetriever,
        multi_modal_llm: OpenAIMultiModal,
        # multi_modal_llm: GeminiMultiModal,
        node_postprocessors: Optional[List[BaseNodePostprocessor]] = [],
    ):
        super().__init__(
            qa_prompt=qa_prompt,
            retriever=retriever,
            multi_modal_llm=multi_modal_llm,
            node_postprocessors=node_postprocessors
        )

    def custom_query(self, query_str: str):
        # retrieve most relevant nodes
        nodes = self.retriever.retrieve(query_str)

        for postprocessor in self.node_postprocessors:
            nodes = postprocessor.postprocess_nodes(
                nodes, query_bundle=QueryBundle(query_str)
            )


        # create image nodes from the image associated with those nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=n.node.metadata["image_path"]))
            for n in nodes
        ]

        # create context string from parsed markdown text
        ctx_str = "\n\n".join(
            [r.node.get_content(metadata_mode=MetadataMode.LLM).strip() for r in nodes]
        )

        # prompt for the LLM
        fmt_prompt = self.qa_prompt.format(context_str=ctx_str, query_str=query_str)

        # use the multimodal LLM to interpret images and generate a response to the prompt
        llm_repsonse = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=[image_node.node for image_node in image_nodes],
        )
        return Response(
            response=str(llm_repsonse),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )

Create a query engine instance.

In [None]:
from llama_index.postprocessor.cohere_rerank import CohereRerank

api_key = os.environ["COHERE_API_KEY"]
cohere_rerank = CohereRerank(api_key=api_key, top_n=3, model="rerank-multilingual-v3.0")

# Insert reranking here only if after some test it increase the accuracy
query_engine = MultimodalQueryEngine(
    qa_prompt=QA_PROMPT,
    retriever=index.as_retriever(similarity_top_k=9),
    multi_modal_llm=gpt_4o_mm,
    # multi_modal_llm=gemini_llm,
    node_postprocessors=[]
)


## Example Queries

In [None]:
from IPython.display import display, Markdown

response = query_engine.query("What parts are included in the Uppspel?")
display(Markdown(str(response)))

I don't have information about the parts included in the Uppspel. However, you can refer to the document "fredde.pdf" on page 2 for tools required and general assembly instructions, and other pages for specific steps and parts used in the assembly process.

In [None]:
response = query_engine.query("What does the FREDDE look like?")
display(Markdown(str(response)))

The FREDDE desk is a multi-functional desk with a modern design. It features a main desk surface with additional shelves and brackets for extra storage and organization. The desk has a sturdy frame with multiple levels, including a top shelf and side shelves. The design allows for efficient use of space, making it suitable for various activities such as working on a computer, studying, or gaming.

For more detailed visual instructions on assembling the FREDDE desk, you can refer to the following pages in the "fredde.pdf" document:

- Page 1
- Page 23
- Page 24
- Page 29
- Page 30
- Page 31
- Page 32

In [None]:
response = query_engine.query("What should I do if I'm confused with reading the manual?")
display(Markdown(str(response)))

If you are confused with reading the manual, you should contact IKEA for assistance. This information can be found on page 2 of the document "fredde.pdf".

You can also create an agent around the query engine and chat with the agent. (questa è una cosa beta se nel futuro volessero una cosa ancora + avanzata)

In [None]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.tools import QueryEngineTool

query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="query_engine_tool",
    description="Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data.",
)
agent = FunctionCallingAgentWorker.from_tools(
    [query_engine_tool], llm=llm, verbose=True
).as_agent()

In [None]:
response = agent.chat(
    "How do I assemble the Fredde, the first 3 steps?"
)
display(Markdown(str(response)))

Added user message to memory: How do I assemble the Fredde, the first 3 steps?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "first 3 steps to assemble the Fredde desk"}
=== Function Output ===
The first three steps to assemble the Fredde desk are as follows:

1. **Step 2** (Page 5, fredde.pdf):
   - Insert the screw (100181) into the hole of the metal frame.
   - Tighten the screw.
   - Ensure the screw is properly secured and not loose.

2. **Step 3** (Page 5, fredde.pdf):
   - Align the metal rod with the frame.
   - Insert the rod into the frame.
   - Rotate the rod to secure it in place.

3. **Step 18** (Page 15, fredde.pdf):
   - Insert 2x part 108430 into the designated slots.

For more detailed instructions, refer to pages 5 and 15 of the document "fredde.pdf".
=== LLM Response ===
Here are the first three steps to assemble the Fredde desk:

### Step 1
1. Insert the screw (100181) into the hole of the metal frame.
2. Tighten the screw.
3. Ens

Here are the first three steps to assemble the Fredde desk:

### Step 1
1. Insert the screw (100181) into the hole of the metal frame.
2. Tighten the screw.
3. Ensure the screw is properly secured and not loose.

### Step 2
1. Align the metal rod with the frame.
2. Insert the rod into the frame.
3. Rotate the rod to secure it in place.

### Step 3
1. Insert 2x part 108430 into the designated slots.

For more detailed instructions and visual aids, refer to pages 5 and 15 of the document "fredde.pdf".