# Comparing LlamaIndex and LlamaParse for Dense Document Questioning Answering on Vertex AI
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/doc_parsing_with_llamaindex_and_llamaparse.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fdoc_parsing_with_llamaindex_and_llamaparse.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/doc_parsing_with_llamaindex_and_llamaparse.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/doc_parsing_with_llamaindex_and_llamaparse.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>


| | |
|-|-|
| Author(s) | [Noa Ben-Efraim](https://github.com/noabenefraim/) |

## Overview
There are many ways to customize RAG pipelines by choosing how to ingest, parse, chunk, and retrieve your data. This notebook focuses on comparing different document parsing capabilities offered by LlamaIndex.

We will compare document parsing with LlamaIndex and LlamaParse on a 10-Q financial document, which is heavily populated with complex tables.

### Objectives
This notebook compare using LlamaIndex and LlamaParse for ingesting and indexing a complex document. 

You will complete the following tasks:
1. Ingest and parse document using LlamaIndex SimpleDataReader, LlamaIndex LangchainNodeParser, and LlamaParse Parser using Gemini models.
2. Index your parsed document in a VectorStore.
3. Create a a query agent for each parsing technique that can answer questions against the input document.
4. Compare results across LlamaIndex and LlamaParse.

### LlamaIndex
LlamaIndex is a foundational data framework for building LLM applications. A few of its main capabilities are:

+ Data Ingestion: Loads your data from various sources (documents, databases, APIs).   
+ Indexing: Structures your data into efficient formats for LLM retrieval (e.g., vector stores, tree structures).   
+ Querying: Enables you to ask questions or give instructions to the LLM, referencing your indexed data for answers.   
+ Integration: Connects with various LLMs, vector databases, and other tools.   
  

### LlamaParse
LlamaParse is a tool within the LlamaIndex ecosystem, focused on parsing complex documents:

+ PDFs: Handles PDFs with tables, charts, and other embedded elements that can be challenging for standard parsing.  
+ Semi-structured Data: Extracts structured information from documents that aren't fully formatted databases.   
+ Enhanced Retrieval: Works seamlessly with LlamaIndex to improve retrieval accuracy for complex documents.

## Getting Started

### Authenticate your notebook environment

This notebook expects the following resources to exists:
+ initialized Google Cloud project 
+ Vertex AI API enabled
+ A LlamaParse API Key [request a key here](https://docs.cloud.llamaindex.ai/llamacloud/getting_started/api_key)

In [2]:
PROJECT_ID = "genai-noabe" # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
GCS_BUCKET = "llama_gcs_bucket"  # @param {type:"string"}
VS_INDEX_NAME = "llamaparse_doc_index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "llamaparse_doc_endpoint"  # @param {type:"string"}
DATA_FOLDER = "./data"  # @param {type:"string"}

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [3]:
PROJECT_ID = "genai-noabe"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Setting up the Environment
Install dependencies

In [4]:
%pip install google-cloud-aiplatform \
  llama-index \
  langchain-community \
  llama-index-embeddings-vertex \
  llama-index-llms-vertex \
  termcolor \
  llama-index-core -q

Note: you may need to restart the kernel to use updated packages.


Set up imports

In [5]:
import os
from llama_parse import LlamaParse 
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.llms.vertex import Vertex
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import (
    LangchainNodeParser
) 
from llama_index.core import (
    SimpleDirectoryReader,
    Settings
)
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    KeywordExtractor
)
from llama_index.core.base.response.schema import Response

Generate credentials

In [6]:
import google.auth
import google.auth.transport.requests

# credentials will now have an api token
credentials = google.auth.default(quota_project_id='genai-noabe')[0]
request = google.auth.transport.requests.Request()
credentials.refresh(request)

In [7]:
gemini_embedding_model = VertexTextEmbedding("text-embedding-004", credentials=credentials)
llm = Vertex(model="gemini-pro", temperature=0.0, max_tokens=5000)

Settings.embed_model = gemini_embedding_model
Settings.llm=llm

I0000 00:00:1725640900.746167    2172 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


Set up LlamaIndex settings to point to Gemini models.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [7]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Download sample data

For the remainder of the notebook we will examine Alphabet Inc. 10Q document. A 10Q is a financial document that is dense with tables with financial figures. This document is a great candidate to to investigate document parsing capabilities.

In [8]:
!mkdir {DATA_FOLDER}
!wget "https://abc.xyz/assets/ae/e9/753110054014b6de4d620a2853f6/goog-10-q-q2-2024.pdf" -P {DATA_FOLDER}

mkdir: cannot create directory ‘./data’: File exists
--2024-09-06 16:41:54--  https://abc.xyz/assets/ae/e9/753110054014b6de4d620a2853f6/goog-10-q-q2-2024.pdf
Resolving abc.xyz (abc.xyz)... 74.125.142.100, 74.125.142.113, 74.125.142.138, ...
Connecting to abc.xyz (abc.xyz)|74.125.142.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 648418 (633K) [application/pdf]
Saving to: ‘./data/goog-10-q-q2-2024.pdf.1’


2024-09-06 16:41:54 (30.3 MB/s) - ‘./data/goog-10-q-q2-2024.pdf.1’ saved [648418/648418]



##  Document Parsing with LlamaIndex

This section will ingest and parse the 10Q using LlamaIndex tools, specifically focusing on SimpleDirectoryReader and LangChainNodeParser.

### Option 1: SimpleDirectoryReader
The SimpleDirectoryReader is the core data ingestion tool in LlamaIndex. It's designed to load data from a variety of sources and convert it into a format suitable for further processing and indexing by LlamaIndex.

In [9]:
reader = SimpleDirectoryReader("./data")
documents = reader.load_data(show_progress=True)
print(documents[0])

Loading files:   0%|          | 0/1 [00:00<?, ?file/s]

Loading files: 100%|██████████| 1/1 [00:05<00:00,  5.11s/file]

Doc ID: d569bf24-e740-4b11-abfc-32a380d709da
Text: UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington,
D.C. 20549 ___________________________________________________________
_____________________________ FORM 10-Q   ____________________________
____________________________________________________________ (Mark
One) ☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE
SECURITIES EXCHA...





In [10]:
# Index the parsed document
simpledirectory_index = VectorStoreIndex.from_documents(documents)

# Generate a query engine based on the SimpleDataReader
simple_query_engine = simpledirectory_index.as_query_engine(
    similarity_top_k=2
)

### Option 2: LangChainNodeParser with LlamaIndex
The LangChainNodeParser is a part of LlamaIndex. It is a specialized parser designed to extract structured information from text documents using the power of Langchain.

Key Features:
+ Langchain Integration: Leverages Langchain's powerful language models and tools to parse text.
+ Node-Based Output: Converts unstructured text into a structured format based on a defined schema, represented as a hierarchy of nodes. This enables more sophisticated querying and analysis of the extracted information.
+ Customization: Supports defining custom parsing schemas to match the structure of your specific documents.
+ Flexibility: Can be used in combination with other LlamaIndex components, such as the SimpleDataReader, to process and index the extracted structured data.

In [11]:
parser = LangchainNodeParser(RecursiveCharacterTextSplitter())
langchain_nodes = parser.get_nodes_from_documents(documents)

In [12]:
# An example node that was generated using the LangChainNodeParser and the associated metadata
langchain_nodes[0]

TextNode(id_='7eb25d0f-a527-47d8-9542-83111e6d7c43', embedding=None, metadata={'page_label': '1', 'file_name': 'goog-10-q-q2-2024.pdf', 'file_path': '/home/noabe/generative-ai/gemini/use-cases/document-processing/data/goog-10-q-q2-2024.pdf', 'file_type': 'application/pdf', 'file_size': 648418, 'creation_date': '2024-09-05', 'last_modified_date': '2024-07-24'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='d569bf24-e740-4b11-abfc-32a380d709da', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': 'goog-10-q-q2-2024.pdf', 'file_path': '/home/noabe/generative-ai/gemini/use-cases/document-processing/data/goog-10-q-q2-2024.pdf', 'file_type': 'application/pdf', 'file_size': 648418, 'c

In [13]:
#Index the document based on the LangChain nodes generated above
langchainparser_index = VectorStoreIndex(nodes=langchain_nodes)

#Create a query engine based off the the LangChainNodeParser
lg_query_engine = langchainparser_index.as_query_engine(
    similarity_top_k=2
)

## LlamaParse

LLamaParse Parser is a powerful tool for extracting structured data from unstructured or semi-structured text, offering flexibility, customization, and seamless integration within the LlamaIndex framework.It can take an unstructured or semi-structured text document and, using a defined schema, extract structured information from it. This structured output is represented as a nested hierarchy of nodes, facilitating further processing and analysis.

A few key features include:

+ JSON Schema: Leverages the standardized JSON Schema format for more complex schemas.
+ Prompt Templates: Allows you to craft custom prompts to guide the language model's parsing behavior, offering greater control and adaptability.
+ LLM Selection: You have the flexibility to choose the specific LLM you want to use for parsing, enabling you to tailor the performance to your specific needs and budget.
+ Node-Based Output:
    + Structured Representation: The parsed output is organized into a hierarchy of nodes, each representing a piece of extracted information.
    + Nested Structure: Nodes can contain other nodes, allowing for the representation of complex relationships and nested data structures within the document.
    + Metadata: Nodes can also include additional metadata, such as confidence scores or source information, enriching the extracted data.
+ Integration with LlamaIndex: The structured output from parser() seamlessly integrates with other LlamaIndex components, such as indexing and querying, facilitating efficient retrieval and analysis of the extracted information.

#### Define a Parser

Here we will define a LlamaParse() parser with specific parsing instructions, and ingest the data.

In [14]:
parser = LlamaParse(
    parsing_instruction="You are a financial analyst working specifically with 10Q documents. Not all pages have titles. Try to reconstruct the dialogue spoken in a cohesive way.",
    api_key="",
    result_type="text",  # "markdown" and "text" are available
    language='en',
    invalidate_cache=True
)

### Option 1 - LlamaParse with SimpleDirectoryReader

This is the apples to apples comparison with LlamaIndex. We are using the SimpleDirectoryReader with the LlamaParse file extractor, and then loading the data directly from documents to a Vector Store for retrieval.

In [15]:
import nest_asyncio; nest_asyncio.apply()
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=["./data/goog-10-q-q2-2024.pdf"], file_extractor=file_extractor).load_data()

Started parsing the file under job_id 589d1115-c0c5-42be-994f-ab2195c30414
.

In [16]:
lp_simple = VectorStoreIndex.from_documents(documents)
lp_simple_engine = lp_simple.as_query_engine(
    similarity_top_k=2
)

### Option 2 - LlamaParse and Vertex AI Vector Search

This approach is a more customized approach by defining the Vector Search mechanism through Vertex AI and extracting metadata that will be embedded and stored in the search index. 

Using metadata in Retrieval Augmented Generation (RAG) improves accuracy and context by focusing searches and providing additional information. This leads to efficient filtering, ranking, and personalized responses tailored to user needs and history. Metadata also facilitates handling complex multi-criteria queries, making RAG systems more versatile and effective.

The following section will:
+ Parse the documents using LlamaParse
+ Extract metadata from documents returned from LlamaParse
+ Create metadata embeddings attached to each document
+ Create index in Vertex AI Vector Store
+ Query against the Vector Store

#### Parse data using LlamaParse

In [20]:
documents = parser.load_data("./data/goog-10-q-q2-2024.pdf")

Started parsing the file under job_id e47cedeb-2356-467e-80a6-0ae96253a5c9
.

#### Create Metadata from Nodes

Using extractors we will generate meta-data for each node. The metadata is generated using Gemini-Pro and focuses on what questions can this text answer and what key words are meaningful in this section. Each metadata piece will be embedded with Gemini text-embedding model. 

Creating metadata can be useful for another lookup criteria during RAG based search.

In [21]:
extractors = [
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    KeywordExtractor(keywords=10, llm=llm)]

In [22]:
#Run metadata transformation pipeline.
pipeline = IngestionPipeline(
    transformations=extractors,
)
nodes = await pipeline.arun(documents=documents, in_place=False)

  0%|          | 0/52 [00:00<?, ?it/s]

100%|██████████| 52/52 [00:59<00:00,  1.14s/it]
100%|██████████| 52/52 [00:21<00:00,  2.47it/s]


Example metadata that was generated:

In [23]:
print(nodes[1].metadata)

{'questions_this_excerpt_can_answer': "## 3 Questions this context can answer:\n\n1. **What was the total revenue of Alphabet Inc. for the six months ended June 30, 2024?** This information can be found in the Consolidated Statements of Income on page 6 of the 10-Q report.\n2. **What were the major factors contributing to the change in Alphabet Inc.'s net income between the three months ended June 30, 2023 and 2024?** This information can be found in the Management's Discussion and Analysis (MD&A) section of the 10-Q report, specifically in the section discussing the company's financial performance.\n3. **What are the key risks that Alphabet Inc. faces, as identified by the company itself?** This information can be found in the Risk Factors section of the 10-Q report, which outlines the potential challenges and uncertainties that could impact the company's future performance.\n\n## Higher-level summaries of surrounding context:\n\n* This document is the 10-Q report for Alphabet Inc. fo

In [24]:
#Generate embeddings for each metadata node
for node in nodes:
    node_embedding = gemini_embedding_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

#### Load Nodes into Predefined Vector Store

This following section required a pre existing Vertex AI Vector Store. Vector stores contain embedding vectors of ingested document chunks.

For information to create a vector store, refer to this link https://docs.llamaindex.ai/en/stable/examples/vector_stores/VertexAIVectorSearchDemo/

In [25]:
vector_store = VertexAIVectorStore(
        project_id=PROJECT_ID,
        region=REGION,
        index_id="3234512520265859072",
        endpoint_id="137232245285978112",
        gcs_bucket_name=GCS_BUCKET,
    )

# Only need to run once
vector_store.add(nodes)

Upserting datapoints MatchingEngineIndex index: projects/726844691572/locations/us-central1/indexes/3234512520265859072
MatchingEngineIndex index Upserted datapoints. Resource name: projects/726844691572/locations/us-central1/indexes/3234512520265859072


['802119f0-ad67-47b6-88ec-72a16070efd6',
 'eed74f9c-a79c-4637-8551-fcca69a364b2',
 '99bf3120-060a-4315-824a-4243da5bccfc',
 '60e8fe11-8a4a-4a79-907e-0df349d2959b',
 'd99ab15d-c86a-43f8-953d-7fe3274a0cb3',
 '04c61bcd-6b05-4697-abeb-9c5c0011f555',
 '4b1de6c6-3d52-4227-9c39-6b57df93d16a',
 '21d30c9e-507a-43af-b786-58acc7755fc6',
 'c147838a-fa97-4b1c-8ea1-b94c305e05eb',
 '4b14653e-6430-4886-9c1b-091d2ad3ff2e',
 '674bcb86-9f7a-4f1d-8b30-8b4655b67ad8',
 '35ab5c0f-95fb-4717-9d77-2b227c5e1fd9',
 '234fc980-9ed1-436c-ac61-9bbed8007e16',
 'a187bbd9-bf6f-47ce-b295-ebab99031eea',
 'b8da92f6-4469-4059-a39a-107921a2e8dc',
 '4dd14e19-e2b5-4716-b9c2-413d24b743ee',
 'a2af9cdb-ee05-465a-b9bc-019e4119b76c',
 '34e6bab7-7e5b-478c-a21c-239b07678cb1',
 'ad207ca0-a4c3-4a7b-b9d0-1f39de1c9a55',
 '73f24325-2a41-4b5e-8c0e-fe49935566e6',
 'f0d70ef4-cc18-45f0-8005-a89a7927d123',
 'bb0ce131-d797-4e3c-8ea7-3112c3dd30bd',
 '2b8e7d27-c974-4d78-bd57-995a5aeba114',
 '7d508589-37e6-46a0-94a6-347ec131c00d',
 'd150bdfe-09e4-

#### Create a search index and search and query the Vector Store

In [26]:
lp_index = VectorStoreIndex.from_vector_store(vector_store)
lp_query_engine = lp_index.as_query_engine(
    similarity_top_k=2
)

## Query Comparison between LlamaIndex and LlamaParse
Below are queries that responses can be found in the 10Q document within complex tables. Let's see how each approach compares.

In [27]:
queries = [
    "What are the total cash, cash equivalents, and marketable securities as of Dec 23 2023",
    "Total investments with fair value change reflected in other comprehensive income as of Dec 23 2023",
    "What is the corporate debt securities unrealized loss as of Dec 31 2023 for 12 months or greater?",
    "What is the coupon rate for total outstanding debt",
    "Provide the table of share repurchases"]

In [28]:
from termcolor import colored

def print_output(response: Response):
    print(f"Response:")
    print("-" * 80)
    print(colored(response.response, color="red"))
    print("-" * 80)
    print(f"Source Documents:")
    print("-" * 80)
    for source in response.source_nodes:
        print(f"Sample Text: {source.text[:100]}")
        print(f"Relevance score: {source.get_score():.3f}")
        print(f"File Name: {source.metadata.get('file_name')}")
        print(f"Page #: {source.metadata.get('page_label')}")
        print(f"File Path: {source.metadata.get('file_path')}")
        print("-" * 80)
    
def run_query(query_idx: int):
    query = queries[query_idx]
    print("Query: " + query)
    print(colored("LlamaIndex SimpleDirectoryReader response....\n", color="blue"))
    print_output(simple_query_engine.query(query))
    
    print(colored("LlamaIndex LangChainNodeParser on LlamaIndex response....\n", color="blue"))
    print_output(lg_query_engine.query(query))

    print(colored("LlamaParse Simple response....\n", color="blue"))
    print_output(lp_simple_engine.query(query))

    print(colored("LlamaParse on Vertex AI response....\n", color="blue"))
    print_output(lp_query_engine.query(query))
    print("###################################################\n\n")



In [29]:
run_query(query_idx=0)

Query: What are the total cash, cash equivalents, and marketable securities as of Dec 23 2023
[34mLlamaIndex SimpleDirectoryReader response....
[0m
Response:
--------------------------------------------------------------------------------
[31mThe total cash, cash equivalents, and marketable securities as of December 31, 2023 is $110,916 million.[0m
--------------------------------------------------------------------------------
Source Documents:
--------------------------------------------------------------------------------
Sample Text: PART I. FINANCIAL INFORMATION
ITEM 1. FINANCIAL STATEMENTS
Alphabet Inc.
CONSOLIDATED BALANCE SHEETS
Relevance score: 0.623
File Name: goog-10-q-q2-2024.pdf
Page #: 5
File Path: /home/noabe/generative-ai/gemini/use-cases/document-processing/data/goog-10-q-q2-2024.pdf
--------------------------------------------------------------------------------
Sample Text: or inputs that are based upon quoted prices for similar instruments in active markets. 
De

In [33]:
run_query(query_idx=1)

Query: Total investments with fair value change reflected in other comprehensive income as of Dec 23 2023
[34mLlamaIndex SimpleDirectoryReader response....
[0m
Response:
--------------------------------------------------------------------------------
[31mThe total investments with fair value change reflected in other comprehensive income as of December 31, 2023 is $78,917 million. 
[0m
--------------------------------------------------------------------------------
Source Documents:
--------------------------------------------------------------------------------
Sample Text: or inputs that are based upon quoted prices for similar instruments in active markets. 
Debt securit
Relevance score: 0.693
File Name: google_2024_q2.pdf
Page #: 13
File Path: /usr/local/google/home/noabe/projects/llama/llama_parse_rag/run_parse/data/google_2024_q2.pdf
--------------------------------------------------------------------------------
Sample Text: As of June 30, 2024
Fair Value 
HierarchyAdjusted 

In [34]:
run_query(query_idx=2)

Query: What is the corporate debt securities unrealized loss as of Dec 31 2023 for 12 months or greater?
[34mLlamaIndex SimpleDirectoryReader response....
[0m
Response:
--------------------------------------------------------------------------------
[31mThe corporate debt securities unrealized loss as of Dec 31 2023 for 12 months or greater is $592 million. 
[0m
--------------------------------------------------------------------------------
Source Documents:
--------------------------------------------------------------------------------
Sample Text: Debt  Securiti es
The following table summarizes the estimated fair value of investments in availabl
Relevance score: 0.704
File Name: google_2024_q2.pdf
Page #: 15
File Path: /usr/local/google/home/noabe/projects/llama/llama_parse_rag/run_parse/data/google_2024_q2.pdf
--------------------------------------------------------------------------------
Sample Text: Equity Investments
The carrying value of equity securities is measured as 

In [35]:
run_query(query_idx=3)

Query: What is the coupon rate for total outstanding debt
[34mLlamaIndex SimpleDirectoryReader response....
[0m
Response:
--------------------------------------------------------------------------------
[31mThe coupon rate for total outstanding debt is 0.45% to 2.25% as of December 31, 2023 and 0.57% to 2.33% as of June 30, 2024.[0m
--------------------------------------------------------------------------------
Source Documents:
--------------------------------------------------------------------------------
Sample Text: Long-Term Debt
 Total outstanding debt is summarized below (in millions, except percentages):
Maturi
Relevance score: 0.631
File Name: google_2024_q2.pdf
Page #: 22
File Path: /usr/local/google/home/noabe/projects/llama/llama_parse_rag/run_parse/data/google_2024_q2.pdf
--------------------------------------------------------------------------------
Sample Text: Debt  Securiti es
The following table summarizes the estimated fair value of investments in availabl
Rel

In [36]:
run_query(query_idx=4)

Query: Provide the table of share repurchases
[34mLlamaIndex SimpleDirectoryReader response....
[0m
Response:
--------------------------------------------------------------------------------
[31m## Share Repurchases

| Three Months Ended June 30, 2024 | Six Months Ended June 30, 2024 |
|---|---|
| **Shares** | **Amount** | **Shares** | **Amount** |
| Class A share repurchases | 19 million | 43 million | $6,615 million |
| Class C share repurchases | 73 million | 160 million | $25,040 million |
| **Total share repurchases** | **92 million** | **203 million** | **$31,655 million** |

**Notes:**

* Shares repurchased include unsettled repurchases as of June 30, 2024.
* For additional information, see Note 9 of the Notes to Consolidated Financial Statements included in Item 1 of this Quarterly Report on Form 10-Q.
[0m
--------------------------------------------------------------------------------
Source Documents:
-----------------------------------------------------------------------

## Observations

### Answer Key
| Query                                                                                                | Answer           | Citation page |
|------------------------------------------------------------------------------------------------------|------------------|---------------|
| "What are the total cash, cash equivalents, and marketable securities as of Dec 23 2023"             | $110,916 million | 5             |
| "Total investments with fair value change reflected in other comprehensive income as of Dec 23 2023" | $78,917 million  | 13            |
| "What is the corporate debt securities unrealized loss as of Dec 31 2023 for 12 months or greater?   | 592 million      | 15            |
| "What is the coupon rate for total outstanding debt"                                                 | 0.45-2.25%       | 22            |
| "Provide the table of share repurchases"                                                             | Table            | 27 or 49      |

### Generated Answers
| Document Parsing Technique               | Query 1 | Query 2 | Query 3 | Query 4 | Query 5 |
|------------------------------------------|---------|---------|---------|---------|---------|
| LlamaIndex - SimpleDirectoryReader       | (✓)     | (✓)     | (✓)     | (✓)     | (✓)     |
| LlamaIndex - LangChainNodeParser         | (✓)     | (✓)     | (✓)     | (✓)     | (✓)     |
| LlamaParse - SimpleDirectoryReader       |  (✓)    |  (✓)    |      (✓)|    (✓)  |     (✓) |
| LlamaParse - Vertex AI Vector Search       |  (✓)    |  (✓)    |      (✓)|    (✓)  |     (✓) |

## Conclusion

There are many ways to customize your data ingestion and retrieval pipelines for custom RAG applications. This notebook was an overview to a handful of options that work in combination with Google's Gemini models. 