<a href="https://colab.research.google.com/github/GenAIHub/genai-workshop/blob/main/02_RAG/02_ingestion_and_query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install boto3
!pip install pydantic==1.10.9
!pip installl lama_index
!pip install llama-index-embeddings-bedrock==0.1.4
!pip install llama-index-llms-bedrock==0.1.5
!pip install llama-index-vector-stores-postgres==0.1.5
!pip install langchain==0.0.333
!pip install PyMuPDF
!pip install llama-index-readers-file

In [2]:
import fitz
import os
import requests

# Create the data directory if it doesn't exist
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True)

# Define the URL and the local file path
pdf_url = "https://d687lz8k56fia.cloudfront.net/sec-edgar-filings/0001018724/10-K/0001018724-23-000004/filing-details.pdf"
pdf_path = os.path.join(data_dir, "AMZN.pdf")

# Download the PDF file
response = requests.get(pdf_url)
with open(pdf_path, "wb") as file:
    file.write(response.content)

doc = fitz.open(pdf_path)
# doc = fitz.open("./data/your_file.pdf")

# Print the number of pages in the PDF
print(f"\nThe document has {doc.page_count} pages.\n")

full_pdf_content = ""
for page in range(doc.page_count):
  pg = doc.load_page(page)
  text = pg.get_text()
  full_pdf_content += text

print(f"\n{full_pdf_content[:1000]}")


The document has 75 pages.


Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 ____________________________________
FORM 10-K
____________________________________ 
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2022
or
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from            to             .
Commission File No. 000-22513
____________________________________
AMAZON.COM, INC.
(Exact name of registrant as specified in its charter)
Delaware
 
91-1646860
(State or other jurisdiction of
incorporation or organization)
 
(I.R.S. Employer
Identification No.)
410 Terry Avenue North
Seattle, Washington 98109-5210
(206) 266-1000
(Address and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of the Act:
Title of E

In [3]:
from llama_index.core.node_parser.text.sentence import SentenceSplitter
from llama_index.core.schema import Document
import pprint as pp

chunk_size = 1024
chunk_overlap = 256

splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )

docs = splitter.get_nodes_from_documents(
        [
            Document(
                text=full_pdf_content,
                metadata={
                    "page": 0,
                    #"file_name": f"{metadata_file_name}",
                    "source": "text",
                    "document_id": f"some_id",
                },
                excluded_embed_metadata_keys=["page", "source", "document_id"],
                excluded_llm_metadata_keys=["page", "source", "document_id"],
            )
        ]
    )
pp.pprint(docs[0])

TextNode(id_='19426ee7-846b-49a1-b9c3-93f164aa536f', embedding=None, metadata={'page': 0, 'source': 'text', 'document_id': 'some_id'}, excluded_embed_metadata_keys=['page', 'source', 'document_id'], excluded_llm_metadata_keys=['page', 'source', 'document_id'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='ae2e97b8-252d-4011-8dd1-ce041a5a325e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page': 0, 'source': 'text', 'document_id': 'some_id'}, hash='98b9dc512b239b5301a6757daafca0dcec035a1e353aa95f9e7dbb3bd3200805'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='f37e2c42-8cbb-47b4-8049-d7653e247e6e', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='87c52425f841133fe345569918a8eb358dec72055a105cfcb18c67e313c0c375')}, text='Table of Contents\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n ____________________________________\nFORM 10-K\n____________________________________ \n(Mark One)\n☒\nANNUAL REPORT PURSUANT TO SECT

In [4]:
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.llms.bedrock import Bedrock
from llama_index.core.llms import ChatMessage
from llama_index.core.settings import Settings
import boto3
import asyncio
import os


os.environ["AWS_REGION"] = "us-east-1"
os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""

embed_model_id = "amazon.titan-embed-text-v1"
llm_model_id = "anthropic.claude-3-haiku-20240307-v1:0"

BEDROCK_CLIENT = boto3.client("bedrock-runtime", "us-east-1")

llm = Bedrock(
    model=llm_model_id,
    client=BEDROCK_CLIENT,
    aws_region_name="us-east-1",
    temperature=0.1,
    max_tokens=512
    )


embed_model = BedrockEmbedding(model=embed_model_id, client=BEDROCK_CLIENT)

Settings.llm = llm
Settings.embed_model = embed_model

In [5]:
from llama_index.core.ingestion.pipeline import IngestionPipeline
from llama_index.core.extractors import QuestionsAnsweredExtractor
import asyncio

pipeline = IngestionPipeline(
          transformations=[
              QuestionsAnsweredExtractor(questions=2)
           ]
      )
nodes = await pipeline.arun(nodes = docs)
pp.pprint(nodes[0])

100%|██████████| 76/76 [04:09<00:00,  3.28s/it]  

TextNode(id_='19426ee7-846b-49a1-b9c3-93f164aa536f', embedding=None, metadata={'page': 0, 'source': 'text', 'document_id': 'some_id', 'questions_this_excerpt_can_answer': 'Based on the provided context, here are two questions that can be answered using the information given:\n\n1. What is the ticker symbol for Amazon.com, Inc.\'s common stock?\nThe context states that "Common Stock, par value $.01 per share" is traded under the ticker symbol "AMZN" on the Nasdaq Global Select Market.\n\n2. What was the aggregate market value of Amazon.com, Inc.\'s voting stock held by non-affiliates as of June 30, 2022?\nThe context provides this specific information, stating that the "Aggregate market value of voting stock held by non-affiliates of the registrant as of June 30, 2022" was "$944,744,113,598".\n\nHigher-level summary:\nThe provided context is the beginning of Amazon.com, Inc.\'s 2022 annual report (Form 10-K) filed with the U.S. Securities and Exchange Commission. It contains general inf




In [6]:
from llama_index.core import VectorStoreIndex

index =  VectorStoreIndex(nodes)

print(index)

<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x7dbbbc1ebdc0>


In [7]:
from llama_index.core import PromptTemplate
from IPython.display import Markdown, display

# define prompt viewing function
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))


qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

text_qa_template = PromptTemplate(qa_prompt_str)

query_engine = index.as_query_engine(
    similarity_top_k=2
    )

query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": text_qa_template}
)

prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the question: {query_str}



<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


<br><br>

In [8]:
question = "What is this document about?"
response = query_engine.query(question)
print(response)

Based on the context provided, this document appears to be an excerpt from a risk factors section of a regulatory filing, likely an annual report or 10-K, for the company Amazon.

The key points from the context are:

1. It discusses how the seasonality of Amazon's business, particularly the holiday sales period, impacts its cash, cash equivalents, and marketable securities balances.

2. It outlines risks related to Amazon's relationships with third-party sellers on its platform, including liability for unlawful activities by sellers and the need to protect its intellectual property.

3. It describes risks associated with Amazon's commercial agreements, strategic alliances, and other business relationships, as well as risks related to its acquisitions and investments in other companies.

So in summary, this document is focused on detailing the various risks and challenges that Amazon faces in operating its e-commerce and cloud computing businesses, particularly around seasonality, thir