In [1]:
!wget "https://openreview.net/pdf?id=VtmBAGCN7o"

--2025-02-10 13:57:32--  https://openreview.net/pdf?id=VtmBAGCN7o
Resolving openreview.net (openreview.net)... 35.184.86.251
Connecting to openreview.net (openreview.net)|35.184.86.251|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16911937 (16M) [application/pdf]
Saving to: ‘pdf?id=VtmBAGCN7o.1’


2025-02-10 13:57:32 (34.4 MB/s) - ‘pdf?id=VtmBAGCN7o.1’ saved [16911937/16911937]



In [14]:
%pip install llama-index-embeddings-gemini

Collecting llama-index-embeddings-gemini
  Downloading llama_index_embeddings_gemini-0.3.1-py3-none-any.whl.metadata (697 bytes)
Downloading llama_index_embeddings_gemini-0.3.1-py3-none-any.whl (2.9 kB)
Installing collected packages: llama-index-embeddings-gemini
Successfully installed llama-index-embeddings-gemini-0.3.1


In [2]:
!pip install llama-index
%pip install llama-index-llms-gemini llama-index



In [3]:
import nest_asyncio

nest_asyncio.apply()

In [9]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["/content/Attention is all you need.pdf"]).load_data()

In [19]:
import getpass
import os

GOOGLE_API_KEY = getpass.getpass("Enter your Google AI API key: ")

Enter your Google AI API key: ··········


In [10]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [15]:
# imports
from llama_index.embeddings.gemini import GeminiEmbedding

In [20]:
model_name = "models/embedding-001"

embed_model = GeminiEmbedding(
    model_name=model_name, api_key=GOOGLE_API_KEY, title="this is a document"
)


In [23]:
from llama_index.core import Settings
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm =  Gemini(
    model="models/gemini-1.5-flash",
)
Settings.embed_model = GeminiEmbedding(model=embed_model)

In [24]:
from llama_index.core import SummaryIndex, VectorStoreIndex

summary_index = SummaryIndex(nodes)
vector_index = VectorStoreIndex(nodes)

In [25]:
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)
vector_query_engine = vector_index.as_query_engine()

In [26]:
from llama_index.core.tools import QueryEngineTool


summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to documents"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from the documents."
    ),
)

In [27]:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector


query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
    verbose=True
)

In [28]:
from llama_index.core.tools import QueryEngineTool


summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to MetaGPT"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from the MetaGPT paper."
    ),
)

In [29]:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector


query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
    verbose=True
)

In [30]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

[1;3;38;5;200mSelecting query engine 0: The question asks for a summary of the document, which directly relates to summarization..
[0mThis paper introduces the Transformer, a novel neural network architecture for sequence transduction that relies solely on attention mechanisms, eliminating recurrence and convolutions.  Evaluated on machine translation tasks (English-to-German and English-to-French), the Transformer surpasses existing models in quality and parallelization, requiring significantly less training time.  Its effectiveness extends to other tasks, as demonstrated by successful application to English constituency parsing.  The Transformer's architecture comprises encoder and decoder stacks built from multi-head self-attention and point-wise feed-forward networks, incorporating residual connections and layer normalization.  The paper details the scaled dot-product attention mechanism, multi-head attention, positional encoding, and training procedures.  Experiments show the im

In [31]:
print(len(response.source_nodes))

15


In [32]:
response = query_engine.query("How is self attention calculated")
print(str(response))

[1;3;38;5;200mSelecting query engine 1: The question asks for a specific detail from the MetaGPT paper, which is best addressed by retrieving specific context..
[0mSelf-attention is calculated by using multi-head attention where the queries, keys, and values all come from the same source—the output of the previous layer.  Each position attends to all positions in the previous layer.  In the decoder, this is modified to prevent positions from attending to subsequent positions to maintain the autoregressive property.  This is done by masking out illegal connections within the scaled dot-product attention mechanism.  The multi-head attention is computed as the concatenation of individual attention heads, each calculated as Attention(QWQ<sup>i</sup>, KWK<sup>i</sup>, VWV<sup>i</sup>), where Q, K, and V are the query, key, and value matrices, and WQ<sup>i</sup>, Wk<sup>i</sup>, and WV<sup>i</sup> are projection matrices.  The concatenated heads are then projected to produce the final outp

In [33]:
print(len(response.source_nodes))

2
