In [1]:
import dotenv
%load_ext dotenv
%dotenv

In [2]:
import nest_asyncio
nest_asyncio.apply()

### Load Data

In [3]:
from llama_index.core import SimpleDirectoryReader

# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()

#### Create Document Chunks

In [4]:
from llama_index.core.node_parser import SentenceSplitter

# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)

#### Get Node Details

In [12]:
node_metadata = nodes[1].get_content(metadata_mode=True)
print(node_metadata)

page_label: 2
file_name: lora_paper.pdf
file_path: datasets/lora_paper.pdf
file_type: application/pdf
file_size: 1609513
creation_date: 2024-05-10
last_modified_date: 2024-05-10

often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by optimizing rank decomposition matrice

In [10]:
# total pages
len(nodes)

38

#### Creating LLM and Embedding Models

In [13]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

#### Creating Indexes

In [1]:
from llama_index.core import SummaryIndex, VectorStoreIndex

# summary index
summary_index = SummaryIndex(nodes)
# vector store index
vector_index = VectorStoreIndex(nodes)

NameError: name 'nodes' is not defined

#### Creating Query Engines

In [15]:
# summary query engine
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

# vector query engine
vector_query_engine = vector_index.as_query_engine()

#### Query Tools

In [16]:
# summary query engine
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

# vector query engine
vector_query_engine = vector_index.as_query_engine()

#### Router Query Engine

In [17]:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector


query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
    verbose=True
)

#### Testing Out The Router

In [21]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

[1;3;38;5;200mSelecting query engine 0: The question is asking for a summary of the document, which is typically related to summarization questions..
[0mThe document discusses a novel adaptation strategy called LoRA for large language models like GPT-3, which involves freezing pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This method significantly reduces the number of trainable parameters for downstream tasks, enabling efficient task-switching, reducing memory requirements, and maintaining high model quality without adding inference latency. The document also explores the effectiveness of LoRA through empirical investigations on various tasks and models like RoBERTa, DeBERTa, GPT-2, and GPT-3, comparing its performance to traditional fine-tuning and other adaptation methods. Additionally, it delves into the concept of low-rank structures in deep learning, the implications of low-rank updates in neur

Above is the summary of the paper that is summarized over all the context in the given Lora-paper we passed on to the summarization query engine

In [23]:
print(len(response.source_nodes))

38


In [24]:
print(response.source_nodes[9])

Node ID: 86a3311a-27d5-4260-b919-af058cbb99b2
Text: We report the overall (matched and mismatched) accuracy for
MNLI, Matthew’s correlation for CoLA, Pearson correlation for STS-B,
and accuracy for other tasks. Higher is better for all metrics. *
indicates numbers published in prior works. †indicates runs conﬁgured
in a setup similar to Houlsby et al. (2019) for a fair comparison.
Bias-only or Bi...
Score: None



In [19]:
response = query_engine.query("What is the long from of Lora?")
print(str(response))

[1;3;38;5;200mSelecting query engine 1: The question is asking for the long form of Lora, which would require retrieving specific context from the Lora paper..
[0mLow-Rank-Parametrized Update Matrices


#### Placing All Together Into One Function

In [25]:
async def create_router_query_engine(
    document_fp: str,
    verbose: bool = True,
) -> RouterQueryEngine:
    # load lora_paper.pdf documents
    documents = SimpleDirectoryReader(input_files=[document_fp]).load_data()
    
    # chunk_size of 1024 is a good default value
    splitter = SentenceSplitter(chunk_size=1024)
    # Create nodes from documents
    nodes = splitter.get_nodes_from_documents(documents)
    
    # LLM model
    Settings.llm = OpenAI(model="gpt-3.5-turbo")
    # embedding model
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
    
    # summary index
    summary_index = SummaryIndex(nodes)
    # vector store index
    vector_index = VectorStoreIndex(nodes)
    
    # summary query engine
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True,
    )

    # vector query engine
    vector_query_engine = vector_index.as_query_engine()
    
    summary_tool = QueryEngineTool.from_defaults(
        query_engine=summary_query_engine,
        description=(
            "Useful for summarization questions related to the Lora paper."
        ),
    )

    vector_tool = QueryEngineTool.from_defaults(
        query_engine=vector_query_engine,
        description=(
            "Useful for retrieving specific context from the the Lora paper."
        ),
    )
    
    
    query_engine = RouterQueryEngine(
        selector=LLMSingleSelector.from_defaults(),
        query_engine_tools=[
            summary_tool,
            vector_tool,
        ],
        verbose=verbose
    )
    
    
    return query_engine


In [27]:
query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))

The document introduces a novel adaptation strategy called LoRA for large language models like GPT-3, which involves freezing pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters for downstream tasks, enabling efficient task-switching, reducing memory requirements, and maintaining model quality without introducing inference latency. Empirical investigations highlight the efficacy of LoRA compared to traditional fine-tuning and other adaptation methods, showing that LoRA can outperform or match fine-tuning in model quality while requiring significantly fewer trainable parameters. The document also discusses various experiments and analyses related to deep learning models, focusing on adaptation methods like LoRA and their impact on model performance, exploring topics such as the correlation between LoRA modules, the effect of different rank

In [29]:
from utils import create_router_query_engine

query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))

  m = regex.search(name + tok)


[1;3;38;5;200mSelecting query engine 0: Useful for summarization questions related to the Lora paper..
[0mThe document introduces LoRA, a novel adaptation strategy for large language models that involves freezing pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This method reduces the number of trainable parameters, allows for sharing pre-trained models across tasks, minimizes memory requirements, and enables quick task-switching without inference latency. Empirical investigations demonstrate that LoRA performs comparably or better than full fine-tuning on various language models like RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters. The document also explores the concept of rank-deficiency in language model adaptation and provides insights into the effectiveness of LoRA through experiments and analyses on different hyperparameters and tasks, highlighting its promising perfo