In [1]:
import dotenv
%load_ext dotenv
%dotenv

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [5]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()

In [6]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

In [9]:
node_metadata = nodes[1].get_content(metadata_mode=True)
print(str(node_metadata))

page_label: 2
file_name: lora_paper.pdf
file_path: datasets/lora_paper.pdf
file_type: application/pdf
file_size: 1609513
creation_date: 2024-05-12
last_modified_date: 2024-05-12

often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model
depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-
bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to
match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned
over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the
change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed
Low-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural
network indirectly by optimizing rank decomposition matrice

In [10]:
len(nodes)

38

#### Create LLM And Embedding Model

In [11]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embedding = OpenAIEmbedding(model="text-embedding-ada-002")

#### Creating Indexes

In [12]:
from llama_index.core import SummaryIndex, VectorStoreIndex

summary_index = SummaryIndex(nodes=nodes)
vecto_index = VectorStoreIndex(nodes=nodes)

#### Creating Query Engines

In [13]:
summary_query_engine = summary_index.as_query_engine(
    response_model="tree_summary",
    use_async=True,
)

vector_query_engine = vecto_index.as_query_engine()

#### Query Tool

In [20]:
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarizing of the lora paper."
    )
)


vecto_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context related to the lora paper."
    )
)

#### Router Query Engine

In [21]:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

query_egine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[summary_tool, vecto_tool],
    verbose=True,
)

In [26]:
response = query_egine.query("What is the lora paper about?")
print(str(response))

[1;3;38;5;200mSelecting query engine 0: Useful for summarizing of the lora paper..
[0mThe LoRA paper introduces a novel adaptation strategy called Low-Rank Adaptation (LoRA) for large language models. LoRA aims to address the challenges of fine-tuning large models by freezing pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters for downstream tasks, leading to improved efficiency in terms of memory requirements and training throughput. LoRA has been shown to perform on-par or better than traditional fine-tuning methods on various language models like RoBERTa, DeBERTa, GPT-2, and GPT-3, while also avoiding additional inference latency. The paper provides empirical investigations into rank-deficiency in language model adaptation and offers a package for integrating LoRA with PyTorch models, along with implementations and model checkpoints for 

In [27]:
len(response.source_nodes)

38

In [24]:
response = query_egine.query("What eval datasets where used in the lora paper?")
print(str(response))

[1;3;38;5;200mSelecting query engine 1: Retrieving specific context related to the lora paper would provide information on the evaluation datasets used..
[0mThe evaluation datasets used in the LoRA paper include MNLI, STS-B, WikiSQL, SAMSum, E2E NLG Challenge, DART, and WebNLG.


In [25]:
print(response.source_nodes[0].get_content())

MNLI-
ndescribes a subset with ntraining examples. We evaluate with the full validation set. LoRA
performs exhibits favorable sample-efﬁciency compared to other methods, including ﬁne-tuning.
To be concrete, let the singular values of Ui⊤
AUj
Bto beσ1,σ2,···,σpwherep= min{i,j}. We
know that the Projection Metric Ham & Lee (2008) is deﬁned as:
d(Ui
A,Uj
B) =√p−p∑
i=1σ2
i∈[0,√p]
23


## List 

- SQL
- Python
- FastAPI
- Azure
- K8S


forcemetrics