<a href="https://colab.research.google.com/github/Adityabhaskar685/langchain/blob/main/advanced_rag_parent_child.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parent-Child Document Retriver with Metadata Extraction

In [1]:
!pip install -qqq llama-index llama-hub langchain openai accelerate==0.21.0 bitsandbytes==0.40.2 transformers sentence_transformers InstructorEmbedding chromadb --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.3/803.3 kB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m5

In [5]:
!pip install bitsandbytes --q

## Setup

1. In this section we will work with the QLoRA paper and create an initial set of nodes (chunk size 1024).
2. We will use Open Source LLM [`zephyr-7b-alpha`](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and embedding [`hkunlp/instructor-large`](https://huggingface.co/hkunlp/instructor-large)

In [8]:
import json
import torch
from pathlib import Path

# transformers
import bitsandbytes
from transformers import BitsAndBytesConfig

# llama index
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import download_loader, Document, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SentenceSplitter
from llama_index.schema import IndexNode
from langchain.embeddings import HuggingFaceInstructEmbeddings
from llama_index.response.notebook_utils import display_source_node
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

# Metadata Extraction
from llama_index.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)

# db
import chromadb

DEVICE = 'cuda:0' if torch.cuda.is_available() else 'cpu'


## Load Data

In [9]:
PDFReader = download_loader('PDFReader')
loader = PDFReader()
docs = loader.load_data(file = Path('./QLoRa.pdf'))

In [13]:
print(docs[0].get_content())

QL ORA: Efficient Finetuning of Quantized LLMs
Tim Dettmers∗Artidoro Pagnoni∗Ari Holtzman
Luke Zettlemoyer
University of Washington
{dettmers,artidoro,ahai,lsz}@cs.washington.edu
Abstract
We present QLORA, an efficient finetuning approach that reduces memory us-
age enough to finetune a 65B parameter model on a single 48GB GPU while
preserving full 16-bit finetuning task performance. QLORAbackpropagates gradi-
ents through a frozen, 4-bit quantized pretrained language model into Low Rank
Adapters (LoRA). Our best model family, which we name Guanaco , outperforms
all previous openly released models on the Vicuna benchmark, reaching 99.3%
of the performance level of ChatGPT while only requiring 24 hours of finetuning
on a single GPU. QLORAintroduces a number of innovations to save memory
without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that
is information theoretically optimal for normally distributed weights (b) Double
Quantization to reduce the average memo

In [14]:
# combine all text
doc_text = "\n\n".join([d.get_content() for d in docs])
documents = [Document(text = doc_text)]

In [16]:
len(doc_text)

87804

## Chunking

In [19]:
node_parser = SentenceSplitter(chunk_size = 1024)


In [24]:
base_nodes = node_parser.get_nodes_from_documents(documents)

# set node ids to be a constant
for idx,node in enumerate(base_nodes):
  node.id_ = f"node-{idx}"

In [25]:
base_nodes[0].id_

'node-0'

In [26]:
# all node ids
for node in base_nodes:
  print(node.id_)

node-0
node-1
node-2
node-3
node-4
node-5
node-6
node-7
node-8
node-9
node-10
node-11
node-12
node-13
node-14
node-15
node-16
node-17
node-18
node-19
node-20
node-21
node-22
node-23
node-24
node-25
node-26
node-27
node-28
node-29
node-30


In [27]:
len(base_nodes)

31

# LLM (`zephyr-7b-alpha`)

In [28]:
from google.colab import userdata

# hugging face tokens
hf_token = userdata.get('HUGGINGFACE_TOKEN')

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_quant_type = 'nf4',
    bnb_4bit_use_double_quant=True
)

def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith('<|system|>\n'):
    prompt = '<|system|>\n<s>\n' + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"
  return prompt

llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt = PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window = 3900,
    max_new_tokens = 256,
    model_kwargs = {'quantization_config':quantization_config},
    generate_kwargs = {'temperature': 0.7, 'top_k': 50, 'top_p': 0.95},
    messages_to_prompt = messages_to_prompt,
    device_map = 'auto'
)





config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [29]:
# embedding
embed_model = HuggingFaceInstructEmbeddings(
    model_name = 'hkunlp/instructor-large', model_kwargs={'device':DEVICE}
)

# set your ServiceContext for all the next steps
service_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model = embed_model
)

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


## Baseline Retriver

In [50]:
base_index = VectorStoreIndex(base_nodes, service_context = service_context)
base_retriever = base_index.as_retriever(similarity_top_k = 2)

In [51]:
retrievals = base_retriever.retrieve(
    'Can you tell me about the Paged Optimizers?'
)


In [52]:
for n in retrievals:
  display_source_node(n, source_length = 1500)

**Node ID:** node-5<br>**Similarity:** 0.8639193985783434<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from st...<br>

**Node ID:** node-6<br>**Similarity:** 0.8246144101148493<br>**Text:** We provide more details in the results section for each particular setup to make the results more
readable. Full details in Appendix A.
QLoRA-AllQLoRA-FFN
QLoRA-AttentionAlpaca (ours)
Stanford-Alpaca
Model6061626364RougeL
bits
4
16
Figure 2: RougeL for LLaMA 7B models on the
Alpaca dataset. Each point represents a run with a
different random seed. We improve on the Stanford
Alpaca fully finetuned default hyperparameters to
construct a strong 16-bit baseline for comparisons.
Using LoRA on all transformer layers is critical to
match 16-bit performance.While paged optimizers are critical to do 33B/65B
QLORAtuning on a single 24/48GB GPU, we do
not provide hard measurements for Paged Optimiz-
ers since the paging only occurs when processing
mini-batches with long sequence lengths, which is
rare. We do, however, perform an analysis of the
runtime of paged optimizers for 65B models on
48GB GPUs and find that with a batch size of 16,
paged optimizers provide the same training speed
as regular optimizers. Future work should measure
and characterize under what circumstances slow-
downs occur from the paging process.
Default LoRA hyperparameters do not match 16-
bit performance When using the standard prac-
tice of applying LoRA to query and value attention
projection matrices [ 28], we are not able to replicate
full finetuning performance for large base models.
As shown in Figure 2 for LLaMA 7B finetuning on
Alpaca, we find that the most critical LoRA hyper-
parameter is how many L...<br>

In [35]:
query_engine_base = RetrieverQueryEngine.from_args(
    base_retriever, service_context = service_context
)

In [37]:
response = query_engine_base.query(
    'Can you tell me about the Paged Optimizers?'
)

print(str(response))



Paged Optimizers are a feature used in QLoRA to allocate paged memory for optimizer states. This allows for automatic page-to-page transfers between CPU and GPU memory for error-free GPU processing when the GPU runs out of memory. This feature works like regular memory paging between CPU RAM and disk. By using this feature, QLoRA can significantly reduce the required memory for finetuning models. However, the paging process only occurs when processing mini-batches with long sequence lengths, which is rare. The default LoRA hyperparameters do not match 16-bit performance, and the most critical LoRA hyperparameter is the number of LoRA adapters used in total. LoRA on all linear transformer block layers is required to match full finetuning performance. NormalFloat data type significantly improves bit-for-bit accuracy gains compared to regular 4-bit Floats, and double quantization allows for a more fine-grained control over the memory footprint to fit models of certain size into certain GP

In [39]:
llm.complete('can you tell me about the paged optimizers?').text




'Sure! Paged optimizers are a type of database optimization technique that is commonly used in relational databases. They are designed to improve the performance of database queries by optimizing the way data is read and written from disk.\n\nIn a traditional database, data is stored in a table format, and when a query is executed, the database engine reads the entire table from disk and returns the results. This can be very inefficient, especially for large tables, as it requires a lot of I/O operations and can result in slow query performance.\n\nPaged optimizers work by breaking up the table into smaller, more manageable chunks called pages. Each page contains a fixed amount of data, and when a query is executed, the database engine reads only the pages that are relevant to the query. This can significantly reduce the amount of I/O operations required, resulting in faster query performance.\n\nPaged optimizers also support indexing, which can further improve query performance by all

## Chunk References: Smaller Child Chunks Referring to Bigger Parent Chunk

Now, we will build smaller chunks that will point to their bigger parent chunks.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [53]:
sub_nodes_sizes = [256, 512]
sub_node_parsers = [SentenceSplitter(chunk_size = c) for c in sub_nodes_sizes]

all_nodes = []
for base_node in base_nodes:
  for n in sub_node_parsers:
    sub_nodes = n.get_nodes_from_documents([base_node])
    sub_inodes = [
        IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
    ]
    all_nodes.extend(sub_inodes)

    # also add the original node to node
  original_node = IndexNode.from_text_node(base_node, base_node.node_id)
  all_nodes.append(original_node)


In [55]:
all_nodes_dict = {n.node_id: n for n in all_nodes}


In [59]:
all_nodes_dict.keys()

dict_keys(['12af07e6-03b3-4c0e-b1f4-dd761d8c0693', 'c8da260a-6be1-49aa-86ff-f541ce4df2fc', '0e06a550-21d8-44a6-bbf1-a2db0ba1014c', 'c62d6425-58ac-46ee-bec2-f2c8fdc36107', 'd276803e-b041-47ae-9166-41603f353708', '406bb222-19fd-4f90-9753-4cba3c15e3aa', 'c87f8727-7fb1-4dae-a69b-a260c0a7b5ec', '08143659-0d89-40ee-b52a-262f477713d1', '3612b493-4f80-454f-9602-fd89dc369490', '3b8366c3-590a-448c-95a9-004b6b4b5a4e', '5847708b-d570-4481-9db8-7b5c97c8aae8', 'c53de4cf-eeca-451a-95c6-2617faf59fd5', '6e4d7aa1-dcea-4bea-97ce-4a56eccac2fb', 'f19ab724-2d26-4dcb-8022-2ed1cc9791ea', 'node-0', 'cdd302c2-fc22-4a08-ab2b-42300d0e790d', '48bfb5d4-703d-4bf6-be17-072d2bd29f8a', '736be5a8-5f9b-4f3a-b567-7a4e19504587', 'ab293bc2-eb59-4ca5-963b-f1ef16eefc46', '49ec8a7e-1414-435a-a019-c5adf7d05da4', '9b5d0cfc-60f2-40eb-a3d2-1d3b9bc021be', '39e7f147-df60-466c-9a4a-1ab554b685aa', 'f7acbca5-8d58-4663-976f-9911a42b72aa', '8590e181-1300-4151-90b0-0f7472921851', 'b5e05c3f-797a-4be1-9163-72db9d1ba173', 'bc8c1d0d-ac79-47c4

In [61]:
all_nodes_dict['12af07e6-03b3-4c0e-b1f4-dd761d8c0693']

IndexNode(id_='12af07e6-03b3-4c0e-b1f4-dd761d8c0693', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='node-0', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='4a6ee2cb7ca90f21fb361feed9b31c26f0265d8710637c0dacf171b9be1d7735'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='c8da260a-6be1-49aa-86ff-f541ce4df2fc', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='1f45fb33a11dc4bcab75cf5163ebc5a2adc6174d555311953df0936350322e65')}, hash='602c9890b96e35a3ffa0c9f35c0d69b11275739f5fe0471fedeec5c531aca8e8', text='QL ORA: Efficient Finetuning of Quantized LLMs\nTim Dettmers∗Artidoro Pagnoni∗Ari Holtzman\nLuke Zettlemoyer\nUniversity of Washington\n{dettmers,artidoro,ahai,lsz}@cs.washington.edu\nAbstract\nWe present QLORA, an efficient finetuning approach that reduces memory us-\nage enough to finetune a 65B parameter model on a single 48GB GPU while\npreserving fu

In [64]:
for n in all_nodes[:10]:
  print(n.index_id)

node-0
node-0
node-0
node-0
node-0
node-0
node-0
node-0
node-0
node-0


## indexing the smaller chunks

In [66]:
vector_index_chunk = VectorStoreIndex(
    all_nodes, service_context = service_context
)

vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k
 = 2)

When we perform retrieval, we want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In [70]:
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [71]:
nodes = retriever_chunk.retrieve(
    "Can you tell me about the Paged Optimizers?"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-4
[0m[1;3;34mRetrieving with query id node-4: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-5
[0m[1;3;34mRetrieving with query id node-5: Can you tell me about the Paged Optimizers?
[0m

**Node ID:** node-4<br>**Similarity:** 0.8935850984244506<br>**Text:** For our data type, we
set the arbitrary range [−1,1]. As such, both the quantiles for the data type and the neural network
weights need to be normalized into this range.
The information theoretically optimal data type for zero-mean normal distributions with arbitrary
standard deviations σin the range [−1,1]is computed as follows: (1) estimate the 2k+ 1quantiles
of a theoretical N(0,1)distribution to obtain a k-bit quantile quantization data type for normal distri-
butions, (2) take this data type and normalize its values into the [−1,1]range, (3) quantize an input
weight tensor by normalizing it into the [−1,1]range through absolute maximum rescaling.
Once the weight range and data type range match, we can quantize as usual. Step (3) is equivalent to
rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data
type. More formally, we estimate the 2kvalues qiof the data type as follows:
qi=1
2
QXi
2k+ 1
+QXi+ 1
2k+ 1
, (4)
where QX(·)is the quantile function of the standard normal distribution N(0,1). A problem for
a symmetric k-bit quantization is that this approach does not have an exact representation of zero,
which is an important property to quantize padding and other zero-valued elements with no error. To
4

ensure a discrete zeropoint of 0and to use all 2kbits for a k-bit datatype, we create an asymmetric
data type by estimating the quantiles qiof two ranges qi:2k−1for the negative part and 2k−1+ 1for
the positive part and then we unify these sets of qiand remove one of the two zeros that occurs in both
sets. We term the resulting data type that has equal expected number of values in each quantization bin
k-bit NormalFloat (NFk), since the data type is information-theoretically optimal for zero-centered
normally distributed data. The exact values of this data type can be found in Appendix E.
Double Quantization We introduce Double Quantization (DQ), the process of quantizing the
quantization constants for add...<br>

**Node ID:** node-5<br>**Similarity:** 0.8929982170962448<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from storage WNF4to computation data type
WBF16to calculate the derivative∂X
∂Win BFloat16 precision.
To summarize, QLORAhas one storage data type (usually 4-bit NormalFloat) and a computation
data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type
to perform the forward and backward pass, but we only compute weight gradients for the LoRA
parameters which use 16-bit BrainFloat.
4 QLoRA vs. Standard Finetuning
We have discussed how QLoRA works and how it can signi...<br>

In [72]:
query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk, service_context = service_context
)

In [74]:
response = query_engine_chunk.query(
    'what is qlora'
)
print(str(response))

[1;3;34mRetrieving with query id None: what is qlora
[0m[1;3;38;5;200mRetrieved node with id, entering: node-5
[0m[1;3;34mRetrieving with query id node-5: what is qlora
[0m[1;3;38;5;200mRetrieved node with id, entering: node-1
[0m[1;3;34mRetrieving with query id node-1: what is qlora
[0m



QLORA is a method introduced in the text that aims to significantly reduce the required memory for finetuning models. It involves using 4-bit NormalFloat as a quantization data type, which is an information theoretically optimal quantization data type for normally distributed data. Double Quantization is also used, which quantizes the quantization constants, saving an average of about 0.37 bits per parameter. Paged Optimizers are used to avoid gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length. QLORA has been shown to reduce the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16-bit fully finetuned baseline.


In [75]:
response = query_engine_chunk.query(
    'can you tell me about the Paged Optimizers ?'
)
print(str(response))

[1;3;34mRetrieving with query id None: can you tell me about the Paged Optimizers ?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-4
[0m[1;3;34mRetrieving with query id node-4: can you tell me about the Paged Optimizers ?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-5
[0m[1;3;34mRetrieving with query id node-5: can you tell me about the Paged Optimizers ?
[0mYes, according to the given context information, Paged Optimizers are a feature used in the QLORA model to allocate paged memory for optimizer states. This feature allows for error-free GPU processing in scenarios where the GPU occasionally runs out of memory. The feature works like regular memory paging between CPU RAM and the disk, and the optimizer states are automatically evicted to CPU RAM when the GPU runs out of memory and paged back into GPU memory when needed in the optimizer update step. This helps to conserve memory and improve the efficiency of the model.


## Metadata References: Summaries + Generated Questions referring to a bigger chunk

Now, we will add some additional context that references the source node.

This additional context includes summaries as well as generated questions.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [76]:
import nest_asyncio
nest_asyncio.apply()

In [78]:
extractors = [
    SummaryExtractor(summaries = ['self'], llm = llm, show_progress = True),
    QuestionsAnsweredExtractor(question = 1, llm = llm ,show_progress = True)
]

In [80]:
# metadata extractor across base nodes, get back dictionary
metadata_dicts = []
for extractor in extractors:
  metadata_dicts.extend(extractor.extract(base_nodes))


100%|██████████| 31/31 [08:36<00:00, 16.65s/it]  
100%|██████████| 31/31 [09:54<00:00, 19.18s/it]  


In [112]:
#all nodes consists of source nodes, along with metadata

import copy

all_nodes = copy.deepcopy(base_nodes)
for idx, d in enumerate(metadata_dicts[:31]):
  if idx == 0:
    inode_q = IndexNode(
        text = metadata_dicts[idx + 31]['questions_this_excerpt_can_answer'],
        index_id = base_nodes[idx].node_id,
    )
  else:
    node_q = IndexNode(
        text = metadata_dicts[idx + 30]['questions_this_excerpt_can_answer'],
        index_id = base_nodes[idx].node_id,
    )
  inode_s = IndexNode(
      text = d['section_summary'],
      index_id = base_nodes[idx].node_id
  )
  all_nodes.extend([inode_q, inode_s])

In [114]:
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [115]:
vector_index_metadata = VectorStoreIndex(all_nodes, service_context = service_context)
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k = 2)

In [118]:
retriever_metadata = RecursiveRetriever(
    'vector',
    retriever_dict = {"vector" : vector_retriever_metadata},
    node_dict = all_nodes_dict,
    verbose =True
)

In [119]:
nodes = retriever_metadata.retrieve(
    "Can you tell me about the Paged Optimizers?"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieving text node: On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows

**Node ID:** node-5<br>**Similarity:** 0.8639193985783434<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from storage WNF4to computation data type
WBF16to calculate the derivative∂X
∂Win BFloat16 precision.
To summarize, QLORAhas one storage data type (usually 4-bit NormalFloat) and a computation
data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type
to perform the forward and backward pass, but we only compute weight gradients for the LoRA
parameters which use 16-bit BrainFloat.
4 QLoRA vs. Standard Finetuning
We have discussed how QLoRA works and how it can signi...<br>

**Node ID:** node-5<br>**Similarity:** 0.842482059158431<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from storage WNF4to computation data type
WBF16to calculate the derivative∂X
∂Win BFloat16 precision.
To summarize, QLORAhas one storage data type (usually 4-bit NormalFloat) and a computation
data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type
to perform the forward and backward pass, but we only compute weight gradients for the LoRA
parameters which use 16-bit BrainFloat.
4 QLoRA vs. Standard Finetuning
We have discussed how QLoRA works and how it can signi...<br>

In [120]:
query_engine_metadata = RetrieverQueryEngine.from_args(
    retriever_metadata, service_context=service_context
)

In [121]:
response = query_engine_metadata.query(
    "Can you tell me about the Paged Optimizers?"
)
print(str(response))

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieving text node: On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows



Yes, according to the given context information, Paged Optimizers are a feature used in the QLoRA approach to optimize the memory footprint of finetuning models. This feature automatically pages memory between the CPU and GPU for error-free GPU processing in scenarios where the GPU occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM and the disk. In QLoRA, paged memory is allocated for the optimizer states, which are automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.


In [122]:
# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("QLoRa_knowledge_database")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
vector_index_metadata_db = VectorStoreIndex(
    all_nodes,
    storage_context=storage_context,
    service_context=service_context
)

DuplicateIDError: Expected IDs to be unique, found duplicates of: 729c0c6a-7846-440a-97fe-efb1a7aa79ee

In [132]:
# compare output
baseline_retriever = 'Paged Optimizers are a feature used in QLoRA to allocate paged memory for optimizer states. This allows for automatic page-to-page transfers between CPU and GPU memory for error-free GPU processing when the GPU runs out of memory. This feature works like regular memory paging between CPU RAM and disk. By using this feature, QLoRA can significantly reduce the required memory for finetuning models. However, the paging process only occurs when processing mini-batches with long sequence lengths, which is rare. The default LoRA hyperparameters do not match 16-bit performance, and the most critical LoRA hyperparameter is the number of LoRA adapters used in total. LoRA on all linear transformer block layers is required to match full finetuning performance. NormalFloat data type significantly improves bit-for-bit accuracy gains compared to regular 4-bit Floats, and double quantization allows for a more fine-grained control over the memory footprint to fit models of certain size into certain GPUs.'
chunk_retriever_retriever = 'Yes, according to the given context information, Paged Optimizers are a feature used in the QLORA model to allocate paged memory for optimizer states. This feature allows for error-free GPU processing in scenarios where the GPU occasionally runs out of memory. The feature works like regular memory paging between CPU RAM and the disk, and the optimizer states are automatically evicted to CPU RAM when the GPU runs out of memory and paged back into GPU memory when needed in the optimizer update step. This helps to conserve memory and improve the efficiency of the model.'
metadata_retriever = 'Yes, according to the given context information, Paged Optimizers are a feature used in the QLoRA approach to optimize the memory footprint of finetuning models. This feature automatically pages memory between the CPU and GPU for error-free GPU processing in scenarios where the GPU occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM and the disk. In QLoRA, paged memory is allocated for the optimizer states, which are automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.'

In [133]:
import pandas as pd
df = pd.DataFrame()

df.loc[0, 'baseline_retriever'] = baseline_retriever
df.loc[0, 'chunk_reference_retriever'] = chunk_retriever_retriever
df.loc[0, 'metadata_retriever'] = metadata_retriever
df

Unnamed: 0,baseline_retriever,chunk_reference_retriever,metadata_retriever
0,Paged Optimizers are a feature used in QLoRA t...,"Yes, according to the given context informatio...","Yes, according to the given context informatio..."
