# Improving RAG Performance

This notebook walks through practical techniques to improve the performance of Retrieval-Augmented Generation (RAG) systems:

- Updating Chunk Size
- Re-Ranking Retrieved Chunks
- Query Transformations
- Fine-tuning with LoRA and QLoRA

## Updating Chunk Size

Chunk size plays a critical role in retrieval accuracy and contextual relevance. Too small, and you may lose coherence; too large, and you risk irrelevance or hallucination. Research, such as [“Rethinking Chunk Size for Long-Document Retrieval”](https://arxiv.org/abs/2505.21700), shows that shorter chunks (~64–128 tokens) work better for factual, pinpoint queries, while larger chunks (512–1024 tokens) benefit broader context questions.
arXiv



### Toolkit & Examples
LlamaIndex offers a module ([Response Evaluation](https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5)) to empirically identify an optimal chunk size for your dataset.


[NVIDIA’s blog](https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/) also highlights real-world experimentation linking chunking strategy to retrieval accuracy.

## Re-Ranking Retrieved Chunks

Initial retrieval (e.g., via vector search) may miss the most contextually relevant passages. A cross-encoder or LLM-based re-ranker can reorder those top-k candidates to surface higher-quality context before generation.

### Toolkit & Examples
You can use a cross-encoder like sentence-transformers/cross-encoder/ms-marco-MiniLM-L-6-v2 to compute relevance scores between query and documents, then rerank.

In [None]:
# Example with SentenceTransformers CrossEncoder
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "What is the sick leave policy?"
retrieved_docs = [
    "Employees are entitled to 5 days of paid sick leave...",
    "The company hosts an annual wellness seminar...",
    "You need a doctor’s note for medical leave."
]

scores = model.predict([[query, doc] for doc in retrieved_docs])
reranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)

for doc, score in reranked:
    print(f"Score: {score:.2f} | Doc: {doc}")

## Query Transformations

User questions can be vague or colloquial. Rewriting them into clearer, semantically enriched queries can improve retrieval precision.



### Toolkit & Examples
A simple LLM-based rephraser (e.g., a FLAN-T5 model) can be used to rephrase queries before sending to the retriever layer.

In [None]:
# Using an LLM to rephrase a question
from transformers import pipeline

rephraser = pipeline("text2text-generation", model="google/flan-t5-base")
query = "What do I do if I'm sick?"
rephrased = rephraser(f"Rephrase the question: {query}", max_length=64, do_sample=False)

print("Original:", query)
print("Rephrased:", rephrased[0]['generated_text'])

## Fine-Tuning LLMs with LoRA / QLoRA

Fine-tuning full LLMs is expensive and resource-heavy. LoRA (Low-Rank Adaptation) introduces trainable adapters into frozen models, drastically reducing required parameters while maintaining performance.
arXiv

QLoRA builds on this by applying 4-bit quantization to reduce memory footprint further, enabling fine-tuning for large models even on hardware-limited setups. QLoRA retains near full-performance while being extremely efficient

 We'll use PEFT with LoRA.

In [None]:
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tiiuae/falcon-7b-instruct"

# Load model in float16 or bf16 if possible, or fallback to fp32
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()