# Retrieval-Augmented Generation (RAG) using LlamaIndex
## Introduction

[**LlamaIndex**](https://github.com/run-llama/llama_index) is a widely used RAG framework. **LLMLingua** and **LongLLMLingua** have also been incorporated into the [LlamaIndex pipeline](https://github.com/run-llama/llama_index), which allows for more convenient use of LLMLingua-related technologies in RAG scenarios.

More specifically, [**LongLLMLinguaPostprocessor**](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/postprocessor/longllmlingua.py#L16) can be used as a **Postprocessor** in **LlamaIndex** by invoking it, with arguments consistent with those in the [**PromptCompressor**](https://github.com/microsoft/LLMLingua/blob/main/llmlingua/prompt_compressor.py) of [**LLMLingua**](https://github.com/microsoft/LLMLingua).
You can call the corresponding compression algorithms in LLMLingua and the question-aware prompt compression method in LongLLMLingua.

For examples,
```python
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor

node_postprocessor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder
        "dynamic_context_compression_ratio": 0.4, # enable dynamic compression ratio
    },
)
```

Retrieval-Augmented Generation (RAG) is a powerful and popular technique that applies specialized knowledge to large language models (LLMs). However, traditional RAG methods tend to have increasingly long prompts, sometimes exceeding **40k**, which can result in high financial and latency costs. Moreover, the decreased information density within the prompts can lead to performance degradation in LLMs, such as the "lost in the middle" issue.

<center><img width="800" src="../images/LongLLMLingua_Motivation.png"></center>

To address this, we propose [**LongLLMLingua**](https://aclanthology.org/2024.acl-long.91/), which specifically tackles the low information density problem in long context scenarios via prompt compression, making it particularly suitable for RAG tasks. The main ideas involve a two-stage compression process, as shown by the  <font color='red'>**red line**</font>, which significantly improves the original curve:

- Coarse-grained compression through document-level perplexity;
- Fine-grained compression of the remaining text using token perplexity;

Instead of fighting against positional effects, we aim to utilize them to our advantage through document reordering, as illustrated by the  <font color='green'>**green line**</font>. In this approach, the most critical passages are placed at the beginning and the end of the context. Furthermore, the entire process becomes more **cost-effective and faster** since it only requires handling **1/4** of the original context.

## Installation

In [1]:
%pip install -q llmlingua llama-index llama-index-llms-openai llama-index-postprocessor-longllmlingua

Note: you may need to restart the kernel to use updated packages.


### Setup Data

Next, we will demonstrate the use of LongLLMLingua on the **PG's essay** dataset in LlamaIndex pipeline, which effectively alleviates the "lost in the middle" issue.

In [2]:
!wget -q "https://www.dropbox.com/s/f6bmb19xdg0xedm/paul_graham_essay.txt?dl=1" -O paul_graham_essay.txt

In [3]:
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader,ServiceContext,PromptTemplate

In [4]:
# load documents
documents = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"]).load_data()

In [5]:
index = VectorStoreIndex.from_documents(documents)

In [6]:
retriever = index.as_retriever(similarity_top_k=10)

In [7]:
# question = "What did the author do growing up?"
# question = "What did the author do during his time in YC?"
question = "Where did the author go for art school?"

In [8]:
# Ground-truth Answer
answer = "RISD"

In [9]:
contexts = retriever.retrieve(question)

In [10]:
context_list = [n.get_content() for n in contexts]
len(context_list)

10

### The response of Original prompt

In [11]:
# The response from original prompt
from llama_index.llms.openai import OpenAI


llm = OpenAI(model="gpt-3.5-turbo-16k")
prompt = "\n\n".join(context_list + [question])

response = llm.complete(prompt)
print(str(response))

The author went to the Rhode Island School of Design (RISD) for art school.


### The response of Compressed Prompt

In [12]:
# Setup LLMLingua

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor

node_postprocessor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    device_map="cpu",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder,
        "dynamic_context_compression_ratio": 0.3,
    },
)

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   8%|7         | 797M/9.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.huggingface.co/repos/92/fe/92fe6c2eb37cadada54988f54f129844071bcae5dfdc3d1b20ebcd702ddf893f/4ec71fd53e99766de38f24753b30c9e8942630e9e576a1ba27b0ec531e87be41?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00001-of-00002.safetensors%3B+filename%3D%22model-00001-of-00002.safetensors%22%3B&Expires=1726333682&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNjMzMzY4Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy85Mi9mZS85MmZlNmMyZWIzN2NhZGFkYTU0OTg4ZjU0ZjEyOTg0NDA3MWJjYWU1ZGZkYzNkMWIyMGViY2Q3MDJkZGY4OTNmLzRlYzcxZmQ1M2U5OTc2NmRlMzhmMjQ3NTNiMzBjOWU4OTQyNjMwZTllNTc2YTFiYTI3YjBlYzUzMWU4N2JlNDE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=BGSbdngGRzBhYzfjZ4OHlbYpbguYN8FdykEEr8Rtnvv4aHI5MDHi5BpOOkal53g5gHnrOJKeQmCOBU3pGA3znbUT8V1uuFkmmkfJ662qz9t1f%7EarA5X8b5G%7E6%7EMei%7EwQOGvs13eobPFv5X3gNa8D5sQiURc8wrbQlGbq6VVVXIrZMwBtqzLFGAUQDvmZpAhjXxBp0VXKHarRjJb4

ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'cdn-lfs.huggingface.co\', port=443): Max retries exceeded with url: /repos/92/fe/92fe6c2eb37cadada54988f54f129844071bcae5dfdc3d1b20ebcd702ddf893f/4ec71fd53e99766de38f24753b30c9e8942630e9e576a1ba27b0ec531e87be41?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00001-of-00002.safetensors%3B+filename%3D%22model-00001-of-00002.safetensors%22%3B&Expires=1726333682&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNjMzMzY4Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy85Mi9mZS85MmZlNmMyZWIzN2NhZGFkYTU0OTg4ZjU0ZjEyOTg0NDA3MWJjYWU1ZGZkYzNkMWIyMGViY2Q3MDJkZGY4OTNmLzRlYzcxZmQ1M2U5OTc2NmRlMzhmMjQ3NTNiMzBjOWU4OTQyNjMwZTllNTc2YTFiYTI3YjBlYzUzMWU4N2JlNDE~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=BGSbdngGRzBhYzfjZ4OHlbYpbguYN8FdykEEr8Rtnvv4aHI5MDHi5BpOOkal53g5gHnrOJKeQmCOBU3pGA3znbUT8V1uuFkmmkfJ662qz9t1f~arA5X8b5G~6~Mei~wQOGvs13eobPFv5X3gNa8D5sQiURc8wrbQlGbq6VVVXIrZMwBtqzLFGAUQDvmZpAhjXxBp0VXKHarRjJb4SR-dcUolQYk~qib~iBESML1Sy1mp5c5UsInMPNuZeenrkqPuoxl~O-AmV3j8OkpKvq2VpiW3XmvZ2zERUlaznYNV2Yk3HcUnpBqjBTliP67R4CzeqaahjVrbCuSxKrPQcGb~cA__&Key-Pair-Id=K3ESJI6DHPFC7 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7ffebadb0410>: Failed to resolve \'cdn-lfs.huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: f1b5f862-f6d0-4643-9432-b37a14360f73)')

We show you how to compose a `retriever` + `prompt compressor` + `query engine` into the **RAG** pipeline.

#### Method One: Call Step-by-Step

In [None]:
retrieved_nodes = retriever.retrieve(question)
synthesizer = CompactAndRefine()

In [None]:
#from llama_index.indices.query.schema import QueryBundle
from llama_index.core.schema import NodeWithScore, QueryBundle

# outline steps in RetrieverQueryEngine for clarity:
# postprocess (compress), synthesize
new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


KeyboardInterrupt: 

In [None]:
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])

original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)

print(compressed_contexts)
print()
print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")

next Rtm's advice hadn' included anything that. I wanted to do something completely different, so I decided I'd paint. I wanted to how good I could get if I focused on it. the day after stopped on YC, I painting. I was rusty and it took a while to get back into shape, but it was at least completely engaging.1]

I wanted to back RISD, was now broke and RISD was very expensive so decided job for a year and return RISD the fall. I got one at Interleaf, which made software for creating documents. You like Microsoft Word? Exactly That was I low end software tends to high. Interleaf still had a few years to live yet. []

 the Accademia wasn't, and my money was running out, end year back to the
 lot the color class I tookD, but otherwise I was basically myself to do that for in993 I dropped I aroundidence bit then my friend Par did me a big A rent-partment building New York. Did I want it Itt more my place, and York be where the artists. wanted [For when you that ofs you big painting of this 

In [None]:
response = synthesizer.synthesize(question, new_retrieved_nodes)

In [None]:
print(str(response))

The author went to RISD for art school.


#### Method Two: End-to-End Call

In [None]:
retriever_query_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=[node_postprocessor]
)

In [None]:
response = retriever_query_engine.query(question)

In [None]:
print(str(response))

The author went to RISD for art school.
