<a href="https://colab.research.google.com/github/githubpradeep/notebooks/blob/main/LLMLingua_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval-Augmented Generation (RAG)

<a target="_blank" href="https://colab.research.google.com/github/microsoft/LLMLingua/blob/main/examples/RAG.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Retrieval-Augmented Generation (RAG) is a powerful and popular technique that applies specialized knowledge to large language models (LLMs). However, traditional RAG methods tend to have increasingly long prompts, sometimes exceeding **40k**, which can result in high financial and latency costs. Moreover, the decreased information density within the prompts can lead to performance degradation in LLMs, such as the "lost in the middle" issue.

<center><img width="800" src="https://github.com/microsoft/LLMLingua/blob/main/images/LongLLMLingua_Motivation.png?raw=1"></center>

To address this, we propose [**LongLLMLingua**](https://arxiv.org/abs/2310.06839), which specifically tackles the low information density problem in long context scenarios via prompt compression, making it particularly suitable for RAG tasks. The main ideas involve a two-stage compression process, as shown by the  <font color='red'>**red line**</font>, which significantly improves the original curve:

- Coarse-grained compression through document-level perplexity;
- Fine-grained compression of the remaining text using token perplexity;

Instead of fighting against positional effects, we aim to utilize them to our advantage through document reordering, as illustrated by the  <font color='green'>**green line**</font>. In this approach, the most critical passages are placed at the beginning and the end of the context. Furthermore, the entire process becomes more **cost-effective and faster** since it only requires handling **1/4** of the original context.

### NaturalQuestions Multi-document QA

Next, we will demonstrate the use of LongLLMLingua on the NaturalQuestions dataset, which effectively alleviates the "lost in the middle" issue. This dataset closely resembles real-world RAG scenarios, as it first employs the Contriever retrieval system to recall 20 relevant documents (including 1 ground truth and 19 related documents), and then answers the respective questions based on the prompts composed of these 20 documents.

The original dataset can be found in https://github.com/nelson-liu/lost-in-the-middle/tree/main/qa_data.

In [None]:
# Install dependency.
## Lost in the middle
!git clone https://github.com/nelson-liu/lost-in-the-middle
%cd lost-in-the-middle
!echo "xopen" > requirements.txt && pip install -e .
## LLMLingu
!pip install llmlingua

In [2]:
!pip install accelerate bitsandbytes -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -e .

In [4]:
!pip install openai==0.28 -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/76.5 kB[0m [31m895.0 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.[0m[31m
[0m

In [11]:
#RESTART SESSION

In [23]:
# Using the OAI
import openai
openai.api_key = ""

### Setup Data

In [2]:
import json
from xopen import xopen
from copy import deepcopy
from tqdm import tqdm
from lost_in_the_middle.prompting import (
    Document,
    get_closedbook_qa_prompt,
    get_qa_prompt,
)

datasets = []
path = "./lost-in-the-middle/qa_data/20_total_documents/nq-open-20_total_documents_gold_at_9.jsonl.gz"
with xopen(path) as f:
    for ii, jj in tqdm(enumerate(f), total=2655):
        input_example = json.loads(jj)
        question = input_example["question"]
        documents = []
        for ctx in deepcopy(input_example["ctxs"]):
            documents.append(Document.from_dict(ctx))

        prompt = get_qa_prompt(
            question,
            documents,
            mention_random_ordering=False,
            query_aware_contextualization=False,
        )

        c = prompt.split("\n\n")
        instruction, question = c[0], c[-1]
        demonstration = "\n".join(c[1:-1])
        datasets.append({"id": ii, "instruction": instruction, "demonstration": demonstration, "question": question, "answer": input_example["answers"]})

100%|██████████| 2655/2655 [00:02<00:00, 998.47it/s] 


In [3]:
# select an example from NaturalQuestions
instruction, demonstration_str, question, answer = [datasets[23][key] for key in ["instruction", "demonstration", "question", "answer"]]

In [4]:
# Ground-truth Answer
answer

['14']

### The response of Original prompt (Error)

In [5]:
# The response from original prompt, error
prompt = "\n\n".join([instruction, demonstration_str, question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
   model= "gpt-3.5-turbo",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-8cuSuNcOhlyJdNj33TuktInvNQEWj",
    "object": "chat.completion",
    "created": 1704284208,
    "model": "gpt-3.5-turbo-0613",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of the provided search results, OPEC has 15 member countries."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 2897,
        "completion_tokens": 15,
        "total_tokens": 2912
    },
    "system_fingerprint": null
}


### The response of Compressed Prompt (Correct in 10x Compression)

In [6]:
# Setup LLMLingua
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



In [8]:
# 6x Compression

compressed_prompt = llm_lingua.compress_prompt(
    demonstration_str.split("\n"),
    instruction=instruction,
    question=question,
    target_token=500,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)

{
    "compressed_prompt": "Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).\n\nDocument [10](Title: OPEC Organization of the Petroleum Exporting Countries (OPEC, /\u02c8o\u028ap\u025bk/ OH-pek, or OPEP in several other languages) is an intergovernmental organization of 14 nations as of February 2018, founded in 1960 in Baghdad by the first five members (Iran, Iraq, Kuwait, Saudi Arabia, and Venezuela), and headquartered since 1965 in Vienna, Austria. As of 2016, the 14 countries accounted for an estimated 44 percent of global oil production and 73 percent of the world's \"proven\" oil reserves, giving OPEC a major influence on global oil prices that were previously determined by American-dominated multinational oil companies.\n\nDocument1](Title: OPE OPE lost its newest members, who had in mid-1970s E withd in December 192, because it was unwilling to pay annual US$2 million membership fee felt that it neede

In [10]:
# 10x Compression
compressed_prompt = llm_lingua.compress_prompt(
    demonstration_str.split("\n"),
    instruction=instruction,
    question=question,
    target_token=100,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)

{
    "compressed_prompt": "Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).\n\n0Title: OPECization of Petroleum Exporting Count (OPEC, /\u02c8o\u028ap\u025bk OH-pekP in other) is an intergovernmental14 as February 218 founded in 1960 in Baghdad by fiveIran Iraq, Kuwait, Saudi Arab, and Venezuela), headquartered since 965 in, Austria. of the4ed an estimated4 percent of production and 3 percent of the world's \"proven\" oil res OPEC on global by Americandominatedin companies.\n\n5](Title: OPEC) OPEC lost its two newest members, who had joined in the mid-1970s. Ecuador withdrew in December 1992, because it was unwilling to pay the annual US$2 million membership fee and felt that it needed to produce more oil than it was allowed under the OPEC quota, although it rejoined in October 2007. Similar concerns prompted Gabon to suspend membership in January 1995; it rejoined in July 2016. Iraq has remained a member of

In [13]:
# Install dependency.
!pip install -q llama-index

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the

In [None]:
!pip install -U openai

In [14]:
!wget "https://www.dropbox.com/s/f6bmb19xdg0xedm/paul_graham_essay.txt?dl=1" -O paul_graham_essay.txt

--2024-01-03 12:28:16--  https://www.dropbox.com/s/f6bmb19xdg0xedm/paul_graham_essay.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/f6bmb19xdg0xedm/paul_graham_essay.txt [following]
--2024-01-03 12:28:17--  https://www.dropbox.com/s/dl/f6bmb19xdg0xedm/paul_graham_essay.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf0b77453ae6bf0a6634047db27.dl.dropboxusercontent.com/cd/0/get/CKo5rpDEvGHeGZcVvyendVCHsIagFu2vcZ1zQu37brTb9mtO8QMzP5kOuWHG4BmhEdcPh1Jc7NM5dpJCWfOni8mlkBnZcqcWZmU6KsLU1lQoCPNwtaMZh8JZdzE55h25-dnN34Bn511kGOtU-Rr3AbrU/file?dl=1# [following]
--2024-01-03 12:28:17--  https://ucf0b77453ae6bf0a6634047db27.dl.dropboxusercontent.com/cd/0/get/CKo5rpDEvGHeGZcVvyendVCHsIagFu2vcZ1zQu37brTb9mtO8QMzP5kOuWHG4BmhEdcPh1

In [5]:
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
)

In [6]:
# load documents
documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()

In [7]:
index = VectorStoreIndex.from_documents(documents)

In [8]:
retriever = index.as_retriever(similarity_top_k=10)

In [9]:
# question = "What did the author do growing up?"
# question = "What did the author do during his time in YC?"
question = "Where did the author go for art school?"

In [11]:
answer = "RISD"
contexts = retriever.retrieve(question)
context_list = [n.get_content() for n in contexts]
len(context_list)

10

In [12]:
# The response from original prompt
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-16k")
prompt = "\n\n".join(context_list + [question])

response = llm.complete(prompt)
print(str(response))

The author went to the Rhode Island School of Design (RISD) for art school.


In [13]:
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
)

In [14]:
# Setup LLMLingua
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor

node_postprocessor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder,
        "dynamic_context_compression_ratio": 0.3,
    },
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [15]:
retrieved_nodes = retriever.retrieve(question)
synthesizer = CompactAndRefine()

In [16]:
from llama_index.indices.query.schema import QueryBundle

# outline steps in RetrieverQueryEngine for clarity:
# postprocess (compress), synthesize
new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)

In [17]:
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])

original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)

print(compressed_contexts)
print()
print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Comressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")

What should I do next? Rtm's advice hadn't included anything about that. I wanted to do something completely different, so I decided I'd paint. I wanted to see how good I could get if I focused on it. So the day after I stopped working on YC, I started painting. I was rusty and it took a while to get back into shape, but it was at least completely engaging. [18]

Our Ulivi, was a guy. He could see I worked hard, and gave me, wrote down in a sort of pass each student But Accademia wasn't me anything Italian, and my money was running out, so at the end of the first year I back to US

I wanted back to RISD, but I was now broke and RISD very expensive decided to a job for year return RISD the I got one at called Interleaf, which made software. You Microsoft Word? Exactly That was learned end software tends to high. But Interleaf still had a few years to live. [] in ID, but was basically myself to I for free99 I out around my friend Nancy Parmet did big Aled building in York becomingant. Di

In [18]:
response = synthesizer.synthesize(question, new_retrieved_nodes)

In [19]:
print(str(response))

The author went to RISD for art school.


In [20]:
retriever_query_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=[node_postprocessor]
)

In [21]:
response = retriever_query_engine.query(question)

In [22]:
print(str(response))

The author went to RISD for art school.
