
# Longer ≠ Better: Why RAG Still Matters


Retrieval-Augmented Generation (RAG) emerged as a solution to early large language models' context window limitations, allowing selective information retrieval when token constraints prevented processing entire datasets. Now, as models like Gemini 1.5 have the ability to handle [millions of tokens](https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/), this breakthrough enables us to compare whether RAG is still a necessary tool to provide context in the era of LLMs with millions tokens context.

**Background**


*   RAG was developed as a workaround for token constraints in LLMs
*   RAG allowed selective information retrieval to avoid context window limitations
*   New models like Gemini 1.5 can handle millions of tokens
*   As token limits increase, the need for selective retrieval diminishes
*   Future applications may process massive datasets without external databases
*   RAG may become obsolete as models handle more information directly


Let's test it how good models with large token context are compared to RAG




## Architecture

* **RAG**: We're using Elasticsearch with Semantic text search enabled, and results provided are supplied to LLM as context, in this case Gemini.

* **LLM**: We're providing context to the LLM, in this case Gemini, with a maximum of 1M token context.

## Methodology

To compare performance between RAG and LLM full context, we're going to work a mix of technical articles and documentation. To provide full context to the LLM articles and documentation will be provided as context.

To identify if answer is the correct or not we're going to ask to both systems *** What is the title of the article?*** . For this we're going to run 2 sets of tests:

1. Run a **textual** query in order to find an extract of document and identify where it belongs. Compare RAG and LLM performance
2. Run a **semantic** query in order to find a a semantic equivalent sentence from a document. Compare Rag and LLM performance

To compare both technologies we're going to measure:
- Accuracy
- Time
- Cost

# Setup

We setup the python libraries we're going to use
*   **Elasticsearch** - To run queries to Elasticsearch
*   **Langchain**     - Interface to LLM


Also call API Keys to start working with both components

In [None]:
%pip install elasticsearch langchain langchain-core langchain-groq langchain-community matplotlib langchain-google-genai -q

### Import libraries, Elasticsearch, defining LLM and Open AI API Keys

In [71]:
import os
import json
import time
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from getpass import getpass

from elasticsearch import Elasticsearch, helpers
from langchain.callbacks import get_openai_callback
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI


os.environ["GOOGLE_API_KEY"] = getpass("Enter your Google AI API key: ")
os.environ["ES_API_KEY"] = getpass("Elasticsearch API Key: ")
os.environ["ES_URL"] = getpass("Elasticsearch URL: ")


index_name = "technical-articles"

### Elasticsearch client

In [46]:
es_client = Elasticsearch(
    os.environ["ES_URL"],
    api_key=os.environ["ES_API_KEY"],
    request_timeout=120,
)

### Defining LLM

In [72]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

Function to calculate cost of LLM

In [49]:
# Function to calculate cost of LLM with input and output cost per million tokens
def calculate_cost(
    input_price=0.10, output_price=0.40, input_tokens=0, output_tokens=0
):
    input_total_cost = (input_tokens / 1_000_000) * input_price
    output_total_cost = (output_tokens / 1_000_000) * output_price

    return input_total_cost + output_total_cost

# 1. Index working files

For this test, we're going to index a mix of 303 documents with technical articles and documentation. These documents will be the source of information for both tests.

## Create and populate index

 To implement RAG we're including in mappings a semantic_text field so we can run semantic queries in Elasticsearch, along with the regular text field.

 Also we're pushing documents to "technical-articles" index.

### Creating index


In [None]:
if not es_client.indices.exists(index=index_name):
    # Define a simple mapping for text documents
    mappings = {
        "mappings": {
            "properties": {
                "text": {"type": "text", "copy_to": "semantic_text"},
                "meta_description": {"type": "keyword", "copy_to": "semantic_text"},
                "title": {"type": "keyword", "copy_to": "semantic_text"},
                "imported_at": {"type": "date"},
                "url": {"type": "keyword"},
                "semantic_text": {
                    "type": "semantic_text",
                },
            }
        }
    }

    es_client.indices.create(index=index_name, body=mappings)

    print(f"Created index '{index_name}'")

### Populating index

Indexing documents using de Bulk API to Elasticsearch

In [None]:
file_path = "dataset.json"

actions = []

with open(file_path, "r") as f:
    documents = json.load(f)
    for doc in documents:
        document = {
            "_index": index_name,
            "_source": {
                "text": doc["text"],
                "url": doc["url"],
                "title": doc["title"],
                "meta_description": doc["meta_description"],
                "imported_at": datetime.now(),
            },
        }

        actions.append(document)


res = helpers.bulk(es_client, actions)

print("documents indexed", res)

# 2. Run Comparisons


## Test 1: Textual Query

### Query to retrieve semantic search results from Elasticsearch

In [75]:
query_str = """
Let’s now create a test.js file and install our mock client: Now, add a mock for semantic search: We can now create a test for our code, making sure that the Elasticsearch part will always return the same results: Let’s run the tests.
"""

We extract a paragraph of **Elasticsearch in JavaScript the proper way, part II** article, we will use it as input to retrieve the results from Elasticsearch.

Results will be stored in the results variable.


### RAG strategy (Textual)

#### Executing Match Phrase Search

This is the query we're going to use to retrieve the results from Elasticsearch using match phrase search capabilities. We will pass the query_str as input to the match phrase search.

In [None]:
textual_rag_summary = {}  # Variable to store results

start_time = time.time()

es_query = {
    "query": {"match_phrase": {"text": {"query": query_str}}},
    "_source": ["title"],
    "highlight": {
        "pre_tags": [""],
        "post_tags": [""],
        "fields": {"title": {}, "text": {}},
    },
    "size": 10,
}

response = es_client.search(index=index_name, body=es_query)
hits = response["hits"]["hits"]

textual_rag_summary["time"] = (
    time.time() - start_time
)  # save time taken to run the query
textual_rag_summary["es_results"] = hits  # save hits

print("ELASTICSEARCH RESULTS: \n", json.dumps(hits, indent=4))

This template gives the LLM the instructions to answer the question and the context to do so. At the end of the prompt we're asking for the title of the article.

The prompt template will be the same for all test. 

In [78]:
# LLM prompt template
template = """
  Instructions:

  - You are an assistant for question-answering tasks.
  - Answer questions truthfully and factually using only the context presented.
  - If you don't know the answer, just say that you don't know, don't make up an answer.
  - Use markdown format for code examples.
  - You are correct, factual, precise, and reliable.
  - Answer

  Context:
  {context}

  Question:
  {question}.

  What is the title article?
"""

#### Run results through LLM

Results from Elasticsearch will be provided as context to the LLM for us to get the result we need.

In [None]:
start_time = time.time()

prompt = ChatPromptTemplate.from_template(template)

context = ""

for hit in hits:
    # For semantic_text matches, we need to extract the text from the highlighted field
    if "highlight" in hit:
        highlighted_texts = []

        for values in hit["highlight"].values():
            highlighted_texts.extend(values)

        context += f"{hit['_source']['title']}\n"
        context += "\n --- \n".join(highlighted_texts)

# Use LangChain for the LLM part
chain = prompt | llm | StrOutputParser()

printable_prompt = prompt.format(context=context, question=query_str)
print("PROMPT WITH CONTEXT AND QUESTION:\n ", printable_prompt)  # Print prompt

with get_openai_callback() as cb:
    response = chain.invoke({"context": context, "question": query_str})

# Save results
textual_rag_summary["answer"] = response
textual_rag_summary["total_time"] = (time.time() - start_time) + textual_rag_summary[
    "time"
]  # Sum of time taken to run the semantic search and the LLM
textual_rag_summary["tokens_sent"] = cb.prompt_tokens
textual_rag_summary["cost"] = calculate_cost(
    input_tokens=cb.prompt_tokens, output_tokens=cb.completion_tokens
)

print("LLM Response:\n ", response)

### LLM strategy (Textual)



#### Match all query

To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.

In [None]:
textual_llm_summary = {}  # Variable to store results

start_time = time.time()

es_query = {"query": {"match_all": {}}, "sort": [{"title": "asc"}], "size": 303}

es_results = es_client.search(index=index_name, body=es_query)
hits = es_results["hits"]["hits"]

# Save results
textual_llm_summary["es_results"] = hits
textual_llm_summary["time"] = time.time() - start_time

print("ELASTICSEARCH RESULTS: \n", json.dumps(hits, indent=4))

#### Run results through LLM

As in the previous step, we're going to provide the context to the LLM and ask for the answer.

In [None]:
start_time = time.time()

prompt = ChatPromptTemplate.from_template(template)
# Use LangChain for the LLM part
chain = prompt | llm | StrOutputParser()

printable_prompt = prompt.format(context=context, question=query_str)
print("PROMPT:\n ", printable_prompt)  # Print prompt

with get_openai_callback() as cb:
    response = chain.invoke({"context": hits, "question": query_str})

# Save results
textual_llm_summary["answer"] = response
textual_llm_summary["total_time"] = (time.time() - start_time) + textual_llm_summary[
    "time"
]  # Sum of time taken to run the match_all query and the LLM
textual_llm_summary["tokens_sent"] = cb.prompt_tokens
textual_llm_summary["cost"] = calculate_cost(
    input_tokens=cb.prompt_tokens, output_tokens=cb.completion_tokens
)

print("LLM Response:\n ", response)  # Print LLM response

## Test 2: Semantic Query

### RAG strategy (Non-textual)



In [84]:
query_str = "This article explains how to improve code reliability. It includes techniques for error handling, and running applications without managing servers."

To the second test we're going to use a semantic query to retrieve the results from Elasticsearch. For that we built a short synopsis of **Elasticsearch in JavaScript the proper way, part II** article as query_str and provided it as input to RAG.

#### Executing semantic search

This is the query we're going to use to retrieve the results from Elasticsearch using semantic search capabilities. We will pass the query_str as input to the semantic search.

In [None]:
semantic_rag_summary = {}  # Variable to store results

start_time = time.time()

es_query = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "bool": {
                                "should": [
                                    {
                                        "multi_match": {
                                            "query": query_str,
                                            "fields": ["text", "title"],
                                        }
                                    },
                                    {"match_phrase": {"text": {"query": query_str}}},
                                ]
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "semantic": {
                                "field": "semantic_text",
                                "query": query_str,
                            }
                        }
                    }
                },
            ],
            "rank_window_size": 50,
            "rank_constant": 20,
        }
    },
    "_source": ["title"],
    "highlight": {
        "pre_tags": [""],
        "post_tags": [""],
        "fields": {"title": {}, "text": {}},
    },
    "size": 10,
}


response = es_client.search(index=index_name, body=es_query)
hits = response["hits"]["hits"]

semantic_rag_summary["time"] = (
    time.time() - start_time
)  # save time taken to run the query
semantic_rag_summary["es_results"] = hits  # save hits

print("ELASTICSEARCH RESULTS: \n", json.dumps(hits, indent=4))

#### Run results through LLM
Now results from Elasticsearch will be provided as context to the LLM for us to get the result we need.

In [None]:
start_time = time.time()

prompt = ChatPromptTemplate.from_template(template)

context = ""

for hit in hits:
    # For semantic_text matches, we need to extract the text from the highlighted field
    if "highlight" in hit:
        highlighted_texts = []

        for values in hit["highlight"].values():
            highlighted_texts.extend(values)

        context += f"{hit['_source']['title']}\n"
        context += "\n --- \n".join(highlighted_texts)

# Use LangChain for the LLM part
chain = prompt | llm | StrOutputParser()

printable_prompt = prompt.format(context=context, question=query_str)
print("PROMPT:\n ", printable_prompt)  # Print prompt

with get_openai_callback() as cb:
    response = chain.invoke({"context": context, "question": query_str})

# Save results
semantic_rag_summary["answer"] = response
semantic_rag_summary["total_time"] = (time.time() - start_time) + semantic_rag_summary[
    "time"
]  # Sum of time taken to run the semantic search and the LLM
semantic_rag_summary["tokens_sent"] = cb.prompt_tokens
semantic_rag_summary["cost"] = calculate_cost(
    input_tokens=cb.prompt_tokens, output_tokens=cb.completion_tokens
)

print("LLM Response:\n ", response)

### LLM strategy (Non-textual)



#### Match all query

To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.

In [None]:
semantic_llm_summary = {}  # Variable to store results

start_time = time.time()

es_query = {"query": {"match_all": {}}, "sort": [{"title": "asc"}], "size": 303}
es_llm_context = es_client.search(index=index_name, body=es_query)

hits = es_llm_context["hits"]["hits"]

print("ELASTICSEARCH RESULTS: \n", json.dumps(hits, indent=4))

# Save results
semantic_llm_summary["es_results"] = hits
semantic_llm_summary["time"] = time.time() - start_time

#### Run results through LLM

As in the previous step, we're going to provide the context to the LLM and ask for the answer.

In [None]:
start_time = time.time()

prompt = ChatPromptTemplate.from_template(template)
# Use LangChain for the LLM part
chain = prompt | llm | StrOutputParser()

printable_prompt = prompt.format(context=context, question=query_str)
print("PROMPT:\n ", printable_prompt)  # Print prompt

with get_openai_callback() as cb:
    response = chain.invoke({"context": hits, "question": query_str})

print(response)

# Save results
semantic_llm_summary["answer"] = response
semantic_llm_summary["total_time"] = (time.time() - start_time) + semantic_llm_summary[
    "time"
]  # Sum of time taken to run the match_all query and the LLM
semantic_llm_summary["tokens_sent"] = cb.prompt_tokens
semantic_llm_summary["cost"] = calculate_cost(
    input_tokens=cb.prompt_tokens, output_tokens=cb.completion_tokens
)


print("LLM Response:\n ", response)  # Print LLM response

## 3. Printing results

### Printing results

Now we're going to print the results of both tests in a dataframe.

In [None]:
df1 = pd.DataFrame(
    [
        {
            "Strategy": "Textual RAG",
            "Answer": textual_rag_summary["answer"],
            "Tokens Sent": textual_rag_summary["tokens_sent"],
            "Time": textual_rag_summary["total_time"],
            "LLM Cost": textual_rag_summary["cost"],
        },
        {
            "Strategy": "Textual LLM",
            "Answer": textual_llm_summary["answer"],
            "Tokens Sent": textual_llm_summary["tokens_sent"],
            "Time": textual_llm_summary["total_time"],
            "LLM Cost": textual_llm_summary["cost"],
        },
    ]
)


df2 = pd.DataFrame(
    [
        {
            "Strategy": "Semantic RAG",
            "Answer": semantic_rag_summary["answer"],
            "Tokens Sent": semantic_rag_summary["tokens_sent"],
            "Time": semantic_rag_summary["total_time"],
            "LLM Cost": semantic_rag_summary["cost"],
        },
        {
            "Strategy": "Semantic LLM",
            "Answer": semantic_llm_summary["answer"],
            "Tokens Sent": semantic_llm_summary["tokens_sent"],
            "Time": semantic_llm_summary["total_time"],
            "LLM Cost": semantic_llm_summary["cost"],
        },
    ]
)

print("Textual Query DF")
display(df1)

print("Semantic Query DF")
display(df2)

### Printing charts

And for better visualization of the results, we're going to print a bar chart with the number of tokens sent and the response time by strategy.

In [None]:
df_combined = pd.concat([df1, df2])

plt.figure(figsize=(10, 5))
df_combined.plot(kind="bar", x="Strategy", y="Tokens Sent", legend=False, ax=plt.gca())
plt.title("Tokens Sent by Strategy")
plt.yscale("log")

plt.figure(figsize=(10, 5))
df_combined.plot(kind="bar", x="Strategy", y="Time", legend=False, ax=plt.gca())
plt.title("Response Time by Strategy")
plt.ylabel("Time (seconds)")


plt.figure(figsize=(10, 5))
df_combined.plot(kind="bar", x="Strategy", y="LLM Cost", legend=False, ax=plt.gca())
plt.title("Cost by Strategy")
plt.yscale("log")
plt.ylabel("Cost")

# Clean resources

As an optional step, we're going to delete the index from Elasticsearch.

In [None]:
es_client.indices.delete(index=index_name, ignore=[400, 404])

Comments on Textual query :


On RAG:
1.   RAG was able to find the correct result
2.   The time to run a full context was similar to LLM with partial context


On LLM:
1. LLM was unable to find the correct result
2. Only partial context was provided since limit was 1M (about 303 documents)
3. Time to provide a result was similar to RAG
4. Pricing is much higher than RAG since is more constly






### Reference

https://codingscape.com/blog/llms-with-largest-context-windows