# Back to Basics for RAG

[video](https://parlance-labs.com/education/rag/jo.html)

## Demistifying RAG

- It is basically "stuffing text into the LLM prompt".
    - Typical use case: Q&A or search:
        - Ask open-ended question.
        - Retrieve content related to this question
        - Use this content as context by stuffing it into the LLM prompt
        - Result: the LLM response will be "grounded" in this context.
        - It is not hallucination free but it might improve the accuracy of the generated answer.

- Not necessarily related with Q&A or search.
    - Example: labeller, retrieve positive and negative examples from dataset and present them in prompt to have the LLM label remaining following those examples.

## Architecture

- Orchestration
- Evaluation
- Prompt
- LLM
- State (retrieval sources):
    - File, Search Engine, Vector Database, Database, Numpy

## Evaluating Information Retrieval Systems

- Query
- Retrieval process
- Result: ranked list of documents
- Evaluation: relevance of ranking wrt query.
    - Metrics: binary, non-binary
    - Benchmarks: TREC, MS Marco, BEIR, ...
    - Critical: build your own relevance dataset

## Relevance dataset

- If we have real queries from production, spend time labelling results of those queries.
- Otherwise, ask LLM to generate realistic synthetic queries.
- Ideally static collection: no additions while evaluating and comparing, to keep consistency.
- Using LLM judges to evaluate result.
    - Need to find appropriate prompt to make it correlate with human judgement.

## Representational approaches

- Avoid scoring all documents (e.g., if we use at web scale)
- Have indexed documents, score only subset.
- Approaches:
    - Sparse, using inverted index.
        - Top-k retrieval algorithms: WAND, ...
        - Supervised (splade) or unsupervised (tf-idf)
    - Dense: 
        - Vector index
        - Accelerated search (approximate)
        - Supervised via transf-learning (text embedding)

## Dense representations

- Encoder / Transformer style:
    - Tokenize input text into discrete vocabulary.
    - Pretrained learned representations of each token.
    - Feed into encoder. For each token get an output vector.
    - Pooling stage: average all vectors into a single vector representation. 
        - Weakest aspect. Diluted, low precision. Need shrinking mechanism?

- Baseline / benchmark: BM25
    - Can avoid spectacular failures, e.g., those caused by out-of-vocabulary problems.
    - BM25 can also act as a strong baseline, e.g., for long-context use case.

## Hybrid approaches

- Dense + Sparse representations: overcome fixed vocabulary issues.

## Chunking

- Dense representations should not use texts with more than 256 tokens for high precision search (maybe it's ok for other applications like classification)
    - Because they haven't been trained on them.
    - Topic drifts with longer texts.
- You need to chunk if longer than 256 tokens
    - but no need to do it on a per-row basis, if you have right stack.
    - we can index multiple vectors per row.
    - avoid repeating same metadata.

## Other considerations

- Combining GBDT (Gradient-Boosted Decision Trees) with neural features is quite effective.
    - GBDT produces sparse vectors, one feature per leaf in the tree
    - The feature is 1 if the input vector ends in that leaf.

## FAQ

- What kind of metadata is useful to consider in RAG.
    - Consider authoritative sources for this domain (e.g., doctors in health domain), rather than just "drag up Reddit text".
    - Title and other metadata. It depends on use case.
- Calibration of different indices
    - different document indices are not aligned in terms of similarity scores
    - have confidence scores for how likely is the recommendation to be good
    - Answer: it is difficult. It might be a learning task, e.g., use GBDT for combining. 
        - But then you need to train the model with built training data. 
        - Can use LLM for generating this data.
- Efficacy of re-rankers.
    - They can help, but can make the response slower.
- Combining similarity with interaction data
    - It becomes a learning to rank problem. 
        - Type of interaction is used as label.
        - You can use GBT where you can include the semantic score as well as a feature.
- Jason Liu's blog on value of generating structured summaries and reports for decision makers instead of RAG
- Future progress
    - Use models with larger vocabularies, beyond BERT trained on 2018 data
    - Not excited about using longer contexts because they provide lower precision.
- Query search expansion, query understanding
    - BM25 + reranker
- Use case with lots of jargon and out-of-vocabulary words
    - Need to use hybrid approach (keyword search + embedding), since embedding is poor at that.
    - Sometimes we need to ignore the embedding altogether.
- Colbert
    - Yes with pretraining or longer vocabularies.

## Further reading

- [Systematically improving RAG applications](https://www.youtube.com/watch?v=RrDBV6odPKo)
- [Beyond the basics of Retrieval for Augmenting Generation](https://www.youtube.com/watch?v=0nA5QG3087g)
- [Modern Information Retrieval Evaluation in the The RAG Era](https://www.youtube.com/watch?v=Trps2swgeOg)
- [evaluating RAGs](https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/)
- [6 RAG Evals](https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/)
- [RAG is dead](https://pashpashpash.substack.com/p/why-i-no-longer-recommend-rag-for)
- [RAG is not dead](https://hamel.dev/notes/llm/rag/not_dead.html)
- [LangChain's retrieval](https://docs.langchain.com/oss/python/langchain/retrieval)
- [OpenAI's retrieval](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb?ref=blog.langchain.com)
- [RAG Theoretical concepts](https://www.youtube.com/watch?v=rhZgXNdhWDY)
- [Building and Evaluating Advanced RAG Applications - DeepLearning.AI course](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/)