# Back to Basics for RAG

[video](https://parlance-labs.com/education/rag/jo.html)

## Demistifying RAG

- It is basically "stuffing text into the LLM prompt".
    - Typical use case: Q&A or search:
        - Ask open-ended question.
        - Retrieve content related to this question
        - Use this content as context by stuffing it into the LLM prompt
        - Result: the LLM response will be "grounded" in this context.
        - It is not hallucination free but it might improve the accuracy of the generated answer.

- Not necessarily related with Q&A or search.
    - Example: labeller, retrieve positive and negative examples from dataset and present them in prompt to have the LLM label remaining following those examples.

## Architecture

- Orchestration
- Evaluation
- Prompt
- LLM
- State (retrieval sources):
    - File, Search Engine, Vector Database, Database, Numpy

## Evaluating Information Retrieval Systems

- Query
- Retrieval process
- Result: ranked list of documents
- Evaluation: relevance of ranking wrt query.
    - Metrics: binary, non-binary
    - Benchmarks: TREC, MS Marco, BEIR, ...
    - Critical: build your own relevance dataset

## Relevance dataset

- If we have real queries from production, spend time labelling results of those queries.
- Otherwise, ask LLM to generate realistic synthetic queries.
- Ideally static collection: no additions while evaluating and comparing, to keep consistency.
- Using LLM judges to evaluate result.
    - Need to find appropriate prompt to make it correlate with human judgement.

## Representational approaches

- Avoid scoring all documents (e.g., if we use at web scale)
- Have indexed documents, score only subset.
- Approaches:
    - Sparse, using inverted index.
        - Top-k retrieval algorithms: WAND, ...
        - Supervised (splade) or unsupervised (tf-idf)
    - Dense: 
        - Vector index
        - Accelerated search (approximate)
        - Supervised via transf-learning (text embedding)

## Dense representations

- Encoder / Transformer style:
    - Tokenize input text into discrete vocabulary.
    - Pretrained learned representations of each token.
    - Feed into encoder. For each token get an output vector.
    - Pooling stage: average all vectors into a single vector representation. 
        - Weakest aspect. Diluted, low precision. Need shrinking mechanism?

- Baseline / benchmark: BM25

## Further reading

- [Why fine-tuning is dead](https://www.youtube.com/watch?v=h1c_jmk97Ss)
- [Systematically improving RAG applications](https://www.youtube.com/watch?v=RrDBV6odPKo)
- [Beyond the basics of Retrieval for Augmenting Generation](https://www.youtube.com/watch?v=0nA5QG3087g)
- [Modern Information Retrieval Evaluation in the The RAG Era](https://www.youtube.com/watch?v=Trps2swgeOg)
- [Context Rot: When Long Context Fails](https://www.youtube.com/watch?v=3s_N60u0jEY)
- [evaluating RAGs](https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/)
- [6 RAG Evals](https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/)
- [RAG is dead](https://pashpashpash.substack.com/p/why-i-no-longer-recommend-rag-for)
- [RAG is not dead](https://hamel.dev/notes/llm/rag/not_dead.html)