# Retrieval-Augmented Generation (RAG) Pipeline

## ¿Qué es RAG?
RAG (Retrieval-Augmented Generation) fue introducido por Lewis et al. (2020) en el paper: *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*.
Paper: https://arxiv.org/abs/2005.11401

**Se basa principalmente**:
- Combinar un modelo de recuperación (como BM25, DPR) con un generador como BART o GPT-2.
- Esto permite generar respuestas informadas por evidencia textual relevante.

## Componentes de RAG:
- **Retriever**: busca documentos relevantes dado una query, selecciona top-k documentos.
- **Generator**: genera una respuesta condicionada en la query + documentos.
- **Fusion**: puede ser por concatenación, promedio de logits, etc.

En nuestro caso, simplificamos usando BM25 y GPT-2, sin entrenamiento adicional.

## Ecuación de RAG**:

$$
P(y \mid q) = \sum_{i=1}^k P(d_i \mid q)\; P(y \mid q, d_i)
$$

- $P(d_i \mid q)$: probabilidad de que el documento $d_i$ sea relevante para la consulta, dada por el retriever.  
- $P(y \mid q, d_i)$: probabilidad de generar la secuencia $y$ condicionado en la consulta $q$ y en el documento $d_i$, estimada por el generator.

**Explicación**: 

Primero el retriever asigna un peso a cada documento según su relevancia, luego el generator produce la respuesta considerando cada documento como contexto. Finalmente, RAG combina estas contribuciones en un sumatorio ponderado, permitiendo que la generación esté directamente anclada a la evidencia recuperada.



## Estructura del pipeline en este notebook

Se han usado archivos de texto para facilitar la lectura del corpus donde cada linea es un 'documento' adicionalmente archivos como queries y references mas que todo para probar el modelo.

- `corpus.txt`: colección de documentos.
- `queries.txt`: preguntas a responder.
- `references.txt`: respuestas esperadas (para evaluar BLEU).

**Módulos**:
- `preprocess_index.py`: tokeniza e indexa documentos con BM25.
- `retriever.py`: recupera top-k documentos dados una query.
- `generator.py`: genera respuesta usando GPT-2.
- `evaluator.py`: calcula BLEU para cada respuesta.


## Indexar y tokenizar documentos 

In [1]:
from typing import List
from nltk.tokenize import word_tokenize
from rank_bm25 import BM25Okapi
import nltk

nltk.download("punkt")


def load_documents(file_path: str) -> List[str]:
    with open(file_path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f.readlines() if line.strip()]


def preprocess_documents(docs: List[str]) -> List[List[str]]:
    return [word_tokenize(doc.lower()) for doc in docs]


def build_bm25(tokenized_docs: List[List[str]]) -> BM25Okapi:
    return BM25Okapi(tokenized_docs)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\josep\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Se tokeniza los documentos anteriormente convertido a minúsculas y una limpieza con `strip()`, luego se crea un índice BM25 

## Recuperar top-k documentos a partir de una query

In [4]:
from typing import List
from nltk.tokenize import word_tokenize
from rank_bm25 import BM25Okapi


class BM25Retriever:
    def __init__(self, bm25: BM25Okapi, original_docs: List[str], k: int):
        self.bm25 = bm25
        self.original_docs = original_docs
        self.k = k

    def retrieve(self, query: str) -> List[str]:
        tokenized_query = word_tokenize(query.lower())
        return self.bm25.get_top_n(tokenized_query, self.original_docs, n=self.k)


Se ha implementado un componente de recuperación de documentos que, dado una consulta, devuelva los top-k documentos más relevantes usando el algoritmo BM25.

## Generación de respuesta usando GPT-2

In [5]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch


class GPT2Generator:
    def __init__(self, max_tokens=50, temperature=0.7, top_p=0.8):
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        self.model = GPT2LMHeadModel.from_pretrained("gpt2")
        self.model.eval()
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.top_p = top_p

    def generate(self, query: str, docs: list) -> str:
        prompt = f"Context:\n- " + "\n- ".join(docs) + f"\nQuestion: {query}\nAnswer:"
        inputs = self.tokenizer(
            prompt, return_tensors="pt", truncation=True, max_length=512
        )
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=self.max_tokens,
                do_sample=False,
                eos_token_id=self.tokenizer.eos_token_id,
            )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()


  from .autonotebook import tqdm as notebook_tqdm


Sarga el modelo gpt2 y su tokenizer desde Hugging Face (transformers). Se definen hiperparámetros de generación como :

- max_tokens: número máximo de tokens a generar.
- temperature: controla la aleatoriedad 
- top_p: top-p sampling 

Lo que hace es generar un prompt dado por 'Context', 'Question' y 'Answer'. Tokenizando el prompt a 512 tokens, usa `do_sample=False` para obtener la respuesta más probable

## Evaluación

In [6]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize


class Evaluator:
    def compute_bleu(self, references: list, candidates: list) -> float:
        tokenized_refs = [[word_tokenize(ref.lower())] for ref in references]
        tokenized_cands = [word_tokenize(cand.lower()) for cand in candidates]
        chencherry = SmoothingFunction()
        return sentence_bleu(
            tokenized_refs[0], tokenized_cands[0], smoothing_function=chencherry.method1
        )


Se calcula el BLEU score entre una respuesta generada por el modelo (candidate) y una respuesta de referencia (reference).

Tokeniza ambas respuestas (referencia y generadas por el modelo), usa un smoothing para evitar que el BLEU sea 0 cuando no hay coincidencias de 4-gramas, posteriormente se calcula el BLEU. Tiene limitaciones ya que solo evalua el primer reference con el primer candidate


## Prueba usando todo esos pasos

In [7]:
documents = load_documents("../src/pipeline_rag/data/corpus.txt")
preprocessed_docs = preprocess_documents(documents)
bm25 = build_bm25(preprocessed_docs)

# carga querys y references
queries = load_documents("../src/pipeline_rag/data/queries.txt")
references = load_documents("../src/pipeline_rag/data/references.txt")

# diferentes top-k values
top_k_values = [1, 2, 3, 5]

results = []

generator = GPT2Generator(max_tokens=50, temperature=0.7, top_p=0.8)
evaluator = Evaluator()

for query, reference in zip(queries, references):
    print(f"\n=== Consulta: {query} ===")
    for k in top_k_values:
        print(f"\n--- Top-{k} Documentos ---")
        retriever = BM25Retriever(bm25, documents, k)
        top_docs = retriever.retrieve(query)

        for i, doc in enumerate(top_docs):
            print(f"[{i+1}] {doc}")

        response = generator.generate(query, top_docs)

        print("\n--- Respuesta Generada ---")
        print(response)

        bleu_score = evaluator.compute_bleu([reference], [response])
        print("\n--- Evaluación BLEU ---")
        print(f"BLEU: {bleu_score:.4f}")

        results.append([query, k, response, reference, bleu_score])


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



=== Consulta: What is BM25 and how is it used in NLP? ===

--- Top-1 Documentos ---
[1] BM25 is an information retrieval algorithm based on term frequency.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- BM25 is an information retrieval algorithm based on term frequency.
Question: What is BM25 and how is it used in NLP?
Answer: BM25 is a term retrieval algorithm based on term frequency.
Question: What is the difference between BM25 and NLP?
Answer: BM25 is a term retrieval algorithm based on term frequency.
Question: What is the difference between N

--- Evaluación BLEU ---
BLEU: 0.0177

--- Top-2 Documentos ---
[1] BM25 is an information retrieval algorithm based on term frequency.
[2] Artificial intelligence is transforming many industries.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- BM25 is an information retrieval algorithm based on term frequency.
- Artificial intelligence is transforming many industries.
Question: What is BM25 and how is it used in NLP?
Answer: BM25 is a tool for the management of information. It is used to manage information in a way that is not possible with traditional information retrieval systems.
- BM25 is a tool for the management of information. It is used to manage information in

--- Evaluación BLEU ---
BLEU: 0.0171

--- Top-3 Documentos ---
[1] BM25 is an information retrieval algorithm based on term frequency.
[2] Artificial intelligence is transforming many industries.
[3] Information retrieval is key to search engines.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- BM25 is an information retrieval algorithm based on term frequency.
- Artificial intelligence is transforming many industries.
- Information retrieval is key to search engines.
Question: What is BM25 and how is it used in NLP?
Answer: BM25 is a search engine that can be used to search for information. It is a search engine that can be used to search for information. It is a search engine that can be used to search for information. It is a search engine that can

--- Evaluación BLEU ---
BLEU: 0.0143

--- Top-5 Documentos ---
[1] BM25 is an information retrieval algorithm based on term frequency.
[2] Artificial intelligence is transforming many industries.
[3] Information retrieval is key to search engines.
[4] Ranking algorithms determine the relevance of a document to a query.
[5] Natural language processing enables machines to understand human language.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- BM25 is an information retrieval algorithm based on term frequency.
- Artificial intelligence is transforming many industries.
- Information retrieval is key to search engines.
- Ranking algorithms determine the relevance of a document to a query.
- Natural language processing enables machines to understand human language.
Question: What is BM25 and how is it used in NLP?
Answer: BM25 is a search engine that can be used to search for keywords in a document.
Question: What is the difference between BM25 and NLP?
Answer: BM25 is a search engine that can be used to search for keywords in

--- Evaluación BLEU ---
BLEU: 0.0177

=== Consulta: How does GPT-2 generate text? ===

--- Top-1 Documentos ---
[1] Language models like GPT-2 can autonomously generate coherent text.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Language models like GPT-2 can autonomously generate coherent text.
Question: How does GPT-2 generate text?
Answer:
- The GPT-2 language model is a set of models that can generate text.
- The GPT-2 language model is a set of models that can generate text. - The GPT-2 language model is a set of

--- Evaluación BLEU ---
BLEU: 0.0042

--- Top-2 Documentos ---
[1] Language models like GPT-2 can autonomously generate coherent text.
[2] GPT-2 was trained on a large corpus of internet text.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Language models like GPT-2 can autonomously generate coherent text.
- GPT-2 was trained on a large corpus of internet text.
Question: How does GPT-2 generate text?
Answer:
- The corpus of internet text is a large corpus of text.
- The corpus of internet text is a large corpus of text.
- The corpus of internet text is a large corpus of text.
- The corpus of internet text is

--- Evaluación BLEU ---
BLEU: 0.0065

--- Top-3 Documentos ---
[1] Language models like GPT-2 can autonomously generate coherent text.
[2] GPT-2 was trained on a large corpus of internet text.
[3] Ranking algorithms determine the relevance of a document to a query.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Language models like GPT-2 can autonomously generate coherent text.
- GPT-2 was trained on a large corpus of internet text.
- Ranking algorithms determine the relevance of a document to a query.
Question: How does GPT-2 generate text?
Answer:
- The GPT-2 algorithm generates text based on the following:
- The text is a set of words that are related to the text.
- The text is a set of words that are related to the text.
- The

--- Evaluación BLEU ---
BLEU: 0.0071

--- Top-5 Documentos ---
[1] Language models like GPT-2 can autonomously generate coherent text.
[2] GPT-2 was trained on a large corpus of internet text.
[3] Ranking algorithms determine the relevance of a document to a query.
[4] Natural language processing enables machines to understand human language.
[5] Information retrieval is key to search engines.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Language models like GPT-2 can autonomously generate coherent text.
- GPT-2 was trained on a large corpus of internet text.
- Ranking algorithms determine the relevance of a document to a query.
- Natural language processing enables machines to understand human language.
- Information retrieval is key to search engines.
Question: How does GPT-2 generate text?
Answer:
- The GPT-2 algorithm generates text based on the following:
- The text is generated by a machine learning algorithm.
- The text is generated by a human.
- The text is generated by a machine learning algorithm.

--- Evaluación BLEU ---
BLEU: 0.0058

=== Consulta: Why is information retrieval important? ===

--- Top-1 Documentos ---
[1] Information retrieval is key to search engines.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Information retrieval is key to search engines.
Question: Why is information retrieval important?
Answer: Information retrieval is important because it allows us to understand the information we are looking for.
Question: What is the difference between information retrieval and search engines?
Answer: Information retrieval is a way to understand the information we are looking for.
Question

--- Evaluación BLEU ---
BLEU: 0.0063

--- Top-2 Documentos ---
[1] Information retrieval is key to search engines.
[2] BM25 is an information retrieval algorithm based on term frequency.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Information retrieval is key to search engines.
- BM25 is an information retrieval algorithm based on term frequency.
Question: Why is information retrieval important?
Answer: Information retrieval is important because it allows us to search for information that is not available to us.
Question: What is the difference between information retrieval and search engines?
Answer: Information retrieval is a search engine that uses a search engine to search for

--- Evaluación BLEU ---
BLEU: 0.0071

--- Top-3 Documentos ---
[1] Information retrieval is key to search engines.
[2] BM25 is an information retrieval algorithm based on term frequency.
[3] Artificial intelligence is transforming many industries.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Respuesta Generada ---
Context:
- Information retrieval is key to search engines.
- BM25 is an information retrieval algorithm based on term frequency.
- Artificial intelligence is transforming many industries.
Question: Why is information retrieval important?
Answer: Information retrieval is key to search engines.
- BM25 is an information retrieval algorithm based on term frequency.
- Artificial intelligence is transforming many industries.
Question: Why is information retrieval important?
Answer: Information retrieval is key to search engines

--- Evaluación BLEU ---
BLEU: 0.0067

--- Top-5 Documentos ---
[1] Information retrieval is key to search engines.
[2] BM25 is an information retrieval algorithm based on term frequency.
[3] Artificial intelligence is transforming many industries.
[4] Ranking algorithms determine the relevance of a document to a query.
[5] Natural language processing enables machines to understand human language.

--- Respuesta Generada ---
Context:
- Info

## Interpretación de resultados

|Componente|	Resultado	|Interpretación|
|-----------|-----------|--------------|
|BM25 Retrieval | Recupera bien los documentos clave	|Correcto|
|GPT-2 Generation	|Repetitivo, difuso, poco dirigido	|Limita BLEU|
|BLEU Evaluation	| Puntajes bajos en general	|Esperado dada la diferencia textual|

Este experimento muestra que:

- BM25 funciona bien como recuperador simple.
- GPT-2 no es ideal para tareas de QA sin fine-tuning.
- BLEU es útil como métrica, pero muy estricta en casos como este, en la que se usó GTP-2 lo cual no es ideal.

## Conclusiones

- BM25 es efectivo para queries bien representadas en el corpus, pero limitado cuando se requiere interpretación semántica.
- GPT-2 es un generador potente, pero no especializado para QA. Se requiere fine-tuning o prompts más controlados para mejorar precisión.
- BLEU funciona como métrica formal requerida, pero no refleja bien la calidad semántica de las respuestas generadas por modelos como GPT-2.
- Mejora la cobertura semántica, con esto se obtendría mejores `top-k` y se daría material más relevante a GPT-2 para generar mejores respuestas