# Retrieval Agumented Generation - On Research Papers

## Please note this notebook uses the latest verizon of langchain
### Chunking Technique - RecursiveCharacterTextSplitter with 800 chunk size and 200 overlap

### Embedding used - OpenAI's text-embedding-3-small - commonly used as its fast https://docs.langchain.com/langsmith/semantic-search#steps

### Cross encoder used - ms-marco-MiniLM-L-6-v2 - reference https://huggingface.co/cross-encoder - I believe HuggingFaceCrossEncoder is deprecated in the latest version of langchain

### VectorStore and retriever used FAISS - light weight and doesn't require sign up like other options like pinecone etc.

In [None]:
# !pip install langchain langchain-community langchain-openai faiss-cpu pypdf sentence_transformers

In [10]:
import os
import numpy as np
import json
import glob
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS

from sentence_transformers import CrossEncoder

In [22]:
openai_api_key = "<API_KEY>"

os.environ["OPENAI_API_KEY"] = openai_api_key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [12]:
DATA_DIR = "RAG_papers"

def load_all_pdfs(path):
    docs = []
    for pdf_file in glob.glob(f"{path}/*.pdf"):
        loader = PyPDFLoader(pdf_file)
        pages = loader.load()
        for p in pages:
            p.metadata["source"] = os.path.basename(pdf_file)
        docs.extend(pages)
    return docs

raw_docs = load_all_pdfs(DATA_DIR)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=200
)

chunked_docs = splitter.split_documents(raw_docs)

# Attach chunk indices
for idx, doc in enumerate(chunked_docs):
    doc.metadata["chunk_index"] = idx

In [18]:
chunked_docs[:2]

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '1706.03762v7.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1', 'chunk_index': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\n

In [24]:
emb = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = FAISS.from_documents(chunked_docs, emb)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 12}  
)

In [70]:
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") # "BAAI/bge-reranker-v2-m3" is a bit slow while testing

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [74]:
def rerank_with_crossencoder(query, docs, top_k=5, batch_size=16):
    if not docs:
        return []
    pairs = [[query, d.page_content] for d in docs]
    scores = reranker.predict(pairs, batch_size=batch_size) 
    order = np.argsort(scores)[::-1]
    return [docs[i] for i in order[:top_k]]

In [76]:
def format_docs(docs, max_chars=900):
    blocks = []
    citation_map = {}
    for i, d in enumerate(docs, start=1):
        txt = d.page_content[:max_chars]
        md = d.metadata
        blocks.append(f"[{i}] Source: {md['source']} | chunk={md['chunk_index']} \n{txt}")
        citation_map[i] = {
            "source": md["source"],
            "chunk_index": md["chunk_index"],
            "excerpt": txt
        }
    return "\n\n---\n\n".join(blocks), citation_map


def generate_answer(question, docs):
    context, citation_map = format_docs(docs)
    
    prompt = f"""
You are a factual research assistant. 
Use ONLY the context blocks below to answer. 
Use inline citations like [1][2]. 
If unknown, say "Not answerable from documents."

Context:
{context}

Question: {question}

Answer with citations, then add a "SOURCES" section.
"""
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    output = llm.invoke(prompt).content
    return output, citation_map

In [82]:
def answer_query(question):
    initial_docs = retriever.invoke(question)
    top_docs = rerank_with_crossencoder(question, initial_docs, top_k=5)
        
    answer, sources = generate_answer(question, top_docs)
    print("\n\n--- RAG Answer ---")
    print(answer)
    print("\n\n--- SOURCES FROM PIPELINE ---")
    print(json.dumps(sources, indent=2))

### Test Question 1

In [94]:
question = "What are the two sub-layers in each encoder layer of the Transformer model?"
answer_query(question)



--- RAG Answer ---
The two sub-layers in each encoder layer of the Transformer model are:

1. A multi-head self-attention mechanism.
2. A position-wise fully connected feed-forward network.

Additionally, residual connections and layer normalization are applied around each of these sub-layers [1][2].

**SOURCES**
[1] Source: 1706.03762v7.pdf | chunk=13  
[2] Source: 1706.03762v7.pdf | chunk=14


--- SOURCES FROM PIPELINE ---
{
  "1": {
    "source": "1706.03762v7.pdf",
    "chunk_index": 13,
    "excerpt": "Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, posi

### Test Question 2

In [96]:
question = "What are the main components of RAG model, and how do they interact?"
answer_query(question)



--- RAG Answer ---
The RAG (Retrieval-Augmented Generation) model consists of two main components: a retriever and a generator. 

1. **Retriever (pη(z | x))**: This component is responsible for returning distributions over text passages given a query (x). It retrieves the top-K documents that are relevant to the input query, which are then used to inform the generation process.

2. **Generator (pθ(y i | x, z, y 1:i−1))**: This component generates the output sequence based on the retrieved documents. It treats the retrieved document as a latent variable and produces the output sequence probability by marginalizing over the top-K documents retrieved by the retriever. The generator uses the retrieved documents to generate each token in the output sequence, taking into account the previously generated tokens.

The interaction between these components occurs in the following way: the retriever first identifies the most relevant documents for a given query, and then the generator uses thes

### Test Question 3

In [98]:
question = "Explain how positional encoding is implemented in Transformers and why it is necessary."
answer_query(question)



--- RAG Answer ---
Positional encoding in Transformers is implemented by adding sinusoidal functions to the input embeddings at the bottom of the encoder and decoder stacks. This is necessary because the Transformer architecture does not inherently understand the order of the input tokens, as it processes all tokens simultaneously. The positional encodings provide information about the relative or absolute position of tokens in the sequence, allowing the model to learn to attend to tokens based on their positions.

The specific implementation uses sine and cosine functions of different frequencies, defined as follows:

- For even dimensions: \( P E(pos, 2i) = \sin(pos / 10000^{2i/d_{model}}) \)
- For odd dimensions: \( P E(pos, 2i+1) = \cos(pos / 10000^{2i/d_{model}}) \)

Here, \( pos \) is the position of the token in the sequence, and \( i \) is the dimension index. This choice of sinusoidal functions allows the model to easily learn to attend by relative positions, as the position

### Test Question 4

In [100]:
question = "Describe the concept of multi-headed attention in the transformer architecture. Why is it beneficial?"
answer_query(question)



--- RAG Answer ---
Multi-headed attention is a key component of the Transformer architecture that allows the model to jointly attend to information from different representation subspaces at various positions. Instead of using a single attention mechanism with dmodel-dimensional keys, values, and queries, multi-headed attention projects these inputs h times with different learned linear transformations. This results in h parallel attention layers, or heads, which can capture diverse aspects of the input data simultaneously [4].

The benefits of multi-headed attention include:
1. **Enhanced Representation**: By attending to different parts of the input sequence through multiple heads, the model can learn richer representations. Each head can focus on different features or relationships within the data, which improves the model's ability to understand complex patterns [4].
2. **Reduced Averaging Effects**: In a single attention head, averaging can inhibit the model's ability to capture

### Test Question 5

In [102]:
question = "What is few-shot learning, and how does GPT-3 implement it during inference?"
answer_query(question)



--- RAG Answer ---
Few-shot learning (FS) refers to a setting where a model is provided with a few demonstrations of a task at inference time without any weight updates allowed. In the context of GPT-3, this means that the model can perform tasks based on a limited number of examples presented to it during the inference phase, relying solely on its pre-trained knowledge and the context provided by these demonstrations [4]. GPT-3, which is an autoregressive language model with 175 billion parameters, implements few-shot learning by taking these demonstrations and generating responses based on them, achieving strong performance across various natural language processing (NLP) tasks such as translation, question-answering, and reasoning tasks [5].

However, there are limitations to few-shot learning in GPT-3, including uncertainty about whether the model learns new tasks "from scratch" at inference time or simply recognizes tasks it has encountered during training. This ambiguity highli