# RAG Pipeline Demo Notebook
## Week 4: Retrieval-Augmented Generation with arXiv Papers

This notebook demonstrates the complete RAG pipeline with example queries and retrieval results.

## 1. Setup and Installation

First, install the required packages:

In [None]:
# Install required packages
! pip install sentence-transformers faiss-cpu pymupdf arxiv tqdm fastapi uvicorn numpy

## 2. Import Libraries

In [1]:
from rag_pipeline import ArXivRAGPipeline, download_arxiv_papers
from pathlib import Path
import json
from typing import List, Dict
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## 3. Download arXiv Papers

Download 50 recent cs.CL (Computation and Language) papers from arXiv:

In [2]:
# Download papers (this may take 10-15 minutes)
pdf_paths = download_arxiv_papers(
    category='cs.CL',
    max_results=50,
    output_dir='arxiv_papers'
)

print(f"Downloaded {len(pdf_paths)} papers")
print(f"First few papers: {[Path(p).stem for p in pdf_paths[:5]]}")

Downloading 50 papers from cs.CL...


Downloading: 100%|██████████| 50/50 [01:00<00:00,  1.20s/it]

Downloaded 50 papers to arxiv_papers
Downloaded 50 papers
First few papers: ['2511.10645v1', '2511.10643v1', '2511.10628v1', '2511.10621v1', '2511.10618v1']





## 4. Initialize and Build the RAG Pipeline

In [3]:
# Initialize the pipeline
pipeline = ArXivRAGPipeline(model_name='all-MiniLM-L6-v2')

# Process all papers
pipeline.process_papers(
    pdf_paths=pdf_paths,
    chunk_size=512,
    overlap=50
)

Loading embedding model: all-MiniLM-L6-v2

Processing 50 papers...


Processing PDFs: 100%|██████████| 50/50 [00:04<00:00, 11.78it/s]


Generating embeddings for 1074 chunks...


Batches: 100%|██████████| 34/34 [00:25<00:00,  1.32it/s]

Building FAISS index with dimension 384...
Index built with 1074 vectors

Pipeline ready with 1074 chunks from 50 papers





## 5. Save the Index for Future Use

In [4]:
# Save the index and data
pipeline.save_index(
    index_path='faiss_index.bin',
    chunks_path='chunks.json',
    metadata_path='metadata.json'
)

print("\nIndex saved successfully!")
print(f"Total chunks indexed: {len(pipeline.chunks)}")
print(f"Total papers processed: {len(set(m['paper'] for m in pipeline.metadata))}")

Saved index to faiss_index.bin
Saved chunks to chunks.json
Saved metadata to metadata.json

Index saved successfully!
Total chunks indexed: 1074
Total papers processed: 50


## 6. Example Queries - Retrieval Testing

Let's test the retrieval system with various queries related to NLP and machine learning:

In [5]:
# Define some example queries
example_queries = [
    "What are the latest advances in attention mechanisms for transformers?",
    "How do large language models handle multilingual text?",
    "What are the common evaluation metrics for machine translation?",
    "Explain the architecture of BERT and its variants",
    "What techniques are used for few-shot learning in NLP?",
    "How does reinforcement learning apply to language generation?",
    "What are the challenges in neural machine translation?",
    "Describe methods for improving model interpretability"
]

print(f"Testing with {len(example_queries)} example queries...\n")

Testing with 8 example queries...



### Query 1: Attention Mechanisms

In [6]:
query = example_queries[0]
print(f"Query: {query}\n")
print("=" * 80)

results = pipeline.search(query, k=3)

for result in results:
    print(f"\n[Rank {result['rank']}] Distance: {result['distance']:.4f}")
    print(f"Paper: {result['metadata']['paper']}")
    print(f"Chunk ID: {result['metadata']['chunk_id']}")
    print(f"\nText Preview (first 300 chars):")
    print(result['chunk'][:300] + "...")
    print("-" * 80)

Query: What are the latest advances in attention mechanisms for transformers?


[Rank 1] Distance: 1.0285
Paper: 2511.10618v1
Chunk ID: 21

Text Preview (first 300 chars):
vj) = A(qi, kl, vl) ∀i, j, l and necessarily that output activations from the attention layer are identical across all token embeddings. As norms and feedforward layers in the transformer are identical between token indices, outputs will be identical as well. This provides a clear reason as to why a...
--------------------------------------------------------------------------------

[Rank 2] Distance: 1.0932
Paper: 2511.10566v1
Chunk ID: 12

Text Preview (first 300 chars):
that this distinctive behavior across a wide range of model architectures, spanning several vision and language datasets. Overall, our findings uncover a crucial connection on how layer normalization impacts learning and memorization in transformer models, with its broader impacts discussed in Appen...
----------------------------------------------

### Query 2: Multilingual Models

In [7]:
query = example_queries[1]
print(f"Query: {query}\n")
print("=" * 80)

results = pipeline.search(query, k=3)

for result in results:
    print(f"\n[Rank {result['rank']}] Distance: {result['distance']:.4f}")
    print(f"Paper: {result['metadata']['paper']}")
    print(f"\nText Preview (first 300 chars):")
    print(result['chunk'][:300] + "...")
    print("-" * 80)

Query: How do large language models handle multilingual text?


[Rank 1] Distance: 0.7551
Paper: 2511.10229v1

Text Preview (first 300 chars):
separability in LangGPS es- sentially reflects the model’s own multilingual modeling capability. As multilingual capacities vary across models, LangGPS requires model-specific data selection, lacking the simplicity of a one-size-fits-all solution. Secondly, our ex- periments were conducted on LLaMA-...
--------------------------------------------------------------------------------

[Rank 2] Distance: 0.7719
Paper: 2511.10229v1

Text Preview (first 300 chars):
t-SNE (Van der Maaten and Hinton 2008) to visual- ize the representations of 200 sentences sampled from XNLI in parallel across English, Chinese and Arabic. As shown in Figure 5 (1) (2) (3), multilingual SFT leads to clearer linguistic boundaries and more compact clusters in the model’s representati...
--------------------------------------------------------------------------------

[Rank 

### Query 3: Evaluation Metrics

In [8]:
query = example_queries[2]
print(f"Query: {query}\n")
print("=" * 80)

results = pipeline.search(query, k=3)

for result in results:
    print(f"\n[Rank {result['rank']}] Distance: {result['distance']:.4f}")
    print(f"Paper: {result['metadata']['paper']}")
    print(f"\nText Preview (first 300 chars):")
    print(result['chunk'][:300] + "...")
    print("-" * 80)

Query: What are the common evaluation metrics for machine translation?


[Rank 1] Distance: 0.7956
Paper: 2511.10338v1

Text Preview (first 300 chars):
a threshold of 0.4 was identified as a coverage gap. For each gap, we used SERP API to fetch and scrape new doc- uments. These documents are then used as a source for the document grounded generations. Table 25 shows broad as well as specific topic distributions (%) of the synthetic data generated. ...
--------------------------------------------------------------------------------

[Rank 2] Distance: 0.8092
Paper: 2511.10591v1

Text Preview (first 300 chars):
the final weight. BERTScore: An embedding-based metric that measures the semantic similarity between the gen- erated and reference texts. ROUGE-L: A recall-oriented metric that mea- sures the longest common subsequence. 6 Results and Discussion 6.1 Performance Comparison Table 2 presents a comparati...
-------------------------------------------------------------------------------

### Query 4: BERT Architecture

In [9]:
query = example_queries[3]
print(f"Query: {query}\n")
print("=" * 80)

results = pipeline.search(query, k=3)

for result in results:
    print(f"\n[Rank {result['rank']}] Distance: {result['distance']:.4f}")
    print(f"Paper: {result['metadata']['paper']}")
    print(f"\nText Preview (first 300 chars):")
    print(result['chunk'][:300] + "...")
    print("-" * 80)

Query: Explain the architecture of BERT and its variants


[Rank 1] Distance: 1.0059
Paper: 2511.10577v1

Text Preview (first 300 chars):
dual-encoder architecture—semantic and syntactic channels—which we enhance through DeBERTa integration. 3.2 Model Integration and Adjustments We evaluate the effect of replacing BERT in D2E2S with three DeBERTa vari- ants: DeBERTa V3-Base (86M parameters), DeBERTa V3-Large (304M parame- ters), and D...
--------------------------------------------------------------------------------

[Rank 2] Distance: 1.0228
Paper: 2511.10441v1

Text Preview (first 300 chars):
tests whether organizational benefits gen- eralize across alternation types rather than being phenomenon-specific optimizations. C Model Implementation C.1 Encoder Comparison We employ BERT (bert-base-multilingual-cased) as our primary encoder to emphasize contributions attributable to data organiza...
--------------------------------------------------------------------------------

[Rank 3] Di

### Query 5: Few-Shot Learning

In [10]:
query = example_queries[4]
print(f"Query: {query}\n")
print("=" * 80)

results = pipeline.search(query, k=3)

for result in results:
    print(f"\n[Rank {result['rank']}] Distance: {result['distance']:.4f}")
    print(f"Paper: {result['metadata']['paper']}")
    print(f"\nText Preview (first 300 chars):")
    print(result['chunk'][:300] + "...")
    print("-" * 80)

Query: What techniques are used for few-shot learning in NLP?


[Rank 1] Distance: 0.8441
Paper: 2511.10354v1

Text Preview (first 300 chars):
Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateu...
--------------------------------------------------------------------------------

[Rank 2] Distance: 0.8516
Paper: 2511.10192v1

Text Preview (first 300 chars):
prompt-based methods have become a crucial technique for improving performance. A typical prompt provided to an LLM generally includes multiple components: an instruction, database schema information, and the NL question posed by the user. To enhance the model’s generalization ability across differe...
--------------------------------------------------------------------------------

[Rank 

## 7. Comprehensive Results Summary

Let's create a summary table showing all queries and their top results:

In [11]:
# Collect results for all queries
summary_data = []

for query in example_queries[:5]:  # First 5 queries for the report
    results = pipeline.search(query, k=3)
    
    for result in results:
        summary_data.append({
            'Query': query[:60] + "...",
            'Rank': result['rank'],
            'Paper': result['metadata']['paper'][:30],
            'Distance': f"{result['distance']:.4f}",
            'Chunk Preview': result['chunk'][:100] + "..."
        })

# Create DataFrame
df_summary = pd.DataFrame(summary_data)
print("\n" + "=" * 80)
print("RETRIEVAL SUMMARY TABLE")
print("=" * 80)
print(df_summary.to_string(index=False))


RETRIEVAL SUMMARY TABLE
                                                          Query  Rank        Paper Distance                                                                                           Chunk Preview
What are the latest advances in attention mechanisms for tra...     1 2511.10618v1   1.0285 vj) = A(qi, kl, vl) ∀i, j, l and necessarily that output activations from the attention layer are id...
What are the latest advances in attention mechanisms for tra...     2 2511.10566v1   1.0932 that this distinctive behavior across a wide range of model architectures, spanning several vision a...
What are the latest advances in attention mechanisms for tra...     3 2511.10628v1   1.1397 compared to standard LayerNorm (Ba et al., 2016), particularly for large-scale language models (Taka...
      How do large language models handle multilingual text?...     1 2511.10229v1   0.7551 separability in LangGPS es- sentially reflects the model’s own multilingual modeling capability. As

## 8. Analyze Retrieval Quality

Let's analyze some metrics about our retrieval system:

In [12]:
# Calculate average distances for each query
distance_analysis = {}

for query in example_queries[:5]:
    results = pipeline.search(query, k=3)
    avg_distance = sum(r['distance'] for r in results) / len(results)
    distance_analysis[query[:50]] = avg_distance

print("\nAverage Retrieval Distances by Query:")
print("(Lower distance = better match)\n")
for query, avg_dist in distance_analysis.items():
    print(f"{query}... : {avg_dist:.4f}")

# Paper coverage analysis
unique_papers_retrieved = set()
for query in example_queries[:5]:
    results = pipeline.search(query, k=3)
    for result in results:
        unique_papers_retrieved.add(result['metadata']['paper'])

print(f"\nPapers Coverage:")
print(f"Total papers in index: {len(set(m['paper'] for m in pipeline.metadata))}")
print(f"Unique papers retrieved: {len(unique_papers_retrieved)}")
print(f"Coverage: {len(unique_papers_retrieved) / len(set(m['paper'] for m in pipeline.metadata)) * 100:.1f}%")


Average Retrieval Distances by Query:
(Lower distance = better match)

What are the latest advances in attention mechanis... : 1.0871
How do large language models handle multilingual t... : 0.7674
What are the common evaluation metrics for machine... : 0.8214
Explain the architecture of BERT and its variants... : 1.0534
What techniques are used for few-shot learning in ... : 0.8970

Papers Coverage:
Total papers in index: 50
Unique papers retrieved: 10
Coverage: 20.0%


## 9. Experiment with Different Parameters

Let's test how different chunk sizes affect retrieval:

In [13]:
# Note: This is for experimentation - you would need to rebuild the index
# with different parameters

chunk_size_experiments = [
    {"size": 256, "overlap": 25, "description": "Smaller chunks, more precise"},
    {"size": 512, "overlap": 50, "description": "Medium chunks (current)"},
    {"size": 1024, "overlap": 100, "description": "Larger chunks, more context"}
]

print("Chunking Strategy Options:")
print("\nCurrent configuration: 512 tokens with 50 token overlap")
print("\nAlternative configurations to experiment with:")
for exp in chunk_size_experiments:
    print(f"  - {exp['size']} tokens, {exp['overlap']} overlap: {exp['description']}")

print("\nTo experiment with different chunk sizes:")
print("1. Modify chunk_size and overlap parameters in process_papers()")
print("2. Rebuild the index")
print("3. Compare retrieval quality metrics")

Chunking Strategy Options:

Current configuration: 512 tokens with 50 token overlap

Alternative configurations to experiment with:
  - 256 tokens, 25 overlap: Smaller chunks, more precise
  - 512 tokens, 50 overlap: Medium chunks (current)
  - 1024 tokens, 100 overlap: Larger chunks, more context

To experiment with different chunk sizes:
1. Modify chunk_size and overlap parameters in process_papers()
2. Rebuild the index
3. Compare retrieval quality metrics


## 10. Load Previously Saved Index

Demonstration of loading a pre-built index:

In [14]:
# Create a new pipeline instance
loaded_pipeline = ArXivRAGPipeline(model_name='all-MiniLM-L6-v2')

# Load the saved index
loaded_pipeline.load_index(
    index_path='faiss_index.bin',
    chunks_path='chunks.json',
    metadata_path='metadata.json'
)

# Verify it works
test_query = "transformer architecture"
results = loaded_pipeline.search(test_query, k=2)

print(f"Test query: '{test_query}'")
print(f"\nTop result from loaded index:")
print(f"Paper: {results[0]['metadata']['paper']}")
print(f"Distance: {results[0]['distance']:.4f}")
print(f"Preview: {results[0]['chunk'][:150]}...")

Loading embedding model: all-MiniLM-L6-v2
Loaded index with 1074 vectors
Loaded 1074 chunks
Test query: 'transformer architecture'

Top result from loaded index:
Paper: 2511.10566v1
Distance: 0.9588
Preview: Eq. (33), we obtain z′ j = x′ j + FFN(x′ j), zj = xj + MHSA(xj). Hence, we can write gz′ i (from Eq. (10)) for the ith layer as follows: gz′ i = ∂L ∂z...


## 11. Conclusion

This notebook demonstrated:
- Building a complete RAG pipeline from scratch
- Processing 50 arXiv papers into searchable chunks
- Using FAISS for efficient similarity search
- Testing retrieval quality with example queries