# Week 4: Retrieval-Augmented Generation (RAG) with arXiv Papers
This week marks a major shift in your AI agent's capabilities: you’ll build the foundation for a Retrieval-Augmented Generation (RAG) system tailored to scientific research. Rather than relying on an LLM’s memory alone, RAG architectures allow your agent to search a structured knowledge base and generate grounded, document-aware answers.

Your task is to create a RAG pipeline using recent arXiv cs.CL papers, converting them into searchable chunks, embedding them, and indexing them with FAISS. You’ll then implement a simple query interface that takes a user question, retrieves the top relevant chunks, and displays them for further processing.

This week marks the beginning of building your agent’s private research knowledge base—a semantic index that you’ll evolve into a full-featured hybrid database in Week 5.


## 📚 Learning Objectives

* Understand the components of a Retriever-Reader QA pipeline.
* Explore document chunking strategies (e.g., sections vs. sliding windows) and their impact on retrieval performance.
* Index scientific text using vector embeddings and FAISS.
* Build and query a semantic index via a FastAPI endpoint that returns relevant passages.


In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:42:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0


Create rag_env virtual environment

(base) C:\Users\ch939>conda create -n rag_env python=3.10

Activate the vitual environment

(base) C:\Users\ch939>conda activate rag_env

The installation of sentence-transformers is time consuming, 10 minutes? 
(rag_env) C:\Users\ch939>conda install sentence-transformers

(rag_env) C:\Users\ch939>conda install arxiv


(rag_env) C:\Users\ch939>conda install fastapi

(rag_env) C:\Users\ch939>conda install uvicorn

(rag_env) C:\Users\ch939>conda install nest-asyncio

(rag_env) C:\Users\ch939>conda install jupyter

(rag_env) C:\Users\ch939>conda install numpy

(rag_env) C:\Users\ch939>conda install tqdm

(rag_env) C:\Users\ch939>conda install -c pytorch faiss-gpu

Install FAISS-GPU. 

The installations works in Rag-env using the below command, not in this jupyter,

conda install -c pytorch faiss-gpu

Use Anaconda Prompt, create Python environment with Python 3.10 and activate the environmwent
(base) C:\Users\ch939>conda create -n rag_env python=3.10
(base) C:\Users\ch939>conda activate rag_env

Install Pytorch in the rag_env virtual environment:
(rag_env) C:\Users\ch939>pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121


Install other packages:
(rag_env) C:\Users\ch939>conda install arxiv
(rag_env) C:\Users\ch939>conda install fastapi
(rag_env) C:\Users\ch939>conda install uvicorn
(rag_env) C:\Users\ch939>conda install nest-asyncio
(rag_env) C:\Users\ch939>conda install jupyter
(rag_env) C:\Users\ch939>conda install numpy
(rag_env) C:\Users\ch939>conda install tqdm
(rag_env) C:\Users\ch939>conda install -c pytorch faiss-gpu


Libraries installed in the implemetation of the below steps:

(rag_env) C:\Users\ch939>conda install -c conda-forge pymupdf // not working

(rag_env) C:\Users\ch939>pip install pymupdf

(rag_env) C:\Users\ch939>conda install -c conda-forge pypdf2

using "ctl+shft+p" to switch the kernel of jupyter in VS code.

Get the api key

In [1]:
# mod4env.py
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the OpenAI API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Optional: Validate that the key is loaded
if not OPENAI_API_KEY:
    raise ValueError("OpenAI API key not found. Please set it in the .env file.")

Verify Python Version

!python --version

Verify GPU Availability

In [3]:
import torch
print("CUDA Available:", torch.cuda.is_available())
print("CUDA Version:", torch.version.cuda)
print("Device Count:", torch.cuda.device_count())
print("Current Device:", torch.cuda.current_device())
print("Device Name:", torch.cuda.get_device_name(0))

CUDA Available: True
CUDA Version: 12.8
Device Count: 1
Current Device: 0
Device Name: NVIDIA GeForce RTX 4070 SUPER


verify FAISS sees GPU:

In [4]:
import faiss
print("FAISS GPU Available:", hasattr(faiss, "GpuIndexFlatL2"))

FAISS GPU Available: True


In [5]:
!python --version

Python 3.10.18


In [6]:
list_of_chunks = [
    "This is the first text chunk.",
    "Here is the second text chunk, which is a bit longer than the first one.",
    "Finally, this is the third text chunk, and it contains some more information."
]

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2',use_auth_token=False)
embeddings = model.encode(list_of_chunks)  # embeds each text chunk into a 384-d vecto



   (Alternatively, you can use a Hugging Face Transformer model and apply pooling manually to get chunk embeddings.)
   
5. **Indexing with FAISS:** Build a FAISS index of the chunk embeddings. For example, use a simple index like `IndexFlatL2` with the same dimensionality as your embeddings. Add all chunk vectors to the index (e.g., `index.add(np.array(embeddings))`).
6. **Notebook Demo:** Create a notebook where a user query is embedded and passed to the index (`index.search(query_embedding, k)`) to retrieve the top-3 matching chunks. Display the original chunk text for these results.
7. **FastAPI Service:** Build a simple FastAPI app. Define an endpoint (e.g. `@app.get("/search")`) that accepts a query parameter `q`. In the handler, embed `q`, perform the FAISS search, and return the top passages as JSON. (For example, a FastAPI endpoint can accept a question and return relevant documents.)



## Starter Code Snippets

Below are skeleton code templates. Fill in the details (indicated by comments or ellipses).

**Download arxiV Papers (cs.CL):**

In [7]:
import arxiv

client = arxiv.Client()
search = arxiv.Search(
    query="cat:cs.CL",
    max_results=50,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

papers = []
for result in client.results(search):
    result.download_pdf(dirpath="data/pdfs/", filename=f"{result.entry_id.split('/')[-1]}.pdf")
    papers.append(result)

**Data Extraction (PDF → Text):**

Check if fits has been downloaded

In [12]:

import fitz # PyMuPDF
print("PyMuPDF Version:", fitz.__version__)
print("PyMuPDF Docstring:")
print(fitz.__doc__)


PyMuPDF Version: 1.26.3
PyMuPDF Docstring:
PyMuPDF 1.26.3: Python bindings for the MuPDF 1.26.3 library (rebased implementation).
Python 3.10 running on win32 (64-bit).



The extract_text_from_pdf function

In [39]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Open a PDF and extract all text as a single string.
    """
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        page_text = page.get_text()  # get raw text from page
        # (Optional) clean page_text here (remove headers/footers)
        pages.append(page_text)
    full_text = "\n".join(pages)
    return full_text


The alternative extract_text_from_pdf function

In [None]:
def extract_text_from_pdf(pdf_path: str) -> str:
    from PyPDF2 import PdfReader  # Import inside function or globally
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:  # Avoid adding None
            text += page_text
    return text


Try one pdf first

In [36]:
pdf_path = "C:\\Users\\ch939\\Downloads\\LLMBootCampCodes\\Week4\\data\\pdfs\\2508.09337v1.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)

Graphical Abstract
Decoding Neural Emotion Patterns through Natural Language Pro-
cessing Embeddings
Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa
Rahimi Azghadi
arXiv:2508.09337v1  [cs.CL]  12 Aug 2025

 
2

Highlights
Decoding Neural Emotion Patterns through Natural Language Pro-
cessing Embeddings
Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa
Rahimi Azghadi
• This study introduces a computational framework for directly mapping
natural language emotional content to brain regions without requiring
neuroimaging.
• The integration of semantic embeddings and neuro-anatomical map-
ping successfully differentiated between healthy and depressed popula-
tions through distinct limbic activation patterns.
• The framework demonstrated high spatial specificity by accurately map-
ping twenty-seven discrete emotions to neuro-anatomically plausible
brain regions.
• Regional assignment patterns showed strong consistency with estab-
lished neuroimagin

**Chunking Logic (Sliding Window):**

In [41]:
from typing import List

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> List[str]:
    tokens = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(tokens), step):
        chunk = tokens[i:i + max_tokens]
        chunks.append(" ".join(chunk))
    return chunks



**Embedding Generation (Sentence-Transformers):**

In [29]:

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> List[str]:
    tokens = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(tokens), step):
        chunk = tokens[i:i + max_tokens]
        chunks.append(" ".join(chunk))
    return chunks


Store all chunks in a list with metadata (source file, index):

In [42]:
import os
all_chunks = []
for txt_file in os.listdir("data/texts"):
    text = open(f"data/texts/{txt_file}").read()
    chunks = chunk_text(text)
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "text": chunk,
            "source": txt_file,
            "chunk_id": i
        })

Save with

In [34]:
import pickle
with open("chunks/chunks.pkl", "wb") as f:
    pickle.dump(all_chunks, f)

** Generate Embeddings **

In [49]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2', use_auth_token=False)
texts = [chunk["text"] for chunk in all_chunks]
embeddings = model.encode(texts, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')  # FAISS requires float32

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

** Build FAISS Index (optional) **

In [50]:
import faiss

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)  # or IndexIVFFlat for large scale

# Use GPU (if available)
if faiss.get_num_gpus() > 0:
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

index.add(embeddings)

# Save index
faiss.write_index(index, "index/faiss_index.bin")

RuntimeError: Error in void __cdecl faiss::write_index(const struct faiss::Index *,struct faiss::IOWriter *,int) at D:\bld\faiss-split_1734665116635\work\faiss\impl\index_write.cpp:858: don't know how to serialize this type of index

The error occurs because you're trying to save a GPU index directly using faiss.write_index(), but Faiss does not support saving GPU indices directly. The write_index() function only works with CPU indices.

To fix this, you need to copy the index back to CPU before saving it.

In [51]:
import faiss

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)

# Use GPU (if available)
if faiss.get_num_gpus() > 0:
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

# Add embeddings
index.add(embeddings)

# --- Move index back to CPU before saving ---
if faiss.get_num_gpus() > 0:
    index = faiss.index_gpu_to_cpu(index)  # This converts GPU index back to CPU

# Now save the index (only CPU indices can be saved)
faiss.write_index(index, "index/faiss_index.bin")

When you want to load and use the index later (especially on GPU), do:

In [54]:
# Load on CPU
index = faiss.read_index("index/faiss_index.bin")

# Optionally move to GPU
if faiss.get_num_gpus() > 0:
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)


**FAISS Indexing and Search:**


In [52]:

import faiss
import numpy as np

# Assume embeddings is a 2D numpy array of shape (num_chunks, dim)
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)  # using a simple L2 index
index.add(np.array(embeddings))  # add all chunk vectors

# Example: search for a query embedding
query_embedding = np.random.rand(1, dim).astype('float32')  # get embedding for the query (shape: [1, dim])
k = 3
distances, indices = index.search(query_embedding, k)
# indices[0] holds the top-k chunk indices
print("Distances:", distances[0])
print("Indices:", indices[0])

Distances: [125.76947 126.49865 126.55414]
Indices: [239 848 207]


** Query & Retrieve **

In [55]:
query = "How do large language models handle reasoning?"
query_embedding = model.encode([query])
query_embedding = np.array(query_embedding).astype('float32')

distances, indices = index.search(query_embedding, k=3)

"""
for i in indices[0]:
    print(f"Score: {1/(1+distances[0][i]):.3f}")
    print(f"Text: {all_chunks[i]['text'][:200]}...\n")
"""
for i, d in zip(indices[0], distances[0]):
    print(f"Score: {1/(1+d):.3f}")
    print(f"Text: {all_chunks[i]['text'][:200]}...\n")

Score: 0.611
Text: abs/2505.02686. Xiaomi, L.-C.-T. 2025. MiMo-VL Technical Report. arXiv, abs/2506.03569. Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F.; Shao, C.; Yan...

Score: 0.589
Text: Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianr...

Score: 0.556
Text: Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study Mahdi Dhaini  [0000−0002−7831−3141], Juraj Vladika[0000−0002−4941−9166], Ege Erdogan[0000−0001−6170−...



** FastAPI Endpoint **

This is a semantic search API built using FastAPI, leveraging sentence embeddings and FAISS (Facebook AI Similarity Search) for efficient similarity search

In [56]:
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle

app = FastAPI()

# Load globally
model = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.read_index("index/faiss_index.bin")
with open("chunks/chunks.pkl", "rb") as f:
    all_chunks = pickle.load(f)

@app.get("/search")
async def search(q: str):
    query_vec = model.encode([q])
    query_vec = np.array(query_vec).astype('float32')
    D, I = index.search(query_vec, k=3)
    results = []
    for idx in I[0]:
        results.append({
            "score": float(1 / (1 + D[0][0])),  # approximate similarity
            "text": all_chunks[idx]["text"],
            "source": all_chunks[idx]["source"]
        })
    return {"query": q, "results": results}



**FastAPI Route Skeleton:**


This script is to set up a FastAPI endpoint that performs semantic search using a vector database (via FAISS) and a text embedding model.

In [58]:

from fastapi import FastAPI
import numpy as np

app = FastAPI()

@app.get("/search")
async def search(q: str):
    """
    Receive a query 'q', embed it, retrieve top-3 passages, and return them.
    """
    # TODO: Embed the query 'q' using your embedding model
    query_vector =  model.encode([q])[0]  # Shape: [d_model]  # e.g., model.encode([q])[0]
    # Perform FAISS search
    k = 3
    distances, indices = faiss_index.search(np.array([query_vector]), k)
    # Retrieve the corresponding chunks (assuming 'chunks' list and 'indices' shape [1, k])
    results = []
    for idx in indices[0]:
        results.append(chunks[idx])
    return {"query": q, "results": results}



## Deliverables

* **Code Notebook / Script:** Complete code for the RAG pipeline (PDF extraction, chunking, embedding, indexing, retrieval).
* **Data & Index:** The FAISS index file and the set of 50 processed paper chunks (e.g., as JSON or pickled objects).
* **Retrieval Report:** A brief report showing at least 5 example queries and the top-3 retrieved passages for each, to demonstrate system performance.
* **FastAPI Service:** The FastAPI app code (e.g. `main.py`) and instructions on how to run it. The `/search` endpoint should be demonstrable (e.g. returning top-3 passages in JSON for sample queries).

## Student Exploration Tips

* Experiment with different chunk sizes and overlaps. Smaller chunks (∼250 tokens) often give more precise retrieval, while larger chunks include more context.
* Try different embedding models (e.g. using `'all-mpnet-base-v2'` or `'paraphrase-MiniLM-L6-v2'`) to see how retrieval results change.
* Implement a simple reranking step: for example, after retrieving candidates with FAISS, re-score them with a cross-encoder model for finer ranking.
* Use metadata: consider filtering or weighting chunks by paper metadata (e.g. year, authors, keywords) to improve relevance if needed.
