# **Scipy Tutorial 2025 RAG**

# **PART 1: LLM Inference Setup**
---
Before we explore the power of Retrieval-Augmented Generation, let’s first set up our LLM inference endpoint. For this tutorial, we’ll be using an open-source LLM.


**Step 1: Launch a GPU instance**

**Nebari**: If you’re using the Nebari platform, be sure to select a GPU instance.









Differences: CPU vs. GPU

| Aspect            | CPU                                                         | GPU                                                      |
|-------------------|-------------------------------------------------------------|----------------------------------------------------------|
| **Function**      | Generalized component that handles main processing functions of a server | Specialized component that excels at parallel computing   |
| **Processing**    | Designed for serial instruction processing                  | Designed for parallel instruction processing             |
| **Design**        | Fewer, more powerful cores                                  | More cores than CPUs, but less powerful than CPU cores   |
| **Best suited for** | General purpose computing applications                    | High-performance computing applications                  |



**Step 2: Instantiating a Text-Generation Pipeline with a Chat-Style Prompt**

In [1]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-2b-instruct")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.1-2b-instruct")

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.51s/it]


The pipelines are a great and easy way to use models for inference,offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

In [2]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,
    return_full_text=False, # don't return the prompt itself
)

Device set to use cuda:0


Integrate the LLM inference workflow into a minimal helper function for prompt templates, allowing users to provide their own context.

In [3]:
def prompt_template(context: str, question: str):
    """
    context: supporting document or knowledge snippet
    question: user’s query
    """
    # build a prompt that clearly separates context from the question
    prompt = f"""
    You are an expert question-answering assistant in a RAG (Retrieval-Augmented Generation) system.
    Use only the information in the CONTEXT to ANSWER the QUESTION.
    CONTEXT:
    {context.strip()}
    QUESTION:
    {question.strip()}
    ANSWER:
    """
    out = pipe(prompt, max_new_tokens=100, truncation=True, do_sample=True)[0]
    return out["generated_text"]

**Without Context**

Without a defined knowledge context, the LLM may hallucinate and provide inaccurate information.

In [4]:
user_question = "What are the canvas dimensions of “Les Demoiselles d’Avignon,” and what subject does the painting depict?"
prompt_template("",user_question)

'\nThe painting "Les Demoiselles d’Avignon" by Pablo Picasso has a canvas dimension of 73 x 53 centimeters. The subject of the painting is a group of prostitutes, often referred to as "the dancers" or "the courtesans," depicted in a raw and primitive style, marking a significant departure from traditional portraiture. This work is considered a precursor to Cubism and'

**With Context**

With a clearly defined, fact-based context, the LLM can answer this question precisely.

In [5]:
context_input = """
In July 1907, Pablo Picasso unveiled “Les Demoiselles d’Avignon” in his Paris studio.
This groundbreaking canvas (243 cm × 233 cm) depicts five nude female figures with angular,
fragmented forms and faces inspired by African and Iberian masks.
By abandoning traditional single-point perspective, Picasso flattened the pictorial space
and presented multiple viewpoints simultaneously.
The painting’s radical departure from realistic representation laid the groundwork for the
Cubist movement, which Picasso and Georges Braque would develop further in 1908–1914.
"""
user_question = "What are the canvas dimensions of “Les Demoiselles d’Avignon,” and what subject does the painting depict?"
prompt_template(context_input,user_question)

' The canvas dimensions of “Les Demoiselles d’Avignon” are 243 cm (width) × 233 cm (height). The painting depicts five nude female figures.'

# **PART 2: Load Data**


---



In this tutorial, we’ll use 100 scientific papers as our knowledge base. These are real arXiv papers from computer science and AI research, forming a subset of the [SPIQA](https://huggingface.co/datasets/google/spiqa) dataset.
Navigate to the Data folder in your terminal using cd Data. Then, unzip the downloaded file by running `unzip scientific_papers.zip` in your terminal.

In [6]:
from pathlib import Path
# find parent path
current_path = Path.cwd()
root_path = current_path.parents[1]
print("parent path:", root_path)
# specif data file path
folder_path = root_path/"ScipyTutorial2025_RAG/Data/scientific_papers"
print("file path:",folder_path)

parent path: /home/siyulilyqian@gmail.com
file path: /home/siyulilyqian@gmail.com/ScipyTutorial2025_RAG/Data/scientific_papers


In [7]:
import glob
import os
txt_files = glob.glob(os.path.join(folder_path, '*.txt'))
# Read them into a dict, keep track of file names
documents_dict = {}
for fp in txt_files:
    with open(fp, 'r', encoding='utf-8') as f:
        documents_dict[os.path.basename(fp)] = f.read()

In [8]:
from langchain.schema import Document
# Convert each entry in documents_dict into a Document object
docs = [
    Document(page_content=content,metadata={"source": filename})
    for filename, content in documents_dict.items()
]
print(f"Number of documents loaded: {len(docs)}")

Number of documents loaded: 100


# **PART 3: RAG**



---



# **3.1 Chunking**

Chunking refers to the process of splitting a larger document into smaller, more manageable “chunks” of text before embedding and retrieval.




In [10]:
%%time
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="", chunk_size=2000,chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} of chunks are created.")

2004 of chunks are created.
CPU times: user 7.85 s, sys: 0 ns, total: 7.85 s
Wall time: 7.85 s


**Question 1:** What observations did you make about fixed-length chunking, and which alternative chunking method would you like to explore next?

In [91]:
%%time
# Text-structured based:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} of chunks are created.")

2678 of chunks are created.
CPU times: user 92.6 ms, sys: 2.98 ms, total: 95.6 ms
Wall time: 94 ms


In [27]:
# you can check the chunk content here
texts[0].page_content

'In contextual bandits, the objective is to select an action A𝐴A, guided by contextual information X𝑋X, to maximize the resulting outcome Y𝑌Y. This paradigm is prevalent in many real-world applications such as healthcare, personalized recommendation systems, or online advertising [1, 2, 3]. The objective is to perform actions, such as prescribing medication or recommending items, which lead to desired outcomes like improved patient health or higher click-through rates. Nonetheless, updating the policy presents challenges, as naïvely implementing a new, untested policy may raise ethical or financial concerns. For instance, prescribing a drug based on a new policy poses risks, as it may result in unexpected side effects. As a result, recent research [4, 5, 6, 7, 8, 9, 10, 11] has concentrated on evaluating the performance of new policies (target policy) using only existing data that was generated using the current policy (behaviour policy). This problem is known as Off-Policy Evaluation 

In [19]:
from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [22]:
%%time
from langchain_experimental.text_splitter import SemanticChunker
text_splitter = SemanticChunker(
    hf, breakpoint_threshold_type="percentile"
)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} of chunks are created.")

1018 of chunks are created.
CPU times: user 4min 33s, sys: 1.05 s, total: 4min 34s
Wall time: 4min 15s


**Question 2:** Measure each chunking strategy's processing latency. Which method runs the fastest, and which one is the slowest? Why is that?

Semantic Chunking (which splits text into chunks based on semantic similarity) typically takes the longest processing time compared to other methods, due to the complexity of embeddings and computations involved.

# **3.2 Embedding**


Embedding and indexing are the steps that turn text chunks into a searchable vector database. **Embedding** converts pieces of text into high-dimensional numeric vectors that capture their semantic meaning.
**Indexing** stores those vectors in a specialized data structure—or “index”—that supports fast similarity search.

Feel free to explore the wide range of embedding models available on Hugging Face.


In [28]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(texts, hf)

# **3.3 Retrieval**

Retrieval refers to the process of finding and fetching the most relevant chunks (pieces of text) from your indexed knowledge base to serve as context for your LLM.

In [75]:
%%time
retrieved_chunks = vectorstore.similarity_search("What challenge do temporal tracking and forecasting tasks illustrate in machine learning?",k=2)

CPU times: user 19 ms, sys: 92 μs, total: 19.1 ms
Wall time: 17.3 ms


In [176]:
# Alternatively, you can initialize the retriever using a different method.
vector_retriever = vectorstore.as_retriever(
    # optional parameters:
    search_kwargs={"k": 2}
)

In [177]:
# check source document
retrieved_chunks[0].metadata

{'source': '2311.06428v2.txt'}

**Question 1:** What code changes are needed to add both a similarity-score  and metadata-based filtering on top of your standard “top­k chunk” retriever in a RAG pipeline?

In [174]:
%%time
# Use a metadata filter to search within a specific document or source.
metadata_filter = {"source": "2311.06428v2.txt"}
retrieved_chunks = vectorstore.similarity_search("What challenge do temporal tracking and forecasting tasks illustrate in machine learning?",\
                                                 k=2, \
                                                 filter=metadata_filter)
retrieved_chunks[0].metadata

CPU times: user 18.6 ms, sys: 1.02 ms, total: 19.6 ms
Wall time: 18 ms


{'source': '2311.06428v2.txt'}

In [73]:
%%time
# Add a similarity-score threshold 
retrieved_chunks = vectorstore.similarity_search_with_relevance_scores("What challenge do temporal tracking and forecasting tasks illustrate in machine learning?",\
                                                 k=2,\
                                                 score_treshold=0.9)

CPU times: user 17.3 ms, sys: 2.06 ms, total: 19.4 ms
Wall time: 17.7 ms




**Question 2:** What steps are required to plug a sparse retriever into your RAG workflow, replacing the default dense retriever?

In [54]:
%%time
from langchain_community.retrievers import BM25Retriever
sparse_retriever = BM25Retriever.from_documents(docs,k=2)

CPU times: user 213 ms, sys: 9.98 ms, total: 223 ms
Wall time: 221 ms


In [71]:
%%time
question = "What challenge do temporal tracking and forecasting tasks illustrate in machine learning?"
retrieved_chunks = sparse_retriever.invoke(question)

CPU times: user 1.35 ms, sys: 17 μs, total: 1.36 ms
Wall time: 1.34 ms


# **3.4 Evaluation**

**Question 3:** Is there a quick way to evaluate your retrieval results? Hint: Use Metadata

### Evaluating Retrieval Result 

In [156]:
# first, let's run through the 100 questions at once
# specify data file path
import pandas as pd
folder_path = root_path/"ScipyTutorial2025_RAG/Data/RAG_QA.json"
print("file path:",folder_path)
qa_df = pd.read_json(folder_path)
qa_df.shape

file path: /home/siyulilyqian@gmail.com/ScipyTutorial2025_RAG/Data/RAG_QA.json


(60, 3)

In [179]:
%%time
# Function to run retrieval and extract info
def retrieve_info(question):
    # Replace the retriever with the one you believe works best for this use case
    docs = vectorstore.similarity_search(question, k=2)
   
    
    # Grab text chunks and metadata from top k
    retrieved_texts = [doc.page_content for doc in docs]
    retrieved_sources = [doc.metadata.get("source", "") for doc in docs]
    
    # Join texts to reformat them before sending them to the LLM
    joined_texts = "\n".join(retrieved_texts)
    
    return pd.Series({
        "retrieved_texts": joined_texts,
        "retrieved_sources": retrieved_sources
    })

# Apply retrieval to all questions
retrieval_results = qa_df["question"].apply(retrieve_info)

# Concatenate new columns to the original DataFrame
qa_df_retrieved_result = pd.concat([qa_df, retrieval_results], axis=1)

CPU times: user 741 ms, sys: 3.88 ms, total: 745 ms
Wall time: 743 ms


In [180]:
# Check if ground truth source is in retrieved list
qa_df_retrieved_result["correct_retrieval"] = qa_df_retrieved_result.apply(
    lambda row: row["source"] in row["retrieved_sources"],
    axis=1
)

# Compute Recall@k
retrieval_accuracy = qa_df_retrieved_result["correct_retrieval"].mean()
print(f"Retriever Recall@2: {retrieval_accuracy:.2%}")

Retriever Recall@2: 63.33%


### Evaluating Response Result 

In [127]:
qa_df_retrieved_result.head()

Unnamed: 0,question,answer,source,retrieved_texts,retrieved_sources,correct_retrieval
0,What is One of the promises of AI?,One of the promises of AI is to enhance human ...,2311.01007v2.txt,This work builds on our previous work in [MSS2...,"[2311.01007v2.txt, 2311.01007v2.txt]",True
1,What is Sound source localization (SSL)?,Sound source localization (SSL) is the task of...,2311.01052v2.txt,Sound source localization (SSL) is the task of...,"[2311.01052v2.txt, 2311.01052v2.txt]",True
2,What is Markov Decision Process (MDP)?,Markov Decision Process (MDP) is defined by a ...,2311.01075v1.txt,Markov Decision Process (MDP) is defined by a ...,"[2311.01075v1.txt, 2311.02194v1.txt]",True
3,What is Considering the multimodal nature of t...,Considering the multimodal nature of the PNG t...,2311.01091v1.txt,Considering the multimodal nature of the PNG t...,"[2311.01091v1.txt, 2311.13574v1.txt]",True
4,What is To enable a safer and more accurate sy...,"To enable a safer and more accurate system, on...",2311.01106v1.txt,"Classification:For classification problems, mi...","[2311.03570v1.txt, 2311.03570v1.txt]",False


In [155]:
# let's first ask LLM to generate response with the retrieved context
qa_df_retrieved_result["llm_response"] = qa_df_retrieved_result.apply(
    lambda row: prompt_template(row["retrieved_texts"], row["question"]),
    axis=1
)

In [125]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-mpnet-base-v2")

def compute_cosine_similarity(text_a, text_b):
    # Embed two texts and calculate the similarity score
    emb_llm = model.encode(text_a, convert_to_tensor=True)
    emb_ref = model.encode(text_b, convert_to_tensor=True)
    score = util.cos_sim(emb_llm, emb_ref).item()
    return score

In [160]:
qa_df_retrieved_result["response_score"] = qa_df_retrieved_result.apply(
    lambda row: compute_cosine_similarity(row["llm_response"],row["answer"]),
    axis=1
)

In [163]:
def evaluation_template(question: str, reference_answer: str, llm_answer: str):
    # Use an LLM to evaluate how well the llm_answer matches the reference_answer
    prompt = f"""
    You are an expert question-answering assistant.
    
    You are given a question and two answers:
    Question:
    {question}
    Reference Answer:
    {reference_answer}
    LLM-Generated Answer:
    {llm_answer}
    
    Evaluate how similar and correct the LLM's answer is compared to the reference answer.
    Score it from 0 to 1.
    Output the numeric score only.
    """
    out = pipe(prompt, max_new_tokens=100, truncation=True, do_sample=True)[0]
    return out["generated_text"]

In [165]:
%%time
qa_df_retrieved_result["response_score_llm_judge"] = qa_df_retrieved_result.apply(
    lambda row: evaluation_template(row['question'],row["llm_response"],row["answer"]),
    axis=1
)

CPU times: user 4min 9s, sys: 4.71 s, total: 4min 14s
Wall time: 4min 14s


# **3.4 Gradio App**


A Gradio app is a Python-powered interface that lets users interactively demo and test models through customizable input and output components.

With your RAG pipeline in place, you’re all set to start chatting with your LLM-powered assistant!









In [None]:
def retrieve(question):
  ####swap your retriever here #####
  chunks = vectorstore.similarity_search(question,k=2)
  # put the retreived chunks into a context block
  joined_chunks = ["".join(chunk.page_content) for chunk in chunks]
  # reformat them into one Markdown block
  context = "\n\n---\n\n".join(joined_chunks)
  return context

In [None]:
import gradio as gr
def rag_chat(question: str):
    # 1) get context
    context = retrieve(question)
    # 2) generate answer
    answer = prompt_template(context,question)
    # return both to the UI
    return context, answer
# ── 3) Build and launch the app ──
iface = gr.Interface(
    fn=rag_chat,
    inputs=gr.Textbox(lines=2, placeholder="Ask anything…"),
    outputs=[
        gr.Markdown(label="Retrieved Context"),
        gr.Textbox(label="Answer")
    ],
    title="Simple RAG Demo",
    description="Enter a question, see the retrieved context, and the LLM's answer."
)

if __name__ == "__main__":
    iface.launch(share=True)

# **3.5 Advance Section**




## **3.5.1 Hybrid Retrieval**

Hybrid retrieval combines traditional keyword-based search (e.g., BM25) with vector-based semantic search to surface results that are both lexically and conceptually relevant.

In [151]:
from langchain.retrievers import EnsembleRetriever


hybrid_retriever = EnsembleRetriever(
    retrievers=[sparse_retriever,vector_retriever],
    strategy="merge",
    k=2,
)

# Function to run retrieval and extract info
def retrieve_info(question):
    # Replace the retriever with the one you believe works best for this use case
    docs =hybrid_retriever.get_relevant_documents(question)
    
    # Grab text chunks and metadata from top k
    retrieved_texts = [doc.page_content for doc in docs]
    retrieved_sources = [doc.metadata.get("source", "") for doc in docs]
    
    # Join texts to reformat them before sending them to the LLM
    joined_texts = "\n".join(retrieved_texts)
    
    return pd.Series({
        "retrieved_texts": joined_texts,
        "retrieved_sources": retrieved_sources
    })

# Apply retrieval to all questions
retrieval_results = qa_df["question"].apply(retrieve_info)

# Concatenate new columns to the original DataFrame
qa_df_retrieved_result = pd.concat([qa_df, retrieval_results], axis=1)
# Check if ground truth source is in retrieved list
qa_df_retrieved_result["correct_retrieval"] = qa_df_retrieved_result.apply(
    lambda row: row["source"] in row["retrieved_sources"],
    axis=1
)

# Compute Recall@k
retrieval_accuracy = qa_df_retrieved_result["correct_retrieval"].mean()
print(f"Retriever Recall@2: {retrieval_accuracy:.2%}")

Retriever Recall@2: 78.33%


**Question:** Which combination method does this hybrid retriever use?

## **3.5.2 Cross-Encoder Reranker**

A reranker is a secondary model that takes the top-N candidates from an initial retrieval stage and assigns them more precise relevance scores to produce a refined ranking.

In this section, we’ve provided the code for a cross-encoder reranker. Feel free to explore it and try out different models.










In [169]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# load reranker model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-v2-m3")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-v2-m3")
model = model.to("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval()

def cross_encoder_rerank(question: str, doc: str) -> float:
    pairs = [[question, doc]]
    with torch.no_grad():
        inputs = tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors="pt",
            max_length=512,
        ).to(model.device)
        scores = model(**inputs).logits.view(-1).float()
    return scores.item()

**Question**: Plug the reranker into your current RAG pipeline. Is the reranker’s result better than the initial retrieval result?

In [185]:
hybrid_retriever = EnsembleRetriever(
    retrievers=[sparse_retriever,vector_retriever],
    strategy="merge",
    k=10,
)
def retrieve_info(question: str, top_k: int = 2) -> pd.Series:
    # 1. get an initial pool (we set hybrid k=10 above)
    docs = hybrid_retriever.get_relevant_documents(question)

    # 2. rerank each hit with your cross-encoder
    scored = []
    for doc in docs:
        score = cross_encoder_rerank(question, doc.page_content)
        scored.append((doc, score))

    # 3. sort by descending rerank score and truncate to your final top_k
    scored = sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]
    docs = [doc for doc, _ in scored]

    # 4. extract for your downstream pipeline
    retrieved_texts = [d.page_content for d in docs]
    retrieved_sources = [d.metadata.get("source","") for d in docs]

    return pd.Series({
        "retrieved_texts": "\n".join(retrieved_texts),
        "retrieved_sources": retrieved_sources
    })

In [None]:
# Apply retrieval to all questions
retrieval_results = qa_df["question"].apply(retrieve_info)

# Concatenate new columns to the original DataFrame
qa_df_retrieved_result = pd.concat([qa_df, retrieval_results], axis=1)
# Check if ground truth source is in retrieved list
qa_df_retrieved_result["correct_retrieval"] = qa_df_retrieved_result.apply(
    lambda row: row["source"] in row["retrieved_sources"],
    axis=1
)

# Compute Recall@k
retrieval_accuracy = qa_df_retrieved_result["correct_retrieval"].mean()
print(f"Retriever Recall@2: {retrieval_accuracy:.2%}")