# **Scipy Tutorial 2025 RAG**

# **PART 1: LLM Inference Setup**
---
Before we explore the power of Retrieval-Augmented Generation, let’s first set up our LLM inference endpoint. For this tutorial, we’ll be using an open-source LLM.


**Step 1: Launch a GPU instance**
**Nebari**: If you’re using the Nebari platform, be sure to select a GPU instance.









Differences: CPU vs. GPU

| Aspect            | CPU                                                         | GPU                                                      |
|-------------------|-------------------------------------------------------------|----------------------------------------------------------|
| **Function**      | Generalized component that handles main processing functions of a server | Specialized component that excels at parallel computing   |
| **Processing**    | Designed for serial instruction processing                  | Designed for parallel instruction processing             |
| **Design**        | Fewer, more powerful cores                                  | More cores than CPUs, but less powerful than CPU cores   |
| **Best suited for** | General purpose computing applications                    | High-performance computing applications                  |



**Step 2: Instantiating a Text-Generation Pipeline with a Chat-Style Prompt**

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-2b-instruct")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.1-2b-instruct")

Fetching 2 files: 100%|██████████| 2/2 [00:57<00:00, 28.61s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.98s/it]


The pipelines are a great and easy way to use models for inference,offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,
    return_full_text=False, # don't return the prompt itself
)

Device set to use cuda:0


Integrate the LLM inference workflow into a minimal helper function for prompt templates, allowing users to provide their own context.

In [None]:
def prompt_template(context: str, question: str):
    """
    context: supporting document or knowledge snippet
    question: user’s query
    """
    # build a prompt that clearly separates context from the question
    prompt = f"""
    You are an expert question-answering assistant in a RAG (Retrieval-Augmented Generation) system.
    Use only the information in the CONTEXT to ANSWER the QUESTION.
    CONTEXT:
    {context.strip()}
    QUESTION:
    {question.strip()}
    ANSWER:
    """
    out = pipe(prompt, max_new_tokens=100, truncation=True, do_sample=True)[0]
    return out["generated_text"]

**Without Context**

Without a defined knowledge context, the LLM may hallucinate and provide inaccurate information.

In [None]:
user_question = "What are the canvas dimensions of “Les Demoiselles d’Avignon,” and what subject does the painting depict?"
prompt_template("",user_question)

'\nThe painting "Les Demoiselles d’Avignon" by Pablo Picasso has canvas dimensions of 73 x 53 cm (28.7 x 21 inches). The subject of the painting is a group of prostitutes, often referred to as "the whore of Babylon," depicted in a brothel setting. This work is considered a groundbreaking piece in the development of Cubism, as it'

**With Context**

With a clearly defined, fact-based context, the LLM can answer this question precisely.

In [None]:
context_input = """
In July 1907, Pablo Picasso unveiled “Les Demoiselles d’Avignon” in his Paris studio.
This groundbreaking canvas (243 cm × 233 cm) depicts five nude female figures with angular,
fragmented forms and faces inspired by African and Iberian masks.
By abandoning traditional single-point perspective, Picasso flattened the pictorial space
and presented multiple viewpoints simultaneously.
The painting’s radical departure from realistic representation laid the groundwork for the
Cubist movement, which Picasso and Georges Braque would develop further in 1908–1914.
"""
user_question = "What are the canvas dimensions of “Les Demoiselles d’Avignon,” and what subject does the painting depict?"
prompt_template(context_input,user_question)

' The canvas dimensions of “Les Demoiselles d’Avignon” are 243 cm (width) × 233 cm (height). The painting depicts five nude female figures.'

# **PART 2: Load Data**


---



In this tutorial, we’ll use 100 scientific papers as our knowledge base. These are real arXiv papers from computer science and AI research, forming a subset of the [SPIQA](https://huggingface.co/datasets/google/spiqa) dataset.
Unzip the downloaded data by running `unzip scientific_papers.zip` in your terminal.

In [None]:
from pathlib import Path
# find parent path
current_path = Path.cwd()
root_path = current_path.parents[1]
print("parent path:", root_path)
# specif data file path
folder_path = root_path/"ScipyTutorial2025_RAG/Data/scientific_papers"
print("file path:",folder_path)

parent path: /home/siyulilyqian@gmail.com
file path: /home/siyulilyqian@gmail.com/ScipyTutorial2025_RAG/Data/scientific_papers


In [None]:
import glob
import os
txt_files = glob.glob(os.path.join(folder_path, '*.txt'))
# Read them into a dict, keep track of file names
documents_dict = {}
for fp in txt_files:
    with open(fp, 'r', encoding='utf-8') as f:
        documents_dict[os.path.basename(fp)] = f.read()

In [None]:
from langchain.schema import Document
# Convert each entry in documents_dict into a Document object
docs = [
    Document(page_content=content,metadata={"source": filename})
    for filename, content in documents_dict.items()
]
print(f"Number of documents loaded: {len(docs)}")

Number of documents loaded: 100


# **PART 3: RAG**



---



# **3.1 Chunking**

Chunking refers to the process of splitting a larger document into smaller, more manageable “chunks” of text before embedding and retrieval.




In [None]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="", chunk_size=2000,chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} of chunks are created.")

2004 of chunks are created.


**Question 1:** What observations did you make about fixed-length chunking, and which alternative chunking method would you like to explore next?

In [None]:
# Code Here

**Question 2:** Measure each chunking strategy's processing latency. Which method runs the fastest, and which one is the slowest? Why is that?

In [None]:
# Code Here

# **3.2 Embedding**


Embedding and indexing are the steps that turn text chunks into a searchable vector database. **Embedding** converts pieces of text into high-dimensional numeric vectors that capture their semantic meaning.
**Indexing** stores those vectors in a specialized data structure—or “index”—that supports fast similarity search.

Feel free to explore the wide range of embedding models available on Hugging Face.


In [None]:
from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(texts, hf)

# **3.3 Retrieval**

Retrieval refers to the process of finding and fetching the most relevant chunks (pieces of text) from your indexed knowledge base to serve as context for your LLM.

In [None]:
retrieved_chunks = vectorstore.similarity_search("What challenge do temporal tracking and forecasting tasks illustrate in machine learning?",k=2)

In [None]:
# check source document
retrieved_chunks[0].metadata

**Question 1:** What code changes are needed to add both a similarity-score threshold and metadata-based filtering on top of your standard “top­k chunk” retriever in a RAG pipeline?

In [None]:
### enter code here



**Question 2:** What steps are required to plug a sparse retriever into your RAG workflow, replacing the default dense retriever?

In [None]:
from langchain_community.retrievers import BM25Retriever
### enter code here

**Question 3:** Is there a quick way to evaluate your retrieval results? Hint: Use Metadata

In [None]:
## enter code here

# **3.4 Gradio App**


A Gradio app is a Python-powered interface that lets users interactively demo and test models through customizable input and output components.

With your RAG pipeline in place, you’re all set to start chatting with your LLM-powered assistant!









In [None]:
def retrieve(question):
  ####swap your retriever here #####
  chunks = vectorstore.similarity_search(question,k=2)
  # put the retreived chunks into a context block
  joined_chunks = ["".join(chunk.page_content) for chunk in chunks]
  # reformat them into one Markdown block
  context = "\n\n---\n\n".join(joined_chunks)
  return context

In [None]:
import gradio as gr
def rag_chat(question: str):
    # 1) get context
    context = retrieve(question)
    # 2) generate answer
    answer = prompt_template(context,question)
    # return both to the UI
    return context, answer
# ── 3) Build and launch the app ──
iface = gr.Interface(
    fn=rag_chat,
    inputs=gr.Textbox(lines=2, placeholder="Ask anything…"),
    outputs=[
        gr.Markdown(label="Retrieved Context"),
        gr.Textbox(label="Answer")
    ],
    title="Simple RAG Demo",
    description="Enter a question, see the retrieved context, and the LLM's answer."
)

if __name__ == "__main__":
    iface.launch(share=True)

# **3.5 Advance Section**




## **3.5.1 Hybrid Retrieval**

Hybrid retrieval combines traditional keyword-based search (e.g., BM25) with vector-based semantic search to surface results that are both lexically and conceptually relevant.

In [None]:
from langchain.retrievers import EnsembleRetriever
## code here

**Question:** Which combination method does this hybrid retriever use?

## **3.5.2 Cross-Encoder Reranker**

A reranker is a secondary model that takes the top-N candidates from an initial retrieval stage and assigns them more precise relevance scores to produce a refined ranking.

In this section, we’ve provided the code for a cross-encoder reranker. Feel free to explore it and try out different models.










In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# load reranker model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-v2-m3")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-v2-m3")
model = model.to("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval()

def cross_encoder_rerank(question: str, doc: str) -> float:
    pairs = [[question, doc]]
    with torch.no_grad():
        inputs = tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors="pt",
            max_length=512,
        ).to(model.device)
        scores = model(**inputs).logits.view(-1).float()
    return scores.item()


**Question**: Plug the reranker into your current RAG pipeline. Is the reranker’s result better than the initial retrieval result?

In [None]:
## code here