# Scipy Tutorial 2025 RAG

# **0. Prerequisites: LLM Inference Setup**
---
Before we explore the power of Retrieval-Augmented Generation, let’s first set up our LLM inference endpoint.




***Use Open Source LLM***




**Step 1: Install Required Packages**

Both for LLM and later RAG portion

In [None]:
!pip install transformers accelerate huggingface-hub langchain_huggingface langchain_community faiss-cpu

**Step 2: Set Up Google Colab**



Open Google Colab, in setting Change Runtime Type, choose Runtime → Change Runtime Type to High Ram and pick a GPU.









Differences: CPU vs. GPU

| Aspect            | CPU                                                         | GPU                                                      |
|-------------------|-------------------------------------------------------------|----------------------------------------------------------|
| **Function**      | Generalized component that handles main processing functions of a server | Specialized component that excels at parallel computing   |
| **Processing**    | Designed for serial instruction processing                  | Designed for parallel instruction processing             |
| **Design**        | Fewer, more powerful cores                                  | More cores than CPUs, but less powerful than CPU cores   |
| **Best suited for** | General purpose computing applications                    | High-performance computing applications                  |



**Step 3: Setup HuggingFace Token**



1.   Go to your Hugging Face account’s [Settings](https://huggingface.co/settings/tokens) → Access Tokens (huggingface.co/settings/tokens).
2.   Click “New token”, give it a name, and select the “Read” scope (sufficient for this tutorial).
3. Copy the generated token and save it in your Colab notebook as a secret in the Secrets section.

In [1]:
from google.colab import userdata
import os
from huggingface_hub import login
os.environ["HF_TOKEN"] = userdata.get("HF_Token")
login(token=os.environ["HF_TOKEN"], new_session=False)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


**Step 4: Instantiating a Text-Generation Pipeline with a Chat-Style Prompt**

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-2b-instruct")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.1-2b-instruct")

The pipelines are a great and easy way to use models for inference,offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

In [36]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,
    return_full_text=False, # don't return the prompt itself
)

Device set to use cpu


Integrate the LLM inference workflow into a minimal RAG helper function that lets users supply their own context.

In [47]:
def rag_generate(context: str, question: str):
    """
    context: supporting document or knowledge snippet
    question: user’s query
    """
    # build a prompt that clearly separates context from the question
    prompt = f"""
    You are an expert question-answering assistant in a RAG (Retrieval-Augmented Generation) system.
    Use only the information in the CONTEXT to ANSWER the QUESTION.
    CONTEXT:
    {context.strip()}
    QUESTION:
    {question.strip()}
    ANSWER:
    """
    out = pipe(prompt, max_new_tokens=100, truncation=True, do_sample=True)[0]
    return out["generated_text"]

**WITH Context**

With a clearly defined, fact-based context, the LLM can answer this question precisely.

In [48]:
context_input = """
In July 1907, Pablo Picasso unveiled “Les Demoiselles d’Avignon” in his Paris studio.
This groundbreaking canvas (243 cm × 233 cm) depicts five nude female figures with angular,
fragmented forms and faces inspired by African and Iberian masks.
By abandoning traditional single-point perspective, Picasso flattened the pictorial space
and presented multiple viewpoints simultaneously.
The painting’s radical departure from realistic representation laid the groundwork for the
Cubist movement, which Picasso and Georges Braque would develop further in 1908–1914.
"""
user_question = "What are the canvas dimensions of “Les Demoiselles d’Avignon,” and what subject does the painting depict?"

rag_generate(context_input,user_question)

' The canvas dimensions of “Les Demoiselles d’Avignon” are 243 cm (width) × 233 cm (height). The painting depicts five nude female figures.'

**WITHOUT Context**

Without a defined knowledge context, the LLM may hallucinate and provide inaccurate information.

In [49]:
rag_generate("",user_question)

'\nThe painting "Les Demoiselles d’Avignon" by Pablo Picasso has canvas dimensions of 73 x 53 centimeters. The subject of the painting is a group of prostitutes, often referred to as "The Bather" and "The Nude," depicted in a raw and primitive style, marking a significant departure from traditional European art. This work is considered a precursor to Cubism and is renown'

# **1. Load Data**



---



In [52]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [53]:
import glob
import os
# Find all .txt files in that folder
folder_path = '/content/drive/MyDrive/Scipy_Data/selected_files_scipy'
txt_files = glob.glob(os.path.join(folder_path, '*.txt'))
# Read them into a dict, keep track of file names
documents_dict = {}
for fp in txt_files:
    with open(fp, 'r', encoding='utf-8') as f:
        documents_dict[os.path.basename(fp)] = f.read()

In [54]:
from langchain_text_splitters import CharacterTextSplitter
from langchain.schema import Document
docs = [
    Document(page_content=content,metadata={"source": filename})
    for filename, content in documents_dict.items()
]
print(f"Number of documents loaded: {len(docs)}")

Number of documents loaded: 100


# **2. Chunking**


---



In [None]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} of chunks are created.")

**Question 1:** What observations did you make about fixed-length chunking, and which alternative chunking method would you like to explore next?

In [None]:
# Code Here

**Question 2:** Measure each chunking strategy's processing latency. Which method runs the fastest, and which one is the slowest? Why is that?

In [None]:
# Code Here

# **3. Indexing**


---



Feel free to explore the wide range of embedding models available on Hugging Face.


In [None]:
from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(texts, hf)

# **4. Retrieval**


---



In [None]:
retrieved_chunks = vectorstore.similarity_search("What challenge do temporal tracking and forecasting tasks illustrate in machine learning?",k=2)

In [None]:
# check source document
retrieved_chunks[0].metadata

{'source': '2311.06428v2.txt'}

**Question 1:** What code changes are needed to add both a similarity-score threshold and metadata-based filtering on top of your standard “top­k chunk” retriever in a RAG pipeline?

In [None]:
### enter code here



**Question 2:** What steps are required to plug a sparse retriever into your RAG workflow, replacing the default dense retriever?

In [None]:
from langchain_community.retrievers import BM25Retriever
### enter code here

**Question 3:** Is there a quick way to evaluate your retrieval results? Hint: Use Metadata

In [None]:
## enter code here

# **5. Gradio App**


---



With your RAG pipeline in place, you’re all set to start chatting with your LLM-powered assistant!









In [51]:
def retrieve(question):
  ####swap your retriever here #####
  chunks = vectorstore.similarity_search(question,k=2)
  # put the retreived chunks into a context block
  joined_chunks = ["".join(chunk.page_content) for chunk in chunks]
  # reformat them into one Markdown block
  context = "\n\n---\n\n".join(joined_chunks)
  return context

In [None]:
import gradio as gr
def rag_chat(question: str):
    # 1) get context
    context = retrieve(question)
    # 2) generate answer
    answer = rag_generate(context,question)
    # return both to the UI
    return context, answer
# ── 3) Build and launch the app ──
iface = gr.Interface(
    fn=rag_chat,
    inputs=gr.Textbox(lines=2, placeholder="Ask anything…"),
    outputs=[
        gr.Markdown(label="Retrieved Context"),
        gr.Textbox(label="Answer")
    ],
    title="Simple RAG Demo",
    description="Enter a question, see the retrieved context, and the LLM's answer."
)

if __name__ == "__main__":
    iface.launch()

# **6. Advance Section**


---



## **6.1 Hybrid Retriever**

In [None]:
from langchain.retrievers import EnsembleRetriever
## code here

**Question:** Which combination method does this ensemble/hybrid retriever use?

## **6.2 Cross-Encoder Reranker**

In this section, we’ve provided the code for a cross-encoder reranker. Feel free to explore it and try out different models.










In [43]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# load reranker model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-v2-m3")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-v2-m3")
model = model.to("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval()

def cross_encoder_rerank(question: str, doc: str) -> float:
    pairs = [[question, doc]]
    with torch.no_grad():
        inputs = tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors="pt",
            max_length=512,
        ).to(model.device)
        scores = model(**inputs).logits.view(-1).float()
    return scores.item()


**Question**: Plug the reranker into your current RAG pipeline. Is the reranker’s result better than the initial retrieval result?

In [None]:
## code here

# **References**

https://aws.amazon.com/compare/the-difference-between-gpus-cpus/