# RAG -- Retrieval Augmented Generation

RAG is needed because large language models don’t have access to up-to-date or private information and can hallucinate when they lack facts. By retrieving relevant documents at query time and grounding the generation on them, RAG makes answers more accurate.


### I. Data Indexing
1. Divide Documents into chunks
2. Split documents into text chunks
3. Vectorize text chunks and store in vector database

---

### II. Data Retrieval & Generation

1. Vectorize the user query
2. Find Top-K Chunks
3. Send user query and Top-K Chunks result to auto-regressive LLM
4. Obtain response



![RAG flow](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*FhMJ8OE_PoeOyeAavYjzlw.png)

In [28]:
import os
import glob
import tiktoken
import numpy as np
import gradio as gr
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import SystemMessage, HumanMessage

In [2]:
MODEL = "gpt-4.1-nano"
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")


OpenAI API Key exists and begins sk-proj-


## Parse the PDF files

In [3]:
import fitz  # PyMuPDF

# 分開解析兩個 PDF
with fitz.open("ImageXpress Nano User Guide.pdf") as nano_pdf:
    text_nano = ""
    for page in nano_pdf:
        text_nano += page.get_text("text") + "\n"

with fitz.open("MetaXpress Image Acquisition Guide v5.0.pdf") as metaxpress_pdf:
    text_meta = ""
    for page in metaxpress_pdf:
        text_meta += page.get_text("text") + "\n"

print(f"Nano guide: {len(text_nano):,} chars")
print(f"MetaXpress guide: {len(text_meta):,} chars")

Nano guide: 241,196 chars
MetaXpress guide: 92,342 chars


In [6]:
print(text_meta[5000:6000])

 . . . . . . . . . . . . . . . . . 78

5020933 A
5 
1
Introduction
The MetaXpress® High Content Image Acquisition & Analysis Software 
is divided into two major parts: 
•
Acquisition, which involves configuring settings, acquiring 
images, and storing plate data in a database. For information 
about image acquisition, see the user guide for the 
ImageXpress® Micro Widefield High Content Screening System 
or the ImageXpress® Ultra Confocal High Content Screening 
System. Both user guides are provided on the MetaXpress 
Software installation media and are available in the Molecular 
Devices knowledge base at 
http://www.moleculardevices.com/support.html.
•
Analysis, which consists of selecting, measuring, assessing, and 
managing acquired images and plate data.
This manual describes the general analysis workflow:

Introduction
6 
5020933 A 
Obtaining Support
Molecular Devices provides a wide range of support for the MetaXpress 
Software:
1.
Documentation — Check the manuals that are incl

## Divide Documents into chunks

In [7]:
encoding = tiktoken.encoding_for_model(MODEL)
tokens = encoding.encode(text_nano)
token_count = len(tokens)
print(f"Total tokens for {MODEL}: {token_count:,}")

Total tokens for gpt-4.1-nano: 52,933


In [8]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.create_documents(
    texts=[text_nano, text_meta],
    metadatas=[
        {"source": "ImageXpress Nano User Guide"},
        {"source": "MetaXpress Image Acquisition Guide"}
    ]
)

print(f"Total chunks: {len(chunks)}")
print(f"First chunk source: {chunks[0].metadata['source']}")
print(f"First chunk content:\n{chunks[0].page_content[:200]}")

Total chunks: 480
First chunk source: ImageXpress Nano User Guide
First chunk content:
5058342 B
October 2017
ImageXpress® Nano 
Automated Imaging System
With MetaXpress Software
User Guide


## Vectorize chunks and store in the vector DB (Chroma)

Vectorizing text chunks in RAG so both queries and documents live in the same semantic space, allowing us to efficiently find chunks that are meaningfully similar rather than just keyword-matched.

In [9]:
import shutil

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

db_name = "nano_manual_chroma_db"

if os.path.exists(db_name):
    shutil.rmtree(db_name)

# 使用 from_documents 保留 metadata
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=db_name
)

print(f"Vector store created with {vector_store._collection.count()} documents")

Vector store created with 480 documents


In [13]:
collection = vector_store._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"Embedding dimensions for the all-MiniLM-L6-v2 model: {dimensions}")

Embedding dimensions for the all-MiniLM-L6-v2 model: 384


# Data Retrieval & Generation

In [14]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0, model_name=MODEL)

In [18]:
from IPython.display import display, Markdown

In [None]:
# Try a query without vector retrieval
Markdown(llm.invoke(
    "Explain how MetaXpress software obtain z-series 2D projection Image."
).content)

MetaXpress software obtains z-series 2D projection images through a process called z-stack imaging combined with image projection techniques. Here's an overview of how this process works:

1. **Acquisition of Z-Stack Images:**
   - The software controls a microscope equipped with a motorized stage and a focus drive.
   - It captures a series of 2D images (slices) at different focal depths along the z-axis, creating a z-stack. Each slice corresponds to a specific z-position within the sample.

2. **Image Processing and Alignment:**
   - The individual z-slices are processed to correct for any drift or misalignment.
   - This ensures that the images are properly registered for accurate projection.

3. **Projection to 2D:**
   - The z-stack images are combined into a single 2D projection image.
   - MetaXpress offers various projection methods, such as:
     - **Maximum Intensity Projection (MIP):** Takes the brightest pixel value from each stack position, highlighting the most intense features.
     - **Average Intensity Projection:** Computes the average pixel value across all slices.
     - **Sum Projection:** Adds pixel values across slices, useful for quantifying total signal.
   - The choice of projection method depends on the specific imaging goal.

4. **Generation of the Z-Projection Image:**
   - The software processes the selected projection method to produce a single 2D image that summarizes the 3D information contained in the z-stack.
   - This projection facilitates easier analysis and visualization of structures within the sample.

**In summary:** MetaXpress software acquires a series of images at different depths (z-stack), then applies a projection algorithm (like maximum intensity projection) to generate a 2D image that represents the combined information from the entire z-series. This approach allows researchers to analyze complex 3D structures in a simplified 2D format.

In [None]:
# Now try with retrieval
retriever = vector_store.as_retriever(search_kwargs={"k": 10})
retriever.invoke("Explain how MetaXpress software obtain z-series 2D projection Image.")

[Document(id='ae0e7758-2ca6-40a7-ae33-9bc482101e74', metadata={'source': 'ImageXpress Nano User Guide'}, page_content='2D\nProjection\nImage\nAvailable for all Z Series acquisition options other than Single Plane indicates how\nthe resulting 2D projection image is generated.\nBest Focus is the default value. The MetaXpress Software estimates the regions\nof best focus in the image stack to within one-tenth pixel accuracy along the Z\naxis. Two resolution grid sizes are used to enhance the criterion of focus through\nthe stack.\nMaximum is recommended only for fluorescence. For each corresponding pixel\nposition in the images, the pixel that has the highest intensity value out of all the\nplanes is determined, and this is the value that is output to the resulting image.\nMinimum is recommended only for transmitted light. For each corresponding\npixel position in the images, the pixel that has the lowest intensity value out of all\nthe planes is determined, and this is the value that is 

In [24]:
SYSTEM_PROMPT_TEMPLATE = """You are a helpful AI assistant specialized in answering questions based on provided documentation.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say you don't know. Don't try to make up an answer.
{context}   
"""

In [25]:
def answer_question(question: str, history):
    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)
    system_prompt = SYSTEM_PROMPT_TEMPLATE.format(context=context)
    response = llm.invoke([SystemMessage(content=system_prompt), HumanMessage(content=question)])
    return response.content

In [29]:
gr.ChatInterface(answer_question).launch()

* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.




Explain how MetaXpress software obtain z-series 2D projection Image.