diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md new file mode 100644 index 000000000..598baad28 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/1_rag.md @@ -0,0 +1,126 @@ +--- +title: Understanding RAG on Grace–Blackwell (GB10) +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is RAG? + +This module provides the conceptual foundation for how Retrieval-Augmented Generation operates on the ***Grace–Blackwell*** (GB10) platform before you begin building the system in the next steps. + +**Retrieval-Augmented Generation (RAG)** combines information retrieval with language-model generation. +Instead of relying solely on pre-trained weights, a RAG system retrieves relevant text from a document corpus and passes it to a language model to create factual, context-aware responses. + +Typical pipeline: + +User Query ─> Embedding ─> Vector Search ─> Context ─> Generation ─> Answer + +Each stage in this pipeline plays a distinct role in transforming a user’s question into an accurate, context-aware response: + +* ***Embedding model*** (e.g., E5-base-v2): Converts text into dense numerical vectors. +* ***Vector database*** (e.g., FAISS): Searches for semantically similar chunks. +* ***Language model*** (e.g., Llama 3.1 8B Instruct – GGUF Q8_0): Generates an answer conditioned on retrieved context. + +More information about RAG system and the challenges of building them can be found in this [learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/copilot-extension/1-rag/) + + +## Why Grace–Blackwell (GB10)? + +The Grace–Blackwell (GB10) platform combines Arm-based Grace CPUs with NVIDIA Blackwell GPUs, forming a unified architecture optimized for large-scale AI workloads. + +Its unique CPU–GPU co-design and Unified Memory enable seamless data exchange, making it an ideal foundation for Retrieval-Augmented Generation (RAG) systems that require both fast document retrieval and high-throughput language model inference. + +The GB10 platform integrates: +- ***Grace CPU (Arm v9.2)*** – 20 cores (10 × Cortex-X925 + 10 × Cortex-A725) +- ***Blackwell GPU*** – CUDA 13.0 Tensor Core architecture +- ***Unified Memory (128 GB NVLink-C2C)*** – Shared address space between CPU and GPU. The shared NVLink-C2C interface allows both processors to access the same 128 GB Unified Memory region without copy operations — a key feature validated later in Module 4. + +Benefits for RAG: +- ***Hybrid execution*** – Grace CPU efficiently handles embedding, indexing, and API orchestration. +- ***GPU acceleration*** – Blackwell GPU performs token generation with low latency. +- ***Unified memory*** – Eliminates CPU↔GPU copy overhead; tensors and document vectors share the same memory region. +- ***Open-source friendly*** – Works natively with PyTorch, FAISS, Transformers, and FastAPI. + +## Conceptual Architecture + +``` + ┌─────────────────────────────────────┐ + │ User Query │ + └──────────────┬──────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ Embedding (E5) │ + │ → FAISS (CPU) │ + └────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ Context Builder │ + │ (Grace CPU) │ + └────────────────────┘ + │ + ▼ + ┌───────────────────────────────────────────────┐ + │ llama.cpp (GGUF Model, Q8_0) │ + │ -ngl 40 --ctx-size 8192 │ + │ Grace CPU + Blackwell GPU (split compute) │ + └───────────────────────────────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ FastAPI Response │ + └────────────────────┘ + +``` + +To make the concept concrete, this learning path will later demonstrate a small **engineering assistant** example. +The assistant retrieves technical references (e.g., datasheet, programming guide or application note) and generates helpful explanations for software developers. +This use case illustrates how a RAG system can provide **real, contextual knowledge** without retraining the model. + +| **Stage** | **Technology / Framework** | **Hardware Execution** | **Function** | +|------------|-----------------------------|--------------------------|---------------| +| **Document Processing** | pypdf, text preprocessing scripts | Grace CPU | Converts PDFs and documents into plain text, performs cleanup and segmentation. | +| **Embedding Generation** | E5-base-v2 via sentence-transformers | Grace CPU | Transforms text into semantic vector representations for retrieval. | +| **Semantic Retrieval** | FAISS + LangChain | Grace CPU | Searches the vector index to find the most relevant text chunks for a given query. | +| **Text Generation** | llama.cpp REST Server (GGUF model) | Blackwell GPU + Grace CPU | Generates natural language responses using the Llama 3 model, accelerated by GPU inference. | +| **Pipeline Orchestration** | Python (RAG Query Script) | Grace CPU | Coordinates embedding, retrieval, and generation via REST API calls. | +| **Unified Memory Architecture** | Unified LPDDR5X Shared Memory | Grace CPU + Blackwell GPU | Enables zero-copy data sharing between CPU and GPU for improved latency and efficiency. | + + +## Prerequisites Check + +In the following content, I am using [EdgeXpert](https://ipc.msi.com/product_detail/Industrial-Computer-Box-PC/AI-Supercomputer/EdgeXpert-MS-C931), a product from [MSI](https://www.msi.com/index.php). + +Before proceeding, verify that your GB10 system meets the following: + +Run the following commands to confirm your hardware environment: + +```bash +# Check Arm CPU architecture +lscpu | grep "Architecture" + +# Confirm visible GPU and driver version +nvidia-smi +``` + +Expected output: +- ***Architecture***: aarch64 +- ***CUDA Version***: 13.0 (or later) +- ***Driver Version***: 580.95.05 + +{{% notice Note %}} +If your software version is lower than the one mentioned above, it’s recommended to upgrade the driver before proceeding with the next steps. +{{% /notice %}} + +## Wrap-up + +In this module, you explored the foundational concepts of **Retrieval-Augmented Generation (RAG)** and how it benefits from the **Grace–Blackwell (GB10)** architecture. +You examined how the **Grace CPU** and **Blackwell GPU** collaborate through **Unified Memory**, enabling seamless data sharing and hybrid execution for AI workloads. + +With the conceptual architecture and hardware overview complete, you are now ready to begin hands-on implementation. +In the next module, you will **set up the development environment**, install the required dependencies, and verify that both the **E5-base-v2** embedding model and **Llama 3.1 8B Instruct** LLM run correctly on the **Grace–Blackwell** platform. + +This marks the transition from **theory to practice** — moving from RAG concepts to building your own **hybrid CPU–GPU pipeline** on Grace–Blackwell. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md new file mode 100644 index 000000000..13b660c40 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/2_rag_steup.md @@ -0,0 +1,489 @@ +--- +title: Setting Up and Validating the RAG Foundation +weight: 3 +layout: "learningpathall" +--- + +## Setting Up and Validating the RAG Foundation + +In the previous session, you verified that your **DGX Spark (GB10)** system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. + +This module prepares the software and data foundation that enables the RAG workflow in later stages. + +In this module, you will: +- Set up and validate the core environment for the RAG pipeline. +- Load and test the **E5-base-v2** embedding model. +- Build a local **FAISS** index for document retrieval. +- Prepare and verify the **Llama 3.1 8B Instruct** model for text generation. +- Confirm GPU acceleration and overall system readiness. + +## Step 1 - Create The Development Environment + +```bash +# Create and activate a virtual environment +cd ~ +python3 -m venv rag-venv +source rag-venv/bin/activate + +# Upgrade pip and install base dependencies +pip install --upgrade pip +pip install torch --index-url https://download.pytorch.org/whl/cpu +pip install transformers==4.46.2 sentence-transformers==2.7.0 faiss-cpu langchain==1.0.5 \ + langchain-community langchain-huggingface huggingface_hub \ + pypdf tqdm numpy +``` + +**Why these packages?** +These libraries provide the essential building blocks of the RAG system: +- **sentence-transformers** — used for text embedding with the E5-base-v2 model. +- **faiss-cpu** — enables efficient similarity search for document retrieval. Since this pipeline runs on the Grace CPU, the CPU version of FAISS is sufficient — GPU acceleration is not required for this stage. +- **LangChain** — manages data orchestration between embedding, retrieval, and generation. +- **huggingface_hub** — handles model download and authentication. +- **pypdf** — extracts and processes text content from documents. +- **tqdm** — provide progress visualization. + + +Check installation: +```bash +python - <<'EOF' +import faiss, transformers +print("FAISS version:", faiss.__version__) +print("FAISS GPU:", faiss.get_num_gpus() > 0) +EOF +``` + +The output confirms that FAISS is running in CPU mode (FAISS GPU: False), which is expected for this setup. +``` +FAISS version: 1.12.0 +FAISS GPU: False +``` + +## Step 2 – Model Preparation + +Download and organize the models required for the **GB10 Local RAG Pipeline**: + +- **LLM (Large Language Model)** — llama-3-8b-instruct for text generation. +- **Embedding Model** — E5-base-v2 for document vectorization. + +Both models will be stored locally under the `~/models` directory for offline operation. + +```bash +mkdir -p ~/models && cd ~/models + +# Login to your Hugging Face Token +hf auth login +hf download intfloat/e5-base-v2 --local-dir ~/models/e5-base-v2 + +# Download GGUF version of llama-3.1 8B model to save the time for local conversion +wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -P ~/models/Llama-3.1-8B-gguf +``` + +### Verify the **E5-base-v2** model + +Run a Python script to verify that the **E5-base-v2** model loads correctly and can generate embeddings. + +```bash +from sentence_transformers import SentenceTransformer +import numpy as np +import os + +model_path = os.path.expanduser("~/models/e5-base-v2") +print(f"Loading model from: {model_path}") + +try: + model = SentenceTransformer(model_path) + sentences = [ + "Arm processors are designed for high efficiency.", + "The Raspberry Pi uses Arm cores for its SoC." + ] + embeddings = model.encode(sentences) + + if isinstance(embeddings, np.ndarray) and embeddings.shape[0] == len(sentences): + print(" Model loaded and embeddings generated successfully.") + print("Embedding shape:", embeddings.shape) + print("First vector snippet:", np.round(embeddings[0][:10], 4)) + else: + print(" Model loaded, but embedding output seems incorrect.") +except Exception as e: + print(f" Model failed to load or generate embeddings: {e}") +``` + +Expected output should confirm the E5-base-v2 model can generate embeddings successfully.” +``` + Model loaded and embeddings generated successfully. +Embedding shape: (2, 768) +First vector snippet: [-0.012 -0.0062 -0.0008 -0.0014 0.026 -0.0066 -0.0173 0.026 -0.0238 + -0.0455] + ``` + +Interpret the E5-base-v2 Result: + +- ***Test sentences***: The two example sentences are used to confirm that the model can process text input and generate embeddings correctly. If this step succeeds, it means the model’s tokenizer, encoder, and PyTorch runtime on the Grace CPU are all working together properly. +- ***Embedding shape (2, 768)***: The two sentences were converted into two 768-dimensional embedding vectors — 768 is the hidden dimension size of this model. +- ***First vector snippet***: Displays the first 10 values of the first embedding vector. Each number represents a learned feature extracted from the text. + +A successful output confirms that the ***E5-base-v2 embedding model*** is functional and ready for use on the Grace CPU. + + +### Verify the **llama-3.1-8B** model + +Then, you are going to verify the gguf model. + +The **llama.cpp** runtime will be used for text generation. +Please ensure that both the **CPU** and **GPU** builds have been installed following the previous [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/). + +Perform a quick verification test on `llama-3.1-8B-Q8_0.gguf` + +```bash +cd ~/llama.cpp/build-gpu + +./bin/llama-cli \ + -m ~/models/Llama-3.1-8B-gguf/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \ + -p "Hello from RAG user" \ + -ngl 40 --n-predict 64 +``` + +You should see the model load successfully and print a short generated sentence, for example: + +``` +Hello from this end! What brings you to this chat? Do you have any questions or topics you'd like to discuss? I'm here to help! +``` + +Then, you need to check ***REST Server*** which you will need to use in RAG pipeline in next session. + +```bash +./bin/llama-server \ + -m ~/models/Llama-3.1-8B-gguf/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \ + -ngl 40 --ctx-size 8192 \ + --port 8000 \ + --host 0.0.0.0 +``` + +Use another terminal in the same machine to do the health checking: + +```bash +curl http://127.0.0.1:8000/completion \ + -H "Content-Type: application/json" \ + -d '{"prompt": "Explain why unified memory improves CPU–GPU collaboration.", "n_predict": 64}' +``` + +A short JSON payload containing a coherent explanation generated by the model. + +{{% notice Note %}} +To test remote access from another machine, replace `127.0.0.1` with the GB10 IP address. +{{% /notice %}} + + +## Step 3 – Prepare a Sample Document Corpus + +Prepare the text corpus that your **RAG system** will use for retrieval and reasoning. +This stage converts your raw knowledge documents into clean, chunked text segments that can later be **vectorized and indexed** by FAISS. + +### Create a workspace and data folder +We’ll use a consistent directory layout so later scripts can find your data easily. + +```bash +mkdir -p ~/rag && cd ~/rag +mkdir pdf text +``` + +List all the source PDF URLs into a file, one per line. +In this learning path, we collect all of Raspberry Pi datasheet links into file called `datasheet.txt` + +``` +https://datasheets.raspberrypi.com/cm/cm1-and-cm3-datasheet.pdf +https://datasheets.raspberrypi.com/cm/cm3-plus-datasheet.pdf +https://datasheets.raspberrypi.com/cm4/cm4-datasheet.pdf +https://datasheets.raspberrypi.com/cm4io/cm4io-datasheet.pdf +https://datasheets.raspberrypi.com/cm4s/cm4s-datasheet.pdf +https://datasheets.raspberrypi.com/pico/pico-2-datasheet.pdf +https://datasheets.raspberrypi.com/pico/pico-datasheet.pdf +https://datasheets.raspberrypi.com/picow/pico-2-w-datasheet.pdf +https://datasheets.raspberrypi.com/picow/pico-w-datasheet.pdf +https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf +https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.pdf +https://datasheets.raspberrypi.com/rpi4/raspberry-pi-4-datasheet.pdf +``` + +Use `wget` to batch download all of pdf into `~/rag/pdf` +```bash +wget -P ~/rag/pdf -i datasheet.txt +``` + +### Convert PDF into txt file + +Then, create a python file `pdf2text.py` + +```python +from pypdf import PdfReader +import glob, os + +pdf_root = os.path.expanduser("~/rag/pdf") +txt_root = os.path.expanduser("~/rag/text") +os.makedirs(txt_root, exist_ok=True) + +count = 0 +for file in glob.glob(os.path.join(pdf_root, "**/*.pdf"), recursive=True): + print(f"File processing {file}") + try: + reader = PdfReader(file) + text = "\n".join(page.extract_text() or "" for page in reader.pages) + + rel_path = os.path.relpath(file, pdf_root) + txt_path = os.path.join(txt_root, os.path.splitext(rel_path)[0] + ".txt") + os.makedirs(os.path.dirname(txt_path), exist_ok=True) + + with open(txt_path, "w", encoding="utf-8") as f: + f.write(text) + + count += 1 + print(f"Converted: {file} -> {txt_path}") + + except Exception as e: + print(f"Error processing {file}: {e}") + +print(f"\nTotal converted PDFs: {count}") +print(f"Output directory: {txt_root}") +``` + +The resulting text files will form the base corpus for semantic retrieval in later steps. + +Run the Python script to convert all PDFs into text files. + +```bash +python pdf2text.py +``` + +This script converts all PDFs into text files for later embedding. + +### Verify your corpus +You should now see something like this in your folder: +```bash +find ~/rag/text/ -type f -name "*.txt" -exec cat {} + | wc -l +``` + +It will show how many line in total. + + +## Step 4 – Build an Embedding and Search Index + +Convert your prepared text corpus into **vector embeddings** and store them in a **FAISS index** for efficient semantic search. + +This stage enables your RAG pipeline to retrieve the most relevant text chunks when users ask questions. + +| **Component** | **Role** | +|--------------|------------------------------| +| **SentenceTransformer (E5-base-v2)** | Generates vector embeddings for each text chunk | +| **LangChain + FAISS** | Stores and searches embeddings efficiently | +| **RecursiveCharacterTextSplitter** | Splits long documents into manageable text chunks | + +Use **E5-base-v2** to encode the documents and create a FAISS vector index. + +### Create the FAISS builder script + +Save the following as `build_index.py` in `~/rag` + +```bash +mkdir -p ~/rag/faiss_index +``` + +The embedding process (about 10 minutes on CPU) will batch every 100 text chunks for progress logging. + +```python +import os, glob +from tqdm import tqdm + +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_community.vectorstores import FAISS +from langchain_core.documents import Document +from langchain_text_splitters import RecursiveCharacterTextSplitter + +# Paths +data_dir = os.path.expanduser("~/rag/text") +model_dir = os.path.expanduser("~/models/e5-base-v2") +index_dir = os.path.expanduser("~/rag/faiss_index") + +os.makedirs(index_dir, exist_ok=True) + +# Load embedding model (CPU only) +embedder = HuggingFaceEmbeddings( + model_name=model_dir, + model_kwargs={"device": "cpu"} +) + +print(f" Embedder loaded on: {embedder._client.device}") +print(f" Model path: {model_dir}") + +# Collect and split all text files (recursive) +docs = [] +splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100) + +print("\n Scanning and splitting text files...") +for path in glob.glob(os.path.join(data_dir, "**/*.txt"), recursive=True): + with open(path, "r", encoding="utf-8", errors="ignore") as f: + text = f.read() + if not text.strip(): + continue + rel_path = os.path.relpath(path, data_dir) + for chunk in splitter.split_text(text): + docs.append(Document(page_content=chunk, metadata={"source": rel_path})) + +print(f" Total chunks loaded: {len(docs)}") + +# Prepare inputs for embedding +texts = [d.page_content for d in docs] +metadatas = [d.metadata for d in docs] + +""" +# Full embedding with progress logging every 100 chunks +print("\n Embedding text chunks (batch log every 100)...") +embeddings = [] +for i, chunk in enumerate(texts): + embedding = embedder.embed_documents([chunk])[0] + embeddings.append(embedding) + if (i + 1) % 100 == 0 or (i + 1) == len(texts): + print(f" Embedded {i + 1} / {len(texts)} chunks") +""" +# Batch embedding +embeddings = [] +batch_size = 16 +for i in range(0, len(texts), batch_size): + batch_texts = texts[i:i+batch_size] + batch_embeddings = embedder.embed_documents(batch_texts) + embeddings.extend(batch_embeddings) + print(f" Embedded {i + len(batch_texts)} / {len(texts)}") + +# Pair (text, embedding) for FAISS +text_embeddings = list(zip(texts, embeddings)) + +print("\n Saving FAISS index...") +db = FAISS.from_embeddings( + text_embeddings, + embedder, + metadatas=metadatas +) +db.save_local(index_dir) +print(f"\n FAISS index saved to: {index_dir}") +``` + + +**Run it:** +```bash +python build_index.py +``` + +The script will process the corpus, load approximately 6,000 text chunks, and save the resulting FAISS index to: `~/rag/faiss_index` + +You will find two of files inside. +- ***index.faiss*** + - A binary file that stores the vector index built using ***FAISS***. + - It contains the actual embeddings and data structures used for ***efficient similarity search*** (e.g., L2 distance, cosine). + - This file enables fast retrieval of nearest neighbors for any given query vector. +- ***index.pkl*** + - A ***Pickle*** file that stores metadata and original document chunks. + - It maps each vector in index.faiss back to its ***text content and source info*** (e.g., file name). + - Used by LangChain to return human-readable results along with context. + +You can verify the FAISS index using the following script. + +```python +import os +from langchain_community.vectorstores import FAISS +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_core.documents import Document + +model_path = os.path.expanduser("~/models/e5-base-v2") +index_path = os.path.expanduser("~/rag/faiss_index") + +embedder = HuggingFaceEmbeddings(model_name=model_path) +db = FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True) + +query = "raspberry pi 4 power supply" +results = db.similarity_search(query, k=3) + +for i, r in enumerate(results, 1): + print(f"\nResult {i}") + print(f"Source: {r.metadata.get('source')}") + print(r.page_content[:300], "...") + +query = "Use SWD debug Raspberry Pi Pico" +results = db.similarity_search(query, k=3) + +for i, r in enumerate(results, 4): + print(f"\nResult {i}") + print(f"Source: {r.metadata.get('source')}") + print(r.page_content[:300], "...") +``` + +The results will look like the following: + +``` +Result 1 +Source: cm4io-datasheet.txt +Raspberry Pi Compute Module 4 IO Board. We recommend budgeting 9W for CM4. +If you want to supply an external +5V supply to the board, e.g. via J20 or via PoE J9, then we recommend that L5 be +removed. Removing L5 will prevent the on-board +5V and +3.3V supplies from starting up and +5V coming out of ... + +Result 2 +Source: cm4io-datasheet.txt +power the CM4. There is also an on-board +12V to +3.3V DC-DC converter PSU which is only used for the PCIe slot. The ++12V input feeds the +12V PCIe slot, the external PSU connector and the fan connector directly. If these aren’t being +used then a wider input supply is possible (+7.5V to +28V). +With ... + +Result 3 +Source: cm4io-datasheet.txt +that Raspberry Pi 4 Model B has, and for general usage you should refer to the Raspberry Pi 4 Model B documentation . +The significant difference between CM4IO and Raspberry Pi 4 Model B is the addition of a single PCIe socket. The +CM4IO has been designed as both a reference design for CM4 or to be u ... + +Result 4 +Source: pico-datasheet.txt +mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code +without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. +Raspberry Pi Pico Datasheet +Chapter 1. About Raspberry Pi Pico 4 +Getting started ... + +Result 5 +Source: pico-2-datasheet.txt +mass storage device), or the standard Serial Wire Debug (SWD) port can reset the system and load and run code +without any button presses. The SWD port can also be used to interactively debug code running on the RP2350. + TIP +Getting started with Raspberry Pi Pico-series walks through loading progra ... + +Result 6 +Source: pico-w-datasheet.txt +without any button presses. The SWD port can also be used to interactively debug code running on the RP2040. +Getting started with Pico W +The Getting started with Raspberry Pi Pico-series book walks through loading programs onto the +board, and shows how to install the C/C++ SDK and build the example ... +``` + +The execution of `check_index.py` confirmed that your local ***FAISS vector index*** is functioning correctly for semantic search tasks. + +You performed two distinct queries targeting different product lines within the Raspberry Pi ecosystem: ***Raspberry Pi 4 power supply*** and ***Raspberry Pi Pico SWD debugging***. + +- For the first query, ***raspberry pi 4 power supply***, the system returned three highly relevant results, all sourced from the `cm4io-datasheet.txt` file. These passages provided technical guidance on power requirements, supply voltage ranges, and hardware configurations specific to the Compute Module 4 IO Board. This indicates that the embeddings captured the correct semantic intent, and that the FAISS index correctly surfaced content even when specific keywords like ***power supply*** appeared in varied contexts. + +- For the second query, ***Use SWD debug Raspberry Pi Pico***, the search retrieved top results from all three relevant datasheets in the Pico family: `pico-datasheet.txt`, `pico-2-datasheet.txt`, and `pico-w-datasheet.txt`. +The extracted passages consistently explained how the ***Serial Wire Debug (SWD)*** port allows developers to reset the system, load and run code without manual input, and perform interactive debugging on the RP2040 or RP2350 microcontrollers. This demonstrates that your chunking and indexing pipeline accurately retained embedded debugging context, and that metadata mapping correctly links each result to its original source document. + +This process validates that your system can perform semantic retrieval on technical documents — a core capability of any RAG application. + +In summary, both semantic queries were successfully answered using your local vector store, validating that the indexing, embedding, metadata, and retrieval components of your RAG backend are working correctly in a CPU-only configuration. + + +| **Stage** | **Technology** | **Hardware Execution** | **Function** | +|------------|----------------|------------------------|---------------| +| Document Processing | pypdf, python-docx | Grace CPU | Text extraction | +| Embedding | E5-base-v2 (sentence-transformers) | Grace CPU | Vectorization | +| Retrieval | FAISS + LangChain | Grace CPU | Semantic search | +| Generation | llama.cpp REST Server | Blackwell GPU + Grace CPU | Text generation | +| Orchestration | Python RAG Script | Grace CPU | Pipeline control | +| Unified Memory | NVLink-C2C | Shared | Zero-copy data exchange | + +At this point, your environment is fully configured and validated. +You have confirmed that the E5-base-v2 embedding model, FAISS index, and Llama 3.1 8B model are all functioning correctly. + +In the next module, you will integrate all these validated components into a full **Retrieval-Augmented Generation (RAG)** pipeline, combining CPU-based retrieval and GPU-accelerated generation on the ***Grace–Blackwell (GB10)*** platform. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md new file mode 100644 index 000000000..e71d91f38 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/3_rag_pipeline.md @@ -0,0 +1,199 @@ +--- +title: Implementing the RAG Pipeline +weight: 4 +layout: "learningpathall" +--- + +## Integrating Retrieval and Generation + +In the previous modules, you prepared the environment, validated the ***E5-base-v2*** embedding model, and verified that the ***Llama 3.1 8B*** Instruct model runs successfully on the ***Grace–Blackwell (GB10)*** platform. + +In this module, you will bring all components together to build a complete ***Retrieval-Augmented Generation*** (RAG) workflow. +This stage connects the ***CPU-based retrieval and indexing*** with ***GPU-accelerated language generation***, creating an end-to-end system capable of answering technical questions using real documentation data. + +Building upon the previous modules, you will now: +- Connect the **E5-base-v2** embedding model and FAISS vector index. +- Integrate the **llama.cpp** REST server for GPU-accelerated inference. +- Execute a complete **Retrieval-Augmented Generation** (RAG) workflow for end-to-end question answering. + +### Step 1 – Start the llama.cpp REST Server + +Before running the RAG query script, ensure the LLM server is active. + +```bash +cd ~/llama.cpp/build-gpu/ +./bin/llama-server \ + -m ~/models/Llama-3.1-8B-gguf/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \ + -ngl 40 --ctx-size 8192 \ + --port 8000 --host 0.0.0.0 +``` + +Verify the server status from another terminal: +```bash +curl http://127.0.0.1:8000/health +``` + +Expected output: +``` +{"status":"ok"} +``` + + +### Step 2 – Create the RAG Query Script + +This script performs the full pipeline: + +***query*** → ***embedding*** → ***retrieval*** → ***context assembly*** → ***generation*** + +Save as rag_query_rest.py under ~/rag/. + +```bash +import os +import requests, faiss, json, numpy as np +from sentence_transformers import SentenceTransformer +from langchain_community.vectorstores import FAISS +from langchain_huggingface import HuggingFaceEmbeddings + +# --- Paths --- +index_path = os.path.expanduser("~/rag/faiss_index") +model_path = os.path.expanduser("~/models/e5-base-v2") +LLAMA_URL = "http://127.0.0.1:8000/completion" + +# --- Load Embedding Model & FAISS Index --- +embedder = HuggingFaceEmbeddings(model_name=model_path, model_kwargs={"device": "cpu"}) +db = FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True) + +def rag_query(question, top_k=3, max_new_tokens=256): + # Step 1: Retrieve documents + results = db.similarity_search(question, k=top_k) + context = "\n\n".join([r.page_content for r in results]) + + print("\nRetrieved sources:") + for i, r in enumerate(results, 1): + print(f"{i}. {r.metadata.get('source', 'unknown')}") + + # Step 2: Construct prompt + prompt = f"""You are a helpful engineering assistant. +Use the following context to answer the question. + +Context: +{context} + +Question: +{question} + +Answer:""" + + # Step 3: Call llama.cpp REST Server + payload = {"prompt": prompt, "n_predict": max_new_tokens, "temperature": 0.2} + try: + resp = requests.post(LLAMA_URL, json=payload, timeout=300) + data = resp.json() + return data.get("content", data) + except Exception as e: + print(f"llama.cpp server error or invalid response: {e}") + +if __name__ == "__main__": + answer = rag_query("How many CPU core inside the RaspberryPi 4?") +# answer = rag_query("On the Raspberry Pi 4, which GPIOs have a default pull-down (pull low) configuration? Please specify the source and the section of the datasheet where this information can be found.") + print("\n=== RAG Answer ===\n") + print(answer) +``` + +### Step 3 – Execute the RAG Query Script + +Then, run the python script to ask the question about ***How many CPU core inside the RaspberryPi 4?*** + +```bash +python rag_query_rest.py +``` + +You will receive an answer similar to the following. + +``` +Retrieved sources: +1. cm4-datasheet.txt +2. raspberry-pi-4-datasheet.txt +3. cm4s-datasheet.txt + +=== RAG Answer === + + 4 +The Raspberry Pi 4 has 4 CPU cores. +``` + +The retrieved context referenced three datasheets and produced the correct answer: "4". + +Next, let’s ask a more Raspberry Pi 4 hardware-specific question like: what's the default pull setting of GPIO12?` + +Comment out the first question line `answer = rag_query("How many CPU core inside the RaspberryPi 4?")` +and uncomment the second one to test a more detailed query. +`answer = rag_query("On the Raspberry Pi 4, which GPIOs have a default pull-down (pull low) configuration? Please specify the source and the section of the datasheet where this information can be found.")` + + +Modify the answer = rag_query("On raspbeery pi 4, what's the default pull of GPIO12?") +``` +Retrieved sources: +1. cm3-plus-datasheet.txt +2. raspberry-pi-4-datasheet.txt +3. cm4s-datasheet.txt + +=== RAG Answer === + + Low +Step 1: The question asks about the default pull state of GPIO12 on a Raspberry Pi 4. +Step 2: To answer this question, we need to refer to the provided table, which lists the default pin pull state and available alternate GPIO functions for the Raspberry Pi 4. +Step 3: Specifically, we are looking for the default pull state of GPIO12. We can find this information in the table by locating the row corresponding to GPIO12. +Step 4: The table shows that GPIO12 has a default pull state of Low. +Step 5: Therefore, the default pull of GPIO12 on a Raspberry Pi 4 is Low. + +The final answer is: $\boxed{Low}$ +``` + + +``` +Retrieved sources: +1. raspberry-pi-4-datasheet.txt +2. cm4-datasheet.txt +3. cm3-plus-datasheet.txt + +=== RAG Answer === + +The GPIOs with a default pull-down (pull low) configuration are: +- GPIO 9 (SPI0 MISO) +- GPIO 10 (SPI0 MOSI) +- GPIO 11 (SPI0 SCLK) +- GPIO 12 (PWM0) +- GPIO 13 (PWM1) +- GPIO 14 (TXD0) +- GPIO 15 (RXD0) +- GPIO 16 (FL0) +- GPIO 17 (FL1) +- GPIO 19 (PCM FS) + +Source: Table 5: Raspberry Pi 4 GPIO Alternate Functions, section 5.1.2 GPIO Alternate Functions. +``` + +This demonstrates that the RAG system correctly retrieved relevant sources and generated the right answer using both CPU retrieval and GPU inference. + +You can reference the section 5.1.2 on the PDF to verify the result. + +### Step 4 - CPU–GPU Utilization Observation + +Follow the previous (learning path) [https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/], you can also install `htop` and `nvtop` to observe CPU and GPU utilitization. + +![image1 CPU–GPU Utilization screenshot](rag_utilization.jpeg "CPU–GPU Utilization") + +The figure above illustrates how the ***Grace CPU*** and ***Blackwell GPU*** collaborate during ***RAG** execution. +On the left, the GPU utilization graph shows a clear spike reaching ***96%***, indicating that the llama.cpp inference engine is actively generating tokens on the GPU. +Meanwhile, on the right, the htop panel shows multiple Python processes (rag_query_rest.py) running on a single Grace CPU core, maintaining around 93% per-core utilization. + +This demonstrates the hybrid execution model of the RAG pipeline: +- The Grace CPU handles embedding computation, FAISS retrieval, and orchestration of REST API calls. +- The Blackwell GPU performs heavy matrix multiplications for LLM token generation. +- Both operate concurrently within the same Unified Memory space, eliminating data copy overhead between CPU and GPU. + +You have now connected all components of the RAG pipeline on the ***Grace–Blackwell*** (GB10) platform. +The ***Grace CPU*** handled ***embedding*** and ***FAISS retrieval***, while the ***Blackwell GPU*** generated answers efficiently via llama.cpp REST Server. + +With the RAG pipeline now complete, the next module will focus on Unified Memory behavior, you will observe Unified Memory behavior to understand how CPU and GPU share data seamlessly within the same memory space. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md new file mode 100644 index 000000000..fb2c79097 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/4_rag_memory_observation.md @@ -0,0 +1,209 @@ +--- +title: Observing Unified Memory Collaboration +weight: 5 +layout: "learningpathall" +--- + +## Observing Unified Memory Collaboration + +In this module, you will monitor how the ***Grace CPU*** and ***Blackwell GPU*** share data through Unified Memory during RAG execution. + +You will start from an idle system state, then progressively launch the model server and run a query, while monitoring both system memory and GPU activity from separate terminals. + +Through these real-time observations, you will verify that the Grace–Blackwell Unified Memory architecture enables zero-copy data sharing — allowing both processors to access the same memory space without moving data. + + +| **Terminal** | **Observation Target** | **Purpose** | +|----------------------|------------------------|----------------------------------------------------| +| `Monitor Terminal 1` | System memory usage | Observe memory allocation changes as processes run | +| `Monitor Terminal 2` | GPU activity | Track GPU utilization, power draw, and temperature | + +### Step 1 – Experiment Preparation + +Ensure the RAG pipeline is stopped before starting the observation. + +#### Monitor Terminal 1 - System Memory Observation + +```bash +while true; do + echo -n "$(date '+[%Y-%m-%d %H:%M:%S]') " + free -h | grep Mem: | awk '{printf "used=%s free=%s available=%s\n", $3, $4, $7}' + sleep 1 +done +``` + +Example Output: +``` +[2025-11-07 22:34:24] used=3.5Gi free=106Gi available=116Gi +[2025-11-07 22:34:25] used=3.5Gi free=106Gi available=116Gi +[2025-11-07 22:34:26] used=3.5Gi free=106Gi available=116Gi +[2025-11-07 22:34:27] used=3.5Gi free=106Gi available=116Gi +``` + +**Field Explanation:** +- `used` — Total memory currently utilized by all active processes. +- `free` — Memory not currently allocated or reserved by the system. +- `available` — Memory immediately available for new processes, accounting for reclaimable cache and buffers. + +#### Monitor Terminal 2 – GPU Status Observation + +```bash +sudo stdbuf -oL nvidia-smi --loop-ms=1000 \ + --query-gpu=timestamp,utilization.gpu,utilization.memory,power.draw,temperature.gpu,memory.used \ + --format=csv,noheader,nounits +``` + +Example Output: <-- format not easy to read +``` +2025/11/07 22:38:05.114, 0, 0, 4.43, 36, [N/A] +2025/11/07 22:38:06.123, 0, 0, 4.46, 36, [N/A] +2025/11/07 22:38:07.124, 0, 0, 4.51, 36, [N/A] +2025/11/07 22:38:08.124, 0, 0, 4.51, 36, [N/A] +``` + +**Field Output Explanation**: +| **Field** | **Description** | **Interpretation** | +|----------------------|---------------------------|-----------------------------------------------------------------------------| +| `timestamp` | Time of data sampling | Used to align GPU metrics with memory log timestamps | +| `utilization.gpu` | GPU compute activity | Peaks during token generation | +| `utilization.memory` | GPU DRAM controller usage | Stays at 0% — Unified Memory bypasses the GDDR controller | +| `power.draw` | GPU power consumption | Rises during inference, falls after completion | +| `temperature.gpu` | GPU temperature (°C) | Slightly increases during workload, confirming GPU activity | +| `memory.used` | GPU VRAM usage | GB10 does not include separate VRAM; all data resides within Unified Memory | + + +### Step 2 – Launch the llama-server + +Now, start the `llama.cpp` REST server again in your original terminal (the same flow of previous session) + +```bash +cd ~/llama.cpp/build-gpu/ +./bin/llama-server \ + -m ~/models/Llama-3.1-8B-gguf/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \ + -ngl 40 --ctx-size 8192 \ + --port 8000 --host 0.0.0.0 +``` + +Observe both monitoring terminals: + +Monitor Terminal 1 +``` +[2025-11-07 22:50:27] used=3.5Gi free=106Gi available=116Gi +[2025-11-07 22:50:28] used=3.9Gi free=106Gi available=115Gi +[2025-11-07 22:50:29] used=11Gi free=98Gi available=108Gi +[2025-11-07 22:50:30] used=11Gi free=98Gi available=108Gi +[2025-11-07 22:50:31] used=11Gi free=98Gi available=108Gi +[2025-11-07 22:50:32] used=12Gi free=97Gi available=106Gi +[2025-11-07 22:50:33] used=12Gi free=97Gi available=106Gi +``` + +Monitor Terminal 2 +``` +2025/11/07 22:50:27.836, 0, 0, 4.39, 35, [N/A] +2025/11/07 22:50:28.836, 0, 0, 6.75, 36, [N/A] +2025/11/07 22:50:29.837, 6, 0, 11.47, 36, [N/A] +2025/11/07 22:50:30.837, 7, 0, 11.51, 36, [N/A] +2025/11/07 22:50:31.838, 6, 0, 11.50, 36, [N/A] +2025/11/07 22:50:32.839, 0, 0, 11.90, 36, [N/A] +2025/11/07 22:50:33.840, 0, 0, 10.85, 36, [N/A] +``` + +| **Terminal** | **Observation** | **Behavior** | +|--------------------|------------------------------------------------------|-------------------------------------------------| +| Monitor Terminal 1 | used increases by ~8 GiB | Model weights loaded into shared Unified Memory | +| Monitor Terminal 2 | utilization.gpu momentarily spikes, power.draw rises | GPU initialization and model mapping | + + +This confirms the model is resident in Unified Memory — visible by increased system RAM, but not as GPU VRAM usage. + + +## Step 3 – Execute the RAG Query + +In another terminal (or background session), run: + +```bash +python3 ~/rag/rag_query_rest.py +``` + +Monitor Terminal 1 +``` +[2025-11-07 22:53:56] used=12Gi free=97Gi available=106Gi +[2025-11-07 22:53:57] used=12Gi free=97Gi available=106Gi +[2025-11-07 22:53:58] used=12Gi free=97Gi available=106Gi +[2025-11-07 22:53:59] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:00] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:01] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:02] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:03] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:04] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:05] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:06] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:07] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:08] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:09] used=13Gi free=96Gi available=106Gi +[2025-11-07 22:54:10] used=12Gi free=97Gi available=106Gi +[2025-11-07 22:54:11] used=12Gi free=97Gi available=106Gi +``` + +Monitor Terminal 2 +``` +2025/11/07 22:53:56.010, 0, 0, 11.24, 41, [N/A] +2025/11/07 22:53:57.010, 0, 0, 11.22, 41, [N/A] +2025/11/07 22:53:58.011, 0, 0, 11.20, 41, [N/A] +2025/11/07 22:53:59.012, 0, 0, 11.19, 41, [N/A] +2025/11/07 22:54:00.012, 0, 0, 11.33, 41, [N/A] +2025/11/07 22:54:01.013, 0, 0, 11.89, 41, [N/A] +2025/11/07 22:54:02.014, 96, 0, 31.53, 44, [N/A] +2025/11/07 22:54:03.014, 96, 0, 31.93, 45, [N/A] +2025/11/07 22:54:04.015, 96, 0, 31.98, 45, [N/A] +2025/11/07 22:54:05.015, 96, 0, 32.11, 46, [N/A] +2025/11/07 22:54:06.016, 96, 0, 32.01, 46, [N/A] +2025/11/07 22:54:07.016, 96, 0, 32.03, 46, [N/A] +2025/11/07 22:54:08.017, 96, 0, 32.14, 47, [N/A] +2025/11/07 22:54:09.017, 95, 0, 32.17, 47, [N/A] +2025/11/07 22:54:10.018, 0, 0, 28.87, 45, [N/A] +2025/11/07 22:54:11.019, 0, 0, 11.83, 44, [N/A] +``` + +| **Timestamp** | **GPU Utilization** | **GPU Power** | **System Memory (used)** | **Interpretation** | +|---------------|---------------------|---------------|--------------------------|-------------------------------------------------------| +| 22:53:58 | 0% | 11 W | 12 Gi | System idle | +| 22:54:02 | 96% | 32 W | 13 Gi | GPU performing generation while CPU handles retrieval | +| 22:54:09 | 96% | 32 W | 13 Gi | Unified Memory data sharing in progress | +| 22:54:10 | 0% | 12 W | 12 Gi | Query completed, temporary buffers released | + + +The GPU executes compute kernels (utilization.gpu ≈ 96%) without reading from GDDR or PCIe. + +Hence, `utilization.memory=0` and `memory.used=[N/A]` are the clearest signs that data sharing, not data copying, is happening. + +### Observe and Interpret Unified Memory Behavior: + +This experiment confirms the Grace–Blackwell Unified Memory architecture in action: +- CPU and GPU share the same address space. +- No data transfers occur via PCIe. +- Memory activity remains stable while GPU utilization spikes. + +Data doesn’t move — computation moves to the data. + +The Grace CPU orchestrates retrieval, and the Blackwell GPU performs generation, +both operating within the same Unified Memory pool. + +### Summary of Unified Memory Behavior + +| **Observation** | **Unified Memory Explanation** | +|----------------------------------------------------|----------------------------------------------------------| +| Memory increases once (during model loading) | Model weights are stored in shared Unified Memory | +| Slight memory increase during query execution | CPU temporarily stores context; GPU accesses it directly | +| GPU power increases during computation | GPU cores are actively performing inference | +| No duplicated allocation or data transfer observed | Data is successfully shared between CPU and GPU | + + +In this learning path, you have successfully implemented a ***Retrieval-Augmented Generation*** (RAG) pipeline on the ***Grace–Blackwell*** (GB10) platform and observed how the ***Grace CPU*** and ***Blackwell GPU*** operate together within the same ***Unified Memory*** space — sharing data seamlessly, without duplication or explicit data movement. + +Through this hands-on experiment, you confirmed that: +- The Grace CPU efficiently handles retrieval, embedding, and orchestration tasks. +- The Blackwell GPU accelerates generation using data directly from Unified Memory. +- The system memory and GPU activity clearly demonstrate zero-copy data sharing. + +This exercise highlights how the Grace–Blackwell architecture simplifies hybrid AI development — enabling data to stay in place while computation moves to it, reducing complexity and improving efficiency for next-generation Arm-based AI systems. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md new file mode 100644 index 000000000..498f86e1a --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_index.md @@ -0,0 +1,58 @@ +--- +title: End-to-End RAG Pipeline on Grace–Blackwell (GB10) + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This learning path is designed for developers and engineers who want to understand and implement a Retrieval-Augmented Generation (RAG) pipeline optimized for the Grace–Blackwell (GB10) platform. It is ideal for those interested in exploring how Arm-based Grace CPUs manage local document retrieval and orchestration, while Blackwell GPUs accelerate large language model inference through the open-source llama.cpp REST Server. By the end, learners will understand how to build an efficient hybrid CPU–GPU RAG system that leverages Unified Memory for seamless data sharing between computation layers. + +learning_objectives: + - Understand how a RAG system combines document retrieval and language model generation. + - Deploy a hybrid CPU–GPU RAG pipeline on the GB10 platform using open-source tools. + - Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval. + - Build a reproducible RAG application that demonstrates efficient hybrid computing. + +prerequisites: + - One NVIDIA DGX Spark system with at least 15 GB of available disk space. + - Follow the previous [Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/) to install both the CPU and GPU builds of llama.cpp. + +author: Odin Shen + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Cortex-X + - Cortex-A +operatingsystems: + - Linux +tools_software_languages: + - Python + - C++ + - Bash + - llama.cpp + +further_reading: + - resource: + title: Nvidia DGX Spark + link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/ + type: website + - resource: + title: Nvidia DGX Spark Playbooks + link: https://github.com/NVIDIA/dgx-spark-playbooks + type: documentation + - resource: + title: Arm Learning Path + link: https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/ + type: Learning Path + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_next-steps.md b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_next-steps.md new file mode 100644 index 000000000..c3db0de5a --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_rag/rag_utilization.jpeg b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/rag_utilization.jpeg new file mode 100644 index 000000000..aba6e6d84 Binary files /dev/null and b/content/learning-paths/laptops-and-desktops/dgx_spark_rag/rag_utilization.jpeg differ