Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions content/learning-paths/servers-and-cloud-computing/rag/_demo.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
---
title: Run a llama.cpp chatbot powered by Arm Kleidi technology
weight: 2

overview: |
This Arm learning path shows how to use a single c4a-highcpu-72 Google Axion instance -- powered by an Arm Neoverse CPU -- to build a simple "Token as a Service" RAG-enabled server, used below to provide a chatbot to serve a small number of concurrent users.
This Learning Path shows you how to use a c4a-highcpu-72 Google Axion instance powered by an Arm Neoverse CPU to build a simple Token-as-a-Service (TaaS) RAG-enabled server that you can then use to provide a chatbot to serve a small number of concurrent users.

This architecture would be suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines. It enables semantic search over chunked documents using FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced by contributing the latest Arm Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.
This architecture is suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines.

It enables semantic search over chunked documents using the FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced with its own Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.

Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, then follow the learning path to build your own Generative AI service on Arm Neoverse.
Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, and then follow the Learning Path to build your own Generative AI service on Arm Neoverse.


demo_steps:
- Type & send a message to the chatbot.
- Type and send a message to the chatbot.
- Receive the chatbot's reply, including references from RAG data.
- View stats showing how well Google Axion runs LLMs.
- View performance statistics demonstrating how well Google Axion runs LLMs.

diagram: config-diagram-dark.png
diagram_blowup: config-diagram.png
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Deploy a RAG-based LLM backend server
weight: 3
weight: 4

layout: learningpathall
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: The RAG Chatbot and its Performance
weight: 5
weight: 6

layout: learningpathall
---
Expand All @@ -15,9 +15,9 @@ http://[your instance ip]:8501

{{% notice Note %}}

To access the links you may need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they may introduce security vulnerabilities.
To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.

For an Axion instance, this can be done as follows from the gcloud cli:
For an Axion instance, you can do this from the gcloud cli:

gcloud compute firewall-rules create allow-my-ip \
--direction=INGRESS \
Expand All @@ -43,7 +43,7 @@ Follow these steps to create a new index:
5. Enter a name for your vector index.
6. Click the **Create Index** button.

Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).
Upload the Cortex-M processor comparison document, which can be downloaded from [the Arm developer website](https://developer.arm.com/documentation/102787/latest/).

You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:

Expand All @@ -56,15 +56,15 @@ After creating the index, you can switch to the **Load Existing Store** option a
Follow these steps:

1. Switch to the **Load Existing Store** option in the sidebar.
2. Select the index you created. It should be auto-selected if it's the only one available.
2. Select the index you created. It should be auto-selected if it is the only one available.

This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
This allows you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:

![RAG_IMG2](rag_img2.png)

## Interact with the LLM

You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
You can now start issuing various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.

Follow these steps:

Expand All @@ -73,7 +73,7 @@ Follow these steps:

![RAG_IMG3](rag_img3.png)

While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This provides insights into the processing speed and efficiency of the LLM.

![RAG_IMG4](rag_img4.png)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Deploy RAG-based LLM frontend server
weight: 4
weight: 5

layout: learningpathall
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
# User change
title: "Set up a RAG based LLM Chatbot"

weight: 2 # 1 is first, 2 is second, etc.
weight: 3

# Do not modify these elements
layout: "learningpathall"
---

## Before you begin

This learning path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
This Learning Path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.

## Overview

Expand Down Expand Up @@ -100,7 +100,7 @@ Download the Hugging Face model:
wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf
```

## Build llama.cpp & Quantize the Model
## Build llama.cpp and Quantize the Model

Navigate to your home directory:

Expand Down