diff --git a/content/learning-paths/servers-and-cloud-computing/rag/_demo.md b/content/learning-paths/servers-and-cloud-computing/rag/_demo.md index 19b3a9c2e5..ca62fbf8e4 100644 --- a/content/learning-paths/servers-and-cloud-computing/rag/_demo.md +++ b/content/learning-paths/servers-and-cloud-computing/rag/_demo.md @@ -1,18 +1,21 @@ --- title: Run a llama.cpp chatbot powered by Arm Kleidi technology +weight: 2 overview: | - This Arm learning path shows how to use a single c4a-highcpu-72 Google Axion instance -- powered by an Arm Neoverse CPU -- to build a simple "Token as a Service" RAG-enabled server, used below to provide a chatbot to serve a small number of concurrent users. + This Learning Path shows you how to use a c4a-highcpu-72 Google Axion instance powered by an Arm Neoverse CPU to build a simple Token-as-a-Service (TaaS) RAG-enabled server that you can then use to provide a chatbot to serve a small number of concurrent users. - This architecture would be suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines. It enables semantic search over chunked documents using FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced by contributing the latest Arm Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage. + This architecture is suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines. + + It enables semantic search over chunked documents using the FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced with its own Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage. - Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, then follow the learning path to build your own Generative AI service on Arm Neoverse. + Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, and then follow the Learning Path to build your own Generative AI service on Arm Neoverse. demo_steps: - - Type & send a message to the chatbot. + - Type and send a message to the chatbot. - Receive the chatbot's reply, including references from RAG data. - - View stats showing how well Google Axion runs LLMs. + - View performance statistics demonstrating how well Google Axion runs LLMs. diagram: config-diagram-dark.png diagram_blowup: config-diagram.png diff --git a/content/learning-paths/servers-and-cloud-computing/rag/backend.md b/content/learning-paths/servers-and-cloud-computing/rag/backend.md index de50065fc6..220a5b3356 100644 --- a/content/learning-paths/servers-and-cloud-computing/rag/backend.md +++ b/content/learning-paths/servers-and-cloud-computing/rag/backend.md @@ -1,6 +1,6 @@ --- title: Deploy a RAG-based LLM backend server -weight: 3 +weight: 4 layout: learningpathall --- diff --git a/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md b/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md index 2ad984a4f5..1cd6eb3488 100644 --- a/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md +++ b/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md @@ -1,6 +1,6 @@ --- title: The RAG Chatbot and its Performance -weight: 5 +weight: 6 layout: learningpathall --- @@ -15,9 +15,9 @@ http://[your instance ip]:8501 {{% notice Note %}} -To access the links you may need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they may introduce security vulnerabilities. +To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities. -For an Axion instance, this can be done as follows from the gcloud cli: +For an Axion instance, you can do this from the gcloud cli: gcloud compute firewall-rules create allow-my-ip \ --direction=INGRESS \ @@ -43,7 +43,7 @@ Follow these steps to create a new index: 5. Enter a name for your vector index. 6. Click the **Create Index** button. -Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/). +Upload the Cortex-M processor comparison document, which can be downloaded from [the Arm developer website](https://developer.arm.com/documentation/102787/latest/). You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance: @@ -56,15 +56,15 @@ After creating the index, you can switch to the **Load Existing Store** option a Follow these steps: 1. Switch to the **Load Existing Store** option in the sidebar. -2. Select the index you created. It should be auto-selected if it's the only one available. +2. Select the index you created. It should be auto-selected if it is the only one available. -This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance: +This allows you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance: ![RAG_IMG2](rag_img2.png) ## Interact with the LLM -You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal. +You can now start issuing various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal. Follow these steps: @@ -73,7 +73,7 @@ Follow these steps: ![RAG_IMG3](rag_img3.png) -While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM. +While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This provides insights into the processing speed and efficiency of the LLM. ![RAG_IMG4](rag_img4.png) diff --git a/content/learning-paths/servers-and-cloud-computing/rag/frontend.md b/content/learning-paths/servers-and-cloud-computing/rag/frontend.md index 51cb4eb33a..fd72eaa099 100644 --- a/content/learning-paths/servers-and-cloud-computing/rag/frontend.md +++ b/content/learning-paths/servers-and-cloud-computing/rag/frontend.md @@ -1,6 +1,6 @@ --- title: Deploy RAG-based LLM frontend server -weight: 4 +weight: 5 layout: learningpathall --- diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md b/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md index 38babb632c..b375428528 100644 --- a/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md +++ b/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md @@ -2,7 +2,7 @@ # User change title: "Set up a RAG based LLM Chatbot" -weight: 2 # 1 is first, 2 is second, etc. +weight: 3 # Do not modify these elements layout: "learningpathall" @@ -10,7 +10,7 @@ layout: "learningpathall" ## Before you begin -This learning path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance. +This Learning Path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance. ## Overview @@ -100,7 +100,7 @@ Download the Hugging Face model: wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf ``` -## Build llama.cpp & Quantize the Model +## Build llama.cpp and Quantize the Model Navigate to your home directory: