diff --git a/content/learning-paths/servers-and-cloud-computing/rag/_index.md b/content/learning-paths/servers-and-cloud-computing/rag/_index.md new file mode 100644 index 0000000000..745e9ac2a3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/_index.md @@ -0,0 +1,39 @@ +--- +title: Deploy a RAG-based Chatbot with llama-cpp-python using KleidiAI on Arm Servers + +minutes_to_complete: 45 + +who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to deploy production-ready LLM chatbots with RAG capabilities, knowledge base integration, and performance optimization for Arm Architecture. + +learning_objectives: + - Set up llama-cpp-python optimized for Arm servers. + - Implement RAG architecture using the FAISS vector database. + - Optimize model performance through 4-bit quantization. + - Build a web interface for document upload and chat. + - Monitor and analyze inference performance metrics. + +prerequisites: + - Basic understanding of Python and ML concepts. + - Familiarity with REST APIs and web services. + - Basic knowledge of vector databases. + - Understanding of LLM fundamentals. + +author_primary: Nobel Chowdary Mandepudi + +### Tags +skilllevels: Advanced +armips: + - Neoverse +subjects: LLM +operatingsystems: + - Linux +tools_software_languages: + - Python + - Streamlit + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/rag/_review.md b/content/learning-paths/servers-and-cloud-computing/rag/_review.md new file mode 100644 index 0000000000..df0d24aaf6 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/_review.md @@ -0,0 +1,45 @@ +--- +review: + - questions: + question: > + What is the primary purpose of using RAG in an LLM chatbot? + answers: + - To reduce the size of the model. + - To enhance the chatbot's responses with contextually-relevant information. + - To increase the training speed of the model. + - To simplify the deployment process. + correct_answer: 2 + explanation: > + RAG (Retrieval Augmented Generation) enhances the chatbot's responses by retrieving and incorporating contextually-relevant information from a vector database. + + - questions: + question: > + Which framework is used to create the web interface for the RAG-based LLM server? + answers: + - Django. + - Flask. + - Streamlit. + - FastAPI. + correct_answer: 3 + explanation: > + Streamlit is used to create the web interface for the RAG-based LLM server, allowing users to interact with the backend. + + - questions: + question: > + What is the role of FAISS in the RAG-based LLM server? + answers: + - To train the LLM model. + - To store and retrieve vectorized documents. + - To handle HTTP requests. + - To manage user authentication. + correct_answer: 2 + explanation: > + FAISS is used to store and retrieve vectorized documents, enabling the RAG-based LLM server to provide contextually relevant responses. + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +title: "Review" # Always the same title +weight: 6 # Set to always be larger than the content in this path +layout: "learningpathall" # All files under learning paths have this same wrapper +--- diff --git a/content/learning-paths/servers-and-cloud-computing/rag/backend.md b/content/learning-paths/servers-and-cloud-computing/rag/backend.md new file mode 100644 index 0000000000..e5601f06e5 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/backend.md @@ -0,0 +1,196 @@ +--- +title: Deploy a RAG-based LLM backend server +weight: 3 + +layout: learningpathall +--- + +## Backend Script for RAG-based LLM Server +Once the virtual environment is activated, create a `backend.py` script using the following content. This script integrates the LLM with the FAISS vector database for RAG: + +```python +import os +import time +import logging +from flask import Flask, request, jsonify +from flask_cors import CORS +from langchain_community.vectorstores import FAISS +from langchain_community.embeddings import HuggingFaceEmbeddings +from langchain_community.llms import LlamaCpp +from langchain_core.callbacks import StreamingStdOutCallbackHandler +from langchain_core.prompts import PromptTemplate +from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader +from langchain_text_splitters import HTMLHeaderTextSplitter, CharacterTextSplitter +from langchain.schema.runnable import RunnablePassthrough +from langchain_core.output_parsers import StrOutputParser +from langchain_core.runnables import ConfigurableField + +# Configure logging +logging.getLogger('watchdog').setLevel(logging.ERROR) +logger = logging.getLogger(__name__) + +# Initialize Flask app +app = Flask(__name__) +CORS(app) + +# Configure paths +BASE_PATH = "/home/ubuntu" +TEMP_DIR = os.path.join(BASE_PATH, "temp") +VECTOR_DIR = os.path.join(BASE_PATH, "vector") +MODEL_PATH = os.path.join(BASE_PATH, "models/llama3.1-8b-instruct.Q4_0_arm.gguf") + +# Ensure directories exist +os.makedirs(TEMP_DIR, exist_ok=True) +os.makedirs(VECTOR_DIR, exist_ok=True) + +# Token Streaming +class StreamingCallback(StreamingStdOutCallbackHandler): + def __init__(self): + super().__init__() + self.tokens = [] + self.start_time = None + + def on_llm_start(self, *args, **kwargs): + self.start_time = time.time() + self.tokens = [] + print("\nLLM Started generating response...", flush=True) + + def on_llm_new_token(self, token: str, **kwargs): + self.tokens.append(token) + print(token, end="", flush=True) + + def on_llm_end(self, *args, **kwargs): + end_time = time.time() + duration = end_time - self.start_time + print(f"\nLLM finished generating response in {duration:.2f} seconds", flush=True) + +def format_docs(docs): + return "\n\n".join(doc.page_content for doc in docs).replace("Context:", "").strip() + +# Vectordb creating API +@app.route('/create_vectordb', methods=['POST']) +def create_vectordb(): + try: + data = request.json + vector_name = data['vector_name'] + chunk_size = int(data['chunk_size']) + doc_type = data['doc_type'] + vector_path = os.path.join(VECTOR_DIR, vector_name) + + # Process document + chunk_overlap = 30 + if doc_type == "PDF": + loader = DirectoryLoader(TEMP_DIR, glob='*.pdf', loader_cls=PyPDFLoader) + docs = loader.load() + elif doc_type == "HTML": + url = data['url'] + splitter = HTMLHeaderTextSplitter([ + ("h1", "Header 1"), ("h2", "Header 2"), + ("h3", "Header 3"), ("h4", "Header 4") + ]) + docs = splitter.split_text_from_url(url) + else: + return jsonify({"error": "Unsupported document type"}), 400 + + # Create vectorstore + text_splitter = CharacterTextSplitter( + chunk_size=chunk_size, + chunk_overlap=chunk_overlap + ) + split_docs = text_splitter.split_documents(docs) + embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base") + vectorstore = FAISS.from_documents(documents=split_docs, embedding=embedding) + vectorstore.save_local(vector_path) + + return jsonify({"status": "success", "path": vector_path}) + except Exception as e: + logger.exception("Error creating vector database") + return jsonify({"error": str(e)}), 500 + +# Query API +@app.route('/query', methods=['POST']) +def query(): + try: + data = request.json + question = data['question'] + vector_path = data.get('vector_path') + use_vectordb = data.get('use_vectordb', False) + + # Initialize LLM + callbacks = [StreamingCallback()] + model = LlamaCpp( + model_path=MODEL_PATH, + temperature=0.1, + max_tokens=1024, + n_batch=2048, + callbacks=callbacks, + n_ctx=10000, + n_threads=64, + n_threads_batch=64 + ) + + # Create chain + if use_vectordb and vector_path: + embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base") + vectorstore = FAISS.load_local(vector_path, embedding, allow_dangerous_deserialization=True) + retriever = vectorstore.as_retriever().configurable_fields( + search_kwargs=ConfigurableField(id="search_kwargs") + ).with_config({"search_kwargs": {"k": 5}}) + + template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|> + You are a helpful assistant. Use the following context to answer the question. + Context: {context} + Question: {question} + Answer: <|eot_id|>""" + + prompt = PromptTemplate(template=template, input_variables=["context", "question"]) + chain = ( + {"context": retriever | format_docs, "question": RunnablePassthrough()} + | prompt + | model + | StrOutputParser() + ) + else: + template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|> + Question: {question} + Answer: <|eot_id|>""" + + prompt = PromptTemplate(template=template, input_variables=["question"]) + chain = RunnablePassthrough().assign(question=lambda x: x) | prompt | model | StrOutputParser() + + # Generate response + response = chain.invoke(question) + return jsonify({"answer": response}) + except Exception as e: + logger.exception("Error processing query") + return jsonify({"error": str(e)}), 500 + +# File Upload API +@app.route('/upload_file', methods=['POST']) +def upload_file(): + try: + file = request.files['file'] + if file and file.filename.endswith('.pdf'): + filename = os.path.join(TEMP_DIR, "uploaded.pdf") + file.save(filename) + return jsonify({"status": "success", "path": filename}) + return jsonify({"error": "Invalid file"}), 400 + except Exception as e: + logger.exception("Error uploading file") + return jsonify({"error": str(e)}), 500 + +if __name__ == '__main__': + app.run(host='0.0.0.0', port=5000, debug=True) +``` + +## Run the Backend Server + +You are now ready to run the backend server for the RAG Chatbot. +Use the following command in a terminal to start the backend server: + +```python +python3 backend.py +``` + +You should see output similar to the image below when the backend server starts successfully: +![backend](backend_output.png) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/rag/backend_output.png b/content/learning-paths/servers-and-cloud-computing/rag/backend_output.png new file mode 100644 index 0000000000..67de5f2e4a Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/backend_output.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md b/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md new file mode 100644 index 0000000000..8e659b8a41 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md @@ -0,0 +1,76 @@ +--- +title: The RAG Chatbot and its Performance +weight: 5 + +layout: learningpathall +--- + +## Access the Web Application + +Open the web application in your browser using either the local URL or the external URL: + +```bash +http://localhost:8501 or http://75.101.253.177:8501 +``` + +## Upload a PDF File and Create a New Index + +Now you can upload a PDF file in the web browser by selecting the **Create New Store** option. + +Follow these steps to create a new index: + +1. Open the web browser and navigate to the Streamlit frontend. +2. In the sidebar, select **Create New Store** under the **Vector Database** section. +3. By default, **PDF** is the source type selected. +4. Upload your PDF file using the file uploader. +5. Enter a name for your vector index. +6. Click the **Create Index** button. + +Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/). + +You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance: + +![RAG_IMG1](rag_img1.png) + +## Load Existing Store + +After creating the index, you can switch to the **Load Existing Store** option and then select the index you created earlier. Initially, it will be the only available index and will be auto-selected. + +Follow these steps: + +1. Switch to the **Load Existing Store** option in the sidebar. +2. Select the index you created. It should be auto-selected if it's the only one available. + +This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance: + +![RAG_IMG2](rag_img2.png) + +## Interact with the LLM + +You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal. + +Follow these steps: + +1. Enter your query in the prompt field of the web application. +2. Submit the query to receive a response from the LLM. + +![RAG_IMG3](rag_img3.png) + +While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM. + +![RAG_IMG4](rag_img4.png) + +## Observe Performance Metrics + +As shown in the image above, the RAG LLM Chatbot completed the generation in 4.65 seconds, processing and generating a total count of tokens as `1183`. + +This demonstrates the efficiency and speed of the RAG LLM Chatbot in handling queries and generating responses. + +## Further Interaction and Custom Applications + +You can continue to ask follow-up prompts and observe the performance metrics in the backend terminal. + +This setup demonstrates how you can create various applications and configure your LLM backend connected to RAG for custom text generation with specific documents. This Learning Path serves as a guide and example to showcase the LLM inference of RAG on Arm CPUs, highlighting the optimized performance gains. + + + diff --git a/content/learning-paths/servers-and-cloud-computing/rag/frontend.md b/content/learning-paths/servers-and-cloud-computing/rag/frontend.md new file mode 100644 index 0000000000..ebcfc0d232 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/frontend.md @@ -0,0 +1,142 @@ +--- +title: Deploy RAG-based LLM frontend server +weight: 4 + +layout: learningpathall +--- + +## Frontend Script for RAG-based LLM Server + +After activating the virtual environment in a new terminal, you can use the following `frontend.py` script to input documents or PDFs and interact with the backend. This script uses the Streamlit framework to create a web interface for the RAG-based LLM server. + +Create a `frontend.py` script with the following content: + +```python +import os +import requests +import time +import streamlit as st +from PIL import Image +from typing import Dict, Any + +# Configure paths and URLs +BASE_PATH = "/home/ubuntu" +API_URL = "http://localhost:5000" + +# Page config +st.set_page_config( + page_title="LLM RAG on Arm Neoverse CPU" +) + +# Title +st.title("LLM RAG on Arm Neoverse CPU") + +# Sidebar +with st.sidebar: + st.write("## Model Settings") + model = st.selectbox('Select LLM', ["llama3.1-8b-instruct.Q4_0_arm.gguf"]) + use_vectordb = st.checkbox("Use Vector Database") + +# Initialize session state +if 'messages' not in st.session_state: + st.session_state.messages = [] +if 'vectordb_path' not in st.session_state: + st.session_state.vectordb_path = None + +# Vector Database Creation +if use_vectordb: + st.sidebar.write("## Vector Database") + # First select vector store type + vector_store = st.sidebar.selectbox("Vector Storage Type", ["FAISS"]) + + # Then select action + action = st.sidebar.radio("Action", ["Create New Store", "Load Existing Store"]) + + if action == "Create New Store": + source = st.sidebar.radio("Source", ["PDF"]) + if source == "PDF": + uploaded_file = st.sidebar.file_uploader("Upload PDF", type="pdf") + if uploaded_file: + files = {'file': uploaded_file} + response = requests.post(f"{API_URL}/upload_file", files=files) + if response.ok: + st.sidebar.success("File uploaded successfully") + + db_name = st.sidebar.text_input("Vector Index Name") + if st.sidebar.button("Create Index"): + response = requests.post( + f"{API_URL}/create_vectordb", + json={ + "vector_name": db_name, + "chunk_size": 400, + "doc_type": "PDF" + } + ) + if response.ok: + st.session_state.vectordb_path = response.json()['path'] + st.sidebar.success("Vector Index created!") + else: + st.sidebar.error("Failed to create vector Index") + + else: # Load Existing + # Updated directory handling + vector_dir = os.path.join(BASE_PATH, "vector") # Remove vector_store from path + if os.path.exists(vector_dir): + # Get all directories that contain FAISS index files + dbs = [] + for root, dirs, files in os.walk(vector_dir): + if "index.faiss" in files: # Check for FAISS index file + # Get relative path from vector_dir + rel_path = os.path.relpath(root, vector_dir) + dbs.append(rel_path) + if dbs: + selected_db = st.sidebar.selectbox("Select Index", dbs) + st.session_state.vectordb_path = os.path.join(vector_dir, selected_db) + st.sidebar.success(f"Loaded index: {selected_db}") + else: + st.sidebar.warning("No existing indexes found. Please create a new one.") + else: + # Create vector directory if it doesn't exist + os.makedirs(vector_dir, exist_ok=True) + st.sidebar.warning("No indexes found. Please create a new one.") + +# Chat interface +if use_vectordb and action == "Load Existing Store" and dbs: + if prompt := st.chat_input("Ask a question"): + st.session_state.messages.append({"role": "user", "content": prompt}) + + # Display messages + for msg in st.session_state.messages: + with st.chat_message(msg["role"]): + st.write(msg["content"]) + + # Get response + with st.chat_message("assistant"): + response = requests.post( + f"{API_URL}/query", + json={ + "question": prompt, + "vector_path": st.session_state.vectordb_path, + "use_vectordb": use_vectordb + } + ) + + if response.ok: + answer = response.json()['answer'] + st.write(answer) + st.session_state.messages.append({"role": "assistant", "content": answer}) + else: + st.error("Failed to get response from the model") +``` + +## Run the Frontend Server + +You are now ready to run the frontend server for the RAG Chatbot. +Use the following command in a new terminal to start the Streamlit frontend server: + +```python +python3 -m streamlit run frontend.py +``` + +You should see output similar to the image below when the frontend server starts successfully: +![frontend](frontend_output.png) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/rag/frontend_output.png b/content/learning-paths/servers-and-cloud-computing/rag/frontend_output.png new file mode 100644 index 0000000000..e3ff01cbb7 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/frontend_output.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_img1.png b/content/learning-paths/servers-and-cloud-computing/rag/rag_img1.png new file mode 100644 index 0000000000..cabc4a9a0b Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/rag_img1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_img2.png b/content/learning-paths/servers-and-cloud-computing/rag/rag_img2.png new file mode 100644 index 0000000000..107af1e657 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/rag_img2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_img3.png b/content/learning-paths/servers-and-cloud-computing/rag/rag_img3.png new file mode 100644 index 0000000000..424c0af431 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/rag_img3.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_img4.png b/content/learning-paths/servers-and-cloud-computing/rag/rag_img4.png new file mode 100644 index 0000000000..ce98c60722 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/rag/rag_img4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md b/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md new file mode 100644 index 0000000000..9551af2cb1 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md @@ -0,0 +1,136 @@ +--- +# User change +title: "Set up a RAG based LLM Chatbot" + +weight: 2 # 1 is first, 2 is second, etc. + +# Do not modify these elements +layout: "learningpathall" +--- + +## Before you begin + +This learning path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores and 8GB of RAM to run this example. Configure disk storage up to at least 32GB. The instructions have been tested on an AWS Graviton4 r8g.16xlarge instance. + +## Overview + +In this Learning Path, you learn how to build a Retrieval Augmented Generation (RAG) chatbot using llama-cpp-python, a Python binding for llama.cpp that enables efficient LLM inference on Arm CPUs. + +The tutorial demonstrates how to integrate the FAISS vector database with the Llama-3.1-8B model for document retrieval, while leveraging llama-cpp-python's optimized C++ backend for high-performance inference. + +This architecture enables the chatbot to combine the model's generative capabilities with contextual information retrieved from your documents, all optimized for Arm-based systems. + +## Install dependencies + +Install the following packages on your Arm based server instance: + +```bash +sudo apt update +sudo apt install python3-pip python3-venv cmake -y +``` + +## Create a requirements file + +```bash +vim requirements.txt +``` + +Add the following dependencies to your `requirements.txt` file: + +```python +# Core LLM & RAG Components +langchain==0.1.16 +langchain_community==0.0.38 +langchainhub==0.1.20 + +# Vector Database & Embeddings +faiss-cpu +sentence-transformers + +# Document Processing +pypdf +PyPDF2 +lxml + +# API and Web Interface +flask +requests +flask_cors +streamlit + +# Environment and Utils +argparse +python-dotenv==1.0.1 +``` + +## Install Python Dependencies + +Create a virtual environment: +```bash + python3 -m venv rag-env +``` + +Activate the virtual environment: +```bash + source rag-env/bin/activate +``` + +Install the required libraries using pip: +```bash + pip install -r requirements.txt +``` +## Install llama-cpp-python + +Install the `llama-cpp-python` package, which includes the Kleidi AI optimized llama.cpp backend, using the following command: + +```bash +pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu +``` + +## Download the Model + +Create a directory called models, and navigate to it: +```bash + mkdir models + cd models +``` + +Download the Hugging Face model: +```bash + wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf +``` + +## Build llama.cpp & Quantize the Model + +Navigate to your home directory: + +```bash +cd ~ +``` + +Clone the source repository for llama.cpp: + +```bash +git clone https://github.com/ggerganov/llama.cpp +``` + +By default, `llama.cpp` builds for CPU only on Linux and Windows. You do not need to provide any extra switches to build it for the Arm CPU that you run it on. + +Run `cmake` to build it: + +```bash +cd llama.cpp +mkdir build +cd build +cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native" +cmake --build . -v --config Release -j `nproc` +``` + +`llama.cpp` is now built in the `bin` directory. + +Run the following command to quantize the model: + +```bash +cd bin +./llama-quantize --allow-requantize ../../../models/ggml-model-Q4_K_M.gguf ../../../models/llama3.1-8b-instruct.Q4_0_arm.gguf Q4_0 +``` \ No newline at end of file