ArmDeveloperEcosystem · pareenaverma · Jan 17, 2025 · Jan 14, 2025 · Jan 15, 2025
diff --git a/content/learning-paths/servers-and-cloud-computing/rag/_index.md b/content/learning-paths/servers-and-cloud-computing/rag/_index.md
@@ -0,0 +1,39 @@
+---
+title: Deploy a RAG-based Chatbot with llama-cpp-python using KleidiAI on Arm Servers
+
+minutes_to_complete: 45
+
+who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to deploy production-ready LLM chatbots with RAG capabilities, knowledge base integration, and performance optimization for Arm Architecture.
+
+learning_objectives:
+    - Set up llama-cpp-python optimized for Arm servers.
+    - Implement RAG architecture using the FAISS vector database.
+    - Optimize model performance through 4-bit quantization.
+    - Build a web interface for document upload and chat.
+    - Monitor and analyze inference performance metrics.
+
+prerequisites:
+    - Basic understanding of Python and ML concepts.
+    - Familiarity with REST APIs and web services.
+    - Basic knowledge of vector databases.
+    - Understanding of LLM fundamentals.
+
+author_primary: Nobel Chowdary Mandepudi
+
+### Tags
+skilllevels: Advanced
+armips:
+    - Neoverse
+subjects: LLM
+operatingsystems:
+    - Linux
+tools_software_languages:
+    - Python
+    - Streamlit
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/rag/_review.md b/content/learning-paths/servers-and-cloud-computing/rag/_review.md
@@ -0,0 +1,45 @@
+---
+review:
+    - questions:
+        question: >
+            What is the primary purpose of using RAG in an LLM chatbot?
+        answers:
+            - To reduce the size of the model.
+            - To enhance the chatbot's responses with contextually-relevant information.
+            - To increase the training speed of the model.
+            - To simplify the deployment process.
+        correct_answer: 2
+        explanation: >
+            RAG (Retrieval Augmented Generation) enhances the chatbot's responses by retrieving and incorporating contextually-relevant information from a vector database.
+
+    - questions:
+        question: >
+            Which framework is used to create the web interface for the RAG-based LLM server?
+        answers:
+            - Django.
+            - Flask.
+            - Streamlit.
+            - FastAPI.
+        correct_answer: 3
+        explanation: >
+            Streamlit is used to create the web interface for the RAG-based LLM server, allowing users to interact with the backend.
+
+    - questions:
+        question: >
+            What is the role of FAISS in the RAG-based LLM server?
+        answers:
+            - To train the LLM model.
+            - To store and retrieve vectorized documents.
+            - To handle HTTP requests.
+            - To manage user authentication.
+        correct_answer: 2
+        explanation: >
+            FAISS is used to store and retrieve vectorized documents, enabling the RAG-based LLM server to provide contextually relevant responses.
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+title: "Review"                 # Always the same title
+weight: 6                      # Set to always be larger than the content in this path
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/rag/backend.md b/content/learning-paths/servers-and-cloud-computing/rag/backend.md
@@ -0,0 +1,196 @@
+---
+title: Deploy a RAG-based LLM backend server
+weight: 3
+
+layout: learningpathall
+---
+
+## Backend Script for RAG-based LLM Server
+Once the virtual environment is activated, create a `backend.py` script using the following content. This script integrates the LLM with the FAISS vector database for RAG:
+
+```python
+import os
+import time
+import logging
+from flask import Flask, request, jsonify
+from flask_cors import CORS
+from langchain_community.vectorstores import FAISS
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.llms import LlamaCpp
+from langchain_core.callbacks import StreamingStdOutCallbackHandler
+from langchain_core.prompts import PromptTemplate
+from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
+from langchain_text_splitters import HTMLHeaderTextSplitter, CharacterTextSplitter
+from langchain.schema.runnable import RunnablePassthrough
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.runnables import ConfigurableField
+
+# Configure logging
+logging.getLogger('watchdog').setLevel(logging.ERROR)
+logger = logging.getLogger(__name__)
+
+# Initialize Flask app
+app = Flask(__name__)
+CORS(app)
+
+# Configure paths
+BASE_PATH = "/home/ubuntu"
+TEMP_DIR = os.path.join(BASE_PATH, "temp")
+VECTOR_DIR = os.path.join(BASE_PATH, "vector")
+MODEL_PATH = os.path.join(BASE_PATH, "models/llama3.1-8b-instruct.Q4_0_arm.gguf")
+
+# Ensure directories exist
+os.makedirs(TEMP_DIR, exist_ok=True)
+os.makedirs(VECTOR_DIR, exist_ok=True)
+
+# Token Streaming
+class StreamingCallback(StreamingStdOutCallbackHandler):
+    def __init__(self):
+        super().__init__()
+        self.tokens = []
+        self.start_time = None
+
+    def on_llm_start(self, *args, **kwargs):
+        self.start_time = time.time()
+        self.tokens = []
+        print("\nLLM Started generating response...", flush=True)
+
+    def on_llm_new_token(self, token: str, **kwargs):
+        self.tokens.append(token)
+        print(token, end="", flush=True)
+
+    def on_llm_end(self, *args, **kwargs):
+        end_time = time.time()
+        duration = end_time - self.start_time
+        print(f"\nLLM finished generating response in {duration:.2f} seconds", flush=True)
+
+def format_docs(docs):
+    return "\n\n".join(doc.page_content for doc in docs).replace("Context:", "").strip()
+
+# Vectordb creating API
+@app.route('/create_vectordb', methods=['POST'])
+def create_vectordb():
+    try:
+        data = request.json
+        vector_name = data['vector_name']
+        chunk_size = int(data['chunk_size'])
+        doc_type = data['doc_type']
+        vector_path = os.path.join(VECTOR_DIR, vector_name)
+
+        # Process document
+        chunk_overlap = 30
+        if doc_type == "PDF":
+            loader = DirectoryLoader(TEMP_DIR, glob='*.pdf', loader_cls=PyPDFLoader)
+            docs = loader.load()
+        elif doc_type == "HTML":
+            url = data['url']
+            splitter = HTMLHeaderTextSplitter([
+                ("h1", "Header 1"), ("h2", "Header 2"),
+                ("h3", "Header 3"), ("h4", "Header 4")
+            ])
+            docs = splitter.split_text_from_url(url)
+        else:
+            return jsonify({"error": "Unsupported document type"}), 400
+
+        # Create vectorstore
+        text_splitter = CharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap
+        )
+        split_docs = text_splitter.split_documents(docs)
+        embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
+        vectorstore = FAISS.from_documents(documents=split_docs, embedding=embedding)
+        vectorstore.save_local(vector_path)
+
+        return jsonify({"status": "success", "path": vector_path})
+    except Exception as e:
+        logger.exception("Error creating vector database")
+        return jsonify({"error": str(e)}), 500
+
+# Query API
+@app.route('/query', methods=['POST'])
+def query():
+    try:
+        data = request.json
+        question = data['question']
+        vector_path = data.get('vector_path')
+        use_vectordb = data.get('use_vectordb', False)
+
+        # Initialize LLM
+        callbacks = [StreamingCallback()]
+        model = LlamaCpp(
+            model_path=MODEL_PATH,
+            temperature=0.1,
+            max_tokens=1024,
+            n_batch=2048,
+            callbacks=callbacks,
+            n_ctx=10000,
+            n_threads=64,
+            n_threads_batch=64
+        )
+
+        # Create chain
+        if use_vectordb and vector_path:
+            embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
+            vectorstore = FAISS.load_local(vector_path, embedding, allow_dangerous_deserialization=True)
+            retriever = vectorstore.as_retriever().configurable_fields(
+                search_kwargs=ConfigurableField(id="search_kwargs")
+            ).with_config({"search_kwargs": {"k": 5}})
+
+            template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+            You are a helpful assistant. Use the following context to answer the question.
+            Context: {context}
+            Question: {question}
+            Answer: <|eot_id|>"""
+
+            prompt = PromptTemplate(template=template, input_variables=["context", "question"])
+            chain = (
+                {"context": retriever | format_docs, "question": RunnablePassthrough()}
+                | prompt
+                | model
+                | StrOutputParser()
+            )
+        else:
+            template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+            Question: {question}
+            Answer: <|eot_id|>"""
+
+            prompt = PromptTemplate(template=template, input_variables=["question"])
+            chain = RunnablePassthrough().assign(question=lambda x: x) | prompt | model | StrOutputParser()
+
+        # Generate response
+        response = chain.invoke(question)
+        return jsonify({"answer": response})
+    except Exception as e:
+        logger.exception("Error processing query")
+        return jsonify({"error": str(e)}), 500
+
+# File Upload API
+@app.route('/upload_file', methods=['POST'])
+def upload_file():
+    try:
+        file = request.files['file']
+        if file and file.filename.endswith('.pdf'):
+            filename = os.path.join(TEMP_DIR, "uploaded.pdf")
+            file.save(filename)
+            return jsonify({"status": "success", "path": filename})
+        return jsonify({"error": "Invalid file"}), 400
+    except Exception as e:
+        logger.exception("Error uploading file")
+        return jsonify({"error": str(e)}), 500
+
+if __name__ == '__main__':
+    app.run(host='0.0.0.0', port=5000, debug=True)
+```
+
+## Run the Backend Server
+
+You are now ready to run the backend server for the RAG Chatbot.
+Use the following command in a terminal to start the backend server:
+
+```python
+python3 backend.py
+```
+
+You should see output similar to the image below when the backend server starts successfully:
+![backend](backend_output.png)
diff --git a/content/learning-paths/servers-and-cloud-computing/rag/backend_output.png b/content/learning-paths/servers-and-cloud-computing/rag/backend_output.png
diff --git a/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md b/content/learning-paths/servers-and-cloud-computing/rag/chatbot.md
@@ -0,0 +1,76 @@
+---
+title: The RAG Chatbot and its Performance
+weight: 5
+
+layout: learningpathall
+---
+
+## Access the Web Application
+
+Open the web application in your browser using either the local URL or the external URL:
+
+```bash
+http://localhost:8501 or http://75.101.253.177:8501
+```
+
+## Upload a PDF File and Create a New Index
+
+Now you can upload a PDF file in the web browser by selecting the **Create New Store** option. 
+
+Follow these steps to create a new index:
+
+1. Open the web browser and navigate to the Streamlit frontend.
+2. In the sidebar, select **Create New Store** under the **Vector Database** section.
+3. By default, **PDF** is the source type selected.
+4. Upload your PDF file using the file uploader.
+5. Enter a name for your vector index.
+6. Click the **Create Index** button.
+
+Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).
+
+You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:
+
+![RAG_IMG1](rag_img1.png)
+
+## Load Existing Store
+
+After creating the index, you can switch to the **Load Existing Store** option and then select the index you created earlier. Initially, it will be the only available index and will be auto-selected.
+
+Follow these steps:
+
+1. Switch to the **Load Existing Store** option in the sidebar.
+2. Select the index you created. It should be auto-selected if it's the only one available.
+
+This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
+
+![RAG_IMG2](rag_img2.png)
+
+## Interact with the LLM
+
+You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
+
+Follow these steps:
+
+1. Enter your query in the prompt field of the web application.
+2. Submit the query to receive a response from the LLM.
+
+![RAG_IMG3](rag_img3.png)
+
+While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.
+
+![RAG_IMG4](rag_img4.png)
+
+## Observe Performance Metrics
+
+As shown in the image above, the RAG LLM Chatbot completed the generation in 4.65 seconds, processing and generating a total count of tokens as `1183`.
+
+This demonstrates the efficiency and speed of the RAG LLM Chatbot in handling queries and generating responses.
+
+## Further Interaction and Custom Applications
+
+You can continue to ask follow-up prompts and observe the performance metrics in the backend terminal.
+
+This setup demonstrates how you can create various applications and configure your LLM backend connected to RAG for custom text generation with specific documents. This Learning Path serves as a guide and example to showcase the LLM inference of RAG on Arm CPUs, highlighting the optimized performance gains.
+
+
+