Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/rag/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: Deploy a RAG-based Chatbot with llama-cpp-python using KleidiAI on Arm Servers

minutes_to_complete: 45

who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to deploy production-ready LLM chatbots with RAG capabilities, knowledge base integration, and performance optimization for Arm Architecture.

learning_objectives:
- Set up llama-cpp-python optimized for Arm servers.
- Implement RAG architecture using the FAISS vector database.
- Optimize model performance through 4-bit quantization.
- Build a web interface for document upload and chat.
- Monitor and analyze inference performance metrics.

prerequisites:
- Basic understanding of Python and ML concepts.
- Familiarity with REST APIs and web services.
- Basic knowledge of vector databases.
- Understanding of LLM fundamentals.

author_primary: Nobel Chowdary Mandepudi

### Tags
skilllevels: Advanced
armips:
- Neoverse
subjects: LLM
operatingsystems:
- Linux
tools_software_languages:
- Python
- Streamlit

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
45 changes: 45 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/rag/_review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
review:
- questions:
question: >
What is the primary purpose of using RAG in an LLM chatbot?
answers:
- To reduce the size of the model.
- To enhance the chatbot's responses with contextually-relevant information.
- To increase the training speed of the model.
- To simplify the deployment process.
correct_answer: 2
explanation: >
RAG (Retrieval Augmented Generation) enhances the chatbot's responses by retrieving and incorporating contextually-relevant information from a vector database.
- questions:
question: >
Which framework is used to create the web interface for the RAG-based LLM server?
answers:
- Django.
- Flask.
- Streamlit.
- FastAPI.
correct_answer: 3
explanation: >
Streamlit is used to create the web interface for the RAG-based LLM server, allowing users to interact with the backend.
- questions:
question: >
What is the role of FAISS in the RAG-based LLM server?
answers:
- To train the LLM model.
- To store and retrieve vectorized documents.
- To handle HTTP requests.
- To manage user authentication.
correct_answer: 2
explanation: >
FAISS is used to store and retrieve vectorized documents, enabling the RAG-based LLM server to provide contextually relevant responses.
# ================================================================================
# FIXED, DO NOT MODIFY
# ================================================================================
title: "Review" # Always the same title
weight: 6 # Set to always be larger than the content in this path
layout: "learningpathall" # All files under learning paths have this same wrapper
---
196 changes: 196 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/rag/backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: Deploy a RAG-based LLM backend server
weight: 3

layout: learningpathall
---

## Backend Script for RAG-based LLM Server
Once the virtual environment is activated, create a `backend.py` script using the following content. This script integrates the LLM with the FAISS vector database for RAG:

```python
import os
import time
import logging
from flask import Flask, request, jsonify
from flask_cors import CORS
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_text_splitters import HTMLHeaderTextSplitter, CharacterTextSplitter
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import ConfigurableField

# Configure logging
logging.getLogger('watchdog').setLevel(logging.ERROR)
logger = logging.getLogger(__name__)

# Initialize Flask app
app = Flask(__name__)
CORS(app)

# Configure paths
BASE_PATH = "/home/ubuntu"
TEMP_DIR = os.path.join(BASE_PATH, "temp")
VECTOR_DIR = os.path.join(BASE_PATH, "vector")
MODEL_PATH = os.path.join(BASE_PATH, "models/llama3.1-8b-instruct.Q4_0_arm.gguf")

# Ensure directories exist
os.makedirs(TEMP_DIR, exist_ok=True)
os.makedirs(VECTOR_DIR, exist_ok=True)

# Token Streaming
class StreamingCallback(StreamingStdOutCallbackHandler):
def __init__(self):
super().__init__()
self.tokens = []
self.start_time = None

def on_llm_start(self, *args, **kwargs):
self.start_time = time.time()
self.tokens = []
print("\nLLM Started generating response...", flush=True)

def on_llm_new_token(self, token: str, **kwargs):
self.tokens.append(token)
print(token, end="", flush=True)

def on_llm_end(self, *args, **kwargs):
end_time = time.time()
duration = end_time - self.start_time
print(f"\nLLM finished generating response in {duration:.2f} seconds", flush=True)

def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs).replace("Context:", "").strip()

# Vectordb creating API
@app.route('/create_vectordb', methods=['POST'])
def create_vectordb():
try:
data = request.json
vector_name = data['vector_name']
chunk_size = int(data['chunk_size'])
doc_type = data['doc_type']
vector_path = os.path.join(VECTOR_DIR, vector_name)

# Process document
chunk_overlap = 30
if doc_type == "PDF":
loader = DirectoryLoader(TEMP_DIR, glob='*.pdf', loader_cls=PyPDFLoader)
docs = loader.load()
elif doc_type == "HTML":
url = data['url']
splitter = HTMLHeaderTextSplitter([
("h1", "Header 1"), ("h2", "Header 2"),
("h3", "Header 3"), ("h4", "Header 4")
])
docs = splitter.split_text_from_url(url)
else:
return jsonify({"error": "Unsupported document type"}), 400

# Create vectorstore
text_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
split_docs = text_splitter.split_documents(docs)
embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
vectorstore = FAISS.from_documents(documents=split_docs, embedding=embedding)
vectorstore.save_local(vector_path)

return jsonify({"status": "success", "path": vector_path})
except Exception as e:
logger.exception("Error creating vector database")
return jsonify({"error": str(e)}), 500

# Query API
@app.route('/query', methods=['POST'])
def query():
try:
data = request.json
question = data['question']
vector_path = data.get('vector_path')
use_vectordb = data.get('use_vectordb', False)

# Initialize LLM
callbacks = [StreamingCallback()]
model = LlamaCpp(
model_path=MODEL_PATH,
temperature=0.1,
max_tokens=1024,
n_batch=2048,
callbacks=callbacks,
n_ctx=10000,
n_threads=64,
n_threads_batch=64
)

# Create chain
if use_vectordb and vector_path:
embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
vectorstore = FAISS.load_local(vector_path, embedding, allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever().configurable_fields(
search_kwargs=ConfigurableField(id="search_kwargs")
).with_config({"search_kwargs": {"k": 5}})

template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant. Use the following context to answer the question.
Context: {context}
Question: {question}
Answer: <|eot_id|>"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
else:
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Question: {question}
Answer: <|eot_id|>"""

prompt = PromptTemplate(template=template, input_variables=["question"])
chain = RunnablePassthrough().assign(question=lambda x: x) | prompt | model | StrOutputParser()

# Generate response
response = chain.invoke(question)
return jsonify({"answer": response})
except Exception as e:
logger.exception("Error processing query")
return jsonify({"error": str(e)}), 500

# File Upload API
@app.route('/upload_file', methods=['POST'])
def upload_file():
try:
file = request.files['file']
if file and file.filename.endswith('.pdf'):
filename = os.path.join(TEMP_DIR, "uploaded.pdf")
file.save(filename)
return jsonify({"status": "success", "path": filename})
return jsonify({"error": "Invalid file"}), 400
except Exception as e:
logger.exception("Error uploading file")
return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
```

## Run the Backend Server

You are now ready to run the backend server for the RAG Chatbot.
Use the following command in a terminal to start the backend server:

```python
python3 backend.py
```

You should see output similar to the image below when the backend server starts successfully:
![backend](backend_output.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/rag/chatbot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: The RAG Chatbot and its Performance
weight: 5

layout: learningpathall
---

## Access the Web Application

Open the web application in your browser using either the local URL or the external URL:

```bash
http://localhost:8501 or http://75.101.253.177:8501
```

## Upload a PDF File and Create a New Index

Now you can upload a PDF file in the web browser by selecting the **Create New Store** option.

Follow these steps to create a new index:

1. Open the web browser and navigate to the Streamlit frontend.
2. In the sidebar, select **Create New Store** under the **Vector Database** section.
3. By default, **PDF** is the source type selected.
4. Upload your PDF file using the file uploader.
5. Enter a name for your vector index.
6. Click the **Create Index** button.

Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).

You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:

![RAG_IMG1](rag_img1.png)

## Load Existing Store

After creating the index, you can switch to the **Load Existing Store** option and then select the index you created earlier. Initially, it will be the only available index and will be auto-selected.

Follow these steps:

1. Switch to the **Load Existing Store** option in the sidebar.
2. Select the index you created. It should be auto-selected if it's the only one available.

This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:

![RAG_IMG2](rag_img2.png)

## Interact with the LLM

You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.

Follow these steps:

1. Enter your query in the prompt field of the web application.
2. Submit the query to receive a response from the LLM.

![RAG_IMG3](rag_img3.png)

While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.

![RAG_IMG4](rag_img4.png)

## Observe Performance Metrics

As shown in the image above, the RAG LLM Chatbot completed the generation in 4.65 seconds, processing and generating a total count of tokens as `1183`.

This demonstrates the efficiency and speed of the RAG LLM Chatbot in handling queries and generating responses.

## Further Interaction and Custom Applications

You can continue to ask follow-up prompts and observe the performance metrics in the backend terminal.

This setup demonstrates how you can create various applications and configure your LLM backend connected to RAG for custom text generation with specific documents. This Learning Path serves as a guide and example to showcase the LLM inference of RAG on Arm CPUs, highlighting the optimized performance gains.



Loading