LlamaEdge · alabulei1 · Jul 10, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/docs/ai-models/_category_.json b/docs/ai-models/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "Serve AI models",
+  "position": 5,
+  "link": {
+    "type": "generated-index",
+    "description": "Serve open-source AI models via web APIs."
+  }
+}
diff --git a/docs/user-guide/_category_.json → docs/ai-models/embeddings/_category_.json b/docs/user-guide/_category_.json → docs/ai-models/embeddings/_category_.json
@@ -1,6 +1,6 @@
 {
-  "label": "User Guide",
-  "position": 5,
+  "label": "Embeddings",
+  "position": 1,
   "link": {
     "type": "generated-index"
   }

diff --git a/docs/ai-models/embeddings/index.md b/docs/ai-models/embeddings/index.md
@@ -0,0 +1,66 @@
+---
+sidebar_position: 1
+---
+
+# Working with embedding models
+
+Embedding models compute vectors from text inputs. The vectors can then be used as search index
+for semantic search in a vector database.
+
+### Step 1: Install WasmEdge
+
+First off, you'll need WasmEdge, a high-performance, lightweight, and extensible WebAssembly (Wasm) runtime optimized for server-side and edge computing. To install WasmEdge along with the necessary plugin for AI inference, open your terminal and execute the following command:
+
+```
+curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s
+```
+
+This command fetches and runs the WasmEdge installation script, which automatically installs WasmEdge and the WASI-NN plugin, essential for running LLM models like Llama 3.1.
+
+### Step 2: Download the embedding model
+
+Next, you'll need to obtain a model file. For this tutorial, we're focusing on the **GTW Qwen2 1.5B** model, which is a top rated text embedding model from Qwen. It generates vectors of 1536 dimensions. The steps are generally applicable to other models too. Use the following command to download the model file.
+
+```
+curl -LO https://huggingface.co/second-state/gte-Qwen2-1.5B-instruct-GGUF/resolve/main/gte-Qwen2-1.5B-instruct-Q5_K_M.gguf
+```
+
+### Step 3: Download a portable API server app
+
+Next, you need an application that can build an OpenAI compatible API server for the model.
+The [LlamaEdge api server app](https://github.com/LlamaEdge/LlamaEdge/tree/main/llama-api-server) is a lightweight and cross-platform Wasm app that works on any device
+you might have. Just download the compiled binary app.
+
+```
+curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm
+```
+
+> The LlamaEdge apps are written in Rust and compiled to portable Wasm. That means they can run across devices and OSes without any change to the binary apps. You can simply download and run the compiled wasm apps regardless of your platform.
+
+### Step 4: Start the API server
+
+Start the API server with the following command. Notice that the context size of this particular embedding model is 
+32k and the prompt template is `embedding`.
+
+```
+wasmedge --dir .:. --nn-preload default:GGML:AUTO:gte-Qwen2-1.5B-instruct-Q5_K_M.gguf llama-api-server.wasm --model-name gte-qwen2-1.5b --ctx-size 32768 --batch-size 8192 --ubatch-size 8192 --prompt-template embedding
+```
+
+### Step 5: Use the /embeddings API 
+
+You can now send embedding requests to it using the OpenAI-compatible `/embeddings` API endpoint.
+
+```
+curl http://localhost:8080/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "input": "The food was delicious and the waiter..."
+  }'
+```
+
+The response is.
+
+```
+{"object":"list","data":[{"index":0,"object":"embedding","embedding":[0.02968290634,0.04592291266,0.05229084566,-0.001912750886,-0.01647545397,0.01744602434,0.008423444815,0.01363539882,-0.005849621724,-0.004947130103,-0.02326701023,0.1068811566,0.01074867789, ... 0.005662892945,-0.01796873659,0.02428019233,-0.0333112292]}],"model":"gte-qwen2-1.5b","usage":{"prompt_tokens":9,"completion_tokens":0,"total_tokens":9}}
+```
+
diff --git a/docs/ai-models/index.md b/docs/ai-models/index.md
@@ -0,0 +1,33 @@
+---
+sidebar_position: 1
+---
+
+# Introduction
+
+LlamaEdge is a versatile platform supporting multiple types of AI models. The most common use of LlamaEdge is to
+stand up API servers that can replace OpenAI as your application's backend.
+
+## 🤖 Large Language Models (LLM)
+Explore the LLM capabilities
+➔ [Get Started with LLM](/docs/category/llm)
+
+## 👁️ Multimodal Vision
+Work with vision-language models like Llava and Qwen-VL
+➔ [Get Started with Multimodal](/docs/category/multimodal)
+
+## 👁️  Embeddings
+Work with embedding models for vector and semantic search
+➔ [Get Started with Multimodal](/docs/category/embeddings)
+
+## 🎙️ Speech to Text
+Run speech-to-text models like Whisper
+➔ [Get Started with Speech to Text](/docs/category/speech-to-text)
+
+## 🗣️ Text to Speech
+Convert text-to-speech using models like GPT-SOVITs and Piper
+➔ [Get Started with Text to Speech](/docs/category/text-to-speech)
+
+## 🎨 Text to Image
+Generate images using models like Stable Diffusion and FLUX
+➔ [Get Started with Text-to-Image](/docs/category/text-to-image)
+
diff --git a/docs/user-guide/llamaedge-docker.md → docs/ai-models/llamaedge-docker.md b/docs/user-guide/llamaedge-docker.md → docs/ai-models/llamaedge-docker.md
@@ -96,5 +96,5 @@ docker push secondstate/qwen-2-0.5b-allminilm-2:latest
 
 ## What's next
 
-Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](openai-api/intro.md). 
+Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](../llama-nexus/openai-api/intro.md). 
 
diff --git a/docs/user-guide/llamaedge-kubernetes.md → docs/ai-models/llamaedge-kubernetes.md b/docs/user-guide/llamaedge-kubernetes.md → docs/ai-models/llamaedge-kubernetes.md
diff --git a/docs/user-guide/llm/_category_.json → docs/ai-models/llm/_category_.json b/docs/user-guide/llm/_category_.json → docs/ai-models/llm/_category_.json
diff --git a/docs/user-guide/llm/api-reference.md → docs/ai-models/llm/api-reference.md b/docs/user-guide/llm/api-reference.md → docs/ai-models/llm/api-reference.md
diff --git a/...r-guide/llm/get-started-with-llamaedge.md → docs/ai-models/llm/quick-start-llm.md b/...r-guide/llm/get-started-with-llamaedge.md → docs/ai-models/llm/quick-start-llm.md
@@ -40,24 +40,51 @@ curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llam
 
 > The LlamaEdge apps are written in Rust and compiled to portable Wasm. That means they can run across devices and OSes without any change to the binary apps. You can simply download and run the compiled wasm apps regardless of your platform.
 
-### Step 4: Chat with the chatbot UI 
 
-The `llama-api-server.wasm` is a web server with an OpenAI-compatible API. You still need HTML files for the chatbot UI.
-Download and unzip the HTML UI files as follows.
+### Step 4: Use the API
+
+Start the web server by running the `llama-api-server.wasm` app in WasmEdge.
+
+```
+wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-api-server.wasm -p llama-3-chat
+```
+
+The `llama-api-server.wasm` is a web server.
+You can use the OpenAI-compatible `/chat/completions` API endpoint directly.
+
+```
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H 'accept: application/json' \
+  -H 'Content-Type: application/json' \
+  -d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Try to be as brief as possible."}, {"role":"user", "content": "Where is the capital of Texas?"}]}'
+```
+
+The response is.
+
+```
+{"id":"chatcmpl-5f0b5247-7afc-45f8-bc48-614712396a05","object":"chat.completion","created":1751945744,"model":"Mistral-Small-3.1-24B-Instruct-2503-Q5_K_M","choices":[{"index":0,"message":{"content":"The capital of Texas is Austin.","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":38,"completion_tokens":8,"total_tokens":46}}
+```
+
+### Step 5: Chat with the chatbot UI 
+
+The Chatbot UI is a web app that can interact with the OpenAI-compatible `/chat/completions` API to
+provide a human-friendly chatbot in your browser.
+
+Download and unzip the HTML and JS files for the Chatbot UI as follows.
 
 ```
 curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
 tar xzf chatbot-ui.tar.gz
 rm chatbot-ui.tar.gz
 ```
 
-Then, start the web server.
+Restart the web server to serve those HTML and JS files.
 
 ```
 wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-api-server.wasm -p llama-3-chat
 ```
 
 Go to `http://localhost:8080` on your computer to access the chatbot UI on a web page!
 
-Congratulations! You have now started an LLM app on your own device. But if you are interested in running an agentic app beyond the simple chatbot, you will need to start an API server for this LLM along with the embedding model. Check out [this guide on how to do it](/docs/user-guide/openai-api/intro.md)!
+Congratulations! You have now started an LLM app on your own device.
 
diff --git a/docs/user-guide/llm/tool-call.md → docs/ai-models/llm/tool-call.md b/docs/user-guide/llm/tool-call.md → docs/ai-models/llm/tool-call.md
@@ -14,7 +14,7 @@ In this tutorial, we will show you a simple Python program that allows a local L
 
 ## Prerequisites
 
-Follow [this guide](/docs/user-guide/openai-api/intro.md) to start an LlamaEdge API server. 
+Follow [this guide](quick-start-llm.md) to start an LlamaEdge API server. 
 For example, we will need an open source model that is capable of tool calling. 
 The Llama 3.1 8B model is a good choice. Let's download the model file. 
 
@@ -27,14 +27,12 @@ Then start the LlamaEdge API server for this model as follows.
 ```
 wasmedge --dir .:. \
     --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
-    --nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5.f16.gguf \
     llama-api-server.wasm \
-    --model-alias default,embedding \
-    --model-name Meta-Llama-3.1-8B-Instruct-Q5_K_M,nomic-embed \
-    --prompt-template llama-3-tool,embedding \
-    --batch-size 128,8192 \
-    --ubatch-size 128,8192 \
-    --ctx-size 8192,8192
+    --model-name Meta-Llama-3.1-8B-Instruct-Q5_K_M \
+    --prompt-template llama-3-tool \
+    --batch-size 128 \
+    --ubatch-size 128 \
+    --ctx-size 8192
 ```
 
 Note the `llama-3-tool` prompt template. It constructs user queries and LLM responses, including JSON messages for tool calls, into proper formats that the model is finetuned to follow. 
@@ -56,7 +54,7 @@ pip install -r requirements.txt
 Set the environment variables for the API server and model name we just set up. 
 
 ```
-export OPENAI_MODEL_NAME="llama-3-groq-8b"
+export OPENAI_MODEL_NAME="Meta-Llama-3.1-8B-Instruct-Q5_K_M"
 export OPENAI_BASE_URL="http://127.0.0.1:8080/v1"
 ```
 

diff --git a/docs/user-guide/multimodal/_category_.json → docs/ai-models/multimodal/_category_.json b/docs/user-guide/multimodal/_category_.json → docs/ai-models/multimodal/_category_.json
diff --git a/docs/user-guide/multimodal/gemma-3.md → docs/ai-models/multimodal/gemma-3.md b/docs/user-guide/multimodal/gemma-3.md → docs/ai-models/multimodal/gemma-3.md
diff --git a/docs/user-guide/multimodal/llava.md → docs/ai-models/multimodal/llava.md b/docs/user-guide/multimodal/llava.md → docs/ai-models/multimodal/llava.md
diff --git a/docs/user-guide/multimodal/medgemma-4b.md → docs/ai-models/multimodal/medgemma-4b.md b/docs/user-guide/multimodal/medgemma-4b.md → docs/ai-models/multimodal/medgemma-4b.md
diff --git a/docs/user-guide/multimodal/medgemma.jpg → docs/ai-models/multimodal/medgemma.jpg b/docs/user-guide/multimodal/medgemma.jpg → docs/ai-models/multimodal/medgemma.jpg
diff --git a/docs/user-guide/multimodal/qwen2-5.md → docs/ai-models/multimodal/qwen2-5.md b/docs/user-guide/multimodal/qwen2-5.md → docs/ai-models/multimodal/qwen2-5.md
diff --git a/...user-guide/speech-to-text/_category_.json → .../ai-models/speech-to-text/_category_.json b/...user-guide/speech-to-text/_category_.json → .../ai-models/speech-to-text/_category_.json
diff --git a/...ser-guide/speech-to-text/api-reference.md → ...ai-models/speech-to-text/api-reference.md b/...ser-guide/speech-to-text/api-reference.md → ...ai-models/speech-to-text/api-reference.md
diff --git a/docs/user-guide/speech-to-text/cli.md → docs/ai-models/speech-to-text/cli.md b/docs/user-guide/speech-to-text/cli.md → docs/ai-models/speech-to-text/cli.md
diff --git a/...ide/speech-to-text/quick-start-whisper.md → ...els/speech-to-text/quick-start-whisper.md b/...ide/speech-to-text/quick-start-whisper.md → ...els/speech-to-text/quick-start-whisper.md
diff --git a/.../user-guide/text-to-image/_category_.json → docs/ai-models/text-to-image/_category_.json b/.../user-guide/text-to-image/_category_.json → docs/ai-models/text-to-image/_category_.json
diff --git a/docs/user-guide/text-to-image/flux.md → docs/ai-models/text-to-image/flux.md b/docs/user-guide/text-to-image/flux.md → docs/ai-models/text-to-image/flux.md
diff --git a/...ser-guide/text-to-image/quick-start-sd.md → ...ai-models/text-to-image/quick-start-sd.md b/...ser-guide/text-to-image/quick-start-sd.md → ...ai-models/text-to-image/quick-start-sd.md
diff --git a/...user-guide/text-to-speech/_category_.json → .../ai-models/text-to-speech/_category_.json b/...user-guide/text-to-speech/_category_.json → .../ai-models/text-to-speech/_category_.json
diff --git a/docs/user-guide/text-to-speech/gpt-sovits.md → docs/ai-models/text-to-speech/gpt-sovits.md b/docs/user-guide/text-to-speech/gpt-sovits.md → docs/ai-models/text-to-speech/gpt-sovits.md
diff --git a/docs/developer-guide/_category_.json b/docs/developer-guide/_category_.json
diff --git a/docs/developer-guide/create-embeddings-collection.md b/docs/developer-guide/create-embeddings-collection.md
diff --git a/docs/developer-guide/multimodal-app.md b/docs/developer-guide/multimodal-app.md
Original file line number	Diff line number	Diff line change
Expand Up		@@ -96,5 +96,5 @@ docker push secondstate/qwen-2-0.5b-allminilm-2:latest

		## What's next

		Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](openai-api/intro.md).
		Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](../llama-nexus/openai-api/intro.md).