Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/ai-models/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"label": "Serve AI models",
"position": 5,
"link": {
"type": "generated-index",
"description": "Serve open-source AI models via web APIs."
}
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "User Guide",
"position": 5,
"label": "Embeddings",
"position": 1,
"link": {
"type": "generated-index"
}
Expand Down
66 changes: 66 additions & 0 deletions docs/ai-models/embeddings/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
sidebar_position: 1
---

# Working with embedding models

Embedding models compute vectors from text inputs. The vectors can then be used as search index
for semantic search in a vector database.

### Step 1: Install WasmEdge

First off, you'll need WasmEdge, a high-performance, lightweight, and extensible WebAssembly (Wasm) runtime optimized for server-side and edge computing. To install WasmEdge along with the necessary plugin for AI inference, open your terminal and execute the following command:

```
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s
```

This command fetches and runs the WasmEdge installation script, which automatically installs WasmEdge and the WASI-NN plugin, essential for running LLM models like Llama 3.1.

### Step 2: Download the embedding model

Next, you'll need to obtain a model file. For this tutorial, we're focusing on the **GTW Qwen2 1.5B** model, which is a top rated text embedding model from Qwen. It generates vectors of 1536 dimensions. The steps are generally applicable to other models too. Use the following command to download the model file.

```
curl -LO https://huggingface.co/second-state/gte-Qwen2-1.5B-instruct-GGUF/resolve/main/gte-Qwen2-1.5B-instruct-Q5_K_M.gguf
```

### Step 3: Download a portable API server app

Next, you need an application that can build an OpenAI compatible API server for the model.
The [LlamaEdge api server app](https://github.com/LlamaEdge/LlamaEdge/tree/main/llama-api-server) is a lightweight and cross-platform Wasm app that works on any device
you might have. Just download the compiled binary app.

```
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm
```

> The LlamaEdge apps are written in Rust and compiled to portable Wasm. That means they can run across devices and OSes without any change to the binary apps. You can simply download and run the compiled wasm apps regardless of your platform.

### Step 4: Start the API server

Start the API server with the following command. Notice that the context size of this particular embedding model is
32k and the prompt template is `embedding`.

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:gte-Qwen2-1.5B-instruct-Q5_K_M.gguf llama-api-server.wasm --model-name gte-qwen2-1.5b --ctx-size 32768 --batch-size 8192 --ubatch-size 8192 --prompt-template embedding
```

### Step 5: Use the /embeddings API

You can now send embedding requests to it using the OpenAI-compatible `/embeddings` API endpoint.

```
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "The food was delicious and the waiter..."
}'
```

The response is.

```
{"object":"list","data":[{"index":0,"object":"embedding","embedding":[0.02968290634,0.04592291266,0.05229084566,-0.001912750886,-0.01647545397,0.01744602434,0.008423444815,0.01363539882,-0.005849621724,-0.004947130103,-0.02326701023,0.1068811566,0.01074867789, ... 0.005662892945,-0.01796873659,0.02428019233,-0.0333112292]}],"model":"gte-qwen2-1.5b","usage":{"prompt_tokens":9,"completion_tokens":0,"total_tokens":9}}
```

33 changes: 33 additions & 0 deletions docs/ai-models/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
sidebar_position: 1
---

# Introduction

LlamaEdge is a versatile platform supporting multiple types of AI models. The most common use of LlamaEdge is to
stand up API servers that can replace OpenAI as your application's backend.

## 🤖 Large Language Models (LLM)
Explore the LLM capabilities
➔ [Get Started with LLM](/docs/category/llm)

## 👁️ Multimodal Vision
Work with vision-language models like Llava and Qwen-VL
➔ [Get Started with Multimodal](/docs/category/multimodal)

## 👁️ Embeddings
Work with embedding models for vector and semantic search
➔ [Get Started with Multimodal](/docs/category/embeddings)

## 🎙️ Speech to Text
Run speech-to-text models like Whisper
➔ [Get Started with Speech to Text](/docs/category/speech-to-text)

## 🗣️ Text to Speech
Convert text-to-speech using models like GPT-SOVITs and Piper
➔ [Get Started with Text to Speech](/docs/category/text-to-speech)

## 🎨 Text to Image
Generate images using models like Stable Diffusion and FLUX
➔ [Get Started with Text-to-Image](/docs/category/text-to-image)

Original file line number Diff line number Diff line change
Expand Up @@ -96,5 +96,5 @@ docker push secondstate/qwen-2-0.5b-allminilm-2:latest

## What's next

Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](openai-api/intro.md).
Use the container as a drop-in replacement for the OpenAI API for your favorite agent app or framework! [See some examples here](../llama-nexus/openai-api/intro.md).

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -40,24 +40,51 @@ curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llam

> The LlamaEdge apps are written in Rust and compiled to portable Wasm. That means they can run across devices and OSes without any change to the binary apps. You can simply download and run the compiled wasm apps regardless of your platform.

### Step 4: Chat with the chatbot UI

The `llama-api-server.wasm` is a web server with an OpenAI-compatible API. You still need HTML files for the chatbot UI.
Download and unzip the HTML UI files as follows.
### Step 4: Use the API

Start the web server by running the `llama-api-server.wasm` app in WasmEdge.

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-api-server.wasm -p llama-3-chat
```

The `llama-api-server.wasm` is a web server.
You can use the OpenAI-compatible `/chat/completions` API endpoint directly.

```
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Try to be as brief as possible."}, {"role":"user", "content": "Where is the capital of Texas?"}]}'
```

The response is.

```
{"id":"chatcmpl-5f0b5247-7afc-45f8-bc48-614712396a05","object":"chat.completion","created":1751945744,"model":"Mistral-Small-3.1-24B-Instruct-2503-Q5_K_M","choices":[{"index":0,"message":{"content":"The capital of Texas is Austin.","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":38,"completion_tokens":8,"total_tokens":46}}
```

### Step 5: Chat with the chatbot UI

The Chatbot UI is a web app that can interact with the OpenAI-compatible `/chat/completions` API to
provide a human-friendly chatbot in your browser.

Download and unzip the HTML and JS files for the Chatbot UI as follows.

```
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
```

Then, start the web server.
Restart the web server to serve those HTML and JS files.

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-api-server.wasm -p llama-3-chat
```

Go to `http://localhost:8080` on your computer to access the chatbot UI on a web page!

Congratulations! You have now started an LLM app on your own device. But if you are interested in running an agentic app beyond the simple chatbot, you will need to start an API server for this LLM along with the embedding model. Check out [this guide on how to do it](/docs/user-guide/openai-api/intro.md)!
Congratulations! You have now started an LLM app on your own device.

Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ In this tutorial, we will show you a simple Python program that allows a local L

## Prerequisites

Follow [this guide](/docs/user-guide/openai-api/intro.md) to start an LlamaEdge API server.
Follow [this guide](quick-start-llm.md) to start an LlamaEdge API server.
For example, we will need an open source model that is capable of tool calling.
The Llama 3.1 8B model is a good choice. Let's download the model file.

Expand All @@ -27,14 +27,12 @@ Then start the LlamaEdge API server for this model as follows.
```
wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5.f16.gguf \
llama-api-server.wasm \
--model-alias default,embedding \
--model-name Meta-Llama-3.1-8B-Instruct-Q5_K_M,nomic-embed \
--prompt-template llama-3-tool,embedding \
--batch-size 128,8192 \
--ubatch-size 128,8192 \
--ctx-size 8192,8192
--model-name Meta-Llama-3.1-8B-Instruct-Q5_K_M \
--prompt-template llama-3-tool \
--batch-size 128 \
--ubatch-size 128 \
--ctx-size 8192
```

Note the `llama-3-tool` prompt template. It constructs user queries and LLM responses, including JSON messages for tool calls, into proper formats that the model is finetuned to follow.
Expand All @@ -56,7 +54,7 @@ pip install -r requirements.txt
Set the environment variables for the API server and model name we just set up.

```
export OPENAI_MODEL_NAME="llama-3-groq-8b"
export OPENAI_MODEL_NAME="Meta-Llama-3.1-8B-Instruct-Q5_K_M"
export OPENAI_BASE_URL="http://127.0.0.1:8080/v1"
```

Expand Down
File renamed without changes.
8 changes: 0 additions & 8 deletions docs/developer-guide/_category_.json

This file was deleted.

156 changes: 0 additions & 156 deletions docs/developer-guide/create-embeddings-collection.md

This file was deleted.

7 changes: 0 additions & 7 deletions docs/developer-guide/multimodal-app.md

This file was deleted.

Loading