Skip to content
Christopher David edited this page Sep 22, 2023 · 9 revisions

Overview

This AI worker ("workerbee") operates at a high level, responding to model inference, training fine tuning requests for specific model parameters.

Currently we only support llama2, via the llama-cpp-python library because it's fast, compatible with any platform and gpu, feature-complete and handles model-splitting. Stable diffusion will likely be the next target.

Workflow

graph TD

subgraph "Many Workers"
    W1[Connect to QueenBee URL via WebSockets]
    W2[Send a Registration Message]
    W3[Listen for Inference & Training Websocket Requests]
    W4[Reply with OpenAI Formatted Response]
    W1 --> W2
    W2 --> W3
    W3 --> W4
end

subgraph "One QueenBee"
    S1[Listen for Worker Connections]
    S2[Accumulate List of Active Workers]
    S3[Listen for Inference, Training REST Requests]
    S4[Pick an Available Worker]
    S5[Send Request Over a Websocket]
    S6[Translate Reply Into REST Response]
    S7[Complete Any Billing]
    S1 --> S2
    S2 --> S3
    S3 --> S4
    S4 --> S5
    S5 --> S6
    S6 --> S7
end

subgraph "Many Clients"
    C1[Make REST Requests Against the QueenBee]
end

%%% Interactions %%%
W1 -.-> S1
W2 -.-> S2
C1 -.-> S3
S5 -.-> W3
W4 -.-> S6
S6 -.-> C1

Worker Protocol

Each worker communicates with the QueenBee through WebSockets and performs specific tasks, such as listening for inference and training requests and replying with OpenAI-formatted responses.

Steps

1. Establish WebSocket Connection

  • Action: Connect to [queenbee url] via WebSockets.

2. Send Registration Message

  • Action: Send a JSON-formatted registration message through the established WebSocket connection.

  • Registration Message Fields:

    • ln_url (str): Lightning url for payment receipt.
    • auth_key (str): Optional token for control panel connection.
    • cpu_count (int): Number of CPUs available.
    • disk_space (int): Amount of disk space available (in GB).
    • vram (int): Amount of VRAM available (in MB).
    • nv_gpu_count (Optional[int], default: None): Count of Nvidia GPUs.
    • nv_driver_version (Optional[str], default: None): Version of Nvidia GPU driver.
    • nv_gpus (Optional[List[GpuInfo]], default: []): Information for each Nvidia GPU.
    • cl_driver_version (Optional[str], default: None): Version of OpenCL driver.
    • cl_gpus (Optional[List[GpuInfo]], default: []): Information for each OpenCL-compatible GPU.

3. Listen for Requests

  • Action: Listen for JSON-formatted requests over WebSocket that contain two main members.

  • Request Fields:

    • openai_url (str): Always /v1/chat/completion. Can be ignored for now.
    • openai_req (dict): Contains the model name and a list of messages (formatted similarly to regular OpenAI requests).

4. Reply with OpenAI-Formatted Response

  • Action: Send back a JSON-formatted response that follows the OpenAI API schema.

  • Response Fields:

    • choices: Choices array, containing one or more choices depending on the request parameters.
    • usage: Contains metadata about the API usage, such as token count.

Example Messages

Registration Message

{
  "ln_url": "simulx@getalby.com",
  "cpu_count": 8,
  "disk_space": 500,
  "vram": 2048,
  "nv_gpu_count": 1,
  "nv_driver_version": "465.19.01",
  "nv_gpus": [
    {
      "name": "NVIDIA Tesla K80",
      "memory": 11441
    }
  ]
}

Request Message

{
  "openai_url": "/v1/chat/completion",
  "openai_req": {
    "model": "TheBloke/CoolModelv2:Q4_K_M",
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      }
    ]
  }
}

Response Message

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The Los Angeles Dodgers won the World Series in 2020."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "usage": {
    "prompt_tokens": 56,
    "completion_tokens": 31,
    "total_tokens": 87
  }
}