Skip to content

[DOC][llm-api-gateway] Port LLM Gateway design and API examples into the monorepo #20

@sbaum1994

Description

@sbaum1994

Report needed documentation

The LLM Gateway design currently lives outside the monorepo. Customers need the high-level design, component diagram, CLI behavior, and API examples available with the rest of the self-managed NVCF documentation.

The documentation should explain the LLM-optimized invocation path without requiring readers to inspect implementation notes or internal design artifacts. It should show how LLM traffic enters the LLM API Gateway, routes through the LLM Request Router, reaches the LLM Worker Gateway pod, and streams responses from the inference engine.

Describe the documentation you'd like

Port the current LLM Gateway design and diagram into the monorepo docs at a high level.

The documentation should cover:

  • The dedicated LLM invocation path alongside the existing HTTP invocation and gRPC proxy paths.
  • Component responsibilities for LLM API Gateway, LLM Request Router, and the LLM Worker Gateway pod.
  • A source-controlled architecture diagram that shows control-plane services, compute-plane worker sidecars, inference engine pods, and request/streaming flow.
  • Function creation and deployment flow for functionType: "LLM".
  • Model exposure through the existing function models field and LLM-specific config such as supported URIs, tokenizer, token rate limit, and routing method.
  • Authentication flow through the NVCF API gRPC LLM auth endpoints for client invocation and worker registration.
  • Routing behavior, including routing key usage, model-aware worker selection, and streaming over the worker path.
  • Token-aware rate limiting, expected 429 behavior, and quota-related response headers where supported.
  • Deployment topology for self-managed NVCF, including the llm-api-gateway and llm-request-router services, LLM route, and worker-side sidecars injected by NVCA.
  • Telemetry expectations for latency, token usage, worker health, routing overhead, and distributed tracing.
  • Current constraints, including single-cluster beta behavior where applicable and the dependency on PKI work for secure multi-cluster QUIC transport.
  • Troubleshooting guidance for auth failures, no routable workers, rate limiting, streaming interruption, and worker credential issues.

The NVCF CLI must also be extended so users can create, update, deploy, and inspect LLM functions without hand-authoring raw API payloads for the LLM-specific fields.

The API documentation should include examples for:

  • Creating an LLM function.
  • Creating an LLM function version.
  • Updating LLM model config such as token rate limit or routing method.
  • Any limitations on functionality.
  • Streaming responses and handling client disconnects.
  • Interpreting common 429, 503, and auth error responses.
  • An end to end sample.

Steps taken to search for needed documentation

N/A

Describe alternatives you've considered

N/A

Additional context

Suggested acceptance criteria:

  • Monorepo docs includes the LLM gateway design.
  • The design page includes a source-controlled diagram derived from the current design.
  • Docs describe LLM API Gateway, LLM Request Router, and LLM Worker Gateway pod responsibilities.
  • Docs explain the functionType: "LLM" lifecycle and LLM-specific model config.
  • nvcf-cli supports the LLM function workflow without requiring users to hand-author raw payloads.
  • API docs include copy-pastable examples for create, update, invoke, streaming, and common error handling.
  • The samples folder includes an end to end sample, linked in docs.
  • The docs call out current self-managed constraints and link to related PKI or multi-cluster follow-up work where appropriate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions