Skip to content

Add vLLM production deployment notes#112

Merged
chris-colinsky merged 1 commit into
mainfrom
feature/docs-polish-vllm-production-notes
Jun 1, 2026
Merged

Add vLLM production deployment notes#112
chris-colinsky merged 1 commit into
mainfrom
feature/docs-polish-vllm-production-notes

Conversation

@chris-colinsky
Copy link
Copy Markdown
Member

@chris-colinsky chris-colinsky commented Jun 1, 2026

Summary

Second of five docs-polish items. Extends docs/model-providers/vllm.md with cross-cutting gotchas surfaced in real production work.

Tool-calling section additions:

  • --tool-call-parser family table verified against vLLM's docs: Llama 3.x (llama3_json), Llama 4 (llama4_pythonic), Mistral (mistral), Hermes / Qwen 2.5 (hermes), Qwen3 / Qwen3-Coder (qwen3_xml), DeepSeek V3 (deepseek_v3), GPT-OSS (openai)
  • Explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma 2 / 3 / CodeGemma (no parser ships; only the specialized FunctionGemma 270M has one). Heads off three predictable user confusions.
  • Qwen3-VL caveat: vLLM's docs don't currently document a parser for Qwen3-VL specifically; check release notes before assuming the Qwen3 row carries over.

New "Production deployment" H2:

  • VLLM_HTTP_TIMEOUT_KEEP_ALIVE: stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections, surfaces as ProviderUnavailable; widen to roughly 300s. Same rule applies behind a reverse proxy.
  • systemd unit skeleton: structural pattern, no model-specific paths; uses EnvironmentFile so units ship across hosts.
  • Throughput knobs (--max-model-len, --max-num-seqs, --gpu-memory-utilization) framed OA-side: when fan-out concurrency exceeds vLLM's cap, expect ProviderRateLimit.retry_after; wrap the LLM-calling node in RetryMiddleware.

Docs-only; no code or test changes.

Test plan

  • uv run pytest -q (1080 pass, no regressions)
  • ruff check + ruff format --check clean
  • No em dashes in new content (matches the polish convention from PR Add three patterns from graduated agent shapes #111)
  • Local mkdocs serve and confirm the new sections render correctly under Model Providers → Self-hosted vLLM

Out of scope

  • Items 3-5 of the docs-polish sweep (deep docs sweep, README pointer-block thinning, version on homepage). Separate PRs.
  • Sweeping the pre-existing 16 em dashes elsewhere in vllm.md. Falls under item 3's broader docs sweep.

Extend docs/model-providers/vllm.md with cross-cutting gotchas
surfaced in real production work. The "Tool calling" section grows
a --tool-call-parser family table (verified against vLLM's docs:
Llama 3.x, Llama 4, Mistral, Hermes, Qwen3, DeepSeek V3, GPT-OSS)
plus explicit not-supported callouts for Anthropic / Gemini
(proprietary cloud) and mainstream Gemma (no parser ships).

A new "Production deployment" H2 covers the three gotchas:

- VLLM_HTTP_TIMEOUT_KEEP_ALIVE: vLLM's stock 5s uvicorn keep-alive
  lapses pooled OA-side httpx connections and surfaces as
  ProviderUnavailable; widen to roughly 300s. Includes the
  reverse-proxy variant of the same rule.
- systemd unit skeleton: structural, no model-specific paths; uses
  EnvironmentFile so the unit ships across hosts.
- Throughput knobs (--max-model-len, --max-num-seqs,
  --gpu-memory-utilization) framed OA-side: when fan-out
  concurrency exceeds the cap, expect ProviderRateLimit; wrap
  the LLM-calling node in RetryMiddleware.

Docs-only; no code or test changes. CHANGELOG bullet added under
[Unreleased] ### Added.
Copilot AI review requested due to automatic review settings June 1, 2026 19:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Docs-only PR extending docs/model-providers/vllm.md with a tool-call-parser family table (plus Gemma/Gemini/Qwen3-VL caveats) and a new "Production deployment" section covering keep-alive timeouts, a systemd unit skeleton, and throughput knobs.

Changes:

  • Replace single-model parser example with a model-family table and explicit caveats for proprietary/Gemma/Qwen3-VL cases.
  • Add a "Production deployment" H2 covering VLLM_HTTP_TIMEOUT_KEEP_ALIVE, a systemd unit skeleton, and --max-model-len/--max-num-seqs/--gpu-memory-utilization knobs.
  • CHANGELOG entry summarizing the additions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
docs/model-providers/vllm.md Adds parser-family table, model-family caveats, and a production deployment section.
CHANGELOG.md Adds Unreleased Added entry describing the new docs sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chris-colinsky chris-colinsky merged commit c1b2f23 into main Jun 1, 2026
7 checks passed
@chris-colinsky chris-colinsky deleted the feature/docs-polish-vllm-production-notes branch June 1, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants