Add vLLM production deployment notes by chris-colinsky · Pull Request #112 · LunarCommand/openarmature-python

chris-colinsky · 2026-06-01T19:50:32Z

Summary

Second of five docs-polish items. Extends docs/model-providers/vllm.md with cross-cutting gotchas surfaced in real production work.

Tool-calling section additions:

--tool-call-parser family table verified against vLLM's docs: Llama 3.x (llama3_json), Llama 4 (llama4_pythonic), Mistral (mistral), Hermes / Qwen 2.5 (hermes), Qwen3 / Qwen3-Coder (qwen3_xml), DeepSeek V3 (deepseek_v3), GPT-OSS (openai)
Explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma 2 / 3 / CodeGemma (no parser ships; only the specialized FunctionGemma 270M has one). Heads off three predictable user confusions.
Qwen3-VL caveat: vLLM's docs don't currently document a parser for Qwen3-VL specifically; check release notes before assuming the Qwen3 row carries over.

New "Production deployment" H2:

VLLM_HTTP_TIMEOUT_KEEP_ALIVE: stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections, surfaces as ProviderUnavailable; widen to roughly 300s. Same rule applies behind a reverse proxy.
systemd unit skeleton: structural pattern, no model-specific paths; uses EnvironmentFile so units ship across hosts.
Throughput knobs (--max-model-len, --max-num-seqs, --gpu-memory-utilization) framed OA-side: when fan-out concurrency exceeds vLLM's cap, expect ProviderRateLimit.retry_after; wrap the LLM-calling node in RetryMiddleware.

Docs-only; no code or test changes.

Test plan

uv run pytest -q (1080 pass, no regressions)
ruff check + ruff format --check clean
No em dashes in new content (matches the polish convention from PR Add three patterns from graduated agent shapes #111)
Local mkdocs serve and confirm the new sections render correctly under Model Providers → Self-hosted vLLM

Out of scope

Items 3-5 of the docs-polish sweep (deep docs sweep, README pointer-block thinning, version on homepage). Separate PRs.
Sweeping the pre-existing 16 em dashes elsewhere in vllm.md. Falls under item 3's broader docs sweep.

Extend docs/model-providers/vllm.md with cross-cutting gotchas surfaced in real production work. The "Tool calling" section grows a --tool-call-parser family table (verified against vLLM's docs: Llama 3.x, Llama 4, Mistral, Hermes, Qwen3, DeepSeek V3, GPT-OSS) plus explicit not-supported callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no parser ships). A new "Production deployment" H2 covers the three gotchas: - VLLM_HTTP_TIMEOUT_KEEP_ALIVE: vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as ProviderUnavailable; widen to roughly 300s. Includes the reverse-proxy variant of the same rule. - systemd unit skeleton: structural, no model-specific paths; uses EnvironmentFile so the unit ships across hosts. - Throughput knobs (--max-model-len, --max-num-seqs, --gpu-memory-utilization) framed OA-side: when fan-out concurrency exceeds the cap, expect ProviderRateLimit; wrap the LLM-calling node in RetryMiddleware. Docs-only; no code or test changes. CHANGELOG bullet added under [Unreleased] ### Added.

Copilot

Pull request overview

Docs-only PR extending docs/model-providers/vllm.md with a tool-call-parser family table (plus Gemma/Gemini/Qwen3-VL caveats) and a new "Production deployment" section covering keep-alive timeouts, a systemd unit skeleton, and throughput knobs.

Changes:

Replace single-model parser example with a model-family table and explicit caveats for proprietary/Gemma/Qwen3-VL cases.
Add a "Production deployment" H2 covering VLLM_HTTP_TIMEOUT_KEEP_ALIVE, a systemd unit skeleton, and --max-model-len/--max-num-seqs/--gpu-memory-utilization knobs.
CHANGELOG entry summarizing the additions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
docs/model-providers/vllm.md	Adds parser-family table, model-family caveats, and a production deployment section.
CHANGELOG.md	Adds Unreleased Added entry describing the new docs sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings June 1, 2026 19:50

Copilot started reviewing on behalf of chris-colinsky June 1, 2026 19:50 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

chris-colinsky merged commit c1b2f23 into main Jun 1, 2026
7 checks passed

chris-colinsky deleted the feature/docs-polish-vllm-production-notes branch June 1, 2026 19:52

chris-colinsky mentioned this pull request Jun 1, 2026

Add PyPI and spec version shields to homepage #115

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM production deployment notes#112

Add vLLM production deployment notes#112
chris-colinsky merged 1 commit into
mainfrom
feature/docs-polish-vllm-production-notes

chris-colinsky commented Jun 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chris-colinsky commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chris-colinsky commented Jun 1, 2026 •

edited

Loading