docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416) by mesutoezdil · Pull Request #426 · Project-HAMi/website

mesutoezdil · 2026-05-29T21:02:47Z

Adds 8 new FAQ entries to docs/faq/faq.md covering the three topic areas defined in the issue. All questions were sourced from the research compiled in #415.

New entries

GPU virtualization model

How does HAMi enforce GPU memory and compute limits? Explains the libvgpu.so CUDA API interception mechanism, what it covers, and what it does not (DinD, direct driver API calls). Links to GPU Virtualization.
HAMi vGPU vs NVIDIA MIG. Side-by-side comparison table covering hardware requirements, isolation mechanism, enforcement strength, granularity, and dynamic reconfiguration. Guidance on when to use each.
Why does nvidia-smi inside a container show less memory than the host? Explains that this is intentional - libvgpu.so intercepts memory query calls and returns the allocated limit.
Why is my gpumem limit not enforced? Covers the four root causes: CUDA_DISABLE_CONTROL, Docker-in-Docker, direct NVML/driver API calls, and misconfigured container runtime.

Scheduling interaction

Does HAMi replace kube-scheduler or run alongside it? Explains the extender model, the MutatingWebhook schedulerName assignment, and the impact on non-HAMi pods (none). Includes a note on multi-replica leader election.

Ecosystem integration

HAMi with vLLM multi-GPU tensor parallelism. Documents the NCCL segfault issue (CUDA_DEVICE_MEMORY_SHARED_CACHE per-container, fixed in v2.7.0), single-GPU usage, and Volcano multi-pod setup. Links to issues #1764 and #1853.
HAMi with NVIDIA GPU Operator and DCGM. Explains the device plugin conflict and how to disable GPU Operator's device plugin. Notes that DCGM Exporter is unaffected.
Prometheus and Grafana monitoring. Covers the metrics endpoint, key metric names, scrape config, and importing the bundled static/grafana/gpu-dashboard.json dashboard.

Closes #416.
Refs #415.

…-HAMi#416) Adds entries covering the three topic areas defined in the issue: GPU virtualization model: - How HAMi enforces limits via libvgpu.so CUDA interception - HAMi vGPU vs NVIDIA MIG comparison and decision guide - Why nvidia-smi shows less memory inside container than on host - Why gpumem limits are not enforced (CUDA_DISABLE_CONTROL, DinD, direct driver API calls, misconfigured container runtime) Scheduling interaction: - Whether HAMi replaces or extends kube-scheduler (extender model) Ecosystem integration: - HAMi with vLLM multi-GPU tensor parallelism (tp>1 NCCL fix in v2.7) - HAMi with NVIDIA GPU Operator and DCGM metrics - Prometheus and Grafana monitoring setup with bundled dashboard JSON Each entry follows the existing FAQ format: direct answer in the first sentence, supporting detail, links to relevant doc pages. All internal links use the correct ./path format for the faq/faq.md URL depth. Sourced from issue Project-HAMi#415 research output. Closes Project-HAMi#416. Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

hami-robot · 2026-05-29T21:02:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mesutoezdil
Once this PR has been reviewed and has the lgtm label, please assign windsonsea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2026-05-29T21:02:53Z

✅ Deploy Preview for project-hami ready!

Name	Link
🔨 Latest commit	`6f27399`
🔍 Latest deploy log	https://app.netlify.com/projects/project-hami/deploys/6a1a0394c4363b00086d81f0
😎 Deploy Preview	https://deploy-preview-426--project-hami.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

Replace incorrect LD_PRELOAD claim with accurate /etc/ld.so.preload hostPath mount mechanism, matching docs/core-concepts/gpu-virtualization.md. Update vLLM tensor parallelism section: full support for vLLM > 0.18 landed in v2.9.0 (CHANGELOG), not v2.7.0 as previously stated. Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

hami-robot Bot added the dco-signoff: yes label May 29, 2026

hami-robot Bot requested review from archlitchi and wawa0210 May 29, 2026 21:02

hami-robot Bot added the size/L label May 29, 2026

mesutoezdil added 2 commits May 29, 2026 23:18

fix: remove unverified scheduler.replicas Helm value reference in FAQ

472952b

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426
mesutoezdil wants to merge 3 commits into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416

mesutoezdil commented May 29, 2026 •

edited

Loading

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

netlify Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mesutoezdil commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New entries

GPU virtualization model

Scheduling interaction

Ecosystem integration

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

netlify Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for project-hami ready!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mesutoezdil commented May 29, 2026 •

edited

Loading

netlify Bot commented May 29, 2026 •

edited

Loading