Skip to content

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426

Open
mesutoezdil wants to merge 3 commits into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416
Open

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426
mesutoezdil wants to merge 3 commits into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416

Conversation

@mesutoezdil
Copy link
Copy Markdown
Contributor

@mesutoezdil mesutoezdil commented May 29, 2026

Adds 8 new FAQ entries to docs/faq/faq.md covering the three topic areas defined in the issue. All questions were sourced from the research compiled in #415.

New entries

GPU virtualization model

  • How does HAMi enforce GPU memory and compute limits? Explains the libvgpu.so CUDA API interception mechanism, what it covers, and what it does not (DinD, direct driver API calls). Links to GPU Virtualization.
  • HAMi vGPU vs NVIDIA MIG. Side-by-side comparison table covering hardware requirements, isolation mechanism, enforcement strength, granularity, and dynamic reconfiguration. Guidance on when to use each.
  • Why does nvidia-smi inside a container show less memory than the host? Explains that this is intentional - libvgpu.so intercepts memory query calls and returns the allocated limit.
  • Why is my gpumem limit not enforced? Covers the four root causes: CUDA_DISABLE_CONTROL, Docker-in-Docker, direct NVML/driver API calls, and misconfigured container runtime.

Scheduling interaction

  • Does HAMi replace kube-scheduler or run alongside it? Explains the extender model, the MutatingWebhook schedulerName assignment, and the impact on non-HAMi pods (none). Includes a note on multi-replica leader election.

Ecosystem integration

  • HAMi with vLLM multi-GPU tensor parallelism. Documents the NCCL segfault issue (CUDA_DEVICE_MEMORY_SHARED_CACHE per-container, fixed in v2.7.0), single-GPU usage, and Volcano multi-pod setup. Links to issues #1764 and #1853.
  • HAMi with NVIDIA GPU Operator and DCGM. Explains the device plugin conflict and how to disable GPU Operator's device plugin. Notes that DCGM Exporter is unaffected.
  • Prometheus and Grafana monitoring. Covers the metrics endpoint, key metric names, scrape config, and importing the bundled static/grafana/gpu-dashboard.json dashboard.

Closes #416.
Refs #415.

…-HAMi#416)

Adds entries covering the three topic areas defined in the issue:

GPU virtualization model:
- How HAMi enforces limits via libvgpu.so CUDA interception
- HAMi vGPU vs NVIDIA MIG comparison and decision guide
- Why nvidia-smi shows less memory inside container than on host
- Why gpumem limits are not enforced (CUDA_DISABLE_CONTROL, DinD,
  direct driver API calls, misconfigured container runtime)

Scheduling interaction:
- Whether HAMi replaces or extends kube-scheduler (extender model)

Ecosystem integration:
- HAMi with vLLM multi-GPU tensor parallelism (tp>1 NCCL fix in v2.7)
- HAMi with NVIDIA GPU Operator and DCGM metrics
- Prometheus and Grafana monitoring setup with bundled dashboard JSON

Each entry follows the existing FAQ format: direct answer in the first
sentence, supporting detail, links to relevant doc pages. All internal
links use the correct ./path format for the faq/faq.md URL depth.

Sourced from issue Project-HAMi#415 research output.
Closes Project-HAMi#416.

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mesutoezdil
Once this PR has been reviewed and has the lgtm label, please assign windsonsea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify Bot commented May 29, 2026

Deploy Preview for project-hami ready!

Name Link
🔨 Latest commit 6f27399
🔍 Latest deploy log https://app.netlify.com/projects/project-hami/deploys/6a1a0394c4363b00086d81f0
😎 Deploy Preview https://deploy-preview-426--project-hami.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@hami-robot hami-robot Bot added the size/L label May 29, 2026
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Replace incorrect LD_PRELOAD claim with accurate /etc/ld.so.preload
hostPath mount mechanism, matching docs/core-concepts/gpu-virtualization.md.

Update vLLM tensor parallelism section: full support for vLLM > 0.18
landed in v2.9.0 (CHANGELOG), not v2.7.0 as previously stated.

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs/faq] Write new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration

1 participant