Skip to content

[NVIDIA] fix: H100+H200 CoreWeave fix#308

Merged
cquil11 merged 4 commits into
mainfrom
coreweave-fix
Dec 8, 2025
Merged

[NVIDIA] fix: H100+H200 CoreWeave fix#308
cquil11 merged 4 commits into
mainfrom
coreweave-fix

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Dec 8, 2025

After upgrade to vLLM 0.11.2, it appears the container tries to access /dev/shm/sagemaker_sessions (inside the container). For some reason, on CoreWeave cluster (H100 + H200), the same fs is being mounted each time, so between subsequent runs there are permission errors. Therefore, the workaround/solution is to just mount dev/shm/sagemaker_sessions to a temp directory on /mnt/vast.

@cquil11 cquil11 requested a review from a team as a code owner December 8, 2025 15:53
@cquil11 cquil11 merged commit 620f05c into main Dec 8, 2025
18 checks passed
@cquil11 cquil11 deleted the coreweave-fix branch December 8, 2025 17:30
Oseltamivir pushed a commit that referenced this pull request Dec 9, 2025
* add shm path mount

* add fix to h200 cw

* change path

* change path back
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title fix: H100+H200 CoreWeave fix [NVIDIA] fix: H100+H200 CoreWeave fix Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

1 participant