Skip to content

[Bug]: Windows GPU OOM error #210

@RWL-Dittrich

Description

@RWL-Dittrich

Steps to Reproduce

I'm trying to cluster two mac mini's and an RTX 4090 24GB together to run GPT-OSS 120b.
Somehow the RTX 4090 gets assigned more memory than it can handle.

I ran parallax install and parallax check to make sure my environment is up-to-date and correctly setup.

Expected Behavior

I expect the model to be scaled correctly according to the amount of memory a node has

Actual Behavior

Parallax crashes with the following stacktrace

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Nov 10 09:03:46.891 [�[1m�[31mERROR   �[0m] launch.py:175             CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/root/parallax/src/parallax/launch.py", line 127, in <module>
    executor = Executor.create_from_args(args, gradient_server=gradient_server)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/server/executor.py", line 289, in create_from_args
    return cls(**create_executor_config(args, gradient_server))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/server/executor.py", line 116, in __init__
    self.model_runner, self.config, self.tokenizer = initialize_sgl_model_runner(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/sglang/model_runner.py", line 296, in initialize_sgl_model_runner
    model_runner = ParallaxModelRunner(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/sglang/model_runner.py", line 76, in __init__
    super().__init__(
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 312, in __init__
    self.initialize(min_per_gpu_memory)
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 384, in initialize
    self.load_model()
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 739, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 590, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
    return model_class(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 586, in __init__
    self.model = GptOssModel(
                 ^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 507, in __init__
    self.layers, self.start_layer, self.end_layer = make_layers(
                                                    ^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 560, in make_layers
    + get_offloader().wrap_modules(
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 562, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 509, in <lambda>
    lambda idx, prefix: decoder_layer_type(
                        ^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 416, in __init__
    self.mlp = GptOssSparseMoeBlock(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 139, in __init__
    self.experts = experts_type(
                   ^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 212, in __init__
    self.quant_method.create_weights(
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py", line 313, in create_weights
    torch.zeros(
  File "/root/parallax/venv/lib/python3.12/site-packages/torch/utils/_device.py", line 103, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Nov 10 09:03:46.892 [�[1m�[32mINFO    �[0m] server.py:723             Leave scheduler: 12D3KooWGG6rkFt9c33wriGrnDquef4bkhWSwZTzRfmBue2jCpMy
[rank0]:[W1110 09:03:53.902193591 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[INFO] Successfully joined the distributed inference cluster.

Version

7ff90da

Environment & Context

  • I'm using the latest version.
  • I have searched existing issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions