[Bug]: Windows GPU OOM error

### Steps to Reproduce

I'm trying to cluster two mac mini's and an RTX 4090 24GB together to run GPT-OSS 120b.
Somehow the RTX 4090 gets assigned more memory than it can handle.

I ran `parallax install` and `parallax check` to make sure my environment is up-to-date and correctly setup.

### Expected Behavior

I expect the model to be scaled correctly according to the amount of memory a node has

### Actual Behavior

Parallax crashes with the following stacktrace

```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Nov 10 09:03:46.891 [[1m[31mERROR   [0m] launch.py:175             CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/root/parallax/src/parallax/launch.py", line 127, in <module>
    executor = Executor.create_from_args(args, gradient_server=gradient_server)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/server/executor.py", line 289, in create_from_args
    return cls(**create_executor_config(args, gradient_server))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/server/executor.py", line 116, in __init__
    self.model_runner, self.config, self.tokenizer = initialize_sgl_model_runner(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/sglang/model_runner.py", line 296, in initialize_sgl_model_runner
    model_runner = ParallaxModelRunner(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/parallax/sglang/model_runner.py", line 76, in __init__
    super().__init__(
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 312, in __init__
    self.initialize(min_per_gpu_memory)
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 384, in initialize
    self.load_model()
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 739, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 590, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
    return model_class(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 586, in __init__
    self.model = GptOssModel(
                 ^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 507, in __init__
    self.layers, self.start_layer, self.end_layer = make_layers(
                                                    ^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 560, in make_layers
    + get_offloader().wrap_modules(
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 562, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 509, in <lambda>
    lambda idx, prefix: decoder_layer_type(
                        ^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 416, in __init__
    self.mlp = GptOssSparseMoeBlock(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py", line 139, in __init__
    self.experts = experts_type(
                   ^^^^^^^^^^^^^
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 212, in __init__
    self.quant_method.create_weights(
  File "/root/parallax/venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py", line 313, in create_weights
    torch.zeros(
  File "/root/parallax/venv/lib/python3.12/site-packages/torch/utils/_device.py", line 103, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 31.04 GiB is allocated by PyTorch, and 126.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Nov 10 09:03:46.892 [[1m[32mINFO    [0m] server.py:723             Leave scheduler: 12D3KooWGG6rkFt9c33wriGrnDquef4bkhWSwZTzRfmBue2jCpMy
[rank0]:[W1110 09:03:53.902193591 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[INFO] Successfully joined the distributed inference cluster.
```

### Version

7ff90da847a41e1fb5e059d4703011d6c3340676

### Environment & Context

- [x] I'm using the latest version.
- [x] I have searched existing issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Windows GPU OOM error #210

Steps to Reproduce

Expected Behavior

Actual Behavior

Version

Environment & Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Windows GPU OOM error #210

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Version

Environment & Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions