Skip to content

[Bug] RegisteredMemory not properly destroyed #533

Open
@liangyuRain

Description

@liangyuRain

Hi, the following code causes GPU OOM on hopper with nvls enabled. I am using the latest main branch.

from mscclpp import Transport, TcpBootstrap, Communicator
from mscclpp._mscclpp import Context, RawGpuBuffer
import cupy as cp
cp.cuda.Device(0).use()
bootstrap = TcpBootstrap.create(0, 1)
bootstrap.initialize(bootstrap.create_unique_id(), 60)
comm = Communicator(bootstrap)
for i in range(100):
    if i % 10 == 0:
        print(f"{i=}", flush=True)
    mem = RawGpuBuffer(2 ** 30)
    reg = comm.register_memory(mem.data(), mem.bytes(), Transport.CudaIpc)
    del reg, mem

Output:

i=0
i=10
i=20
i=30
i=40
i=50
i=60
i=70
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
mscclpp._mscclpp.CuError: (2, 'Call to result failed./.../mscclpp/src/gpu_utils.cc:128 (Cu failure: out of memory)')

The code is fine if memory is not registered. Could you please check if it can be reproduced on your side?

Activity

Binyang2014

Binyang2014 commented on May 25, 2025

@Binyang2014
Contributor

We can reproduce this issue. Let me figure out the reason

linked a pull request that will close this issue on May 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @Binyang2014@liangyuRain

      Issue actions

        [Bug] RegisteredMemory not properly destroyed · Issue #533 · microsoft/mscclpp