Skip to content

Missing error handling for status in get_memory_limit_from_device_id #39

@lizhangmai

Description

@lizhangmai

Hi, I noticed a potential issue in nvmath/internal/utils.py at line 230, in the function get_memory_limit_from_device_id.
Currently, the function looks like this:

def get_memory_limit_from_device_id(memory_limit: int | float | str, device_id: int) -> int:
    with device_ctx(device_id):
        status, _, total_memory = cbr.cudaMemGetInfo()
        return _get_memory_limit(memory_limit, total_memory)

The problem is that the return value status is not checked. In some cases (e.g., depending on the version of cuda-bindings), I encountered status=35 with total_memory=None. Although I solved this issue by switching to a compatible cuda-bindings version, it seems safer for the library to handle non-zero statuses explicitly.

I suggest adding error handling for status, for example:

def get_memory_limit_from_device_id(memory_limit: int | float | str, device_id: int) -> int:
    with device_ctx(device_id):
        status, _, total_memory = cbr.cudaMemGetInfo()
        if status != 0 or total_memory is None:
            raise RuntimeError(
                f"cudaMemGetInfo failed with status {status}, total_memory={total_memory}"
            )
        return _get_memory_limit(memory_limit, total_memory)

This way, users will get a clear exception instead of unexpected None values being passed to _get_memory_limit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions