FSDP: `_optimizer_has_flat_params` only checks first parameter group #17817

schmidt-ai · 2023-06-12T16:01:31Z

Bug description

The function _optimizer_has_flat_params only checks the first parameter group for fsdp_flattened parameters. There is an edge case where the first parameter group has no fsdp_flattened parameters but subesquent groups do. It would be a small change to this function to simply check all groups in optimizer.param_groups:

def _optimizer_has_flat_params(optimizer: Optimizer) -> bool:
    _FSDP_FLATTENED = "_fsdp_flattened"
    if _TORCH_GREATER_EQUAL_1_13:
-        return any(getattr(param, _FSDP_FLATTENED, False) for param in optimizer.param_groups[0]["params"])
+        return any(getattr(param, _FSDP_FLATTENED, False) for group in optimizer.param_group for param in group["params"])

    from torch.distributed.fsdp import FlatParameter
-    return any(isinstance(param, FlatParameter) for param in optimizer.param_groups[0]["params"])
+    return any(isinstance(param, FlatParameter) for group in optimizer.param_groups for param in group["params"])

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment

Python version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.178-162.673.amzn2.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 525.85.12
[pip3] numpy==1.24.3
[pip3] pytorch-lightning==2.0.3
[pip3] sagemaker-pytorch-training==2.8.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchdata==0.6.1
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2

More info

No response

cc @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

awaelchli · 2023-06-12T18:02:06Z

Hi @schmidt-ai Thanks for reporting.
A PR for this would be very welcome. Are you open to contribute this fix?

schmidt-ai added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 12, 2023

github-actions bot added the ver: 2.0.x label Jun 12, 2023

awaelchli added strategy: fsdp Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Jun 12, 2023

awaelchli added this to the 2.0.x milestone Jun 12, 2023

leng-yue mentioned this issue Jun 13, 2023

Proper way to checkpoint model using FSDP #17798

Closed

schmidt-ai mentioned this issue Jun 24, 2023

Check all param groups for flat parameters in FSDP #17914

Merged

carmocca closed this as completed in #17914 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP: `_optimizer_has_flat_params` only checks first parameter group #17817

FSDP: `_optimizer_has_flat_params` only checks first parameter group #17817

schmidt-ai commented Jun 12, 2023 •

edited by github-actions bot

Loading

awaelchli commented Jun 12, 2023 •

edited

Loading

FSDP: _optimizer_has_flat_params only checks first parameter group #17817

FSDP: _optimizer_has_flat_params only checks first parameter group #17817

Comments

schmidt-ai commented Jun 12, 2023 • edited by github-actions bot Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Jun 12, 2023 • edited Loading

FSDP: `_optimizer_has_flat_params` only checks first parameter group #17817

FSDP: `_optimizer_has_flat_params` only checks first parameter group #17817

schmidt-ai commented Jun 12, 2023 •

edited by github-actions bot

Loading

awaelchli commented Jun 12, 2023 •

edited

Loading