Weight tying + FSDP = out of bounds #257

carmocca · 2024-03-05T01:15:03Z

🐛 Bug

To Reproduce

Code:

import os
import torch
import torch.distributed as tdist
import thunder
from thunder.tests.litgpt_model import GPT, Config

if __name__ == "__main__":
    tdist.init_process_group(backend="nccl")
    LOCAL_RANK = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", LOCAL_RANK)
    torch.set_default_device(device)

    config = Config(block_size=256, padded_vocab_size=32000, n_layer=1, n_head=3, head_size=24, n_embd=144, rotary_percentage=1.0, parallel_residual=False, bias=False, norm_class_name='RMSNorm', mlp_class_name='LLaMAMLP', intermediate_size=384)
    with device:
        model = GPT(config)

    model.transformer.wte.weight = model.lm_head.weight

    model = thunder.distributed.fsdp(model)
    model = thunder.jit(model, executors=["torch"])

    input_ids = torch.randint(1, 30010, (128, 256), dtype=torch.long, device=device)
    logits = model(input_ids)
    print(logits.shape)

Run with:

CUDA_LAUNCH_BLOCKING=1 torchrun --nproc-per-node 2 --local-ranks-filter 0 repro.py

Error:

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]:[E506 07:38:19.598156204 ProcessGroupNCCL.cpp:1432] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6f779017 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6f728cd3 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6fb791f8 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22126926 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d2212b2a8 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1de (0x7f7d221322de in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d221341bc in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6eeb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7d70986ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7d70a18850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Removing one of:

FSDP
A high enough input_ids value (30010 in the example)
weight tying

makes the problem not appear

cc @carmocca @awaelchli @crcrpar

The text was updated successfully, but these errors were encountered:

kevinstephano · 2024-05-02T04:07:43Z

I don't think this is an nvFuser issue. The nvFuser standalone repro does not fail. I wonder if it was just the place that the CUDA error first got caught. On an H100, I am seeing a different error with NCCL.

W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778]
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] *****************************************
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] *****************************************
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/repro.py", line 13, in <module>
[rank0]:     config = Config(block_size=256, padded_vocab_size=32000, n_layer=6, n_head=6, head_size=48, n_embd=288, rotary_percentage=1.0, parallel_residual=False, bias=False, _norm_class='RMSNorm', _mlp_class='LLaMAMLP', intermediate_size=768)
[rank0]: TypeError: Config.__init__() got an unexpected keyword argument '_norm_class'
[rank0]:[W502 04:04:17.154035416 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
W0502 04:04:17.734000 140711848431488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1344 closing signal SIGTERM
E0502 04:04:18.298000 140711848431488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1343) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-02_04:04:17
  host      : viking-prod-229.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

carmocca · 2024-05-06T11:36:03Z

You are correct Kevin. This is not an nvfuser issue. The code was also using some removed arguments. I updated the description

kshitij12345 · 2024-05-27T09:33:08Z

There are 2 problems at play here:

With jit(fsdp(model)), we are incorrectly sharding the shared param twice which leads to index out of bounds error (as the size of the index is smaller than expected). Patch below fixes this problem.

diff --git a/thunder/distributed/__init__.py b/thunder/distributed/__init__.py
index c9aa00a..5ae1554 100644
--- a/thunder/distributed/__init__.py
+++ b/thunder/distributed/__init__.py
@@ -13,6 +13,7 @@ from functools import partial
 
 import torch
 import torch.distributed as tdist
+from torch.utils.weak import WeakTensorKeyDictionary
 
 import thunder.core.utils as utils
 from thunder.core.proxies import DDPType
@@ -559,6 +560,9 @@ def _shard_params(
         local_rank = int(os.environ["LOCAL_RANK"])
         device = torch.device("cuda", local_rank)
 
+    # In case there is weight/param sharing, we don't want to shard the same param
+    # multiple times. We use `sharded_params` to keep track of already sharded param to avoid resharding it.
+    sharded_params = WeakTensorKeyDictionary()
     # We will definitely change the sharding logic in the future
     for module_name, submodule in module.named_modules():
         # Materialize meta-parameters on-device if necessary.
@@ -581,7 +585,10 @@ def _shard_params(
         # Note [FSDP Sharding]
         # All internal code will assume that the parameters are sharded on the first dimension
         for param_name, param in submodule.named_parameters(recurse=False, prefix=module_name):
+            if param in sharded_params:
+                continue
             _shard_param(param, global_rank, world_size, param_name, allow_padding_for_fsdp=allow_padding_for_fsdp)
+            sharded_params[param] = True
 
 
 def _shard_param(

NOTE: fsdp(jit(model)) works ok as it refers to the parameter from original model and creates a shallow copy and shards the shallow copy.

lightning-thunder/thunder/distributed/__init__.py

Lines 447 to 456 in 7d6e540

    
           for pn, p in submodule.named_parameters(recurse=False, prefix=module_name): 
        
               # if we don't have an override or it is just the original, do create a copy 
        
               if thunder_model._overrides.get(pn, p) is p: 
        
                   thunder_model._overrides[pn] = copy.copy(p) 
        
               # we collect shapes and devices because we do not know if other transforms also change it... 
        
               old_shape = thunder_model._overrides[pn].shape 
        
               _shard_param(thunder_model._overrides[pn], global_rank, world_size, pn, allow_padding_for_fsdp=True) 
        
               new_shape = thunder_model._overrides[pn].shape 
        
               sharded_params[pn] = (old_shape, new_shape, thunder_model._overrides[pn].device)

With the fix above, the snippet in the repro works but in the execution trace, we end up creating two copies of the tensor.

  # idx: "cuda:0 i64[128, 256]"
  # tos1: "cuda:0 f32[256, 24]"
  # t_lm_head_weight: "cuda:0 f32[16000, 144]"
  p2 = torch_all_gather_prim_impl(t_lm_head_weight, _torch_distributed_distributed_c10d_ProcessGroup_0, True)  # p2: "FUTURE cuda:0 f32[32000, 144]"
  p20 = torch_all_gather_prim_impl(t_transformer_wte_weight, _torch_distributed_distributed_c10d_ProcessGroup_0, True)  # p20: "FUTURE cuda:0 f32[32000, 144]"

where torch_all_gather_prim_impl is the following snippet which creates a new output tensor for each call.

lightning-thunder/thunder/executors/torchex.py

Lines 1753 to 1770 in 7d6e540

    
           def _all_gather_prim_impl( 
        
               a: torch.Tensor, 
        
               /, 
        
               group: torch.distributed.ProcessGroup, 
        
               do_async: Number, 
        
           ) -> torch.Tensor | tuple[torch.distributed.distributed_c10d.Work, torch.Tensor]: 
        
               out: torch.Tensor = torch.empty((group.size() * a.shape[0],) + a.shape[1:], dtype=a.dtype, device=a.device) 
        
               do_async: bool = bool(do_async) 
        
               handle: None | torch.distributed.distributed_c10d.Work = torch.distributed.all_gather_into_tensor( 
        
                   out, a, group, do_async 
        
               ) 
        
               if do_async: 
        
                   return handle, out 
        
               return out

To tackle 2, I think we need to add some notion of aliasing. Related to inplace support #145 which also has to consider aliasing.

IvanYashchuk · 2024-05-29T08:05:24Z

Could you please submit your fix for 1? It's a perfect solution to this problem.

For 2 I think Thunder JIT could recognize these situations and pass just one tensor to the computational trace.

Borda transferred this issue from another repository Apr 23, 2024

tfogal added the triage review label Apr 23, 2024

mruberry added nvfuser and removed triage review labels Apr 29, 2024

mruberry assigned kevinstephano Apr 29, 2024

kevinstephano added triage review and removed nvfuser labels May 2, 2024

kevinstephano removed their assignment May 2, 2024

carmocca changed the title ~~Weight tying + FSDP = nvfuser internal error~~ Weight tying + FSDP = out of bounds May 6, 2024

carmocca added the bug Something isn't working label May 6, 2024

mruberry added distributed and removed triage review labels May 6, 2024

kshitij12345 self-assigned this May 23, 2024

This was referenced May 29, 2024

Fix FSDP jit(fsdp(...)) when sharing weights. #473

Merged

fsdp_v2 : fix weight sharing #490

Merged

kshitij12345 mentioned this issue Jun 15, 2024

fsdp(jit(model)) + parameter sharing - dont duplicate allgather #602

Merged

kshitij12345 mentioned this issue Jul 31, 2024

DDP as a transform #873

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight tying + FSDP = out of bounds #257

Weight tying + FSDP = out of bounds #257

carmocca commented Mar 5, 2024 •

edited by github-actions bot

Loading

kevinstephano commented May 2, 2024

carmocca commented May 6, 2024

kshitij12345 commented May 27, 2024 •

edited

Loading

IvanYashchuk commented May 29, 2024

Weight tying + FSDP = out of bounds #257

Weight tying + FSDP = out of bounds #257

Comments

carmocca commented Mar 5, 2024 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

kevinstephano commented May 2, 2024

carmocca commented May 6, 2024

kshitij12345 commented May 27, 2024 • edited Loading

IvanYashchuk commented May 29, 2024

carmocca commented Mar 5, 2024 •

edited by github-actions bot

Loading

kshitij12345 commented May 27, 2024 •

edited

Loading