Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight tying + FSDP = out of bounds #257

Open
carmocca opened this issue Mar 5, 2024 · 4 comments
Open

Weight tying + FSDP = out of bounds #257

carmocca opened this issue Mar 5, 2024 · 4 comments
Assignees
Labels
bug Something isn't working distributed

Comments

@carmocca
Copy link
Contributor

carmocca commented Mar 5, 2024

🐛 Bug

To Reproduce

Code:

import os
import torch
import torch.distributed as tdist
import thunder
from thunder.tests.litgpt_model import GPT, Config

if __name__ == "__main__":
    tdist.init_process_group(backend="nccl")
    LOCAL_RANK = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", LOCAL_RANK)
    torch.set_default_device(device)

    config = Config(block_size=256, padded_vocab_size=32000, n_layer=1, n_head=3, head_size=24, n_embd=144, rotary_percentage=1.0, parallel_residual=False, bias=False, norm_class_name='RMSNorm', mlp_class_name='LLaMAMLP', intermediate_size=384)
    with device:
        model = GPT(config)

    model.transformer.wte.weight = model.lm_head.weight

    model = thunder.distributed.fsdp(model)
    model = thunder.jit(model, executors=["torch"])

    input_ids = torch.randint(1, 30010, (128, 256), dtype=torch.long, device=device)
    logits = model(input_ids)
    print(logits.shape)

Run with:

CUDA_LAUNCH_BLOCKING=1 torchrun --nproc-per-node 2 --local-ranks-filter 0 repro.py

Error:

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [313,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]:[E506 07:38:19.598156204 ProcessGroupNCCL.cpp:1432] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6f779017 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6f728cd3 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6fb791f8 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22126926 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d2212b2a8 in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1de (0x7f7d221322de in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d221341bc in /home/carlos/nightly-env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6eeb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7d70986ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7d70a18850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Removing one of:

  • FSDP
  • A high enough input_ids value (30010 in the example)
  • weight tying

makes the problem not appear

cc @carmocca @awaelchli @crcrpar

@Borda Borda transferred this issue from another repository Apr 23, 2024
@Borda Borda transferred this issue from another repository Apr 23, 2024
@kevinstephano kevinstephano removed their assignment May 2, 2024
@kevinstephano
Copy link
Collaborator

I don't think this is an nvFuser issue. The nvFuser standalone repro does not fail. I wonder if it was just the place that the CUDA error first got caught. On an H100, I am seeing a different error with NCCL.

W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778]
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] *****************************************
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0502 04:04:05.809000 140711848431488 torch/distributed/run.py:778] *****************************************
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/repro.py", line 13, in <module>
[rank0]:     config = Config(block_size=256, padded_vocab_size=32000, n_layer=6, n_head=6, head_size=48, n_embd=288, rotary_percentage=1.0, parallel_residual=False, bias=False, _norm_class='RMSNorm', _mlp_class='LLaMAMLP', intermediate_size=768)
[rank0]: TypeError: Config.__init__() got an unexpected keyword argument '_norm_class'
[rank0]:[W502 04:04:17.154035416 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
W0502 04:04:17.734000 140711848431488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1344 closing signal SIGTERM
E0502 04:04:18.298000 140711848431488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1343) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-02_04:04:17
  host      : viking-prod-229.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@carmocca
Copy link
Contributor Author

carmocca commented May 6, 2024

You are correct Kevin. This is not an nvfuser issue. The code was also using some removed arguments. I updated the description

@carmocca carmocca changed the title Weight tying + FSDP = nvfuser internal error Weight tying + FSDP = out of bounds May 6, 2024
@carmocca carmocca added the bug Something isn't working label May 6, 2024
@kshitij12345 kshitij12345 self-assigned this May 23, 2024
@kshitij12345
Copy link
Collaborator

kshitij12345 commented May 27, 2024

There are 2 problems at play here:

  1. With jit(fsdp(model)), we are incorrectly sharding the shared param twice which leads to index out of bounds error (as the size of the index is smaller than expected). Patch below fixes this problem.
diff --git a/thunder/distributed/__init__.py b/thunder/distributed/__init__.py
index c9aa00a..5ae1554 100644
--- a/thunder/distributed/__init__.py
+++ b/thunder/distributed/__init__.py
@@ -13,6 +13,7 @@ from functools import partial
 
 import torch
 import torch.distributed as tdist
+from torch.utils.weak import WeakTensorKeyDictionary
 
 import thunder.core.utils as utils
 from thunder.core.proxies import DDPType
@@ -559,6 +560,9 @@ def _shard_params(
         local_rank = int(os.environ["LOCAL_RANK"])
         device = torch.device("cuda", local_rank)
 
+    # In case there is weight/param sharing, we don't want to shard the same param
+    # multiple times. We use `sharded_params` to keep track of already sharded param to avoid resharding it.
+    sharded_params = WeakTensorKeyDictionary()
     # We will definitely change the sharding logic in the future
     for module_name, submodule in module.named_modules():
         # Materialize meta-parameters on-device if necessary.
@@ -581,7 +585,10 @@ def _shard_params(
         # Note [FSDP Sharding]
         # All internal code will assume that the parameters are sharded on the first dimension
         for param_name, param in submodule.named_parameters(recurse=False, prefix=module_name):
+            if param in sharded_params:
+                continue
             _shard_param(param, global_rank, world_size, param_name, allow_padding_for_fsdp=allow_padding_for_fsdp)
+            sharded_params[param] = True
 
 
 def _shard_param(

NOTE: fsdp(jit(model)) works ok as it refers to the parameter from original model and creates a shallow copy and shards the shallow copy.

for pn, p in submodule.named_parameters(recurse=False, prefix=module_name):
# if we don't have an override or it is just the original, do create a copy
if thunder_model._overrides.get(pn, p) is p:
thunder_model._overrides[pn] = copy.copy(p)
# we collect shapes and devices because we do not know if other transforms also change it...
old_shape = thunder_model._overrides[pn].shape
_shard_param(thunder_model._overrides[pn], global_rank, world_size, pn, allow_padding_for_fsdp=True)
new_shape = thunder_model._overrides[pn].shape
sharded_params[pn] = (old_shape, new_shape, thunder_model._overrides[pn].device)

  1. With the fix above, the snippet in the repro works but in the execution trace, we end up creating two copies of the tensor.
  # idx: "cuda:0 i64[128, 256]"
  # tos1: "cuda:0 f32[256, 24]"
  # t_lm_head_weight: "cuda:0 f32[16000, 144]"
  p2 = torch_all_gather_prim_impl(t_lm_head_weight, _torch_distributed_distributed_c10d_ProcessGroup_0, True)  # p2: "FUTURE cuda:0 f32[32000, 144]"
  p20 = torch_all_gather_prim_impl(t_transformer_wte_weight, _torch_distributed_distributed_c10d_ProcessGroup_0, True)  # p20: "FUTURE cuda:0 f32[32000, 144]"

where torch_all_gather_prim_impl is the following snippet which creates a new output tensor for each call.

def _all_gather_prim_impl(
a: torch.Tensor,
/,
group: torch.distributed.ProcessGroup,
do_async: Number,
) -> torch.Tensor | tuple[torch.distributed.distributed_c10d.Work, torch.Tensor]:
out: torch.Tensor = torch.empty((group.size() * a.shape[0],) + a.shape[1:], dtype=a.dtype, device=a.device)
do_async: bool = bool(do_async)
handle: None | torch.distributed.distributed_c10d.Work = torch.distributed.all_gather_into_tensor(
out, a, group, do_async
)
if do_async:
return handle, out
return out

To tackle 2, I think we need to add some notion of aliasing. Related to inplace support #145 which also has to consider aliasing.

@IvanYashchuk
Copy link
Collaborator

Could you please submit your fix for 1? It's a perfect solution to this problem.

For 2 I think Thunder JIT could recognize these situations and pass just one tensor to the computational trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed
Projects
None yet
Development

No branches or pull requests

6 participants