Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512

IvanYashchuk · 2024-06-03T12:53:20Z

This PR enables the magic environment variable to be on by default for Thunder. The change should be restricted only to Thunder.
This magick environment variable is supposed to fix a problem with the allocator thrashing when using collectives from the NCCL backend of PyTorch.

I have tested performance with the command provided by @parthmannan in #420

torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-13b-hf --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8

and this PR gives ~2.11x performance improvement (1517 ms -> 716 ms).

Fixes #420.
Fixes #477.

cc @parthmannan

… default

thunder/core/decorators.py

thunder/core/trace.py

t-vi · 2024-06-03T13:01:47Z

Does this work, though, from the PyTorch source it would seem that it needs to be set for the process group:

https://github.com/pytorch/pytorch/blob/f343f98710dfa7305a873f558086c595a3c3d3d4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L776

IvanYashchuk · 2024-06-03T13:04:54Z

Does this work, though, from the PyTorch source it would seem that it needs to be set for the process group:

https://github.com/pytorch/pytorch/blob/f343f98710dfa7305a873f558086c595a3c3d3d4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L776

~~Yes,~~ getCvarBool parses the environment variable with std::getenv https://github.com/pytorch/pytorch/blob/f343f98710dfa7305a873f558086c595a3c3d3d4/torch/csrc/distributed/c10d/Utils.hpp#L160

lantiga · 2024-06-03T13:43:15Z

As a side note: in relation to this pytorch/pytorch#76861 (comment)

It would be interesting to understand what happens if we automatically insert wait_stream at the end of a region (or the whole model iteration), instead of disabling record streams entirely.

IvanYashchuk · 2024-06-03T13:46:09Z

As a side note: in relation to this pytorch/pytorch#76861 (comment)

It would be interesting to understand what happens if we automatically insert wait_stream at the end of a region (or the whole model iteration), instead of disabling record streams entirely.

Thunder doesn't know anything about CUDA streams today.

lantiga · 2024-06-03T13:46:51Z

As a side note: in relation to this pytorch/pytorch#76861 (comment)
It would be interesting to understand what happens if we automatically insert wait_stream at the end of a region (or the whole model iteration), instead of disabling record streams entirely.

Thunder doesn't know anything about CUDA streams today.

Fair point

IvanYashchuk · 2024-06-03T13:56:04Z

Setting the environment variable from the command line or in thunder/__init__.py results in consistently better performance suggesting that the current approach in this PR doesn't quite work

TORCH_NCCL_AVOID_RECORD_STREAMS=1 torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-13b-hf --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8

Average iter time: 722.36 ms

IvanYashchuk · 2024-06-03T14:05:05Z

Setting the environment variable from the command line or in thunder/__init__.py results in consistently better performance suggesting that the current approach in this PR doesn't quite work
TORCH_NCCL_AVOID_RECORD_STREAMS=1 torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-13b-hf --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8

Average iter time: 722.36 ms

Maybe the backward call is not affected by this decorator because it's a separate thread and setting the env variable in thunder/__init__.py works because then the side C++ thread to PyTorch's Autograd engine inherits the variable value.

…benchmark script

IvanYashchuk · 2024-06-03T15:06:35Z

Does this work, though, from the PyTorch source it would seem that it needs to be set for the process group:

https://github.com/pytorch/pytorch/blob/f343f98710dfa7305a873f558086c595a3c3d3d4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L776

Right, and this needs to be set at the process group creation meaning in usual scripts before the init_process_group(backend="nccl") call.

thunder/__init__.py

This reverts commit 6b5c101.

thunder/distributed/__init__.py

parthmannan · 2024-06-04T09:12:41Z

This would help fix a lot of performance issues we have been seeing with large models/large batch sizes where we see memory thrashing. Thanks for working on this Ivan 🚀

…essGroupNCCL'

t-vi · 2024-06-04T13:11:54Z

Seems like legit CI failures in distributed. This seems a common thing

  File "/__w/1/s/thunder/core/module.py", line 142, in no_sync
    _sync_grads(self)
  File "/__w/1/s/thunder/distributed/__init__.py", line 142, in _sync_grads
    with tdist.distributed_c10d._coalescing_manager(group=process_group, async_ops=True) as cm:
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2079, in _coalescing_manager
    cm.append(work)  # type: ignore[possibly-undefined]
UnboundLocalError: local variable 'work' referenced before assignment

in these failures:

2024-06-04T13:07:07.0476853Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_torch_bucket_size_in_mb_0_dataset_size_1 - RuntimeError: Process 0 exited with error code 10 and exception:
2024-06-04T13:07:07.0491884Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_fsdp_with_no_sync_grad_accumulation_executor_torch_bucketing_block_zero2 - RuntimeError: Process 0 exited with error code 10 and exception:
2024-06-04T13:07:07.0499141Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_fsdp_with_no_sync_grad_accumulation_executor_nvfuser_bucketing_block_zero2 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0506343Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_nvfuser_bucket_size_in_mb_25_dataset_size_2 - RuntimeError: Process 0 exited with error code 10 and exception:
2024-06-04T13:07:07.0521375Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_fsdp_with_no_sync_grad_accumulation_executor_nvfuser_bucketing_block_zero3 - RuntimeError: Process 0 exited with error code 10 and exception:
2024-06-04T13:07:07.0528651Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_nvfuser_bucket_size_in_mb_25_dataset_size_1 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0536202Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_torch_bucket_size_in_mb_25_dataset_size_2 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0543807Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_fsdp_with_no_sync_grad_accumulation_executor_torch_bucketing_block_zero3 - RuntimeError: Process 0 exited with error code 10 and exception:
2024-06-04T13:07:07.0551037Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_nvfuser_bucket_size_in_mb_0_dataset_size_1 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0558736Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_nvfuser_bucket_size_in_mb_0_dataset_size_2 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0566267Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_torch_bucket_size_in_mb_25_dataset_size_1 - RuntimeError: Process 1 exited with error code 10 and exception:
2024-06-04T13:07:07.0573911Z FAILED thunder/tests/distributed/test_ddp.py::CompileDDPTest::test_ddp_with_no_sync_grad_accumulation_executor_torch_bucket_size_in_mb_0_dataset_size_2 - RuntimeError: Process 0 exited with error code 10 and exception:

IvanYashchuk · 2024-06-04T13:59:58Z

That's unfortunate that the error is coming from the PyTorch source code and is not reproducible with my build. I'll fix it.

for more information, see https://pre-commit.ci

t-vi

Thank you @IvanYashchuk

lantiga · 2024-06-05T11:56:53Z

Awesome @IvanYashchuk !

IvanYashchuk added 2 commits June 3, 2024 15:23

Add avoid_torch_nccl_record_streams decorator

960f3cc

Use avoid_torch_nccl_record_streams in Thunder-generated functions by…

c436656

… default

IvanYashchuk added the cuda label Jun 3, 2024

IvanYashchuk requested review from mruberry, lantiga, robieta and t-vi as code owners June 3, 2024 12:53

IvanYashchuk commented Jun 3, 2024

View reviewed changes

thunder/core/decorators.py Outdated Show resolved Hide resolved

IvanYashchuk commented Jun 3, 2024

View reviewed changes

thunder/core/trace.py Outdated Show resolved Hide resolved

Use '0' as default value for unset env var

62d093e

IvanYashchuk added 3 commits June 3, 2024 18:02

Set TORCH_NCCL_AVOID_RECORD_STREAMS at Thunder import

6b5c101

Set TORCH_NCCL_AVOID_RECORD_STREAMS before init_process_group in the …

326c8dc

…benchmark script

Revert initial changes they don't really work

7772930

lantiga reviewed Jun 3, 2024

View reviewed changes

thunder/__init__.py Outdated Show resolved Hide resolved

IvanYashchuk added 5 commits June 3, 2024 20:09

Instead of default process group create a copy with magic env var set

d2d06cd

Fix partially initialized module error

22205dd

Create tdist.ProcessGroupNCCL.Options

7ece68c

Merge branch 'fix-420-attempt2' into fix-420

695be3d

Revert "Set TORCH_NCCL_AVOID_RECORD_STREAMS at Thunder import"

9da0288

This reverts commit 6b5c101.

IvanYashchuk requested a review from crcrpar June 3, 2024 18:18

t-vi reviewed Jun 4, 2024

View reviewed changes

thunder/distributed/__init__.py Outdated Show resolved Hide resolved

parthmannan mentioned this pull request Jun 4, 2024

Nous-Hermes-13b on 1x8 H100 FSDP zero2 with thunder_cudnn is 23% slower than with inductor #477

Closed

parthmannan mentioned this pull request Jun 4, 2024

Training Llama-2-13b-hf on 2x8 H100 with Thunder inductor is 47% slower than with Inductor #485

Closed

IvanYashchuk added 2 commits June 4, 2024 14:02

Fix AttributeError: module 'torch.distributed' has no attribute 'Proc…

0749eda

…essGroupNCCL'

Merge branch 'main' into fix-420

ff4539f

t-vi enabled auto-merge (squash) June 4, 2024 12:53

IvanYashchuk and others added 4 commits June 5, 2024 11:34

Merge remote-tracking branch 'upstream/main' into fix-420

6309dd9

Fix coalescing manager error

8a1deb7

Fix another coalescing manager error

a09b68a

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb8d4ac

for more information, see https://pre-commit.ci

t-vi approved these changes Jun 5, 2024

View reviewed changes

t-vi merged commit 23da3c1 into Lightning-AI:main Jun 5, 2024
36 checks passed

IvanYashchuk deleted the fix-420 branch June 5, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512

Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512

IvanYashchuk commented Jun 3, 2024 •

edited by tfogal

Loading

t-vi commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024 •

edited

Loading

lantiga commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

lantiga commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

parthmannan commented Jun 4, 2024

t-vi commented Jun 4, 2024

IvanYashchuk commented Jun 4, 2024

t-vi left a comment

lantiga commented Jun 5, 2024

Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512

Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512

Conversation

IvanYashchuk commented Jun 3, 2024 • edited by tfogal Loading

t-vi commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024 • edited Loading

lantiga commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

lantiga commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

IvanYashchuk commented Jun 3, 2024

parthmannan commented Jun 4, 2024

t-vi commented Jun 4, 2024

IvanYashchuk commented Jun 4, 2024

t-vi left a comment

Choose a reason for hiding this comment

lantiga commented Jun 5, 2024

IvanYashchuk commented Jun 3, 2024 •

edited by tfogal

Loading

IvanYashchuk commented Jun 3, 2024 •

edited

Loading