Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

ezyang · 2024-05-10T20:05:38Z

Stack from ghstack (oldest at bottom):

-> Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

Internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1427156211255919/

Signed-off-by: Edward Z. Yang ezyang@meta.com

[ghstack-poisoned]

pytorch-bot · 2024-05-10T20:05:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125952

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit bb0289b with merge base 96a5698 ():

NEW FAILURES - The following jobs have failed:

Lint / Test collect_env (older_python_version) (gh)
The process '/usr/bin/bash' failed with exit code 1
Lint / Test collect_env (with_torch) (gh)
Lint / Test collect_env (without_torch) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 20ad93d07612a3326ecaebaa3a7b2a5b45c15f54 Pull Request resolved: #125952

lezcano

We already do that within canDispatchToMaskedFill:

static std::tuple<bool, Tensor> canDispatchToMaskedFill(const Tensor& self, const torch::List<c10::optional<at::Tensor>>& indices,
const Tensor& value){
  if (!(value.numel() ==1 && value.device().is_cpu())){

ezyang · 2024-05-10T20:41:03Z

OK, that's useful, because the internal user tested and they also said this did not fix it.

ezyang · 2024-05-10T20:48:44Z

I'm out of time right now, but here is the repro

import logging
from dataclasses import dataclass
from datetime import datetime

import torch
import torch._inductor.config as inductor_config


logger = logging.getLogger(__name__)
TIME_FORMAT_STR: str = "%b_%d_%H_%M_%S"


@dataclass
class BenchmarkConfig:
    batch_size: int = 256
    enable_bf16: bool = True
    enable_pt2: bool = True
    device = "cuda:0"
    d_in = 2048


class SimpleModel(torch.nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.linear1 = torch.nn.Linear(dim, dim, bias=False)
        self.ts_encoding_params_dict = torch.nn.Parameter(
            torch.empty(
                [
                    2000,
                    dim,
                ]
            ).uniform_(-0.01, 0.01)
        )
        self.linear2 = torch.nn.Linear(dim, dim, bias=False)

    def forward(
        self,
        x,
        num_object,
        user_event_ts_buckets,
    ):
        emb = self.linear1(x)
        # user_event_ts_encoding = self.ts_encoding_params_dict[user_event_ts_buckets, :]
        user_event_ts_encoding = self.ts_encoding_params_dict.index_select(
            0, user_event_ts_buckets
        )
        emb = emb + user_event_ts_encoding
        res = self.linear2(emb)
        return res


def create_model_input(benchmark_config: BenchmarkConfig):
    batch_size = benchmark_config.batch_size
    d_in = benchmark_config.d_in
    device = benchmark_config.device

    dtype = torch.bfloat16 if benchmark_config.enable_bf16 else torch.float32

    x = torch.rand(
        batch_size * 1000, d_in, dtype=dtype, device=device
    ).requires_grad_()  # assuming seq_len_per_example is max_length // 2
    num_object = torch.tensor(
        [1000] * batch_size,
        dtype=torch.int,
        device=device,
    )

    user_event_ts_buckets = torch.randint(
        0,
        2000,
        (1000 * batch_size,),
        dtype=torch.int,
        device=device,
    )

    return (
        x,
        num_object,
        user_event_ts_buckets,
    )


def run_first_model_once(model, input):
    pred = model(*input)
    pred[0].sum().backward()


def single_run_benchmark():
    benchmark_config = BenchmarkConfig()
    model_input = create_model_input(benchmark_config)
    model = SimpleModel(benchmark_config.d_in)

    if benchmark_config.enable_bf16:
        model = model.to(dtype=torch.bfloat16)

    if benchmark_config.enable_pt2:
        inductor_config.decompose_mem_bound_mm = True
        inductor_config.trace.enabled = True
        model = torch.compile(model)
        model = model.to(benchmark_config.device)
        print("Start compiling model.")
        run_first_model_once(model, model_input)
    else:
        model = model.to(benchmark_config.device)

    # trace
    with torch.profiler.profile(with_flops=True) as profiler:
        for _ in range(5):
            run_first_model_once(model, model_input)

    trace_file_prefix = "{}".format(
        datetime.now().strftime(TIME_FORMAT_STR),
    )

    return


def main() -> None:
    single_run_benchmark()
    print("done")


if __name__ == "__main__":
    main()  # pragma: no cover

The previous fix was not general enough. Fixes #125952 [ghstack-poisoned]

The previous fix was not general enough. Fixes #125952 ghstack-source-id: d1a956c9c514d10ffa99b4589ea8db0c5b74b46d Pull Request resolved: #125973

lezcano · 2024-05-10T22:28:46Z

I put up a fix, but I was not able to test whether it works (my triton version is acting up with the repro). Mind checking if it fixes the issue?

ezyang · 2024-05-11T21:46:37Z

OK, I got a better stacktrace here:

#12 at::_ops::item::call(at::Tensor const&) from ??:0                                                                            
#13 long at::Tensor::item<long>() const from ??:0                                                                                
#14 at::native::computeLinearIndex(at::Tensor const&, c10::ArrayRef<at::Tensor>, bool) [clone .isra.0] from tmpxft_00061ce6_00000
000-6_Indexing.cudafe1.cpp:0                                                                                                     
#15 at::native::makeLinearIndex(at::Tensor, c10::IListRef<at::OptionalTensorRef>, bool) [clone .constprop.0] from tmpxft_00061ce6
_00000000-6_Indexing.cudafe1.cpp:0                                                                                               
#16 at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<std::optional<at::Tensor> > const&, at::
Tensor const&, bool, bool) from ??:0                                                                                             
#17 at::native::_index_put_impl_(at::Tensor&, c10::List<std::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) from ?
?:0

ezyang · 2024-05-11T21:53:57Z

It's this one:

static Tensor wrapIndexOnce(const Tensor & index, int64_t dim, int64_t dim_size, bool check_range=true) {
//we don't need to check range in backward - if there were out of bounds indices forward should already have errored out
  if (index.numel() != 0 && check_range) {
    TORCH_INTERNAL_ASSERT(0);
    auto max_idx = index.max().item<int64_t>();
    auto min_idx = index.min().item<int64_t>();
    if (max_idx >= dim_size) {
      TORCH_CHECK_INDEX(false, "index ", max_idx, " is out of bounds for dimension ", dim, " with size ", dim_size);
    } 
    if (min_idx < -dim_size) {
      TORCH_CHECK_INDEX(false, "index ", min_idx, " is out of bounds for dimension ", dim, " with size ", dim_size);
    } 
  }
  return index.remainder(dim_size);
}

and the reason it doesn't sync in eager is because, dun dun dun, eager used some special API to avoid the check range.

[ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 0fe1f51c815d8d135172d7f6e15bdb22e5339d48 Pull Request resolved: #125952

ezyang · 2024-05-11T22:22:31Z

Updated with fix that actually is confirmed to work

lezcano

Sounds good if it fixes the problem. Could you add a regression test?

lezcano · 2024-05-13T09:06:18Z

aten/src/ATen/native/cuda/Indexing.cu

+    at::_assert_async(index.max() < dim_size);
+    at::_assert_async(index.min() >= -dim_size);


I didn't know we had these working in the end. How does compile interprete these?

This is the CUDA kernel, so compile doesn't really interact with this in a nontrivial way, you just end up hitting this at runtime.

ezyang · 2024-05-13T13:26:12Z

I don't really see how to add a regression test. To check if we're doing a DtoH sync, we need some way of detecting such a sync has happened, but there's no facility for programatically determining this.

ezyang · 2024-05-13T13:26:30Z

@pytorchbot merge -i

lezcano · 2024-05-13T13:28:01Z

I figured a way would be to try to cudagraph the relevant code and see that it was able to do so

pytorchmergebot · 2024-05-13T13:28:18Z

Merge started

Your change will be merged while ignoring the following 3 checks: Lint / Test collect_env (with_torch), Lint / Test collect_env (without_torch), Lint / Test collect_env (older_python_version)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…h#125952) Internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1427156211255919/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#125952 Approved by: https://github.com/lezcano

Update

8f753af

[ghstack-poisoned]

pytorch-bot bot added the ci-td-distributed label May 10, 2024

ezyang added a commit that referenced this pull request May 10, 2024

Do not use masked_fill if it would incur DtoH sync

3640c32

Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 20ad93d07612a3326ecaebaa3a7b2a5b45c15f54 Pull Request resolved: #125952

github-actions bot requested review from albanD, antoniojkim, bdhirsh, miladm and SherlockNoMad May 10, 2024 20:05

ezyang requested review from lezcano, ngimel and eqy May 10, 2024 20:06

ezyang added release notes: python_frontend release notes category topic: bug fixes topic category labels May 10, 2024

lezcano reviewed May 10, 2024

View reviewed changes

lezcano mentioned this pull request May 10, 2024

Fix sync on index_select decomposition when the index has one element #125973

Closed

lezcano added a commit that referenced this pull request May 10, 2024

Fix sync on index_select decomposition when the index has one element

001ca95

The previous fix was not general enough. Fixes #125952 [ghstack-poisoned]

lezcano added a commit that referenced this pull request May 10, 2024

Fix sync on index_select decomposition when the index has one element

eb2cc16

The previous fix was not general enough. Fixes #125952 ghstack-source-id: d1a956c9c514d10ffa99b4589ea8db0c5b74b46d Pull Request resolved: #125973

ezyang closed this May 11, 2024

ezyang reopened this May 11, 2024

Update

bb0289b

[ghstack-poisoned]

ezyang added a commit that referenced this pull request May 11, 2024

Do not use masked_fill if it would incur DtoH sync

c76dc4c

Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 0fe1f51c815d8d135172d7f6e15bdb22e5339d48 Pull Request resolved: #125952

ezyang changed the title ~~Do not use masked_fill if it would incur DtoH sync~~ Make wrapIndexOnce check async, avoid DtoH sync on index_put_ May 11, 2024

ezyang added the ciflow/trunk Trigger trunk jobs on your pull request label May 11, 2024

ezyang requested a review from lezcano May 12, 2024 12:47

lezcano approved these changes May 13, 2024

View reviewed changes

pytorchmergebot added the merging label May 13, 2024

pytorchmergebot added the Merged label May 13, 2024

pytorchmergebot closed this in f0c8b93 May 13, 2024

pytorchmergebot removed the merging label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

ezyang commented May 10, 2024 •

edited

pytorch-bot bot commented May 10, 2024 •

edited

lezcano left a comment

ezyang commented May 10, 2024

ezyang commented May 10, 2024 •

edited

lezcano commented May 10, 2024

ezyang commented May 11, 2024

ezyang commented May 11, 2024

ezyang commented May 11, 2024

lezcano left a comment

lezcano May 13, 2024

ezyang May 13, 2024

ezyang commented May 13, 2024

ezyang commented May 13, 2024

lezcano commented May 13, 2024

pytorchmergebot commented May 13, 2024

		at::_assert_async(index.max() < dim_size);
		at::_assert_async(index.min() >= -dim_size);

Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

Make wrapIndexOnce check async, avoid DtoH sync on index_put_ #125952

Conversation

ezyang commented May 10, 2024 • edited

pytorch-bot bot commented May 10, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125952

❌ 3 New Failures

lezcano left a comment

Choose a reason for hiding this comment

ezyang commented May 10, 2024

ezyang commented May 10, 2024 • edited

lezcano commented May 10, 2024

ezyang commented May 11, 2024

ezyang commented May 11, 2024

ezyang commented May 11, 2024

lezcano left a comment

Choose a reason for hiding this comment

lezcano May 13, 2024

Choose a reason for hiding this comment

ezyang May 13, 2024

Choose a reason for hiding this comment

ezyang commented May 13, 2024

ezyang commented May 13, 2024

lezcano commented May 13, 2024

pytorchmergebot commented May 13, 2024

Merge started

ezyang commented May 10, 2024 •

edited

pytorch-bot bot commented May 10, 2024 •

edited

ezyang commented May 10, 2024 •

edited