Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added mse_loss #218

Merged
merged 41 commits into from
Apr 25, 2024
Merged

added mse_loss #218

merged 41 commits into from
Apr 25, 2024

Conversation

k223kim
Copy link
Contributor

@k223kim k223kim commented Apr 18, 2024

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #174 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

YES!

@k223kim
Copy link
Contributor Author

k223kim commented Apr 22, 2024

Hi @mruberry!
Just wanted to follow up the mse_loss issue and ask some specific questions regarding the implementation. Before anything, without any DecorateInfo in mse_loss_opinfo, the current implementation has the following errors:

FAILED thunder/tests/test_grad.py::test_vjp_correctness_mse_loss_torch_cpu_float64 - AssertionError: Scalars are not close!
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_bfloat16 - RuntimeError: "mse_cpu" not implemented for 'BFloat16'
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float16 - RuntimeError: "mse_backward_cpu_out" not implemented for 'Half'
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float32 - AssertionError: Tensor-likes are not close!
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float64 - AssertionError: Tensor-likes are not close!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_mse_loss_torch_cpu_bfloat16 - RuntimeError: "mse_cpu" not implemented for 'BFloat16'
  1. My question is, I am not sure if the error is due to my implementation itself, or if PyTorch is not supporting something. In the case of the "not implemented error", I am tempted to resolve these with the following DecorateInfo:
test_directives=(
    DecorateInfo(
        pytest.mark.skip,
        "test_core_vs_torch_consistency",
        dtypes=(datatypes.bfloat16,),
        devicetypes=(devices.DeviceType.CPU,),
    ),
    DecorateInfo(
        pytest.mark.skip,
        "test_phantom_grad_vs_torch_consistency",
        dtypes=(datatypes.bfloat16, datatypes.float16,),
        devicetypes=(devices.DeviceType.CPU,),
    ),
),

How do I know if PyTorch supports this or not?
2. Regarding other assertion errors, as it is throwing errors in for a dtype of float64, I am assuming this is because of my implementation. Any suggestions or comments would be appreciated! I am suspecting the implementation of _mse_loss_backward_impl in executors/torchex.pyis incorrect in some way. But again, I am not sure :( Would appreciate any help!

@mruberry mruberry requested a review from nikitaved April 22, 2024 19:40
thunder/clang/__init__.py Outdated Show resolved Hide resolved
thunder/core/utils.py Outdated Show resolved Hide resolved
thunder/tests/opinfos.py Outdated Show resolved Hide resolved
thunder/torch/__init__.py Outdated Show resolved Hide resolved
thunder/torch/__init__.py Outdated Show resolved Hide resolved
thunder/torch/__init__.py Outdated Show resolved Hide resolved
@mruberry
Copy link
Collaborator

Hi @mruberry! Just wanted to follow up the mse_loss issue and ask some specific questions regarding the implementation. Before anything, without any DecorateInfo in mse_loss_opinfo, the current implementation has the following errors:

FAILED thunder/tests/test_grad.py::test_vjp_correctness_mse_loss_torch_cpu_float64 - AssertionError: Scalars are not close!
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_bfloat16 - RuntimeError: "mse_cpu" not implemented for 'BFloat16'
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float16 - RuntimeError: "mse_backward_cpu_out" not implemented for 'Half'
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float32 - AssertionError: Tensor-likes are not close!
FAILED thunder/tests/test_grad.py::test_phantom_grad_vs_torch_consistency_mse_loss_torch_cpu_float64 - AssertionError: Tensor-likes are not close!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_mse_loss_torch_cpu_bfloat16 - RuntimeError: "mse_cpu" not implemented for 'BFloat16'
  1. My question is, I am not sure if the error is due to my implementation itself, or if PyTorch is not supporting something. In the case of the "not implemented error", I am tempted to resolve these with the following DecorateInfo:
test_directives=(
    DecorateInfo(
        pytest.mark.skip,
        "test_core_vs_torch_consistency",
        dtypes=(datatypes.bfloat16,),
        devicetypes=(devices.DeviceType.CPU,),
    ),
    DecorateInfo(
        pytest.mark.skip,
        "test_phantom_grad_vs_torch_consistency",
        dtypes=(datatypes.bfloat16, datatypes.float16,),
        devicetypes=(devices.DeviceType.CPU,),
    ),
),

How do I know if PyTorch supports this or not?

These skips make a lot of sense. I wouldn't worry about supporting grad by adding mse_loss_backward and a custom grad formula transform in this PR. @IvanYashchuk may suggest otherwise, and he's our grad expert, though.

  1. Regarding other assertion errors, as it is throwing errors in for a dtype of float64, I am assuming this is because of my implementation. Any suggestions or comments would be appreciated! I am suspecting the implementation of _mse_loss_backward_impl in executors/torchex.pyis incorrect in some way. But again, I am not sure :( Would appreciate any help!

This may be an interesting follow-up issue for a separate PR that targets optimizing this operation's grad formula, but I think this issue may disappear on this PR if this PR doesn't add the grad formula.

@mruberry
Copy link
Collaborator

Cool PR, @k223kim!

Let's take this PR out of "draft," — I think it's ready for review!

I added some inline comments, particularly about understanding what's happening with the broadcasting better. I'm curious to hear more about that!

I suggest remove the custom grad formula from this PR, and seeing if that simplifies the testing. A separate PR can optimize the gradient computation.

thunder/clang/__init__.py Outdated Show resolved Hide resolved
thunder/core/utils.py Outdated Show resolved Hide resolved
thunder/torch/__init__.py Outdated Show resolved Hide resolved
thunder/tests/opinfos.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, @k223kim. There are mostly clean-up changes that should be simple to make. There is a test that's failing:

FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_mse_loss_nvfuser_cuda_float16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 32 (3.1%)
Greatest absolute difference: 0.00048828125 at index (1, 9) (up to 1e-05 allowed)
Greatest relative difference: 0.0014553070068359375 at index (1, 9) (up to 0.001 allowed)

and that's OK, the deviation is pretty small. This PR just needs to update the OpInfo to account for this by increasing that test's tolerance. An example of how this is done can be found here:

Essentially, a decorator is specified in the OpInfo and it is automatically added to the generated test. In this case it looks like the decorator needs to set

 custom_comparator(partial(assert_close, atol=1e-3, rtol=1e-2)),

If that doesn't work we can increase the test tolerance even more. Let me know if you have any questions about this!

@k223kim
Copy link
Contributor Author

k223kim commented Apr 24, 2024

Hi @mruberry! Thanks for the detailed comments :) it's helping me a lot! I can add the following to opinfos.py

DecorateInfo(
    custom_comparator(partial(assert_close, atol=1e-3, rtol=1e-2)),
    executors=("nvfuser",),
    dtypes=(datatypes.float16,),
)

However, my concern is that, I am not able to reproduce the error that you have shown above. Would this be an issue? (on my end, all tests are passing..)

@mruberry
Copy link
Collaborator

Hi @mruberry! Thanks for the detailed comments :) it's helping me a lot! I can add the following to opinfos.py

DecorateInfo(
    custom_comparator(partial(assert_close, atol=1e-3, rtol=1e-2)),
    executors=("nvfuser",),
    dtypes=(datatypes.float16,),
)

However, my concern is that, I am not able to reproduce the error that you have shown above. Would this be an issue? (on my end, all tests are passing..)

I understand, and that's OK, we want to see the CI passing, and the CI and your local hardware may have some differences that cause a small precision difference.

@k223kim
Copy link
Contributor Author

k223kim commented Apr 24, 2024

Hey @mruberry !
One last question, regarding the custom grad formula for mse_loss, what would be the next steps? Should I submit a separate PR regarding that? Or is there a better way to work on it?
(cc. @IvanYashchuk )

@mruberry
Copy link
Collaborator

mruberry commented Apr 24, 2024

Hey @mruberry ! One last question, regarding the custom grad formula for mse_loss, what would be the next steps? Should I submit a separate PR regarding that? Or is there a better way to work on it? (cc. @IvanYashchuk )

Absolutely a separate PR would be great. The reason to add a custom grad formula would be to improve performance. So it'd be nice if the PR showed performance of mse_loss forward->backward before the PR, and then after the PR, so we can be sure it improves performance.

(Benchmarking with CUDA devices can be a little tricky, but basically you want to run the operations, sync the CUDA device with the CPU, and measure that time.)

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Nice work, @k223kim!

@mruberry mruberry enabled auto-merge (squash) April 24, 2024 20:39
@mruberry
Copy link
Collaborator

@t-vi @Borda This may require some special push to merge. The test failure is flaky and unrelated

@mruberry mruberry merged commit 4d9fa60 into Lightning-AI:main Apr 25, 2024
37 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mse_loss
3 participants