-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset dynamo cache before each test #126586
base: gh/shunting314/147/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126586
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (5 Unrelated Failures)As of commit 9472bcf with merge base 5905207 (): FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 94cbf126e5c128577d388ecf24d362d8594c1e00 Pull Request resolved: #126586
# reset dynamo cache to avoid issues like | ||
# https://github.com/pytorch/pytorch/issues/125967#issuecomment-2118483919 | ||
# which depends on test order. | ||
torch._dynamo.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember that this causes a perf hit on CI testing, so it couldn't be rolled out before. @zou3519 would know more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to do this (reset dynamo cache). I rolled out this change to all tests in the dynamo shard, but didn't do the tests in the inductor shard. I'm not sure about the performance impact on inductor tests, but it is hopefully not bad because we do not run that many inductor tests.
@shunting314 can you get the tests to pass in this PR so that we can see the before/after test timing numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea. 2 of the failures are due to previously fail tests now pass. I removed the file marking the test failure.
For the other failures, I'm not able to repro. E.g., for https://github.com/pytorch/pytorch/actions/runs/9135411752/job/25122927855 , I run
PYTORCH_TEST_WITH_INDUCTOR=1 python test/functorch/test_ops.py -k test_extremal_numerics_l1_loss_cpu
And it passes on my devserver..
Any suggestions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might have to do with python version. Try to skip these for now. Also, the keep-going
label will make all tests run (after a re-push) -- it's possible you have more failures than just these two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the PR is clean now (the previous failed tests shows up as flaky and non-blocking in CI), I'll merge it.
Thanks for letting me know the 'keep-going' label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the PR is clean now (the previous failed tests shows up as flaky and non-blocking in CI), I'll merge it.
Yup go for it
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
hmm, the test failures blocks the merge even though CI is green before I try to merge. I'll skip those tests for now. |
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
Fix #125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. Pull Request resolved: #126593 Approved by: https://github.com/ezyang ghstack dependencies: #126586
@pytorchbot revert -m "broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 https://hud.pytorch.org/pytorch/pytorch/commit/43f2f43eb3b6d8cbe8eb7f45acb50376092f1a16 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495" -c nosignal |
@shunting314 your PR has been successfully reverted. |
@clee2000 is this due to land race? The test is broken even without my PR stacks. If yes, anything else you want me to do before I reland? |
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
ghstack-source-id: c4c01ae7c17caafd1514e3ea659dd1af00bd84b4 Pull Request resolved: #126586
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
ghstack-source-id: 969d13a2e533bc9511ff23624fe0af9ce200b485 Pull Request resolved: #126586
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test [ghstack-poisoned]
ghstack-source-id: ecbcf9f6925d3f87129cef5d640588add6a77acc Pull Request resolved: #126586
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot drci |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "Broke tons of tests, see https://hud.pytorch.org/pytorch/pytorch/commit/bd24991f461476036d6ba20fed92651c7e46ef7c " -c ignoredsignal |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit bd24991. Reverted #126586 on behalf of https://github.com/malfet due to Broke tons of tests, see https://hud.pytorch.org/pytorch/pytorch/commit/bd24991f461476036d6ba20fed92651c7e46ef7c ([comment](#126586 (comment)))
@shunting314 your PR has been successfully reverted. |
In pytorch#125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: pytorch#126586 Approved by: https://github.com/jansel
This reverts commit bd24991. Reverted pytorch#126586 on behalf of https://github.com/malfet due to Broke tons of tests, see https://hud.pytorch.org/pytorch/pytorch/commit/bd24991f461476036d6ba20fed92651c7e46ef7c ([comment](pytorch#126586 (comment)))
Stack from ghstack (oldest at bottom):
In #125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.
This PR clear dynamo cache before each unit test so we get more deterministic result for unit test