Fix the `gradient_clip_algorithm` has no effect issue. #6928

ceshine · 2021-04-09T16:15:59Z

What does this PR do?

It contains some necessary changes to make gradient_clip_algorithm actually work (fixes #6920).

~~Also, I added a temporary workaround to #6807 to make the test cases work. (I can remove it to make this PR does only one thing.)~~ EDIT: removed this workaround for now since four errors that are outside of this PR had spun out.

Notes:

TPUAccelerator.clip_gradients does not implement clipping by value. Passing gradient_clip_algorithm="value" should raise an exception.
Updated the test cases test_gradient_clipping_by_value and test_gradient_clipping_by_value_fp16. They now clip the gradients to a maximum of 1e-5, and check if the maximum gradient value in the result is almost equal to 1e-5 (this threshold is small enough, so there should always be some gradients before clipping that are larger than this threshold).

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Also add a temporay workaround to Lightning-AI#6807

pep8speaks · 2021-04-09T16:16:10Z

Hello @ceshine! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-10 15:10:55 UTC

codecov · 2021-04-09T16:45:09Z

Codecov Report

Merging #6928 (3d3f7cc) into master (3baac71) will decrease coverage by 5%.
The diff coverage is 57%.

@@           Coverage Diff           @@
##           master   #6928    +/-   ##
=======================================
- Coverage      92%     86%    -5%     
=======================================
  Files         194     194            
  Lines       12347   12553   +206     
=======================================
- Hits        11322   10856   -466     
- Misses       1025    1697   +672

ceshine · 2021-04-09T17:30:21Z

The main test that is failing now is tests/models/test_horovod.py::test_horovod_multi_optimizer, which I don't believe has something to do with the PR. And from my observation, this test fails somewhat randomly (for example, the first check was successful once but failed in the next run, and I don't think changing the TPU test case would affect the horovod tests).

The #6807 issue is really interfering with the testing inside Trainer.fit calls. I tried enabling a workaround and see if the errors raised were related to gradient clipping. They weren't, but there's no guarantee after disabling it. I would be happy to fix any errors raised from this changes once #6807 has been fixed.

dhkim0225 · 2021-04-10T11:45:54Z

tests/trainer/test_trainer.py

+        assert abs(round(grad_max.item(), 6) - grad_clip_val) < 1e-6, \
+            f"Gradient max value {grad_max} != grad_clip_val {grad_clip_val} ."


In rare cases, this test will be a problem if gradient values are all smaller than the gradient clipping values.

Yeah, I'm aware. I'd argue that the possibility of that happens is really low (since the threshold is 1e-5). At least I haven't encountered one in my local testing. If you want to make it even lower, we can set the threshold to 1e-10 or 1e-13 (within the fp16 range).

This is the way I can think of to distinguish between clipping by norm and clipping by value. I'm open to better ideas, of course.

If you really want to prevent that false positive case to happen, we can add an if statement before that assertion to make sure the minimum gradient is larger than the threshold (this might create some false negatives, though).

EDIT: this solution would need to get the gradients before clipping, which is not possible in the current test setup.

dhkim0225 · 2021-04-10T11:46:18Z

tests/trainer/test_trainer.py

+        assert abs(round(grad_max.item(), 6) - grad_clip_val) < 1e-6, \
+            f"Gradient max value {grad_max} != grad_clip_val {grad_clip_val} ."


ceshine · 2021-04-10T15:12:36Z

I've changed the clipping threshold in the test cases to 1e-10 and max_step to 1 (matching test_gradient_clipping) to address @dhkim0225's concern.

I don't think false positives (test case failed where it shouldn't) would happen in this setup. Even if one does, I'd argue that the problem is the BoringModel because a test setup should not create a situation where all gradients are almost zero after just one backward pass.

tchaton

LGTM !

carmocca · 2021-04-14T23:36:43Z

Hi @ceshine, quick question

TPUAccelerator.clip_gradients does not implement clipping by value. Passing gradient_clip_algorithm="value" should raise an exception.

Is there any reason why we can't just use torch.nn.utils.clip_grad_value_? Do you have more info?

ceshine · 2021-04-15T02:38:33Z

Hi @ceshine, quick question

TPUAccelerator.clip_gradients does not implement clipping by value. Passing gradient_clip_algorithm="value" should raise an exception.

Is there any reason why we can't just use torch.nn.utils.clip_grad_value_? Do you have more info?

I'm not really familiar with XLA, but I think it is the same reason behind the use of xla_clip_grad_norm_ in TPUAccelerator.

The original #6123 implementation did not even have a gradient_clip_algorithm argument in TPUAccelerator, which will create problems when we use this argument in the training loop. I merely added that argument and made sure to let users know that only the "norm" algorithm has been implemented for TPU.

An attempt to fix gradient_clip_algorithm problem (Lightning-AI#6920)

8585071

Also add a temporay workaround to Lightning-AI#6807

ceshine added 4 commits April 10, 2021 00:21

Fix the PEP8 compliance issue

17a198e

Fix the mistake from rebasing

338769a

Revert changes regarding to Lightning-AI#6807

bf1bd46

Remove an unused variable

218ecec

Expect the TPU test case to fail

a3007ad

ceshine marked this pull request as ready for review April 9, 2021 17:31

ceshine requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners April 9, 2021 17:31

dhkim0225 reviewed Apr 10, 2021

View reviewed changes

Make the possibility of false positive even lower

81cab3b

awaelchli added the bug Something isn't working label Apr 10, 2021

awaelchli modified the milestones: 1.2.x, 1.3 Apr 10, 2021

Only train one step in test cases

3d3f7cc

tchaton approved these changes Apr 14, 2021

View reviewed changes

ananthsub approved these changes Apr 14, 2021

View reviewed changes

kaushikb11 approved these changes Apr 14, 2021

View reviewed changes

kaushikb11 changed the title ~~Fix the gradient_clip_algorithm has no effect issue. (#6920)~~ Fix the gradient_clip_algorithm has no effect issue. Apr 14, 2021

kaushikb11 merged commit 24d0295 into Lightning-AI:master Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the `gradient_clip_algorithm` has no effect issue. #6928

Fix the `gradient_clip_algorithm` has no effect issue. #6928

ceshine commented Apr 9, 2021 •

edited

pep8speaks commented Apr 9, 2021 •

edited

codecov bot commented Apr 9, 2021 •

edited

ceshine commented Apr 9, 2021 •

edited

dhkim0225 Apr 10, 2021

ceshine Apr 10, 2021

ceshine Apr 10, 2021 •

edited

dhkim0225 Apr 10, 2021

ceshine commented Apr 10, 2021 •

edited

tchaton left a comment

carmocca commented Apr 14, 2021

ceshine commented Apr 15, 2021

		assert abs(round(grad_max.item(), 6) - grad_clip_val) < 1e-6, \
		f"Gradient max value {grad_max} != grad_clip_val {grad_clip_val} ."

Fix the gradient_clip_algorithm has no effect issue. #6928

Fix the gradient_clip_algorithm has no effect issue. #6928

Conversation

ceshine commented Apr 9, 2021 • edited

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Apr 9, 2021 • edited

Comment last updated at 2021-04-10 15:10:55 UTC

codecov bot commented Apr 9, 2021 • edited

Codecov Report

ceshine commented Apr 9, 2021 • edited

dhkim0225 Apr 10, 2021

Choose a reason for hiding this comment

ceshine Apr 10, 2021

Choose a reason for hiding this comment

ceshine Apr 10, 2021 • edited

Choose a reason for hiding this comment

dhkim0225 Apr 10, 2021

Choose a reason for hiding this comment

ceshine commented Apr 10, 2021 • edited

tchaton left a comment

Choose a reason for hiding this comment

carmocca commented Apr 14, 2021

ceshine commented Apr 15, 2021

Fix the `gradient_clip_algorithm` has no effect issue. #6928

Fix the `gradient_clip_algorithm` has no effect issue. #6928

ceshine commented Apr 9, 2021 •

edited

pep8speaks commented Apr 9, 2021 •

edited

codecov bot commented Apr 9, 2021 •

edited

ceshine commented Apr 9, 2021 •

edited

ceshine Apr 10, 2021 •

edited

ceshine commented Apr 10, 2021 •

edited