Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear list states (i.e. delete their contents), not reassign the default [] #2493

Conversation

dominicgkerr
Copy link
Contributor

@dominicgkerr dominicgkerr commented Apr 6, 2024

Rather than overwriting list states (with []), call .clear() to correctly (more robustly?) delete/free Tensor elements. Previous behaviour results in CPU (possibly GPU also) memory leak

What does this PR do?

Fixes #2492 (maybe #2481 also)

Before submitting
  • Was this discussed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃


📚 Documentation preview 📚: https://torchmetrics--2493.org.readthedocs.build/en/2493/

…behaviour produced memory leak from list[Tensor] states
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, can we also have a test to cover this edge-case?

…enced, and hence not automatically garbage collected). Fixed failing test (want to check list state, but assigned Tensor)
@dominicgkerr
Copy link
Contributor Author

nice, can we also have a test to cover this edge-case?

Absolutely! Added 1fa7077 to help illustrate the issue (and how the fix helps)

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR and that's certainly a great find.
That being said, I am actually not sure what's the desired behavior here.

I can see good reasons for both ways.

I.e. It might be very confusing for people if they keep a reference to things why this one suddenly empty, whereas with just instantiating a new list we're doing everything correctly on our side and still give the user the possibility to retain a reference to the old state if they wish to. So this would only lead to a memory leak in user code, not on our side.

@dominicgkerr
Copy link
Contributor Author

Interesting - I hadn't thought of that scenario being desirable...

I would argue though, the exiting behaviour is certainly unexpected (according to the current documentation). My reading of .add_state is that users (intentionally) defer the memory/device management entirely to Metric, and wouldn't use it if they needed something more complex (like persistent references.) Similarly, I naturally expect .reset to return the Metric to its original states, AND to tidy-up any intermediate values calculated by .update (which are currently leaked.)

In my specific use-case, I'm fitting a LighntningModule to a very large dataset. I have a setup very similar to [1], except I'm using a custom Metric with a list state - over the course of a large number of training steps, I consistently see my CPU memory usage increase (as a result of #2492) until the python process crashes.

If you're keen to support the ability to retain references to Metric states, would introducing a (say) free_on_reset: bool keyword argument to .add_state (that conditionally deletes/clears states before reassigning the default value) be preferable? Setting free_on_reset=True by could enable the new behaviour by default, while still allowing users to revert to the existing behaviour whn they require something more complex?

@justusschock
Copy link
Member

justusschock commented Apr 8, 2024

I think you have some valid arguments there. I'm fine making an opinionated move here. Could you just add that to the documentation and refer people to use copy/deepcopy if the want to retain the states?

…t care must be taken when referencing them
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 9, 2024
@dominicgkerr
Copy link
Contributor Author

Brilliant, thanks! I've updated the docstring/documentation - let me know if it needs rewording/expanding/etc...

Unfortunately I couldn't get make docs to build locally (as I couldn't pip install lai-sphinx-theme), so just waiting on the CI to double check my formatting is correct

Thanks again

@Borda
Copy link
Member

Borda commented Apr 10, 2024

@SkafteNicki thoughts?

@SkafteNicki SkafteNicki added this to the v1.3.x milestone Apr 12, 2024
@SkafteNicki SkafteNicki enabled auto-merge (squash) April 12, 2024 14:49
@Borda
Copy link
Member

Borda commented Apr 12, 2024

@dominicgkerr there are too many failing tests, could you pls have look...

auto-merge was automatically disabled April 13, 2024 13:44

Head branch was pushed to by a user without write access

@dominicgkerr
Copy link
Contributor Author

@SkafteNicki Nice (5758977) - was just looking at the .forward logic!

@SkafteNicki
Copy link
Member

@dominicgkerr you are welcome :)
Hopefully all tests should pass now

Copy link

codecov bot commented Apr 13, 2024

Codecov Report

Merging #2493 (3f013cf) into master (581c444) will increase coverage by 0%.
The diff coverage is 100%.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #2493   +/-   ##
======================================
  Coverage      69%     69%           
======================================
  Files         307     307           
  Lines       17396   17404    +8     
======================================
+ Hits        11981   11989    +8     
  Misses       5415    5415           

dominicgkerr added 3 commits April 13, 2024 22:39
…ces to avoid memory leakage"

This reverts commit ef27215.
…checking .reset clears memory allocated during update (memory should be allowed to grow, as long as discarded safely)
@dominicgkerr
Copy link
Contributor Author

I think I've got things working (better?) now... @SkafteNicki / 5758977 was a huge help (thanks!), as I hadn't spotted the caching behaviour inside .forward / .update before.

Rather than simply using deepcopy (which worked in all but one edge case), I added a ._copy_state_dict helper method to .detach().clone() Tensor / list[Tensor] states inserted into the cache. Unfortunately this caused tests/unittests/bases/test_metric.py:test_constant_memory_on_repeat_init to fail, but I found (in ef27215) that removing the .clone() calls, fixed it. But, as the resulting cache might now contain references to Tensor values, I was suspicious that it somehow leak memory...

After some digging, I came to the conclusion that test_constant_memory_on_repeat_init wasn't actually testing memory leakage during initialisation, but actually ensure repeated .forward calls didn't (significantly) increase the metric's memory. As an illustration, moving x = torch.randn(10000).cuda() inside the test's for loop (where a constant Tensor couldn't be automatically referenced by python/pytorch in subsequent calls), actually caused it to fail...

I claim, memory should be allowed to increase here (as users might legitimately want to collect lots of observations during fitting, and combine them inside .compute), as long as the allocated memory is eventually freed by .reset. I've added a new test (test_freed_memory_on_reset) check this.

I appreciate changing (failing) tests to pass the CI is pretty suspect - hopefully the above explains why I did (let me know if not!)

@mergify mergify bot added the ready label Apr 14, 2024
@mergify mergify bot removed the has conflicts label Apr 15, 2024
@stancld stancld enabled auto-merge (squash) April 16, 2024 08:41
@stancld stancld merged commit 5259c22 into Lightning-AI:master Apr 16, 2024
61 of 62 checks passed
@dominicgkerr dominicgkerr deleted the bugfix/2492-clear-list-states-not-reassign branch April 16, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready
Projects
None yet
Development

Successfully merging this pull request may close these issues.

list states leak (Tensor) memory
5 participants