Clear `list` states (i.e. delete their contents), not reassign the default `[]` #2493

dominicgkerr · 2024-04-06T22:46:59Z

Rather than overwriting list states (with []), call .clear() to correctly (more robustly?) delete/free Tensor elements. Previous behaviour results in CPU (possibly GPU also) memory leak

What does this PR do?

Fixes #2492 (maybe #2481 also)

Before submitting

Was this discussed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
~~Did you make sure to update the docs?~~
Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

📚 Documentation preview 📚: https://torchmetrics--2493.org.readthedocs.build/en/2493/

…behaviour produced memory leak from list[Tensor] states

Borda

nice, can we also have a test to cover this edge-case?

…enced, and hence not automatically garbage collected). Fixed failing test (want to check list state, but assigned Tensor)

dominicgkerr · 2024-04-07T01:15:53Z

nice, can we also have a test to cover this edge-case?

Absolutely! Added 1fa7077 to help illustrate the issue (and how the fix helps)

justusschock

Thanks for your PR and that's certainly a great find.
That being said, I am actually not sure what's the desired behavior here.

I can see good reasons for both ways.

I.e. It might be very confusing for people if they keep a reference to things why this one suddenly empty, whereas with just instantiating a new list we're doing everything correctly on our side and still give the user the possibility to retain a reference to the old state if they wish to. So this would only lead to a memory leak in user code, not on our side.

dominicgkerr · 2024-04-07T14:51:28Z

Interesting - I hadn't thought of that scenario being desirable...

I would argue though, the exiting behaviour is certainly unexpected (according to the current documentation). My reading of .add_state is that users (intentionally) defer the memory/device management entirely to Metric, and wouldn't use it if they needed something more complex (like persistent references.) Similarly, I naturally expect .reset to return the Metric to its original states, AND to tidy-up any intermediate values calculated by .update (which are currently leaked.)

In my specific use-case, I'm fitting a LighntningModule to a very large dataset. I have a setup very similar to [1], except I'm using a custom Metric with a list state - over the course of a large number of training steps, I consistently see my CPU memory usage increase (as a result of #2492) until the python process crashes.

If you're keen to support the ability to retain references to Metric states, would introducing a (say) free_on_reset: bool keyword argument to .add_state (that conditionally deletes/clears states before reassigning the default value) be preferable? Setting free_on_reset=True by could enable the new behaviour by default, while still allowing users to revert to the existing behaviour whn they require something more complex?

[1] https://lightning.ai/docs/torchmetrics/stable/pages/lightning.html#logging-torchmetrics

justusschock · 2024-04-08T14:56:31Z

I think you have some valid arguments there. I'm fine making an opinionated move here. Could you just add that to the documentation and refer people to use copy/deepcopy if the want to retain the states?

…t care must be taken when referencing them

dominicgkerr · 2024-04-09T00:24:02Z

Brilliant, thanks! I've updated the docstring/documentation - let me know if it needs rewording/expanding/etc...

Unfortunately I couldn't get make docs to build locally (as I couldn't pip install lai-sphinx-theme), so just waiting on the CI to double check my formatting is correct

Thanks again

Borda · 2024-04-10T08:16:59Z

@SkafteNicki thoughts?

Borda · 2024-04-12T17:17:55Z

@dominicgkerr there are too many failing tests, could you pls have look...

…om:dominicgkerr/torchmetrics into bugfix/2492-clear-list-states-not-reassign

dominicgkerr · 2024-04-13T15:11:49Z

@SkafteNicki Nice (5758977) - was just looking at the .forward logic!

SkafteNicki · 2024-04-13T15:13:19Z

@dominicgkerr you are welcome :)
Hopefully all tests should pass now

for more information, see https://pre-commit.ci

…t-reassign' into bugfix/2492-clear-list-states-not-reassign # Conflicts: # src/torchmetrics/metric.py

…void memory leakage

for more information, see https://pre-commit.ci

codecov · 2024-04-13T20:05:35Z

Codecov Report

Merging #2493 (3f013cf) into master (581c444) will increase coverage by 0%.
The diff coverage is 100%.

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #2493   +/-   ##
======================================
  Coverage      69%     69%           
======================================
  Files         307     307           
  Lines       17396   17404    +8     
======================================
+ Hits        11981   11989    +8     
  Misses       5415    5415

…ces to avoid memory leakage" This reverts commit ef27215.

…checking .reset clears memory allocated during update (memory should be allowed to grow, as long as discarded safely)

dominicgkerr · 2024-04-13T22:21:18Z

I think I've got things working (better?) now... @SkafteNicki / 5758977 was a huge help (thanks!), as I hadn't spotted the caching behaviour inside .forward / .update before.

Rather than simply using deepcopy (which worked in all but one edge case), I added a ._copy_state_dict helper method to .detach().clone() Tensor / list[Tensor] states inserted into the cache. Unfortunately this caused tests/unittests/bases/test_metric.py:test_constant_memory_on_repeat_init to fail, but I found (in ef27215) that removing the .clone() calls, fixed it. But, as the resulting cache might now contain references to Tensor values, I was suspicious that it somehow leak memory...

After some digging, I came to the conclusion that test_constant_memory_on_repeat_init wasn't actually testing memory leakage during initialisation, but actually ensure repeated .forward calls didn't (significantly) increase the metric's memory. As an illustration, moving x = torch.randn(10000).cuda() inside the test's for loop (where a constant Tensor couldn't be automatically referenced by python/pytorch in subsequent calls), actually caused it to fail...

I claim, memory should be allowed to increase here (as users might legitimately want to collect lots of observations during fitting, and combine them inside .compute), as long as the allocated memory is eventually freed by .reset. I've added a new test (test_freed_memory_on_reset) check this.

I appreciate changing (failing) tests to pass the CI is pretty suspect - hopefully the above explains why I did (let me know if not!)

Clear (i.e. delete) list state items, not simply overwrite. Previous …

11df0eb

…behaviour produced memory leak from list[Tensor] states

dominicgkerr requested review from SkafteNicki, justusschock, Borda and lantiga as code owners April 6, 2024 22:47

Borda approved these changes Apr 6, 2024

View reviewed changes

Added test to check list states elements are deleted (even when refer…

1fa7077

…enced, and hence not automatically garbage collected). Fixed failing test (want to check list state, but assigned Tensor)

dominicgkerr requested a review from stancld as a code owner April 7, 2024 01:14

justusschock reviewed Apr 7, 2024

View reviewed changes

Updated documentation - highlighted reset clears list states, and tha…

4b5c099

…t care must be taken when referencing them

github-actions bot added the documentation Improvements or additions to documentation label Apr 9, 2024

Add missing method (sphinx) role

8bf151b

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

64fd4d2

Borda assigned SkafteNicki Apr 10, 2024

justusschock approved these changes Apr 11, 2024

View reviewed changes

mergify bot and others added 2 commits April 11, 2024 14:22

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

b991b3b

changelog

82f808b

SkafteNicki added this to the v1.3.x milestone Apr 12, 2024

SkafteNicki approved these changes Apr 12, 2024

View reviewed changes

SkafteNicki enabled auto-merge (squash) April 12, 2024 14:49

dominicgkerr and others added 3 commits April 13, 2024 00:39

Remove failing testcode example (fixing introduces too much complexity)

65b02fa

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

b9dcc8b

Merge branch 'bugfix/2492-clear-list-states-not-reassign' of github.c…

5565524

…om:dominicgkerr/torchmetrics into bugfix/2492-clear-list-states-not-reassign

auto-merge was automatically disabled April 13, 2024 13:44
Head branch was pushed to by a user without write access

dominicgkerr and others added 2 commits April 13, 2024 14:47

Linting - Line break docstring

6241a6b

copy internal states in forward

5758977

dominicgkerr and others added 6 commits April 13, 2024 18:17

Detach Tensor | list[Tensor] state values before copying.

c9d2a86

[pre-commit.ci] auto fixes from pre-commit.com hooks

e1872de

for more information, see https://pre-commit.ci

Use 'typing' type hints

afdd4c5

Merge remote-tracking branch 'origin/bugfix/2492-clear-list-states-no…

76cc0a1

…t-reassign' into bugfix/2492-clear-list-states-not-reassign # Conflicts: # src/torchmetrics/metric.py

DO not clone (when caching) Tensor states, but retain references to a…

ef27215

…void memory leakage

[pre-commit.ci] auto fixes from pre-commit.com hooks

e104587

for more information, see https://pre-commit.ci

dominicgkerr added 3 commits April 13, 2024 22:39

Revert "DO not clone (when caching) Tensor states, but retain referen…

21c7970

…ces to avoid memory leakage" This reverts commit ef27215.

Added mypy type-hinting requirement/recommendation

51975a8

Moved update from test checking .__init__ memory leakage. Added test …

5954d02

…checking .reset clears memory allocated during update (memory should be allowed to grow, as long as discarded safely)

mergify bot added the ready label Apr 14, 2024

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

cd11bb0

mergify bot added the has conflicts label Apr 15, 2024

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

dc14e5e

mergify bot added has conflicts and removed has conflicts labels Apr 15, 2024

Merge branch 'master' into bugfix/2492-clear-list-states-not-reassign

925a3b1

mergify bot removed the has conflicts label Apr 15, 2024

Fix unused loop control variable for pre-commit

3f013cf

stancld approved these changes Apr 16, 2024

View reviewed changes

stancld enabled auto-merge (squash) April 16, 2024 08:41

stancld merged commit 5259c22 into Lightning-AI:master Apr 16, 2024
61 of 62 checks passed

dominicgkerr deleted the bugfix/2492-clear-list-states-not-reassign branch April 16, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear `list` states (i.e. delete their contents), not reassign the default `[]` #2493

Clear `list` states (i.e. delete their contents), not reassign the default `[]` #2493

dominicgkerr commented Apr 6, 2024 •

edited by github-actions bot

Loading

Borda left a comment

dominicgkerr commented Apr 7, 2024

justusschock left a comment •

edited

Loading

dominicgkerr commented Apr 7, 2024

justusschock commented Apr 8, 2024 •

edited

Loading

dominicgkerr commented Apr 9, 2024

Borda commented Apr 10, 2024

Borda commented Apr 12, 2024

dominicgkerr commented Apr 13, 2024

SkafteNicki commented Apr 13, 2024

codecov bot commented Apr 13, 2024 •

edited

Loading

dominicgkerr commented Apr 13, 2024

Clear list states (i.e. delete their contents), not reassign the default [] #2493

Clear list states (i.e. delete their contents), not reassign the default [] #2493

Conversation

dominicgkerr commented Apr 6, 2024 • edited by github-actions bot Loading

What does this PR do?

Did you have fun?

Borda left a comment

Choose a reason for hiding this comment

dominicgkerr commented Apr 7, 2024

justusschock left a comment • edited Loading

Choose a reason for hiding this comment

dominicgkerr commented Apr 7, 2024

justusschock commented Apr 8, 2024 • edited Loading

dominicgkerr commented Apr 9, 2024

Borda commented Apr 10, 2024

Borda commented Apr 12, 2024

dominicgkerr commented Apr 13, 2024

SkafteNicki commented Apr 13, 2024

codecov bot commented Apr 13, 2024 • edited Loading

Codecov Report

dominicgkerr commented Apr 13, 2024

Clear `list` states (i.e. delete their contents), not reassign the default `[]` #2493

Clear `list` states (i.e. delete their contents), not reassign the default `[]` #2493

dominicgkerr commented Apr 6, 2024 •

edited by github-actions bot

Loading

justusschock left a comment •

edited

Loading

justusschock commented Apr 8, 2024 •

edited

Loading

codecov bot commented Apr 13, 2024 •

edited

Loading