New modular metric interface #2528

SkafteNicki · 2020-07-06T11:31:47Z

What does this PR do?

This is a proposal on how an extension for the modular interface for
metric packages could look like. What our interface is missing, is the option to do computations
after dpp sync. Consider the following example for rmse:

20 samples, 2 devices, 10 for each devices
With our current setup, each device would return a metric value which then
get synced.

First machine returns (assume sum of squared error is 200)
sqrt(1/10 * 200) = 4.47
Second machine return (assume sum of squared error is 100)
sqrt(1/10 * 100) = 3.16
After ddp sync the value that is returned would be
(4.47 + 3.16) / 2 = 3.815
But the correct value is:
sqrt(1/20 * 300) = 3.872

It is possible to arrive at the correct result if each  machine instead returns 
a *metric state* i.e a collection of values that can be synced and the correct results can
derived from (in above case that would be the number of samples and the
sum of squared errors). Thus, we need to extend the modular interface, such
that we (or the user) can do calculations after the ddp sync.

This PR therefore propose to go from our decorator orientated modular interface
to a hook-based interface. All hooks are optional, such that the user only needs
to implement forward if the inherent from either TensorMetric or NumpyMetric.

Hooks to add:

input_convert -> used to convert the input to correct type
output_convert -> used to convert the output of forward
ddp_sync -> do ddp_sync
compute -> this is the missing part, that enables us to do computations after ddp sync

Note: this PR just implement the hooks, but I still need to go over each metric,
fixing those where the compute hook is needed.
That will be fixed in follow up PR, since this is already extensive enough.

Tagging @justusschock and @Borda for opinion

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

mergify · 2020-07-06T11:32:08Z

This pull request is now in conflict... :(

justusschock

In total I really like those changes!

I feel like we should automatically sync the output of forward whenever being on ddp, and then call aggregate on it. We could also think on defaults (i.e. aggregate could be a simple mean by default).

So the next step would be to revisit all metrics for reduction?

Edit: This should also make it simpler to pickle metrics, since we only need to make sure the converters can be pickled :)

pytorch_lightning/metrics/metric.py

SkafteNicki · 2020-07-06T14:06:53Z

@justusschock thanks, I took me a while to figure out how to keep this as close to native pytorch as possible, but still expressive enough to support the features we need.
I think the first step is to change the backend to accommodate the features that we need and then do the actual implementation of the features.

For aggregation over multiple batches, one way to achieve this to introduce a new Accumulate class that inserts a hook just before ddp_sync, that accumulate the states:

class Accumulate(nn.Module):
    def __init__(self, base_metric):
        super().__init__()
        self.base_metric = base_metric
        self.base_metric._forward_hooks.popitem(-1) # remove compute hook
        self.base_metric.register_forward_hook(self.save_state) # add save_state hook
        self.base_metric.register_forward_hook(self.base_metric.compute) # re-add compute hook
        self.state = None
    def forward(self, *args, **kwargs):
        return self.base_metric(*args, **kwargs)
    def save_state(self, module, input, output):
        if self.state is None:
            self.state = output
        else:
            state = [s+o for s,o in zip(self.state, output)]
            self.state = state
            output = state
        return output
    def reset(self):
        self.state = None
# Use case
metric = Accumulate(Metric(...))
for (data, target) in dataloader:
    pred = model(data)
    val = metric(pred, target) # val will be the accumulated value

codecov · 2020-07-06T14:25:52Z

Codecov Report

Merging #2528 into master will decrease coverage by 9%.
The diff coverage is 88%.

@@           Coverage Diff            @@
##           master   #2528     +/-   ##
========================================
- Coverage      90%     81%     -9%     
========================================
  Files          81      84      +3     
  Lines        7858    9321   +1463     
========================================
+ Hits         7034    7518    +484     
- Misses        824    1803    +979

justusschock · 2020-07-06T19:38:15Z

@SkafteNicki I see your point. But isn't it basically the same whether you accumulate across nodes or across different batches in the same node? This would probably avoid some code duplication.

We should have a chat in slack considering the perfect integration, once we finished the API for metrics.

SkafteNicki · 2020-07-07T12:07:09Z

@justusschock it probably is the same syncing between nodes and across the same node, so I agree that it should somewhat be handled in the same way. Only difference is that accumulation on the same node, should probably be a feature that the user can enable/disable (i.e. could be a accumulate flag in the metric __init__).

For now, I will rename the ddp_sync hook to aggregate as you proposed, and it will still just support sync between nodes. Then we can discuss on slack the remaining parts.

mergify · 2020-07-08T08:27:41Z

This pull request is now in conflict... :(

mergify · 2020-07-21T19:20:58Z

This pull request is now in conflict... :(

justusschock · 2020-07-29T14:09:01Z

@SkafteNicki How is it going with this PR?

mergify · 2020-08-05T17:06:53Z

This pull request is now in conflict... :(

mergify · 2020-08-07T07:19:59Z

This pull request is now in conflict... :(

CHANGELOG.md

pytorch_lightning/metrics/metric.py

tests/base/model_train_steps.py

tests/metrics/test_metrics.py

Borda

LGTM 🐰

Borda · 2020-08-26T08:29:03Z

@SkafteNicki some metric test seems to be hanging...

SkafteNicki · 2020-08-26T08:32:20Z

@Borda which test are hanging, I cannot figure it out from drone details?

Borda · 2020-08-26T09:12:37Z

@Borda which test are hanging, I cannot figure it out from drone details?

see this build - http://35.192.60.23/PyTorchLightning/pytorch-lightning/8999
the las passing tests was tests/metrics/test_metrics.py::test_metric[metric1] PASSED [ 37%] so the next one hanged...

SkafteNicki · 2020-08-26T09:33:35Z

@Borda fixed the bug, we can merge this now :]

Nicki Skafte added 5 commits June 26, 2020 11:56

new base structure

0ba7d63

missing packages

6481b6b

updated interface

f6a0a4d

revert some changes

9368ac6

fixes

5d76528

mergify bot requested a review from a team July 6, 2020 11:32

justusschock reviewed Jul 6, 2020

View reviewed changes

pytorch_lightning/metrics/metric.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team July 6, 2020 12:10

Borda added feature Is an improvement or enhancement Important labels Jul 6, 2020

Borda added this to in Progress in Metrics package via automation Jul 6, 2020

Borda added this to the 0.9.0 milestone Jul 6, 2020

Merge branch 'master' into new_metric_interface

6709952

Nicki Skafte added 3 commits July 6, 2020 16:46

add changelog

4958cb2

fix bug

0a3849e

added description

883efa9

justusschock assigned justusschock and SkafteNicki Jul 29, 2020

SkafteNicki added 2 commits August 5, 2020 11:32

'merge'

451d4b6

merge

ea0dfc6

pep

9fabc89

mergify bot requested a review from a team August 25, 2020 19:25

rohitgr7 reviewed Aug 25, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

mergify bot requested a review from a team August 25, 2020 19:33

Update CHANGELOG.md

619ffc2

Borda reviewed Aug 25, 2020

View reviewed changes

pytorch_lightning/metrics/metric.py Show resolved Hide resolved

Borda reviewed Aug 25, 2020

View reviewed changes

tests/base/model_train_steps.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team August 25, 2020 21:09

Borda reviewed Aug 25, 2020

View reviewed changes

tests/metrics/test_metrics.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team August 25, 2020 21:12

Nicki Skafte added 3 commits August 26, 2020 08:10

suggestions

b03ce8f

fix tests

ac169df

fix pep8

c1a1389

Borda approved these changes Aug 26, 2020

View reviewed changes

Metrics package automation moved this from in Progress to in Review Aug 26, 2020

Borda requested a review from rohitgr7 August 26, 2020 08:27

Borda added the ready PRs ready to be merged label Aug 26, 2020

Borda removed the ready PRs ready to be merged label Aug 26, 2020

fix tests

847d0ed

Borda added the ready PRs ready to be merged label Aug 26, 2020

Borda merged commit 17d8773 into Lightning-AI:master Aug 26, 2020

Metrics package automation moved this from in Review to Done Aug 26, 2020

ShomyLiu mentioned this pull request Aug 29, 2020

**gather_all_tensors_if_available** share the same underlying storage for all GPUs #3253

Closed

c00k1ez mentioned this pull request Sep 12, 2020

Issues with Confusion Matrix normalization and DDP computation #2724

Closed

SkafteNicki mentioned this pull request Sep 18, 2020

Metric aggregation testing #3517

Merged

7 tasks

SkafteNicki deleted the new_metric_interface branch October 8, 2020 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New modular metric interface #2528

New modular metric interface #2528

SkafteNicki commented Jul 6, 2020 •

edited by Borda

mergify bot commented Jul 6, 2020

justusschock left a comment •

edited

SkafteNicki commented Jul 6, 2020

codecov bot commented Jul 6, 2020 •

edited

justusschock commented Jul 6, 2020

SkafteNicki commented Jul 7, 2020

mergify bot commented Jul 8, 2020

mergify bot commented Jul 21, 2020

justusschock commented Jul 29, 2020

mergify bot commented Aug 5, 2020

mergify bot commented Aug 7, 2020

Borda left a comment

Borda commented Aug 26, 2020

SkafteNicki commented Aug 26, 2020

Borda commented Aug 26, 2020

SkafteNicki commented Aug 26, 2020

New modular metric interface #2528

New modular metric interface #2528

Conversation

SkafteNicki commented Jul 6, 2020 • edited by Borda

What does this PR do?

Before submitting

PR review

Did you have fun?

mergify bot commented Jul 6, 2020

justusschock left a comment • edited

Choose a reason for hiding this comment

SkafteNicki commented Jul 6, 2020

codecov bot commented Jul 6, 2020 • edited

Codecov Report

justusschock commented Jul 6, 2020

SkafteNicki commented Jul 7, 2020

mergify bot commented Jul 8, 2020

mergify bot commented Jul 21, 2020

justusschock commented Jul 29, 2020

mergify bot commented Aug 5, 2020

mergify bot commented Aug 7, 2020

Borda left a comment

Choose a reason for hiding this comment

Borda commented Aug 26, 2020

SkafteNicki commented Aug 26, 2020

Borda commented Aug 26, 2020

SkafteNicki commented Aug 26, 2020

SkafteNicki commented Jul 6, 2020 •

edited by Borda

justusschock left a comment •

edited

codecov bot commented Jul 6, 2020 •

edited