Lose performance between 0.6.0 and 0.7.1 #1136

mpariente · 2020-03-13T08:44:11Z

🐛 Bug

When I train exactly the same model with pl 0.7.1, I get worse performance compared to pl0.6.0.
I did a fresh install or Asteroid with both versions and ran exactly the same script on the same hardware.
I get significantly worse performance with pl0.7.1.
Are there some known issues I should be aware of? In the mean time, I'll have to downgrade to 0.6.0

Environment

PL 0.6.0

Collecting environment information... [8/105]
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Debian GNU/Linux 10 (buster)
GCC version: (Debian 8.3.0-6) 8.3.0
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] pytorch-lightning==0.6.0
[pip3] torch==1.4.0
[pip3] torchvision==0.4.2
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-include 2020.0 166
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.14 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] torch 1.3.1 pypi_0 pypi
[conda] torchvision 0.4.2 pypi_0 pypi

Diff between 0.6.0 and 0.7.1 envs

diff env_0.7 env_0.6

19c19
< [pip3] pytorch-lightning==0.7.1
---
> [pip3] pytorch-lightning==0.6.0

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-03-13T12:36:11Z

Could you be more precise what you mean with performance?
It could mean

Your test loss is lower with 0.7.1 compared to 0.6.0
Training epochs take longer to finish in 0.7.1 vs. 0.6.0
or other things.

mpariente · 2020-03-13T12:51:20Z

Yeah true, sorry. Training, validation and testing losses are worse with 0.7.1, that's what I meant.

Borda · 2020-03-13T12:55:26Z

@mpariente could you pls give us some numbers?

mpariente · 2020-03-13T13:31:01Z

Yes. We minimize negative signal to distortion ratio (SDR), which is widely used in speech separation. I report loss values here, they are negative and lower is better.
With 0.6.0, we reached -18dB / With 0.7.1, we couldn't reach better than -15dB. People publish for 1dB so this is a huge difference.

I'm re-running the experiments with exactly the same version of the code, same libraries, same hardware, only difference is lightning. This is a screenshot of the ongoing runs.We can already see a 2dB difference between the runs

Borda · 2020-03-13T14:01:11Z

just crossed some other comment about slower performances #525 (comment)

gwichern · 2020-03-13T15:38:36Z

I also noticed a similar issue with my code after upgrading to 0.7.1 from 0.6, so I tried running the MNIST example, and confirmed performance difference (both of my environments used torch==1.4.0 and torchvision==0.5.0)

Orange curve is version 0.6, pink curve is version 0.7.1

awaelchli · 2020-03-13T17:34:17Z

If it even happens with MNIST, we have to find the bug asap!
I thought there were tests in place to ensure the performance does not change, but it seems this is not the case.

awaelchli · 2020-03-13T17:36:03Z

Just to be sure, @mpariente and @gwichern did you make the training deterministic before running this comparison?

williamFalcon · 2020-03-13T17:39:08Z

there are test cases to make sure performance doesn't change. please rerun using the exact same seed and only change the versions.

gwichern · 2020-03-13T17:39:52Z

In my case yes. The pytorch-lightning MNIST example sets both the numpy and torch seeds to 2334. Everything is the same except environments.

williamFalcon · 2020-03-13T17:40:20Z

I'm not sure what you mean here.
You still reach the same accuracy just at different times

williamFalcon · 2020-03-13T17:41:15Z

@gwichern can you post a colab here? we can test it there.

gwichern · 2020-03-13T17:47:38Z

Agreed, on that point about the accuracy getting to the same value, but I just ran pl_examples/basic_examples/cpu_template.py in two different environments (passing --hidden_dim 500 from the command line in both cases). The seed is set to the same value in both versions, so I would have expected things to match better

mpariente · 2020-03-13T17:48:19Z

Just to be sure, @mpariente and @gwichern did you make the training deterministic before running this comparison?

No, my bad. But I did make each run twice, and have consistent difference between 0.6.0 and 0.7.1
I'll rerun it tonight with the same seed on both trials.

ethanwharris · 2020-03-13T18:04:19Z

It looks like the MNIST example doesn't set shuffle=True in the dataloader, that could be the cause of the poor MNIST performance. @mpariente is data being shuffled in your case?

williamFalcon · 2020-03-13T18:21:02Z

ok, yeah... finding the same. Let's dig into this a bit.
The problem is that the core logic didn't really change though.
Might be dataloader related

williamFalcon · 2020-03-13T18:21:09Z

@PyTorchLightning/core-contributors

awaelchli · 2020-03-13T18:24:22Z

the @pl.dataloader decorator got removed in 0.7. could it be related? (can't test right now).

mpariente · 2020-03-13T18:29:13Z

I didn't try to find the cause of this yet, I thought I should report first.

@mpariente is data being shuffled in your case?

Yes it is.

williamFalcon · 2020-03-13T18:39:32Z

might be the refresh rate of the progress bar. maybe that also changed the update freq of the loggers by mistake

mpariente · 2020-03-13T18:46:14Z

might be the refresh rate of the progress bar. maybe that also changed the update freq of the loggers by mistake

Had considered this possibility as well. But for a given epoch (while training), the results are significantly degraded.

williamFalcon · 2020-03-13T18:52:59Z

Using this colab (https://colab.research.google.com/drive/1NUrJ7LZqblKW_OIpiGYVaOLGJ2l_tFxs)

0.6.0

At the end of 1 epoch.

0.7.1

Removing the decorators has no effect

Using the new epoch_end signature

When i decrease the refresh rate (gets closet to the 0.6.0 value) (actually, lower loss than 0.6.0)

So, i don't see a huge difference here. Mind playing with it for a bit?

williamFalcon · 2020-03-13T19:04:28Z

And when doing exactly the same model including environment reset we get an exact curve match... (across 6 epochs)

mpariente · 2020-03-13T19:59:01Z

We'll try to see if we get the same problem when we are setting seeds.
Thanks for taking time for it.
If it persists, I'll try to set up a reproducible example but the dataset we use is not open, it makes things complicated.

williamFalcon · 2020-03-19T00:42:13Z

@mpariente was thinking about this. maybe it has to do with distributedSampler if you're using that? since we now inject that automatically, it may be that your effective batch size is differnt now and thus if you use the same learning rate, you won't get the best results?

It might be that you have to readjust your learning rate. (from this graph, i would make it slower).

mpariente · 2020-03-19T07:52:11Z

Hmm, we use dp only, not ddp, would this still apply?
Thanks for taking a look again

mpariente · 2020-03-28T20:34:25Z

Well, this persists.
I finally took time to reproduce the issue, on another architecture.
Now, the training is deterministic, here are the tensorboard logs (grey pl0.7.1, orange pl0.6., nothing else changes between the envs) :

Training is not over but the differences are already not negligible.
I also have a video about the 10 first epochs if you'd like.

This script to reproduce are here but the training dataset is under license..

Info that might be useful, distributed backend is 'dp'.

williamFalcon · 2020-03-28T20:41:26Z

ok awesome. will look into this.
and this was introduced in 0.7.1?

@mpariente i'm looking into this today and tomorrow, will push a fix if I find something.
In the meantime I'm also going to put together a few architectures to prove correctness going forward so we know if we mess something up.

It's weird because the tests do test a specific performance goal

mpariente · 2020-03-28T20:45:29Z

IIRC 0.7.0 was not backward compatible because of pl.data_loader so I couldn't test it.

It's weird because the tests do test a specific performance goal

I know you're doing the best you can about this, no worries.

For now, both architecture involved LSTMs, did you change anything about BBTT?
I'm going to try with a ConvNet, see if it changes as well.

williamFalcon · 2020-03-28T20:48:31Z

ummm... i don't think we did but that's good to know. Maybe it is RNN related.
Why don't you do the sample in this colab so we can unify efforts here (ping me your email on our slack so I can give you access)

I want to create the following tests:

MNIST using MLPs.
cifar10 using CNNs.
an RNN example, maybe sequence classification. whatever the equivalent simple test is for RNN.
a VAE
probably enough

mpariente · 2020-03-28T22:38:37Z

Tried on two convolutional architectures and the training and validation curves are a perfect match.
At first sight, it seems to be RNN related, which would be a good news.

mpariente · 2020-04-02T14:25:28Z

Any update on that please?

williamFalcon · 2020-04-02T14:52:09Z

will do an rnn test. however, we now have a parity test between pytorch pure and lightning with convnets in continuous integration. the test forces a match across trials to 5 decimal points.

i’ll add an rnn test as well

mpariente · 2020-04-02T16:02:20Z

Sounds great !
Could you give me a link to the parity tests you mentioned please?

williamFalcon · 2020-04-02T16:16:27Z

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/benchmarks/test_trainer_parity.py

williamFalcon · 2020-04-02T16:17:40Z

this runs on every PR to make sure no PR breaks parity.
We do have a bit of difference in speed with pytorch, but looking into it. It looks related to logging, tqdm and tensorboard.

so, speedwise it's not a fair comparison because the pure pytorch version has no logging or any of that, whereas lightning does

mpariente · 2020-04-02T16:43:31Z

Oh I didn't see the PR, I thought you'd ping this Issue with it.
About the RNN, did you decide the task? Do you want a fake dataset or something real? I can put together an averaging RNN example if needed.

williamFalcon · 2020-04-02T17:31:14Z

yeah, that would be super helpful! maybe the addition task is a good dataset to test?

Can do the colab here:
https://colab.research.google.com/drive/1qvQdkiTfCeHot6Db9OI1acXqqn3qZdYO#scrollTo=dTzj2fH6I1Mn

awaelchli · 2020-04-03T07:36:17Z

I took the code from #1351 and ran it for 10 epochs and 3 runs on CPU first (because @mpariente also has no GPU), then I noticed that there is a performance gap between 0.6.0 and 0.7.1 in the third decimal point.
https://colab.research.google.com/drive/1yek1fUkIEmJgt9iI7pnr3HWFNgrf14pr
Not sure if it helps.

mpariente · 2020-04-03T07:50:56Z

Thanks for looking into this, could you grant me access to the colab please?

Did you also try on GPU?

awaelchli · 2020-04-03T07:53:38Z

Try again, I had sharing turned off.
No, Colab doesn't want to give me GPU for some reason, that's why I tried CPU.

mpariente · 2020-04-03T08:25:02Z

Ok, I can reproduce the same results as you and checked that the pytorch vanilla_loop also passes, and this is the case.
I'm trying something else, I'll let you know if it gives anything

mpariente · 2020-04-06T09:34:38Z

@williamFalcon have you seen this?
I think the parity test are not as good as they could be because if configure_optimizers does something under the hood, it won't have any impact as these are the optimizers used in the pure pytorch case as well.

I took the code from #1351 and ran it for 10 epochs and 3 runs on CPU first (because @mpariente also has no GPU), then I noticed that there is a performance gap between 0.6.0 and 0.7.1 in the third decimal point.
https://colab.research.google.com/drive/1yek1fUkIEmJgt9iI7pnr3HWFNgrf14pr
Not sure if it helps.

And the results @awaelchli mentionned shows exactly this right? Parity test passes but the performances are different, how can we explain that?

jeremyjordan · 2020-04-06T13:07:06Z

we should also probably include truncated_bptt in the parity test

mpariente · 2020-04-13T08:57:28Z

Sorry to ping you @williamFalcon, but this is not resolved.

williamFalcon · 2020-04-13T11:10:43Z

we have parity tests now with an exact performance match...
can you provide a colab where you can reproduce this behavior?

mpariente · 2020-04-13T13:19:39Z

we have parity tests now with an exact performance match...
can you provide a colab where you can reproduce this behavior?

I would not call it exact performance match actually : performance are matching between pytorch-lightning and torch, true. But if you change the version of pytorch-lightning, the perfs are changing and they shouldn't because the seeds are exactly the same.

So something is happening under the hood, see those lines for example, in 0.6.0, the train dataloader is also changed by lightning right?

I don't think these parity tests are as valuable as they should be.

williamFalcon · 2020-04-13T13:27:03Z

the performance comparison has to be against pure pytorch because that’s the bound for speed and accuracy. comparing across lightning versions makes no sense.

again, can’t help without a real example that breaks on colab. every other time anyone has brought up a performance difference they’ve ended up finding a bug in their code.

happy to fix if something is broken, but we need tangible proof to find a possible problem.

mpariente · 2020-04-13T15:13:20Z

the performance comparison has to be against pure pytorch because that’s the bound for speed and accuracy. comparing across lightning versions makes no sense.

But I don't think it qualifies as pure pytorch, everything still comes from a lightning module.

again, can’t help without a real example that breaks on colab. every other time anyone has brought up a performance difference they’ve ended up finding a bug in their code.

happy to fix if something is broken, but we need tangible proof to find a possible problem.

I understand, I'll try to build an example that fails next week, thanks again

williamFalcon · 2020-04-13T15:19:14Z

it’s literally the same code. it’s like saying 2 = (2) are different haha. it’s written this way for convenience because the pytorch code is exactly the same...

mpariente · 2020-05-17T18:56:45Z

I've tried with 0.7.5 against 0.6.0 and got the same results on several of our architectures. We'll finally upgrade and get all the new features you integrated 😀
Thanks again for looking into it, I'm closing this.

mpariente added bug Something isn't working help wanted Open to be worked on labels Mar 13, 2020

mpariente mentioned this issue Mar 13, 2020

Downgrade to pl0.6.0 asteroid-team/asteroid#58

Merged

Borda added information needed and removed bug Something isn't working labels Mar 13, 2020

williamFalcon changed the title ~~Loose performance between 0.6.0 and 0.7.1~~ Lose performance between 0.6.0 and 0.7.1 Mar 13, 2020

williamFalcon self-assigned this Mar 28, 2020

mpariente mentioned this issue Apr 2, 2020

Add parity test for simple RNN #1351

Merged

mpariente closed this as completed May 17, 2020

Lose performance between 0.6.0 and 0.7.1 #1136

Lose performance between 0.6.0 and 0.7.1 #1136

Comments

mpariente commented Mar 13, 2020

🐛 Bug

Environment

PL 0.6.0

Diff between 0.6.0 and 0.7.1 envs

awaelchli commented Mar 13, 2020

mpariente commented Mar 13, 2020 via email

Borda commented Mar 13, 2020

mpariente commented Mar 13, 2020

Borda commented Mar 13, 2020

gwichern commented Mar 13, 2020

awaelchli commented Mar 13, 2020

awaelchli commented Mar 13, 2020

williamFalcon commented Mar 13, 2020

gwichern commented Mar 13, 2020

williamFalcon commented Mar 13, 2020 • edited

williamFalcon commented Mar 13, 2020

gwichern commented Mar 13, 2020

mpariente commented Mar 13, 2020 • edited

ethanwharris commented Mar 13, 2020 • edited

williamFalcon commented Mar 13, 2020

williamFalcon commented Mar 13, 2020

awaelchli commented Mar 13, 2020

mpariente commented Mar 13, 2020 • edited

williamFalcon commented Mar 13, 2020

mpariente commented Mar 13, 2020

williamFalcon commented Mar 13, 2020 • edited

0.6.0

0.7.1

williamFalcon commented Mar 13, 2020 • edited

mpariente commented Mar 13, 2020

williamFalcon commented Mar 19, 2020

mpariente commented Mar 19, 2020

mpariente commented Mar 28, 2020 • edited

williamFalcon commented Mar 28, 2020 • edited

mpariente commented Mar 28, 2020

williamFalcon commented Mar 28, 2020 • edited

mpariente commented Mar 28, 2020

mpariente commented Apr 2, 2020

williamFalcon commented Apr 2, 2020

mpariente commented Apr 2, 2020

williamFalcon commented Apr 2, 2020

williamFalcon commented Apr 2, 2020

mpariente commented Apr 2, 2020

williamFalcon commented Apr 2, 2020 • edited

awaelchli commented Apr 3, 2020 • edited

mpariente commented Apr 3, 2020

awaelchli commented Apr 3, 2020

mpariente commented Apr 3, 2020

mpariente commented Apr 6, 2020

jeremyjordan commented Apr 6, 2020

mpariente commented Apr 13, 2020

williamFalcon commented Apr 13, 2020

mpariente commented Apr 13, 2020

williamFalcon commented Apr 13, 2020 • edited

mpariente commented Apr 13, 2020

williamFalcon commented Apr 13, 2020

mpariente commented May 17, 2020

williamFalcon commented Mar 13, 2020 •

edited

mpariente commented Mar 13, 2020 •

edited

ethanwharris commented Mar 13, 2020 •

edited

mpariente commented Mar 13, 2020 •

edited

williamFalcon commented Mar 13, 2020 •

edited

williamFalcon commented Mar 13, 2020 •

edited

mpariente commented Mar 28, 2020 •

edited

williamFalcon commented Mar 28, 2020 •

edited

williamFalcon commented Mar 28, 2020 •

edited

williamFalcon commented Apr 2, 2020 •

edited

awaelchli commented Apr 3, 2020 •

edited

williamFalcon commented Apr 13, 2020 •

edited