`Trainer.fit()` multiple times #9636

ninginthecloud · 2021-09-22T06:25:41Z

🐛 Bug

Context

I noticed in the unit test case test_dataloaders_reset_and_attach in test_dataloaders.py that trainer.fit() was called twice with different train_dataloaders. (code pointer)

    model = BoringModel()
    trainer = Trainer(default_root_dir=tmpdir, fast_dev_run=1)

    # 1st fit
    trainer.fit(model, train_dataloaders=dataloader_0, val_dataloaders=dataloader_1)
    assert trainer.train_dataloader.loaders.dataset is dataloader_0.dataset
    assert trainer.val_dataloaders[0].dataset is dataloader_1.dataset
    # 2nd fit
    trainer.fit(model, train_dataloaders=dataloader_2, val_dataloaders=dataloader_3)
    assert trainer.train_dataloader.loaders.dataset is dataloader_2.dataset
    assert trainer.val_dataloaders[0].dataset is dataloader_3.dataset

The original test case succeed under the current implementation that train/val_dataloader will be reattached before fit_loop.run() is called (code pointer)

# reload data when needed
model = self.lightning_module

self.reset_train_val_dataloaders(model)

self.fit_loop.trainer = self
self.fit_loop.run()

However, the second fit_loop could never properly run, because the first fit_loop could property stop when either max_epochs or max_steps are reached, and meanwhile fit_loop.done = True, which leads to fit_loop.skip = True (code pointer). Without initializing a new trainer, the second fit_loop run is just skipped (code pointer).

Discussion:

Do we allow user to start multiple trainer.fit() with train_dataloaders? I understand the needs to have trainer.validate()/test()/predict(), but I think the pattern that allowing trainer.fit() multiple times could be problem. One edge case I can think of to call trainer.fit() multiple times is that trainer.fit() is interrupted by early stopping condition and resumed fit again with different training data. At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

Pitch

Update this test case by removing the second fit call
Document the change

Environment

PyTorch Lightning Version (e.g., 1.3.0):
PyTorch Version (e.g., 1.8)
Python version:
OS (e.g., Linux):
CUDA/cuDNN version:
GPU models and configuration:
How you installed PyTorch (conda, pip, source):
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

cc @Borda @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud

The text was updated successfully, but these errors were encountered:

ninginthecloud · 2021-09-22T06:26:38Z

This issue is related to the fix of issue #9502
cc: @carmocca and @tchaton

carmocca · 2021-09-22T12:19:52Z

Calling fit() multiple times needs to be supported. However, there are several bugs related to it currently.

Do we allow user to start multiple trainer.fit() with train_dataloaders?

Yes

At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

I don't think this is unexpected as long as you know that the trainer state does not get reset on the next fit call

tchaton · 2021-09-22T12:26:02Z

This issue is related to the fix of issue #9502
cc: @carmocca and @tchaton

Hey @ninginthecloud,

If you have reached the total max_epochs. What do you expect to be the right behaviour. Load the new dataloaders or conserve the pervious ones. I would align more with the latter with the current behaviour.
Therefore, the test might have to be updated, so it doesn't complete on the first fit.

@awaelchli as he designed those tests.

Best,
T.C

awaelchli · 2021-09-22T16:17:52Z

@ninginthecloud The I introduced the tests here in the original issue #8442.The problem was that the reset methods would not reset if a dataloader is already attached. This is problematic if we exit out of fit earlier and want to run a second fit with a new dataloader. Or more importantly, when we run trainer.fit() with one validation dataloader and then trainer.validate() we expect it to reload the validation dataloader instead of using the old one.

IMO it's not really a question whether we support calling fit() in a sequence or not. It's more about the reset logic and that we are able to switch the dataloaders if we want to instead of Lightning silently keeping the old one attached. That's what the tests are mainly about.

ninginthecloud · 2021-09-22T17:47:18Z

Hi, @awaelchli Thanks for the comment and giving me more context about the problem!

when we run trainer.fit() with one validation dataloader and then trainer.validate() we expect it to reload the validation dataloader instead of using the old one.
I think this logic makes sense, and current behavior also aligns with it. Cause EvaluationLoop has different stopping condition checking, it stops when num_val_batches or num_sanity_val_batches is reached.

I think the current issue I mentioned is related to when fit_loop starts and stops (they are controlled by fit_loop.skip and fit_loop.done).

Let's reuse your example, what if we run trainer.fit(train_dataloader1, val_dataloader1), and run another trainer.fit(train_dataloader2, val_dataloader2), the second trainer.fit() does not run fit_loop, even though we loaded the train_dataloader2 and val_dataloader2. Because fit_loop has reached max_epochs/max_steps after the first fit_loop.

As @tchaton suggested, I'm also aligned more with conserving the previous fit result, but we need to warn users that the second fit_loop does not start. Additionally, do we want the second fit_loop is able to validate the newly fitted val_dataloader without training new train_dataloader?

ninginthecloud · 2021-09-22T17:52:48Z

Calling fit() multiple times needs to be supported. However, there are several bugs related to it currently.

Do we allow user to start multiple trainer.fit() with train_dataloaders?

Yes

At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

I don't think this is unexpected as long as you know that the trainer state does not get reset on the next fit call

For the second fit, users input new train_dataloder or val_dataloader, they may expect fit_loop should run with the new input. But current logic makes the fit_loop just silently stops, this could be confusing to users who use multiple fit().

carmocca · 2021-09-22T21:09:33Z

If the stopping condition in the loops has been meet, training will not continue. This is unrelated to the dataloaders. The value for max_epochs needs to be changed. I'm fine with adding a warning in that case.

ninginthecloud · 2021-09-23T17:54:10Z

Thanks all the valuable discussion~ Since this change won't impact any existing behavior, we just want to add a warning for users when fit_loop does not start, it's not quite an urgent PR. Do you think I can set this issue to be good first issue for open community members to pick up? cc: @carmocca, @awaelchli and @ananthsub

zhaoedf · 2021-09-26T03:41:15Z

well, i am working on incremental learning and using trainer.fit() multiple times is an urgent need for me.
So it would be nice if pl coule support such behavior, otherwise i will have to create a new trainer every new fit.

tchaton · 2021-09-27T13:15:23Z

Dear @zhaoedf,

You might want to check this out: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash/image/classification/integrations/baal/loop.py#L31

Lightning 1.5 would support Loop Customisation as an early feature.
Here is how to use it: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash_examples/integrations/baal/image_classification_active_learning.py#L42

Best,
T.C

jinyoung-lim · 2021-09-28T00:11:21Z

I can take a stab at this. Based on discussions above, it seems like a warning when fit_loop does not start would be useful!

ninginthecloud · 2021-09-28T19:06:13Z

Cool, @jinyoung-lim, do you want to work on this issue? Let me know, and I will assign it to you, thanks~

ninginthecloud · 2021-09-28T21:41:45Z

@tchaton I'm interested to know more about Loop Customization feature! Any plan and doc associated? Specifically, what's the goal to have customized loop vs extending current base loop in lightning?

jinyoung-lim · 2021-09-28T21:56:25Z

Cool, @jinyoung-lim, do you want to work on this issue? Let me know, and I will assign it to you, thanks~

Yes! Feel free to assign to me :).

ninginthecloud · 2021-09-28T23:03:21Z

Thanks for working on this issue @jinyoung-lim 😃! Feel free to tag me if you have any questions~

jinyoung-lim · 2021-09-29T06:50:27Z

Hi all - I raised my initial PR (linked above). Would love some feedbacks!

Also, some test cases (specified below) seem to be not passing and not sure how to go about debugging as there are not much context.

azure-pipelines / PL.pytorch-lightning IPU build failing on

E       AttributeError: type object 'TrainerDataLoadingMixin' has no attribute 'replace_sampler'

which seems like a rebase error.

CircleCI Checks / ci-tests TPU-tests failing on (could not find an apparent error

Deploy the job on the kubernetes cluster

cc. @ninginthecloud, @carmocca, @awaelchli, @tchaton

Update 09/30/21: All checks except for the ones that need a maintainer to approve running workflows passed. No more errors (1, 2 above).

zhaoedf · 2021-09-29T08:01:38Z

Dear @zhaoedf,

You might want to check this out: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash/image/classification/integrations/baal/loop.py#L31

Lightning 1.5 would support Loop Customisation as an early feature. Here is how to use it: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash_examples/integrations/baal/image_classification_active_learning.py#L42

Best, T.C

could you provide me with more info, cos i can't find anything about "Active loop" in lightning-flash docs.
Besides, does lightning-flash work well with pytorch-lightning? is it compatible？ is there any other way for me to custom my loop? i don't think the activelearningloop will help me since i am working on incremental learning.

tchaton · 2021-09-29T08:57:49Z

Dear @zhaoedf,

Lightning Flash is built on top of Lightning and built by the same team, so yes it should work fine :)

Here is the documentation about it: https://lightning-flash.readthedocs.io/en/latest/integrations/baal.html

The ActiveLearning Loop is just an example to get your inspired from.

You might want to check this library for increment learning: https://github.com/lebrice/Sequoia

Best,
T.C

zhaoedf · 2021-09-29T09:03:33Z

Dear @zhaoedf,

Lightning Flash is built on top of Lightning and built by the same team, so yes it should work fine :)

Here is the documentation about it: https://lightning-flash.readthedocs.io/en/latest/integrations/baal.html

The ActiveLearning Loop is just an example to get your inspired from.

You might want to check this library for increment learning: https://github.com/lebrice/Sequoia

Best, T.C

that is for active learning, fight? can i directly use it for incremental learning? if not, how can i custom my own loop to fit the need of incremental learning?

tchaton · 2021-09-29T10:22:22Z

Dear @zhaoedf,

Yes, this is for ActiveLearning.

Yes, I believe it could be adapted for Increment Learning:. Here is the main loop within the Seequoia codebase: https://github.com/lebrice/Sequoia/blob/c2174cde7370fc42e6ebc35ce07ada571b2b265d/sequoia/settings/assumptions/incremental.py#L176

Here is the in progress doc for the Loop Customization: https://131437-178626720-gh.circle-artifacts.com/0/html/extensions/loops.html

And its PR: #9609

I hope it helps and any feedbacks are welcome.

Best,
T.C

zhaoedf · 2021-09-29T10:24:53Z

Dear @zhaoedf,

Yes, this is for ActiveLearning.

Yes, I believe it could be adapted for Increment Learning:. Here is the main loop within the Seequoia codebase: https://github.com/lebrice/Sequoia/blob/c2174cde7370fc42e6ebc35ce07ada571b2b265d/sequoia/settings/assumptions/incremental.py#L176

Here is the in progress doc for the Loop Customization: https://131437-178626720-gh.circle-artifacts.com/0/html/extensions/loops.html

And its PR: #9609

I hope it helps and any feedbacks are welcome.

Best, T.C

wow, these links really help me a lot. i will look into it. i will get back to you if i have any problem.
Thanks a lot!

GF-Huang · 2022-06-26T19:18:34Z

So, how to fit multiple times? First I train it 5 epochs, then it finish but I think it's not enough (loss still high), I want to fit more epochs.

ninginthecloud added bug Something isn't working help wanted Open to be worked on labels Sep 22, 2021

ninginthecloud added this to To do in Sprint Q3-7: 20 Sep - 1 Oct via automation Sep 22, 2021

carmocca moved this from To do to To be approved in Sprint Q3-7: 20 Sep - 1 Oct Sep 22, 2021

ninginthecloud added the good first issue Good for newcomers label Sep 23, 2021

ninginthecloud assigned jinyoung-lim Sep 28, 2021

jinyoung-lim mentioned this issue Sep 29, 2021

Add logging messages to notify when FitLoop stopping conditions are met #9749

Merged

12 tasks

ninginthecloud mentioned this issue Oct 15, 2021

remove reset_train_val_dataloaders from Trainer and move data reloading logic to loop #9671

Merged

12 tasks

tchaton added this to To do in Lightning Sprint via automation Nov 15, 2021

tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021

amin-nejad mentioned this issue Jan 11, 2022

Trainer.fit() multiple times with max_steps #11425

Closed

carmocca mentioned this issue Feb 3, 2022

Fix current_epoch value on training end #8578

Merged

12 tasks

carmocca removed help wanted Open to be worked on good first issue Good for newcomers labels Feb 3, 2022

carmocca unassigned jinyoung-lim Feb 3, 2022

carmocca added the loops Related to the Loop API label Feb 3, 2022

carmocca removed this from To be approved in Lightning Sprint Feb 3, 2022

carmocca added docs Documentation related and removed bug Something isn't working labels Feb 3, 2022

carmocca closed this as completed Mar 28, 2022

TheaperDeng mentioned this issue Jul 11, 2022

Chronos : fit_loop epoch not reset properly for Autoformer intel-analytics/ipex-llm#5062

Closed

thuyng-ing mentioned this issue Oct 2, 2023

[BUG] use_gpu=True raise unexpected keyword argument in pytorch lightning georgian-io/pyoats#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Trainer.fit()` multiple times #9636

`Trainer.fit()` multiple times #9636

ninginthecloud commented Sep 22, 2021 •

edited by github-actions bot

ninginthecloud commented Sep 22, 2021 •

edited

carmocca commented Sep 22, 2021

tchaton commented Sep 22, 2021

awaelchli commented Sep 22, 2021 •

edited

ninginthecloud commented Sep 22, 2021

ninginthecloud commented Sep 22, 2021 •

edited

carmocca commented Sep 22, 2021

ninginthecloud commented Sep 23, 2021

zhaoedf commented Sep 26, 2021 •

edited

tchaton commented Sep 27, 2021

jinyoung-lim commented Sep 28, 2021 •

edited

ninginthecloud commented Sep 28, 2021

ninginthecloud commented Sep 28, 2021

jinyoung-lim commented Sep 28, 2021

ninginthecloud commented Sep 28, 2021

jinyoung-lim commented Sep 29, 2021 •

edited

zhaoedf commented Sep 29, 2021 •

edited

tchaton commented Sep 29, 2021 •

edited

zhaoedf commented Sep 29, 2021

tchaton commented Sep 29, 2021

zhaoedf commented Sep 29, 2021

GF-Huang commented Jun 26, 2022

Trainer.fit() multiple times #9636

Trainer.fit() multiple times #9636

Comments

ninginthecloud commented Sep 22, 2021 • edited by github-actions bot

🐛 Bug

Context

Discussion:

Pitch

Environment

Additional context

ninginthecloud commented Sep 22, 2021 • edited

carmocca commented Sep 22, 2021

tchaton commented Sep 22, 2021

awaelchli commented Sep 22, 2021 • edited

ninginthecloud commented Sep 22, 2021

ninginthecloud commented Sep 22, 2021 • edited

carmocca commented Sep 22, 2021

ninginthecloud commented Sep 23, 2021

zhaoedf commented Sep 26, 2021 • edited

tchaton commented Sep 27, 2021

jinyoung-lim commented Sep 28, 2021 • edited

ninginthecloud commented Sep 28, 2021

ninginthecloud commented Sep 28, 2021

jinyoung-lim commented Sep 28, 2021

ninginthecloud commented Sep 28, 2021

jinyoung-lim commented Sep 29, 2021 • edited

zhaoedf commented Sep 29, 2021 • edited

tchaton commented Sep 29, 2021 • edited

zhaoedf commented Sep 29, 2021

tchaton commented Sep 29, 2021

zhaoedf commented Sep 29, 2021

GF-Huang commented Jun 26, 2022

`Trainer.fit()` multiple times #9636

`Trainer.fit()` multiple times #9636

ninginthecloud commented Sep 22, 2021 •

edited by github-actions bot

ninginthecloud commented Sep 22, 2021 •

edited

awaelchli commented Sep 22, 2021 •

edited

ninginthecloud commented Sep 22, 2021 •

edited

zhaoedf commented Sep 26, 2021 •

edited

jinyoung-lim commented Sep 28, 2021 •

edited

jinyoung-lim commented Sep 29, 2021 •

edited

zhaoedf commented Sep 29, 2021 •

edited

tchaton commented Sep 29, 2021 •

edited