Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer.fit() multiple times #9636

Closed
ninginthecloud opened this issue Sep 22, 2021 · 22 comments · Fixed by #9749
Closed

Trainer.fit() multiple times #9636

ninginthecloud opened this issue Sep 22, 2021 · 22 comments · Fixed by #9749
Labels
docs Documentation related loops Related to the Loop API

Comments

@ninginthecloud
Copy link
Contributor

ninginthecloud commented Sep 22, 2021

🐛 Bug

Context

I noticed in the unit test case test_dataloaders_reset_and_attach in test_dataloaders.py that trainer.fit() was called twice with different train_dataloaders. (code pointer)

    model = BoringModel()
    trainer = Trainer(default_root_dir=tmpdir, fast_dev_run=1)

    # 1st fit
    trainer.fit(model, train_dataloaders=dataloader_0, val_dataloaders=dataloader_1)
    assert trainer.train_dataloader.loaders.dataset is dataloader_0.dataset
    assert trainer.val_dataloaders[0].dataset is dataloader_1.dataset
    # 2nd fit
    trainer.fit(model, train_dataloaders=dataloader_2, val_dataloaders=dataloader_3)
    assert trainer.train_dataloader.loaders.dataset is dataloader_2.dataset
    assert trainer.val_dataloaders[0].dataset is dataloader_3.dataset

The original test case succeed under the current implementation that train/val_dataloader will be reattached before fit_loop.run() is called (code pointer)

# reload data when needed
model = self.lightning_module

self.reset_train_val_dataloaders(model)

self.fit_loop.trainer = self
self.fit_loop.run()

However, the second fit_loop could never properly run, because the first fit_loop could property stop when either max_epochs or max_steps are reached, and meanwhile fit_loop.done = True, which leads to fit_loop.skip = True (code pointer). Without initializing a new trainer, the second fit_loop run is just skipped (code pointer).

Discussion:

Do we allow user to start multiple trainer.fit() with train_dataloaders? I understand the needs to have trainer.validate()/test()/predict(), but I think the pattern that allowing trainer.fit() multiple times could be problem. One edge case I can think of to call trainer.fit() multiple times is that trainer.fit() is interrupted by early stopping condition and resumed fit again with different training data. At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

Pitch

  • Update this test case by removing the second fit call
  • Document the change

Environment

  • PyTorch Lightning Version (e.g., 1.3.0):
  • PyTorch Version (e.g., 1.8)
  • Python version:
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @Borda @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud

@ninginthecloud ninginthecloud added bug Something isn't working help wanted Open to be worked on labels Sep 22, 2021
@ninginthecloud
Copy link
Contributor Author

ninginthecloud commented Sep 22, 2021

This issue is related to the fix of issue #9502
cc: @carmocca and @tchaton

@ninginthecloud ninginthecloud added this to To do in Sprint Q3-7: 20 Sep - 1 Oct via automation Sep 22, 2021
@carmocca
Copy link
Contributor

Calling fit() multiple times needs to be supported. However, there are several bugs related to it currently.

Do we allow user to start multiple trainer.fit() with train_dataloaders?

Yes

At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

I don't think this is unexpected as long as you know that the trainer state does not get reset on the next fit call

@carmocca carmocca moved this from To do to To be approved in Sprint Q3-7: 20 Sep - 1 Oct Sep 22, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 22, 2021

This issue is related to the fix of issue #9502
cc: @carmocca and @tchaton

Hey @ninginthecloud,

If you have reached the total max_epochs. What do you expect to be the right behaviour. Load the new dataloaders or conserve the pervious ones. I would align more with the latter with the current behaviour.
Therefore, the test might have to be updated, so it doesn't complete on the first fit.

@awaelchli as he designed those tests.

Best,
T.C

@awaelchli
Copy link
Member

awaelchli commented Sep 22, 2021

@ninginthecloud The I introduced the tests here in the original issue #8442.The problem was that the reset methods would not reset if a dataloader is already attached. This is problematic if we exit out of fit earlier and want to run a second fit with a new dataloader. Or more importantly, when we run trainer.fit() with one validation dataloader and then trainer.validate() we expect it to reload the validation dataloader instead of using the old one.

IMO it's not really a question whether we support calling fit() in a sequence or not. It's more about the reset logic and that we are able to switch the dataloaders if we want to instead of Lightning silently keeping the old one attached. That's what the tests are mainly about.

@ninginthecloud
Copy link
Contributor Author

Hi, @awaelchli Thanks for the comment and giving me more context about the problem!

when we run trainer.fit() with one validation dataloader and then trainer.validate() we expect it to reload the validation dataloader instead of using the old one.
I think this logic makes sense, and current behavior also aligns with it. Cause EvaluationLoop has different stopping condition checking, it stops when num_val_batches or num_sanity_val_batches is reached.

I think the current issue I mentioned is related to when fit_loop starts and stops (they are controlled by fit_loop.skip and fit_loop.done).

Let's reuse your example, what if we run trainer.fit(train_dataloader1, val_dataloader1), and run another trainer.fit(train_dataloader2, val_dataloader2), the second trainer.fit() does not run fit_loop, even though we loaded the train_dataloader2 and val_dataloader2. Because fit_loop has reached max_epochs/max_steps after the first fit_loop.

As @tchaton suggested, I'm also aligned more with conserving the previous fit result, but we need to warn users that the second fit_loop does not start. Additionally, do we want the second fit_loop is able to validate the newly fitted val_dataloader without training new train_dataloader?

@ninginthecloud
Copy link
Contributor Author

ninginthecloud commented Sep 22, 2021

Calling fit() multiple times needs to be supported. However, there are several bugs related to it currently.

Do we allow user to start multiple trainer.fit() with train_dataloaders?

Yes

At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.

I don't think this is unexpected as long as you know that the trainer state does not get reset on the next fit call

For the second fit, users input new train_dataloder or val_dataloader, they may expect fit_loop should run with the new input. But current logic makes the fit_loop just silently stops, this could be confusing to users who use multiple fit().

@carmocca
Copy link
Contributor

If the stopping condition in the loops has been meet, training will not continue. This is unrelated to the dataloaders. The value for max_epochs needs to be changed. I'm fine with adding a warning in that case.

@ninginthecloud
Copy link
Contributor Author

Thanks all the valuable discussion~ Since this change won't impact any existing behavior, we just want to add a warning for users when fit_loop does not start, it's not quite an urgent PR. Do you think I can set this issue to be good first issue for open community members to pick up? cc: @carmocca, @awaelchli and @ananthsub

@ninginthecloud ninginthecloud added the good first issue Good for newcomers label Sep 23, 2021
@zhaoedf
Copy link

zhaoedf commented Sep 26, 2021

well, i am working on incremental learning and using trainer.fit() multiple times is an urgent need for me.
So it would be nice if pl coule support such behavior, otherwise i will have to create a new trainer every new fit.

@tchaton
Copy link
Contributor

tchaton commented Sep 27, 2021

@jinyoung-lim
Copy link
Contributor

jinyoung-lim commented Sep 28, 2021

I can take a stab at this. Based on discussions above, it seems like a warning when fit_loop does not start would be useful!

@ninginthecloud
Copy link
Contributor Author

Cool, @jinyoung-lim, do you want to work on this issue? Let me know, and I will assign it to you, thanks~

@ninginthecloud
Copy link
Contributor Author

@tchaton I'm interested to know more about Loop Customization feature! Any plan and doc associated? Specifically, what's the goal to have customized loop vs extending current base loop in lightning?

@jinyoung-lim
Copy link
Contributor

Cool, @jinyoung-lim, do you want to work on this issue? Let me know, and I will assign it to you, thanks~

Yes! Feel free to assign to me :).

@ninginthecloud
Copy link
Contributor Author

Thanks for working on this issue @jinyoung-lim 😃! Feel free to tag me if you have any questions~

@jinyoung-lim
Copy link
Contributor

jinyoung-lim commented Sep 29, 2021

Hi all - I raised my initial PR (linked above). Would love some feedbacks!

Also, some test cases (specified below) seem to be not passing and not sure how to go about debugging as there are not much context.

  1. azure-pipelines / PL.pytorch-lightning IPU build failing on
E       AttributeError: type object 'TrainerDataLoadingMixin' has no attribute 'replace_sampler'

which seems like a rebase error.

  1. CircleCI Checks / ci-tests TPU-tests failing on (could not find an apparent error
Deploy the job on the kubernetes cluster

cc. @ninginthecloud, @carmocca, @awaelchli, @tchaton

Update 09/30/21: All checks except for the ones that need a maintainer to approve running workflows passed. No more errors (1, 2 above).

@zhaoedf
Copy link

zhaoedf commented Sep 29, 2021

Dear @zhaoedf,

You might want to check this out: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash/image/classification/integrations/baal/loop.py#L31

Lightning 1.5 would support Loop Customisation as an early feature. Here is how to use it: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash_examples/integrations/baal/image_classification_active_learning.py#L42

Best, T.C

could you provide me with more info, cos i can't find anything about "Active loop" in lightning-flash docs.
Besides, does lightning-flash work well with pytorch-lightning? is it compatible? is there any other way for me to custom my loop? i don't think the activelearningloop will help me since i am working on incremental learning.

@tchaton
Copy link
Contributor

tchaton commented Sep 29, 2021

Dear @zhaoedf,

Lightning Flash is built on top of Lightning and built by the same team, so yes it should work fine :)

Here is the documentation about it: https://lightning-flash.readthedocs.io/en/latest/integrations/baal.html

The ActiveLearning Loop is just an example to get your inspired from.

You might want to check this library for increment learning: https://github.com/lebrice/Sequoia

Best,
T.C

@zhaoedf
Copy link

zhaoedf commented Sep 29, 2021

Dear @zhaoedf,

Lightning Flash is built on top of Lightning and built by the same team, so yes it should work fine :)

Here is the documentation about it: https://lightning-flash.readthedocs.io/en/latest/integrations/baal.html

The ActiveLearning Loop is just an example to get your inspired from.

You might want to check this library for increment learning: https://github.com/lebrice/Sequoia

Best, T.C

that is for active learning, fight? can i directly use it for incremental learning? if not, how can i custom my own loop to fit the need of incremental learning?

@tchaton
Copy link
Contributor

tchaton commented Sep 29, 2021

Dear @zhaoedf,

Yes, this is for ActiveLearning.

Yes, I believe it could be adapted for Increment Learning:. Here is the main loop within the Seequoia codebase: https://github.com/lebrice/Sequoia/blob/c2174cde7370fc42e6ebc35ce07ada571b2b265d/sequoia/settings/assumptions/incremental.py#L176

Here is the in progress doc for the Loop Customization: https://131437-178626720-gh.circle-artifacts.com/0/html/extensions/loops.html

And its PR: #9609

I hope it helps and any feedbacks are welcome.

Best,
T.C

@zhaoedf
Copy link

zhaoedf commented Sep 29, 2021

Dear @zhaoedf,

Yes, this is for ActiveLearning.

Yes, I believe it could be adapted for Increment Learning:. Here is the main loop within the Seequoia codebase: https://github.com/lebrice/Sequoia/blob/c2174cde7370fc42e6ebc35ce07ada571b2b265d/sequoia/settings/assumptions/incremental.py#L176

Here is the in progress doc for the Loop Customization: https://131437-178626720-gh.circle-artifacts.com/0/html/extensions/loops.html

And its PR: #9609

I hope it helps and any feedbacks are welcome.

Best, T.C

wow, these links really help me a lot. i will look into it. i will get back to you if i have any problem.
Thanks a lot!

@tchaton tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021
@carmocca carmocca removed help wanted Open to be worked on good first issue Good for newcomers labels Feb 3, 2022
@carmocca carmocca added the loops Related to the Loop API label Feb 3, 2022
@carmocca carmocca removed this from To be approved in Lightning Sprint Feb 3, 2022
@carmocca carmocca added docs Documentation related and removed bug Something isn't working labels Feb 3, 2022
@GF-Huang
Copy link

So, how to fit multiple times? First I train it 5 epochs, then it finish but I think it's not enough (loss still high), I want to fit more epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related loops Related to the Loop API
Projects
No open projects
7 participants