-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer.fit()
multiple times
#9636
Comments
Calling
Yes
I don't think this is unexpected as long as you know that the trainer state does not get reset on the next |
Hey @ninginthecloud, If you have reached the total max_epochs. What do you expect to be the right behaviour. Load the new dataloaders or conserve the pervious ones. I would align more with the latter with the current behaviour. @awaelchli as he designed those tests. Best, |
@ninginthecloud The I introduced the tests here in the original issue #8442.The problem was that the reset methods would not reset if a dataloader is already attached. This is problematic if we exit out of fit earlier and want to run a second fit with a new dataloader. Or more importantly, when we run IMO it's not really a question whether we support calling fit() in a sequence or not. It's more about the reset logic and that we are able to switch the dataloaders if we want to instead of Lightning silently keeping the old one attached. That's what the tests are mainly about. |
Hi, @awaelchli Thanks for the comment and giving me more context about the problem!
I think the current issue I mentioned is related to when fit_loop starts and stops (they are controlled by fit_loop.skip and fit_loop.done). Let's reuse your example, what if we run As @tchaton suggested, I'm also aligned more with conserving the previous fit result, but we need to warn users that the second fit_loop does not start. Additionally, do we want the second fit_loop is able to validate the newly fitted val_dataloader without training new train_dataloader? |
For the second fit, users input new train_dataloder or val_dataloader, they may expect fit_loop should run with the new input. But current logic makes the fit_loop just silently stops, this could be confusing to users who use multiple fit(). |
If the stopping condition in the loops has been meet, training will not continue. This is unrelated to the dataloaders. The value for |
Thanks all the valuable discussion~ Since this change won't impact any existing behavior, we just want to add a warning for users when fit_loop does not start, it's not quite an urgent PR. Do you think I can set this issue to be |
well, i am working on incremental learning and using trainer.fit() multiple times is an urgent need for me. |
Dear @zhaoedf, You might want to check this out: https://github.com/PyTorchLightning/lightning-flash/blob/master/flash/image/classification/integrations/baal/loop.py#L31 Lightning 1.5 would support Loop Customisation as an early feature. Best, |
I can take a stab at this. Based on discussions above, it seems like a warning when fit_loop does not start would be useful! |
Cool, @jinyoung-lim, do you want to work on this issue? Let me know, and I will assign it to you, thanks~ |
@tchaton I'm interested to know more about Loop Customization feature! Any plan and doc associated? Specifically, what's the goal to have customized loop vs extending current base loop in lightning? |
Yes! Feel free to assign to me :). |
Thanks for working on this issue @jinyoung-lim 😃! Feel free to tag me if you have any questions~ |
Hi all - I raised my initial PR (linked above). Would love some feedbacks! Also, some test cases (specified below) seem to be not passing and not sure how to go about debugging as there are not much context.
which seems like a rebase error.
cc. @ninginthecloud, @carmocca, @awaelchli, @tchaton Update 09/30/21: All checks except for the ones that need a maintainer to approve running workflows passed. No more errors (1, 2 above). |
could you provide me with more info, cos i can't find anything about "Active loop" in lightning-flash docs. |
Dear @zhaoedf, Lightning Flash is built on top of Lightning and built by the same team, so yes it should work fine :) Here is the documentation about it: https://lightning-flash.readthedocs.io/en/latest/integrations/baal.html The ActiveLearning Loop is just an example to get your inspired from. You might want to check this library for increment learning: https://github.com/lebrice/Sequoia Best, |
that is for active learning, fight? can i directly use it for incremental learning? if not, how can i custom my own loop to fit the need of incremental learning? |
Dear @zhaoedf, Yes, this is for ActiveLearning. Yes, I believe it could be adapted for Increment Learning:. Here is the main loop within the Seequoia codebase: https://github.com/lebrice/Sequoia/blob/c2174cde7370fc42e6ebc35ce07ada571b2b265d/sequoia/settings/assumptions/incremental.py#L176 Here is the in progress doc for the Loop Customization: https://131437-178626720-gh.circle-artifacts.com/0/html/extensions/loops.html And its PR: #9609 I hope it helps and any feedbacks are welcome. Best, |
wow, these links really help me a lot. i will look into it. i will get back to you if i have any problem. |
So, how to |
🐛 Bug
Context
I noticed in the unit test case
test_dataloaders_reset_and_attach
intest_dataloaders.py
thattrainer.fit()
was called twice with different train_dataloaders. (code pointer)The original test case succeed under the current implementation that train/val_dataloader will be reattached before fit_loop.run() is called (code pointer)
However, the second fit_loop could never properly run, because the first fit_loop could property stop when either max_epochs or max_steps are reached, and meanwhile fit_loop.done = True, which leads to fit_loop.skip = True (code pointer). Without initializing a new trainer, the second fit_loop run is just skipped (code pointer).
Discussion:
Do we allow user to start multiple trainer.fit() with train_dataloaders? I understand the needs to have trainer.validate()/test()/predict(), but I think the pattern that allowing trainer.fit() multiple times could be problem. One edge case I can think of to call trainer.fit() multiple times is that trainer.fit() is interrupted by early stopping condition and resumed fit again with different training data. At least, we need to document this or add warning so that users could be aware of fit_loop actually did not start.
Pitch
Environment
conda
,pip
, source):torch.__config__.show()
:Additional context
cc @Borda @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud
The text was updated successfully, but these errors were encountered: