Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch-lightning AimLogger is finalized after fit, breaking sessions with fit and test routines #3097

Open
labrunhosarodrigues opened this issue Feb 1, 2024 · 4 comments
Labels
type / question Issue type: question

Comments

@labrunhosarodrigues
Copy link

labrunhosarodrigues commented Feb 1, 2024

❓Question

I am setting up a pytorch lightning experiment and using the AimLogger object to log training/validation losses, as well as test results.
However, while tracking during trainer.fit works perfectly, it breaks when trainer.test tries to load the model ( TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file' is the final exception thrown).

My workaround was to disable the logger.finalize() call in the fit loop teardown routine, but that shouldn't be a good solution.

Is this behavior resulting from some change in how pytorch-lightning deals with teardowns that was not tracked by AimStack? Or is there anything I should set to prevent this from happening?

I am using a remote repository, but have confirmed that if I just use a local one, then there is no issue. Which makes me think that there may be some timing definitions at play here...

Here is a snippet representing my setup:

import pytorch_lightning as pl
from aim.pytorch_lightning import AimLogger

model = Model()  # subclass of pl.LightningModule
dataset = DataModule()  # subclass of pl.LightningDataModule
logger = AimLogger(
    repo="aim://<server_address>:53800",
    experiment="experiment"
)
callbacks = AimCallbacks()  # subclass of pl.callbacks.Callback

trainer = pl.Trainer(
    log_every_n_steps=20,
    logger=logger,
    callbacks=callbacks
)

trainer.fit(model, datamodule=dataset)
# all goes fine untill here
trainer.test(ckpt_path="best", datamodule=dataset)  # gives the Timeout TypeError if fit teardown of logger is not deactivated

I am also adding the traceback I got:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/project_path/experiments/data_generation.py", line 59, in <module>
    main()
  File "/home/user/project_path/experiments/data_generation.py", line 55, in main
    train_model()
  File "/home/user/project_path/experiments/data_generation.py", line 50, in train_model
    print(trainer.logger.experiment)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
           ^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 80, in experiment
    self._run = Run(
                ^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
                                        ^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
@labrunhosarodrigues labrunhosarodrigues added the type / question Issue type: question label Feb 1, 2024
@Michael-Tanzer
Copy link

Hi, how did you solve this?

Did you have to change the teardown function within lightning?

    def _teardown(self) -> None:
        """This is the Trainer's internal teardown, unrelated to the `teardown` hooks in LightningModule and
        Callback; those are handled by :meth:`_call_teardown_hook`."""
        self.strategy.teardown()
        loop = self._active_loop
        # loop should never be `None` here but it can because we don't know the trainer stage with `ddp_spawn`
        if loop is not None:
            loop.teardown()
        self._logger_connector.teardown()
        self._signal_connector.teardown()

disabling the penultimate line?

@Michael-Tanzer
Copy link

Michael-Tanzer commented Apr 12, 2024

Patch notes seem to say that this is fixed (https://aimstack.readthedocs.io/en/latest/generated/CHANGELOG.html#feb-7-2024-fixes), but my code hangs when getting to the test loop when using aim. No error is raised, but the run is marked as finished right as the training loop finishes and then the code hangs when trying to log anything from the test loop.

@labrunhosarodrigues
Copy link
Author

Hi, how did you solve this?

Did you have to change the teardown function within lightning?

    def _teardown(self) -> None:
        """This is the Trainer's internal teardown, unrelated to the `teardown` hooks in LightningModule and
        Callback; those are handled by :meth:`_call_teardown_hook`."""
        self.strategy.teardown()
        loop = self._active_loop
        # loop should never be `None` here but it can because we don't know the trainer stage with `ddp_spawn`
        if loop is not None:
            loop.teardown()
        self._logger_connector.teardown()
        self._signal_connector.teardown()

disabling the penultimate line?

Hi, yes, exactly. Although that solution is not great, and forces me to manually teardown the logger when the entire process is complete.

@Michael-Tanzer
Copy link

Looks like #3134 will fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type / question Issue type: question
Projects
None yet
Development

No branches or pull requests

2 participants