No training loss logged + val metrics logged incorrectly when using lightning as a backend #18849
Unanswered
NickleDave
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi 👋
I develop a domain-specific library and recently we switched to using lightning as a backend.
We have realized that our logging is not working correctly though (we're using the default Tensorboard logger).
vocalpy/vak#726
Our calls to
![image](https://private-user-images.githubusercontent.com/11934090/277658591-0425be69-70a9-41c3-ae84-297f5b04bce8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MDA5MTcsIm5iZiI6MTcxODkwMDYxNywicGF0aCI6Ii8xMTkzNDA5MC8yNzc2NTg1OTEtMDQyNWJlNjktNzBhOS00MWMzLWFlODQtMjk3ZjViMDRiY2U4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDE2MjMzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQzZWZmNzU0ZDI2ZmQzMmM1YjEyYmRmMjE0YzFlYmY1YjgyY2IzNzhlODU1OGM2MDAxOTk5YmM2MDA2ZjI5ZjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.K0NJ-E8dns3-VcAUVK9kve37uLDEbUhbCCVM3bORUiQ)
self.log
with the loss in thetraining_step
do not get saved in the events file, and all the metrics we log onvalidation_step
appear to be incorrect--see first attached image.However, if I run the exact same model in a script then everything gets logged correctly (tested with this notebook)--see second + third attached image.
![image](https://private-user-images.githubusercontent.com/11934090/277658621-a3dec6f1-be3a-4355-a9f9-2e75e7361030.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MDA5MTcsIm5iZiI6MTcxODkwMDYxNywicGF0aCI6Ii8xMTkzNDA5MC8yNzc2NTg2MjEtYTNkZWM2ZjEtYmUzYS00MzU1LWE5ZjktMmU3NWU3MzYxMDMwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDE2MjMzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE1ZTRkY2NiYmU1MDkyMzRmNzZiNDJjY2Q1MTU3OTk4MWMyMTRjYTEzZmQ0NTI0MWQ4YTc1YmIyNDkyZjM5YWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.f7hNTbmJxV4NQFTUNTH2_Lz494DIjgbUkq2Wr_v4PSg)
![image](https://private-user-images.githubusercontent.com/11934090/277658647-cbf6c4ce-e6a3-4161-901e-c1d79abe47bd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MDA5MTcsIm5iZiI6MTcxODkwMDYxNywicGF0aCI6Ii8xMTkzNDA5MC8yNzc2NTg2NDctY2JmNmM0Y2UtZTZhMy00MTYxLTkwMWUtYzFkNzlhYmU0N2JkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDE2MjMzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZlODVjZTg0YTJkNGE2NGIyMGYxNjQ4MGIwZjRjMGFkNTdkNjEwYmRmY2QwNzc0YmY1YWJkOGI4NTgxYmUzMjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.dzQSGjkwm9OG7VtJFZkXVxWpqQMkfBMkh1jZNpaAUto)
Things that might matter:
type
directly to make a new sub-class. I know this system is a bit convoluted--we're trying to make it as easy as possible for someone to declare a model--but I'm not sure why it would break logging specifically? I can tell from the progress bar that training is working as expected and the model converges, even though the logs are incorrect__main__
function (i.e.,vak train
callsvak.__main__.main
). But according to theTrainer
docstring I would guess this is a good practice? It also means that the DDP strategy doesn't work for us, but that's another issue...Any guesses as to why running with our library as the main process messes up lightning's logging?
My best guess is that it's something related to the sub-classing, since the script above that works correctly removes that sub-classing, and then logging seems to work correctly
Beta Was this translation helpful? Give feedback.
All reactions