-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model weights are being reset to zero on Windows #19537
Comments
@BramVanroy Thanks for reporting. I am not familiar with the COMET library, but if I can reproduce it (hopefully don't need Windows) I can probably help debug this. Is pip installing and copying this code snippet all I need to repro this? |
Yes, pip install and the code snippet should be enough but I don't think this occurs on non-Windows so I fear you won't be able to reproduce on other OS. (Which might explain why it has flown under the radar.) |
Thanks. Could you also share the PyTorch version that was installed when you ran into this issue? |
Sure, I ran it with |
Ok no luck trying to reproduce on Linux with
I get
It will be day or two before I can check on a Windows machine. |
Hey @BramVanroy I confirm your observations on Windows. How the dataloader workers get created (which is when we call
There is a small section about this in the PyTorch DataLoader docs:
Since this is a limitation of the operating systems and a PyTorch DataLoader implementation detail, Lightning can't really do anything here. This would happen if the code was written in plain PyTorch. A possible workaround for you is to set
in this line here: It would be slower in general but in your specific code example here not noticable. For COMET in general, if they want to support prediction on Windows, maybe the code could be changed to either not run the model in the collate function, or to load the model checkpoint inside the dataloader workers (inside the collate). Or maybe there is a memory-sharing trick to avoid this problem, I don't know. I hope this is somewhat clearer now and I hope the workaround will be useful to you. I'm closing the issue because there is no action that can be taken in PyTorch Lightning at this moment (that I am aware of). |
Thanks a lot @awaelchli! I would swear that I tested that, but my memory does not serve me well. Thanks a lot for going further down the rabbit hole. The linked PR should hopefully fix the default behavior on Windows. Thanks again! |
Great, no problem! The PR you created looks very good! |
Bug description
In the highly popular evaluation metric for machine translation, COMET, an issue has been raised where on Windows the predictions are always zero. I can confirm this with a new env installation of the library (
pip install unbabel-comet
) on Windows and this snippet:The output will be be all zeroes:
I also get the following PyTorch warning but I am not sure if it is relevant:
When trying to debug this, I went down a LONG rabbit hole and found that during the prediction loop, the model's weights seem to get set to zero. I do not fully understand how exactly (no time to look into this further atm), but it occurs here when calling
iter
on the data fetcher:pytorch-lightning/src/lightning/pytorch/loops/prediction_loop.py
Lines 179 to 181 in e1e8770
To verify, replace those lines with this
And execute the script above. If you replace those lines with this to check the weights of the final layer, you'll see that the first print contains the normal weights, but the second print after
data_fetcher.setup
gives all zeroes.I am stumped as to why this would happen here when setting up a data fetcher, but PL has so many moving parts that are interconnected that it is very hard for me to debug this further. Again, this only seems to be a reported issue on Windows.
Running on Windows 10, Python 3.10, PL 2.2.0.post0.
cc @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: