-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395
Comments
The dice loss alone is pretty shitty, that's why nnU-Net uses it always in combination with CE. I suggest you use the default first and only then start to make changes. This makes it easier to see when and why things go south. |
Hey @mahanpro, did Fabians response answer your question satisfactory or do you still have any questions? If not I would close this issue. Best, |
I got the same problem, but I used the default loss function of nnU-Net v2. Out of 5 folds, I got the problem in 3 folds where loss values become worsened and pseudo loss stuck at zero afterward. Please see the attached training progress plots. The issue arose so quickly in one fold (see the second image). My training command relied on default parameters for everything. Specifically, my training command is Note: I am not sure if I should open a new issue in GitHub, but I think appending to this old one may be better in that we can probably got further comment from the OP and help each other identify the root cause of the issue. |
To me this looks like a training instability due to NaNs . Do you by any chance get NaNs somewhere, as this will break the model. NaNs are usually cause by Overflows/Underflows due to large values in mixed precision training. Should there be NaNs you could try to train it with full precision (fp32), which will make the training more resilient against NaNs. Could you also provide me with infos on how large your dataset is? If it's fairly large you might sample some problematic cases that could induce this instability. Maybe @constantinulrich or @FabianIsensee can also chime in with what are the most common causes of training instability, as they trained significantly more models than me. |
Thank you for your response. However, I did not see anything about NaN in the log files. I attached them here. If there is anything somewhat unusual about the data, I think the radiologist (or a medical technician) marked some outer regions of the CT by -3000 or some value like that. (Sorry that I cannot share the CT data. I am not allowed to do so.) I can change the values to -1000 to make it more similar to a common air region in a CT image, but I think that should not break the model this badly. There must be something else. But, if you think I should try modifying that -3000 HUs to -1000, please let me know. dataset_fingerprint_for_debugging.txt |
Thanks for providing the files and giving a lot of context. unfortunately I don't know what exactly could cause this issue, what you could try is lowering learning rate or disabling fp16 to increase numerical stability but this should not be necessary for your use-case. Maybe @FabianIsensee Can help? |
Hey @pinyotae, after consulting colleagues @seziegler pointed me to the Cheers, |
Hey all, just pinging me won't work these days, unfortunately. There is too much going on 🙃 |
Hi @TaWald and @FabianIsensee, |
Thank you everybody. I tested with all five folds, and all Dice scores seem normal. By the way, do the authors want help about improving the documents? I found several typos and I think there are examples that should be added to help users/developers use nnU-Net more effectively. Perhaps, I can help pinpoint the typos and write more examples to the documents. Thank you again to the nnU-Net dev team. |
If you have some specific examples in mind, feel free to open another issue to discuss them in more depth. In general the examples to get nnUNet working were better and v1 and we are aware that v2 is not as nicely explained. Having said this we are ofcourse happy to add better |
Hi Fabian @FabianIsensee,
First of all, thank you for your great code and implementation. It is appreciable.
I have a segmentation task and I want to use only the Dice loss. So I pass the
-tr nnUNetTrainerDiceLoss
with thennUNetv2_train
. What happens is that pseudo dice goes to zero dramatically after a few epochs, like 10-20 or so.I run the 3d_fullres configuration and also made sure that masks and images are aligned in the dataset.
I also tried to manipulate some of the parameters like the fold number and
-num_gpus
but it wasn't effective.Update! So the novel thing that just happened is that I submitted the same job exactly with the same parameters, but this time it did not go to zero!! I just renamed the dataset from
Dataset054_TCIA
toDataset055_TCIA
.I am wondering what is the cause of this phenomenon. Is it related to the patches drawn from the validation set?
The text was updated successfully, but these errors were encountered: