Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

Closed
mahanpro opened this issue Apr 11, 2023 · 11 comments
Closed

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

mahanpro opened this issue Apr 11, 2023 · 11 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@mahanpro
Copy link

mahanpro commented Apr 11, 2023

Hi Fabian @FabianIsensee,

First of all, thank you for your great code and implementation. It is appreciable.

I have a segmentation task and I want to use only the Dice loss. So I pass the -tr nnUNetTrainerDiceLoss with the nnUNetv2_train. What happens is that pseudo dice goes to zero dramatically after a few epochs, like 10-20 or so.

I run the 3d_fullres configuration and also made sure that masks and images are aligned in the dataset.
I also tried to manipulate some of the parameters like the fold number and -num_gpus but it wasn't effective.

Update! So the novel thing that just happened is that I submitted the same job exactly with the same parameters, but this time it did not go to zero!! I just renamed the dataset from Dataset054_TCIA to Dataset055_TCIA.
I am wondering what is the cause of this phenomenon. Is it related to the patches drawn from the validation set?

@mahanpro mahanpro changed the title pseudo dice goes to zero a little while after running nnUNetTrainerDiceLoss pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss Apr 11, 2023
@FabianIsensee
Copy link
Member

The dice loss alone is pretty shitty, that's why nnU-Net uses it always in combination with CE. I suggest you use the default first and only then start to make changes. This makes it easier to see when and why things go south.
The fact that it worked sometimes and not other times could be related to luck in the weight initialization. nnU-Net does not use seeding.
Best,
Fabian

@TaWald
Copy link
Contributor

TaWald commented Aug 9, 2023

Hey @mahanpro,

did Fabians response answer your question satisfactory or do you still have any questions? If not I would close this issue.

Best,
Tassilo

@TaWald TaWald added the stale No activity in a long time label Aug 15, 2023
@TaWald TaWald closed this as completed Aug 21, 2023
@pinyotae
Copy link

I got the same problem, but I used the default loss function of nnU-Net v2. Out of 5 folds, I got the problem in 3 folds where loss values become worsened and pseudo loss stuck at zero afterward. Please see the attached training progress plots. The issue arose so quickly in one fold (see the second image).

My training command relied on default parameters for everything. Specifically, my training command is
nnUNetv2_train Dataset001_TreatmentRegion 3d_lowres 0

Note: I am not sure if I should open a new issue in GitHub, but I think appending to this old one may be better in that we can probably got further comment from the OP and help each other identify the root cause of the issue.

progress
progress

@TaWald TaWald reopened this Sep 18, 2023
@TaWald
Copy link
Contributor

TaWald commented Sep 18, 2023

To me this looks like a training instability due to NaNs . Do you by any chance get NaNs somewhere, as this will break the model.
Can you check if there are NaNs in the training_log.txt? It is located in the same folder as the progress.png.

NaNs are usually cause by Overflows/Underflows due to large values in mixed precision training. Should there be NaNs you could try to train it with full precision (fp32), which will make the training more resilient against NaNs.

Could you also provide me with infos on how large your dataset is? If it's fairly large you might sample some problematic cases that could induce this instability.

Maybe @constantinulrich or @FabianIsensee can also chime in with what are the most common causes of training instability, as they trained significantly more models than me.

@pinyotae
Copy link

Thank you for your response. However, I did not see anything about NaN in the log files. I attached them here.
Regarding the dataset size, there are only around 60 CT images, and I believe none of them is larger than 512x512x600 (x, y, z).
I attached the fingerprint and the training plans too. Just in case they help.

If there is anything somewhat unusual about the data, I think the radiologist (or a medical technician) marked some outer regions of the CT by -3000 or some value like that. (Sorry that I cannot share the CT data. I am not allowed to do so.)

I can change the values to -1000 to make it more similar to a common air region in a CT image, but I think that should not break the model this badly. There must be something else. But, if you think I should try modifying that -3000 HUs to -1000, please let me know.

dataset_fingerprint_for_debugging.txt
[nnUNetPlans_for_debu
training_log_2023_9_15_21_38_42.txt
gging.txt](https://github.com/MIC-DKFZ/nnUNet/files/12647099/nnUNetPlans_for_debugging.txt)
training_log_2023_9_16_17_36_26.txt

@TaWald
Copy link
Contributor

TaWald commented Oct 7, 2023

Thanks for providing the files and giving a lot of context. unfortunately I don't know what exactly could cause this issue, what you could try is lowering learning rate or disabling fp16 to increase numerical stability but this should not be necessary for your use-case. Maybe @FabianIsensee Can help?

@TaWald TaWald added help wanted Extra attention is needed and removed stale No activity in a long time labels Oct 11, 2023
@TaWald
Copy link
Contributor

TaWald commented Oct 24, 2023

Hey @pinyotae,

after consulting colleagues @seziegler pointed me to the nnUNetTrainerDiceCELoss_noSmooth , that was recommended by Fabian here for v1 and also works in v2.
It would be nice if you tried it and report if it worked for you!

Cheers,
Tassilo

@FabianIsensee
Copy link
Member

Hey all, just pinging me won't work these days, unfortunately. There is too much going on 🙃
nnUNetTrainerDiceCELoss_noSmooth should do the trick (hopefully). Your symptoms don't quite match the prototypical situation in which it helps though. Definitely worth a try.
@TaWald fp32 training is no longer accessible in v2 as we never had problems with mixed precision so far and we didn't feel like keeping fp32 alive. over/underflow is unlikely because there is gradient clipping. NaNs would be recognizable in progress.png because they cannot be plotted. So the lines would stop existing or have holes.
CT values are unlikely to be a problem. Before we investigate further we should wait for the _noSmooth results
Best,
Fabian

@pinyotae
Copy link

Hi @TaWald and @FabianIsensee,
Thank you so much. It seems nnUNetTrainerDiceCELoss_noSmooth has solved the issue in a fold that had a problem before. I will need to confirm with all five folds. So, please let this issue open a little longer. When I finish all five folds, I will let you know again so that we can close the issue as a solved one (or need to find another solution, but I hope not).

@pinyotae
Copy link

pinyotae commented Nov 3, 2023

Thank you everybody. I tested with all five folds, and all Dice scores seem normal.
Since some users may not understand flag -tr and want to see the command I used to solve the issue, I show an example from my work here. (I was confused about how to use -tr for a long while too).
nnUNetv2_train Dataset001_TreatmentRegion 3d_lowres 0 -tr nnUNetTrainerDiceCELoss_noSmooth --c

By the way, do the authors want help about improving the documents? I found several typos and I think there are examples that should be added to help users/developers use nnU-Net more effectively. Perhaps, I can help pinpoint the typos and write more examples to the documents. Thank you again to the nnU-Net dev team.

@TaWald
Copy link
Contributor

TaWald commented Nov 9, 2023

If you have some specific examples in mind, feel free to open another issue to discuss them in more depth. In general the examples to get nnUNet working were better and v1 and we are aware that v2 is not as nicely explained. Having said this we are ofcourse happy to add better ReadMe.md :) If you are enthusiastic feel free to create new ones with some examples and open a pull-request.
If you want to discuss this further, you can open another issue too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants