pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

mahanpro · 2023-04-11T14:48:41Z

First of all, thank you for your great code and implementation. It is appreciable.

I have a segmentation task and I want to use only the Dice loss. So I pass the -tr nnUNetTrainerDiceLoss with the nnUNetv2_train. What happens is that pseudo dice goes to zero dramatically after a few epochs, like 10-20 or so.

I run the 3d_fullres configuration and also made sure that masks and images are aligned in the dataset.
I also tried to manipulate some of the parameters like the fold number and -num_gpus but it wasn't effective.

Update! So the novel thing that just happened is that I submitted the same job exactly with the same parameters, but this time it did not go to zero!! I just renamed the dataset from Dataset054_TCIA to Dataset055_TCIA.
I am wondering what is the cause of this phenomenon. Is it related to the patches drawn from the validation set?

The text was updated successfully, but these errors were encountered:

FabianIsensee · 2023-05-02T09:48:41Z

The dice loss alone is pretty shitty, that's why nnU-Net uses it always in combination with CE. I suggest you use the default first and only then start to make changes. This makes it easier to see when and why things go south.
The fact that it worked sometimes and not other times could be related to luck in the weight initialization. nnU-Net does not use seeding.
Best,
Fabian

TaWald · 2023-08-09T08:51:10Z

Hey @mahanpro,

did Fabians response answer your question satisfactory or do you still have any questions? If not I would close this issue.

Best,
Tassilo

pinyotae · 2023-09-16T15:39:11Z

I got the same problem, but I used the default loss function of nnU-Net v2. Out of 5 folds, I got the problem in 3 folds where loss values become worsened and pseudo loss stuck at zero afterward. Please see the attached training progress plots. The issue arose so quickly in one fold (see the second image).

My training command relied on default parameters for everything. Specifically, my training command is
nnUNetv2_train Dataset001_TreatmentRegion 3d_lowres 0

Note: I am not sure if I should open a new issue in GitHub, but I think appending to this old one may be better in that we can probably got further comment from the OP and help each other identify the root cause of the issue.

TaWald · 2023-09-18T10:36:00Z

To me this looks like a training instability due to NaNs . Do you by any chance get NaNs somewhere, as this will break the model.
Can you check if there are NaNs in the training_log.txt? It is located in the same folder as the progress.png.

NaNs are usually cause by Overflows/Underflows due to large values in mixed precision training. Should there be NaNs you could try to train it with full precision (fp32), which will make the training more resilient against NaNs.

Could you also provide me with infos on how large your dataset is? If it's fairly large you might sample some problematic cases that could induce this instability.

Maybe @constantinulrich or @FabianIsensee can also chime in with what are the most common causes of training instability, as they trained significantly more models than me.

pinyotae · 2023-09-18T11:24:50Z

Thank you for your response. However, I did not see anything about NaN in the log files. I attached them here.
Regarding the dataset size, there are only around 60 CT images, and I believe none of them is larger than 512x512x600 (x, y, z).
I attached the fingerprint and the training plans too. Just in case they help.

If there is anything somewhat unusual about the data, I think the radiologist (or a medical technician) marked some outer regions of the CT by -3000 or some value like that. (Sorry that I cannot share the CT data. I am not allowed to do so.)

I can change the values to -1000 to make it more similar to a common air region in a CT image, but I think that should not break the model this badly. There must be something else. But, if you think I should try modifying that -3000 HUs to -1000, please let me know.

dataset_fingerprint_for_debugging.txt
[nnUNetPlans_for_debu
training_log_2023_9_15_21_38_42.txt
gging.txt](https://github.com/MIC-DKFZ/nnUNet/files/12647099/nnUNetPlans_for_debugging.txt)
training_log_2023_9_16_17_36_26.txt

TaWald · 2023-10-07T11:25:08Z

Thanks for providing the files and giving a lot of context. unfortunately I don't know what exactly could cause this issue, what you could try is lowering learning rate or disabling fp16 to increase numerical stability but this should not be necessary for your use-case. Maybe @FabianIsensee Can help?

TaWald · 2023-10-24T09:17:21Z

Hey @pinyotae,

after consulting colleagues @seziegler pointed me to the nnUNetTrainerDiceCELoss_noSmooth , that was recommended by Fabian here for v1 and also works in v2.
It would be nice if you tried it and report if it worked for you!

Cheers,
Tassilo

FabianIsensee · 2023-10-25T06:48:43Z

Hey all, just pinging me won't work these days, unfortunately. There is too much going on 🙃
nnUNetTrainerDiceCELoss_noSmooth should do the trick (hopefully). Your symptoms don't quite match the prototypical situation in which it helps though. Definitely worth a try.
@TaWald fp32 training is no longer accessible in v2 as we never had problems with mixed precision so far and we didn't feel like keeping fp32 alive. over/underflow is unlikely because there is gradient clipping. NaNs would be recognizable in progress.png because they cannot be plotted. So the lines would stop existing or have holes.
CT values are unlikely to be a problem. Before we investigate further we should wait for the _noSmooth results
Best,
Fabian

pinyotae · 2023-10-26T12:56:23Z

Hi @TaWald and @FabianIsensee,
Thank you so much. It seems nnUNetTrainerDiceCELoss_noSmooth has solved the issue in a fold that had a problem before. I will need to confirm with all five folds. So, please let this issue open a little longer. When I finish all five folds, I will let you know again so that we can close the issue as a solved one (or need to find another solution, but I hope not).

pinyotae · 2023-11-03T05:00:56Z

Thank you everybody. I tested with all five folds, and all Dice scores seem normal.
Since some users may not understand flag -tr and want to see the command I used to solve the issue, I show an example from my work here. (I was confused about how to use -tr for a long while too).
nnUNetv2_train Dataset001_TreatmentRegion 3d_lowres 0 -tr nnUNetTrainerDiceCELoss_noSmooth --c

By the way, do the authors want help about improving the documents? I found several typos and I think there are examples that should be added to help users/developers use nnU-Net more effectively. Perhaps, I can help pinpoint the typos and write more examples to the documents. Thank you again to the nnU-Net dev team.

TaWald · 2023-11-09T10:54:25Z

If you have some specific examples in mind, feel free to open another issue to discuss them in more depth. In general the examples to get nnUNet working were better and v1 and we are aware that v2 is not as nicely explained. Having said this we are ofcourse happy to add better ReadMe.md :) If you are enthusiastic feel free to create new ones with some examples and open a pull-request.
If you want to discuss this further, you can open another issue too.

mahanpro changed the title ~~pseudo dice goes to zero a little while after running nnUNetTrainerDiceLoss~~ pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss Apr 11, 2023

FabianIsensee assigned TaWald Jul 28, 2023

TaWald added the stale No activity in a long time label Aug 15, 2023

TaWald closed this as completed Aug 21, 2023

TaWald reopened this Sep 18, 2023

TaWald assigned FabianIsensee, constantinulrich and mrokuss Oct 11, 2023

TaWald added help wanted Extra attention is needed and removed stale No activity in a long time labels Oct 11, 2023

TaWald closed this as completed Nov 9, 2023

valosekj mentioned this issue Feb 6, 2024

Train a baseline REGION-BASED model from scratch on dcm-zurich-lesions datasets using nnUNet ivadomed/model-seg-dcm#1

Open

5 tasks

valosekj mentioned this issue Feb 14, 2024

Dice Loss vs Dice Loss with smooting term ivadomed/utilities#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

mahanpro commented Apr 11, 2023 •

edited

Loading

FabianIsensee commented May 2, 2023

TaWald commented Aug 9, 2023

pinyotae commented Sep 16, 2023

TaWald commented Sep 18, 2023

pinyotae commented Sep 18, 2023

TaWald commented Oct 7, 2023

TaWald commented Oct 24, 2023

FabianIsensee commented Oct 25, 2023

pinyotae commented Oct 26, 2023

pinyotae commented Nov 3, 2023

TaWald commented Nov 9, 2023

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

pseudo dice goes to zero a little after running nnUNetTrainerDiceLoss #1395

Comments

mahanpro commented Apr 11, 2023 • edited Loading

FabianIsensee commented May 2, 2023

TaWald commented Aug 9, 2023

pinyotae commented Sep 16, 2023

TaWald commented Sep 18, 2023

pinyotae commented Sep 18, 2023

TaWald commented Oct 7, 2023

TaWald commented Oct 24, 2023

FabianIsensee commented Oct 25, 2023

pinyotae commented Oct 26, 2023

pinyotae commented Nov 3, 2023

TaWald commented Nov 9, 2023

mahanpro commented Apr 11, 2023 •

edited

Loading