Investigating the training loss #4043

guaneec · 2022-10-31T10:12:46Z

guaneec
Oct 31, 2022

The largest mystery I have encountered when training my Textual Inversion embeddings/Hypernetworks is that although there are no obvious trends in the training loss curve, the images produced are no doubt fitting the training dataset better. That makes loss not a useful metric for evaluating the success of training session.

Is the loss actually decreasing but just hidden by the inherent noise? Or is there some more robust metric we can use by considering more variables? If we can solve these problems, we can more efficiently find training hyperparameters and HN architectures instead of relying on folklore.

What does the loss mean?

The loss as defined in Eq. 3 of the LDM paper is

The losses used in TIs/HNs are variants of this. In plain English, that's the expected value of the squared error of the noise predicted by the denoiser. This is the number we are trying to minimize in the training process.

There's a problem though: the loss can only be estimated, and the sample space is huge. Even if we restrict the dataset size to be small, we still have to pick a random timestep $t \in [0, 999]$ (which decides the noise level) and a random latent noise $\epsilon$.

In the WebUI, the loss reported is the average of the last 32 steps. This number alone is not too meaningful, as the underlying distribution has high variance.

My setup

For training:

I trained a 24-token-wide TI embedding on 140 manually cropped images (512x512)
The images are by an artist with a distinctive style
Base model: WD 1.3
Images are labeled with danbooru tags with no truncation
Learning rate: 5e-3
Total Steps: 18500
Prompt template: [filewords] [name]
Additionally, I shuffled the content of [filewords] every epoch

For evaluation:

I collected raw loss values without smoothing or averaging
I modified ddpm.py to record the timestep used in loss calculation
The dataset is not changed
I collected logs on embedding checkpoints of [0, 1000, 10000, 18500] steps, taking 10k-20k samples from each

Overall loss

Perhaps unsurprisingly, the loss highly related to the noise level. Sudden bumps in the loss curve when training might just be unlucky streaks of low timesteps.
Some statistics:

count    11033.000000
mean         0.103449
std          0.097503
min          0.001619
25%          0.022032
50%          0.080197
75%          0.158529
max          0.767767

Per-image loss

With the dataset entry fixed, the loss forms unexpectedly nice smooth curves over timesteps. It is also clear that some images are learned better than others.

Effect of trained steps

Now this is the weird part. Despite collecting so many samples, no obvious decrease of the loss is observed over trained steps. The difference of the mean of the losses is in the order of the standard error.

Conclusion

Honestly, I don't really know. I was expecting to see some tiny but measurable decrease of loss, but there doesn't seem to be any. Hopefully someone can be inspired by my crude exploration and find better metrics and develop a quantitative approach to training.

Heathen · 2022-10-31T22:00:48Z

Heathen
Oct 31, 2022

Honestly, I don't really know. I was expecting to see some tiny but measurable decrease of loss, but there doesn't seem to be any. Hopefully someone can be inspired by my crude exploration and find better metrics and develop a quantitative approach to training.

It is probably because embeddings work between clip and the diffuser. It's most likely dealing with small values to give big results and the way the loss returns is most likely not entirely appropriate. I assume there is a lot of horizontal drift with the values after a while. I suggested implementing TensorBoard to investigate this further.

0 replies

huanranchen · 2023-11-28T12:24:49Z

huanranchen
Nov 28, 2023

Thank you~ Greatly inspired me

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating the training loss #4043

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Investigating the training loss #4043

guaneec Oct 31, 2022

What does the loss mean?

My setup

Overall loss

Per-image loss

Effect of trained steps

Conclusion

Replies: 2 comments

Heathen Oct 31, 2022

huanranchen Nov 28, 2023

guaneec
Oct 31, 2022

Heathen
Oct 31, 2022

huanranchen
Nov 28, 2023