Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training code for the Adversarial Diffusion Distillation(ADD) not available? #238

Open
Mohan2351999 opened this issue Dec 8, 2023 · 19 comments

Comments

@Mohan2351999
Copy link

I was not able to find the code for the ADD training mechanism, when will the code be released?

@Mohan2351999 Mohan2351999 changed the title Training code for the Adversarial Diffusion Distillation(ADD) Training code for the Adversarial Diffusion Distillation(ADD) not available? Dec 8, 2023
@tnickMoxuan
Copy link

Looking forward to update the training code.

@m-muaz
Copy link

m-muaz commented Dec 12, 2023

Same question. Is the training code planned to be released soon?

@jon-chuang
Copy link

Actually, if you look at the ADD paper, they train StyleGAN-T++ for 2M iterations at batch size 2048 on 128 A100s. This suggests that the project had a budget that allows for ~100K USD experiments. So I highly doubt the ordinary person is going to be able to replicate their result, even with the training code available.

It is probably more appropriate to think of the ADD model as training an SD model almost from scratch. The problem it learns is much harder than LCM - they have to go from noise straight to a highly polished image.

LCM never manages to do that as the original training process of SD is not designed to do few-step denoising, so my hypothesis is that ADD has to learn a lot of new "concepts".

@Mohan2351999
Copy link
Author

@jon-chuang , thanks for your feedback, I tired to implement a similar training mechanism as what ADD is doing, but it seem to having have a lot of instability in the training, which doesn't yeild good images, but looking at the papers, I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

@fingerk28
Copy link

@Mohan2351999 , Have you achieved good results with your ADD training? I've also tried training a ADD, but the images generated after a few training looked terrible, like those from a failed GAN training.

@Mohan2351999
Copy link
Author

@fingerk28 I was getting a similar image which becomes complete noise, with longer training, which probably could be due to instability in the training, I still face the issue of 'nan' in the grad_norm of the discriminator while training. Please let me know if you find any success with your training. thanks.

@jon-chuang
Copy link

jon-chuang commented Jan 11, 2024

I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

Ok, you're right, colour me surprised. I expected stability AI (and all major for-profit labs) to retract details like that.

but it seem to having have a lot of instability in the training,

I have the same result (and others I've talked to have reported the same).

But GAN training is generally very hard to tune.

I still face the issue of 'nan' in the grad_norm of the discriminator while training.

I think in the ADD paper they mention using R1 gradient penalty as regularization. I have yet to try this.

@jon-chuang
Copy link

Btw @Mohan2351999 do shoot me an email at chuang dot jon at gmail dot com if you want to chat about this more offline. I'm quite determined to have this ADD training suceed.

@Mohan2351999
Copy link
Author

Hi @jon-chuang, thanks for your answers, I have already tried including the R1 gradient penality, but still couldn't get rid of the "nan" in the gradient norm for the discriminator.

Thanks for sharing your contact, I will send you an email soon.

@YangPanHZAU
Copy link

@Mohan2351999 @jon-chuang I have also tried reproduced ADD recently, and I have some doubts about the training data. Is it the Laion dataset ? Will the quality of training data have a significant impact on adversarial training?

@MqLeet
Copy link

MqLeet commented Jan 17, 2024

@jon-chuang @Mohan2351999 Hi, have you obtained good generation results? I used the training method of ADD, but the generated images have color issues, such as oversaturation...

Just like this
9d4ff71a3074ad9af42b3ebcc51c63b

And I don't know what the problem is...

@leffff
Copy link

leffff commented Feb 13, 2024

Hey, there!
While the code for ADD is still unpublished, I started working on my own implementation.
In a couple of weeks I will be able to train and test my model. For my tests I have trained my own (toy) UNet on food101 dataset. And will further distill it.

Will be glad to receive any comments and pieces of advice on my work!

https://github.com/leffff/adversarial-diffusion-distillation/

@digbangbang
Copy link

Hi, the paper says that the step size of the teacher model is set to 1. I think this is unreasonable. I tried to use ddpm CIFAR10 to conduct ADD experiments. When the teacher model step size is 1, sampling is performed, and the result is a picture of completely random noise. Or is it that their teacher model is already sufficient to generate higher quality images in step 1?

@jonaskohler
Copy link

you're right. The single step teacher is quite useless. You can see this from Table 1 d) by comparing the first and second row

@leffff
Copy link

leffff commented Feb 22, 2024

Снимок экрана 2024-02-22 в 12 21 15 Снимок экрана 2024-02-22 в 12 21 02

Here are screenshots form the paper, proving they do only 1 teacher step, which is in my opinion unreasonable as we force student to produce samples of the best quality possible but in 4 steps instead of all the steps of the teacher, meaning, the teacher should make more steps.

But imagine, teacher makes less steps than the student. This means, generation quality of the teacher is worse than students'. Then why do we want students' predictions to be as close as possible to teachers'.

I do not understand this moment yet.

in this video https://www.youtube.com/watch?v=ZxPQtXu1Wbw the author says that the teacher makes 1000 steps)

@digbangbang
Copy link

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

@leffff
Copy link

leffff commented Feb 22, 2024

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

I will soon change my UNet and dataset and switch either to Imagenet or Cifar10! If I succeed I will inform you! Waiting for your results :)

@leffff
Copy link

leffff commented Feb 28, 2024

Okay I've figured out the answer.

The main contribution to distillation is made by the discriminator, while teacher is there to prevent overfitting and this is the reason the teacher only does 1 step.

@jonaskohler
Copy link

@leffff Thanks for the explanation! Did you uncover any training hacks that were not mentioned in the paper? And are you getting good results for a single step?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants