-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training code for the Adversarial Diffusion Distillation(ADD) not available? #238
Comments
Looking forward to update the training code. |
Same question. Is the training code planned to be released soon? |
Actually, if you look at the ADD paper, they train StyleGAN-T++ for 2M iterations at batch size 2048 on 128 A100s. This suggests that the project had a budget that allows for ~100K USD experiments. So I highly doubt the ordinary person is going to be able to replicate their result, even with the training code available. It is probably more appropriate to think of the ADD model as training an SD model almost from scratch. The problem it learns is much harder than LCM - they have to go from noise straight to a highly polished image. LCM never manages to do that as the original training process of SD is not designed to do few-step denoising, so my hypothesis is that ADD has to learn a lot of new "concepts". |
@jon-chuang , thanks for your feedback, I tired to implement a similar training mechanism as what ADD is doing, but it seem to having have a lot of instability in the training, which doesn't yeild good images, but looking at the papers, I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)? |
@Mohan2351999 , Have you achieved good results with your ADD training? I've also tried training a ADD, but the images generated after a few training looked terrible, like those from a failed GAN training. |
@fingerk28 I was getting a similar image which becomes complete noise, with longer training, which probably could be due to instability in the training, I still face the issue of 'nan' in the grad_norm of the discriminator while training. Please let me know if you find any success with your training. thanks. |
Ok, you're right, colour me surprised. I expected stability AI (and all major for-profit labs) to retract details like that.
I have the same result (and others I've talked to have reported the same). But GAN training is generally very hard to tune.
I think in the ADD paper they mention using R1 gradient penalty as regularization. I have yet to try this. |
Btw @Mohan2351999 do shoot me an email at |
Hi @jon-chuang, thanks for your answers, I have already tried including the R1 gradient penality, but still couldn't get rid of the "nan" in the gradient norm for the discriminator. Thanks for sharing your contact, I will send you an email soon. |
@Mohan2351999 @jon-chuang I have also tried reproduced ADD recently, and I have some doubts about the training data. Is it the Laion dataset ? Will the quality of training data have a significant impact on adversarial training? |
@jon-chuang @Mohan2351999 Hi, have you obtained good generation results? I used the training method of ADD, but the generated images have color issues, such as oversaturation... And I don't know what the problem is... |
Hey, there! Will be glad to receive any comments and pieces of advice on my work! https://github.com/leffff/adversarial-diffusion-distillation/ |
Hi, the paper says that the step size of the teacher model is set to 1. I think this is unreasonable. I tried to use ddpm CIFAR10 to conduct ADD experiments. When the teacher model step size is 1, sampling is performed, and the result is a picture of completely random noise. Or is it that their teacher model is already sufficient to generate higher quality images in step 1? |
you're right. The single step teacher is quite useless. You can see this from Table 1 d) by comparing the first and second row |
Here are screenshots form the paper, proving they do only 1 teacher step, which is in my opinion unreasonable as we force student to produce samples of the best quality possible but in 4 steps instead of all the steps of the teacher, meaning, the teacher should make more steps. But imagine, teacher makes less steps than the student. This means, generation quality of the teacher is worse than students'. Then why do we want students' predictions to be as close as possible to teachers'. I do not understand this moment yet. in this video https://www.youtube.com/watch?v=ZxPQtXu1Wbw the author says that the teacher makes 1000 steps) |
@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro! |
I will soon change my UNet and dataset and switch either to Imagenet or Cifar10! If I succeed I will inform you! Waiting for your results :) |
Okay I've figured out the answer. The main contribution to distillation is made by the discriminator, while teacher is there to prevent overfitting and this is the reason the teacher only does 1 step. |
@leffff Thanks for the explanation! Did you uncover any training hacks that were not mentioned in the paper? And are you getting good results for a single step? |
I was not able to find the code for the ADD training mechanism, when will the code be released?
The text was updated successfully, but these errors were encountered: