-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue in using training AnimateLCM SVD #22
Comments
Hi, thanks for the interest!
|
Also would say that the default hyper-parameters applied in the training script are not carefully tailored and should just be sub-optimal. For example, using |
I just found the code at this line was a typo, and I fixed it. Just hope it did not mislead you. |
Amazing! It start to look good after fixing the typo. ThanQ :) |
Awesome! Very glad to hear that : D. |
Hey @habibian, just uploaded an example dataset.py. In addition to that, I would recommend freezing all the convolutional layers when training because convolution layers seem to be more vulnerable for fine-tuning. Hope this will help for better performance. |
Thanks for the response @G-U-N . Regarding the freezing the convolutional layers, do you mean the ones in ResBlocks? Is is part of your implementation, or I need to implement it? Thanks! |
Hey @G-U-N , Thanks for the input. Following your suggestion, I kept
I actually observe some improvements in training with this modification as: Convs Frozen: 20k iter, cfg=1, inference step = 4 All Finetuned: 20k iter, cfg=1, inference step = 4 However, I still see my trained models to have much lower quality compared to the SVD checkpoint that you guys have released SVD checkpoint: . Here are some more test examples to give you some idea about how poor the quality of my replications are. So wonder if you have trained SVD checkpoint as I am doing here, or maybe there are some differences, i.e., in code, data, etc? Thanks a lot for your guidance and support in replicating your excellent work :) |
Hey @habibian , I would say there's no too much difference. The only difference is that I tried to freeze more weights at the beginning of training instead of fully fine-tuing. I didn't do too much ablation on that due to my limited GPU resources. What about trying this: for name, para in unet.named_parameters():
if "transformer_block" in name and "temporal_transformer_block" not in name:
para.requires_grad = True
parameters_list.append(para) Again, I would recommend logging the generated videos in resolution 1024 x 576. You will not get ideal results on low resolutions even if you train the model successfully. LMK if you get better results. |
Hi @habibian, just checking in to see if you have any updates. Hope everything is going well on your end! |
Hey @G-U-N Thanks for the suggestion and your great support here, much appreciated! Following your last suggestion, instead of Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 And, here are the 1024 x 576 generated videos using my trained checkpoint (compared to your released checkpoint): Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4 Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4 As you see, there is still a gap in generation qualities, which I am not sure how can be reduced. Is the released checkpoint trained with 50K iterations? Any particular multi-stage training or lr scheduling involved? Thanks :) |
Hey @habibian. Very glad to see the improvement! And I really appreciate the detailed visual ablations. I actually conducted the training in two-stage.
Additionally, some more iterations on larger resolutions will help enhance the performance. Hope this will make better performance! |
Hey @G-U-N , Great, thanks for the elaboration. I will follow this multi stage training and get back to you about results. For that, could you please describe a bit the details of the large resolution training? More specifically:
Thanks! |
The details: Training videos: bilinear interpolated webvid-2M. If you have other video dataset with larger resolution, that will be great. |
Hi, I think in stage 2, it should use unet weight saved from stage1 to initialize the unet weights of stage2, but target unet and teacher unet should be initialized from stalibity svd xt? Am I right? But the code seemed not support this?? |
Hey @ersanliqiao. You should load the unet and target unet from your finetuned weight and initialize the teacher unet with stability weight. Try this at this line from safetensors.torch import load_file
finetuned_weight = load_file("xxx.safetensors","cpu")
unet.load_state_dict(finetuned_weight)
target_unet.load_state_dict(finetuned_weight)
del finetuned_weight |
thank you!! |
hi @habibian |
hi @dreamyou070 I needed to retrain AnimateLCM on a different UNet to run faster than standard SVD architecture. |
Hi @G-U-N, thanks for your great open-source work I have some questions about loss weighting when training svd-lcm (codes): where the weights is defined here:
This formulation seems a bit different from the representation of λn in the arXiv paper: I'd like to know if the formulation used in the code is based on any reference paper or if it is just a heuristic setting. |
Hey, @haohang96 . Yes, I would say the choice of weights is very heuristic and hard to give an explicit analysis. Most designs are heuristic and should be sub-optimal. |
@habibian Hi, have you obtained results similar to the released AnimateLCM-svd-xt? I fine-tuned the Spatial Transformer layer for 30k iterations, the results appear as blurry as what you've shown above. |
trainable parameters are set as follows: unet.requires_grad_(False)
parameters_list = []
# Customize the parameters that need to be trained; if necessary, you can uncomment them yourself.
for name, para in unet.named_parameters():
# 1 stage: 30k iterations with only spatial transformer block tuned with learning rate 1e-6.
# Only temporal transformer block, a 80 GB GPU should be able to train on resolution 768x448.
if args.training_stage == 1:
if "temporal_transformer_blocks" not in name and "transformer_blocks" in name:
para.requires_grad = True
parameters_list.append(para)
elif args.training_stage == 2:
# 2 stage: 50k iterations with only temporal transformer block tuned with learning rate 3e-7. (The temporal weights of SVD is relatively large and vulnerable.)
# Only spatial transformer block, a 80 GB GPU should be able to train on resolution 1024x576.
if "temporal_transformer_blocks" in name:
para.requires_grad = True
parameters_list.append(para) |
Thanks for the great work, also for releasing the training script
train_svd_lcm.py
.I am trying to reproduce the results using the provided
train_svd_lcm.py
, but after half of the training (20,000 / 50,000 itrs) don't see any improvement neither in loss value nor generation qualities (training on a single A100 on WebVid2M).Could you please confirm if Ishould set the hyper-params as follows?
accelerate launch train_svd_lcm.py \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=50000 \
--width=576 \
--height=320 \
--checkpointing_steps=1000 --checkpoints_total_limit=1 \
--learning_rate=1e-6 --lr_warmup_steps=1000 \
--seed=123 \
--adam_weight_decay=1e-3 \
--mixed_precision="fp16" \
--N=40 \
--validation_steps=500 \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--output_dir="outputs" \
In the current
train_svd_lcm.py
, the model is being trained on576x320
resolutions, which is much lower than the standard SVD, i.e.,1024x572
. Would not this cause a problem as normal (non LCM) SVD suffer from generating lower resolution videos?Any input is much appreciated :)
The text was updated successfully, but these errors were encountered: