Issue in using training AnimateLCM SVD #22

habibian · 2024-05-03T15:46:58Z

Thanks for the great work, also for releasing the training script train_svd_lcm.py.

I am trying to reproduce the results using the provided train_svd_lcm.py, but after half of the training (20,000 / 50,000 itrs) don't see any improvement neither in loss value nor generation qualities (training on a single A100 on WebVid2M).

Could you please confirm if Ishould set the hyper-params as follows?

accelerate launch train_svd_lcm.py \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=50000 \
--width=576 \
--height=320 \
--checkpointing_steps=1000 --checkpoints_total_limit=1 \
--learning_rate=1e-6 --lr_warmup_steps=1000 \
--seed=123 \
--adam_weight_decay=1e-3 \
--mixed_precision="fp16" \
--N=40 \
--validation_steps=500 \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--output_dir="outputs" \

In the current train_svd_lcm.py, the model is being trained on 576x320 resolutions, which is much lower than the standard SVD, i.e., 1024x572. Would not this cause a problem as normal (non LCM) SVD suffer from generating lower resolution videos?

Any input is much appreciated :)

The text was updated successfully, but these errors were encountered:

G-U-N · 2024-05-03T15:58:56Z

Hi, thanks for the interest!

I set the default resolution 576x 320 because I find that even trained with lower resolution, it still works to accelerate the generation of SVD. The generation quality of SVD on resolution 576 x 320 is not ideal, and therefore even trained with lower resolutions, you'd better log the generation results in the resolution of 1024x576. Also, for better results, using 1024x576 for training is definitely a better choice when the resource is sufficient.
Also, I would recommend slightly decreasing the learning rate since you trained the model with only one GPU.

G-U-N · 2024-05-03T16:10:42Z

Also would say that the default hyper-parameters applied in the training script are not carefully tailored and should just be sub-optimal. For example, using EMA should generally increase the generation stability.

habibian · 2024-05-03T16:17:32Z

Thanks for the swift response :)

I am now switching to 4xA100 and unfortunately still see vague blobs like the attachment. Curious to know at what iterations should I expect the generations start looking like a video? :)

Thanks!

G-U-N · 2024-05-03T16:36:40Z

The results uploaded seem to be abnormal. It should not flash like this with unnatural colour. Here's what I obtained trained on 576x320.

Training beginning, 0-iter, cfg = 1, inference step = 4

10k iter, cfg=1, inference step = 4

The devices are 8 A 800, and the batch size is set to 8 without gradient accumulation.

G-U-N · 2024-05-03T16:57:18Z

I just found the code at this line was a typo, and I fixed it. Just hope it did not mislead you.

habibian · 2024-05-03T17:13:00Z

Amazing! It start to look good after fixing the typo.

ThanQ :)

G-U-N · 2024-05-03T17:14:39Z

Awesome! Very glad to hear that : D.

habibian · 2024-05-05T07:35:51Z

Hey Fu-Yun,

After fixing the typo, I have been training the model on 8xA100s, which should be exactly like your setting then. Unfortunately, I still can't match your generations:

Training beginning, 0-iter, cfg = 1, inference step = 4

10k iter, cfg=1, inference step = 4

20k iter, cfg=1, inference step = 4

Any suggestion on why this is happening?

I suspect it might be from the data. Currently I am training on WebVid2M-train (results_2M_train.csv with 2.5M videos) without any particular subsampling (based on resolution, content, etc.). Could you please elaborate a bit your training data?

Also, my dataloader does not do any particular transformation/augmentation except for normalizing pixel values to [-1, 1]. Would be great if you can share your WebVid dataloader if there is any particular detail missing.

Again, thanks a lot for your great contribution :)

G-U-N · 2024-05-05T11:57:55Z

Hey @habibian, just uploaded an example dataset.py.

In addition to that, I would recommend freezing all the convolutional layers when training because convolution layers seem to be more vulnerable for fine-tuning.

Hope this will help for better performance.

habibian · 2024-05-05T12:23:00Z

Thanks for the response @G-U-N .

Regarding the freezing the convolutional layers, do you mean the ones in ResBlocks? Is is part of your implementation, or I need to implement it?

Thanks!

G-U-N · 2024-05-05T13:03:04Z

Hi @habibian,

Yes, the ResBlocks. That was not implemented in the training script. But it should be easy to achieve that through modifying this line.

habibian · 2024-05-06T06:13:12Z

Hey @G-U-N ,

Thanks for the input. Following your suggestion, I kept conv layers in resblocks frozen during the training as:

    for name, para in unet.named_parameters():
        # freeze resnet convs as suggested in https://github.com/G-U-N/AnimateLCM/issues/22#issuecomment-2094802365
        if 'conv' in name and not ('conv_in' in name or 'conv_out' in name):
            para.requires_grad = False
        else:
            para.requires_grad = True
            parameters_list.append(para)

I actually observe some improvements in training with this modification as:

Convs Frozen: 20k iter, cfg=1, inference step = 4

All Finetuned: 20k iter, cfg=1, inference step = 4

However, I still see my trained models to have much lower quality compared to the SVD checkpoint that you guys have released SVD checkpoint: . Here are some more test examples to give you some idea about how poor the quality of my replications are. So wonder if you have trained SVD checkpoint as I am doing here, or maybe there are some differences, i.e., in code, data, etc?

Thanks a lot for your guidance and support in replicating your excellent work :)

Convs Frozen: 20k iter, cfg=1, inference step = 4

G-U-N · 2024-05-06T07:07:56Z

Hey @habibian , I would say there's no too much difference. The only difference is that I tried to freeze more weights at the beginning of training instead of fully fine-tuing. I didn't do too much ablation on that due to my limited GPU resources.

What about trying this:

 for name, para in unet.named_parameters():
    if "transformer_block" in name and "temporal_transformer_block" not in name:
        para.requires_grad = True
        parameters_list.append(para)

Again, I would recommend logging the generated videos in resolution 1024 x 576. You will not get ideal results on low resolutions even if you train the model successfully.

LMK if you get better results.

G-U-N · 2024-05-07T07:33:22Z

Hi @habibian, just checking in to see if you have any updates. Hope everything is going well on your end!

habibian · 2024-05-07T12:39:20Z

Hey @G-U-N

Thanks for the suggestion and your great support here, much appreciated!

Following your last suggestion, instead of finetuning all except resblocks I am now only finetuning spatial_transformer_blocks that is actually improving the results as follows:

Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4