C2G2: Controllable Co-speech Gesture Generation

This is the official code for C2G2: Controllable Co-speech Gesture Generation.

Abstract

Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines.

Installation & Preparation (If you only want to train or inference without visualization)

Clone this repository and install packages:

https://github.com/C2G2-Gesture/C2G2.git
pip install -r requirements.txt

Download pretrained vqvae, latent_diffusion and SRD from here and put into your selected path.
Download pretrained fasttext model from here and put crawl-300d-2M-subword.bin and crawl-300d-2M-subword.vec at data/fasttext/.
Download the autoencoder used for FGD. which include the following:

For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility the ckpt in the train_h36m_gesture_autoencoder folder.

For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided here.

Save the models in output/train_h36m_gesture_autoencoder/gesture_autoencoder_checkpoint_best.bin for TED Gesture, and output/TED_Expressive_output/AE-cos1e-3/checkpoint_best.bin for TED Expressive.
Download two datasets (original dataset based on pyarrow no longer supported well) from ted_expressive and ted_gesture, then put in data/ted_expressive_pickle and data/ted_pickle respectively.

Training (e.g for expressive)

# Train vqvae
python scripts/train_vqvae_expressive.py --config=config/pose_diffusion_expressive.yml

# Train latent_diffusion
python scripts/train_expressive_latent.py --config=config/pose_diffusion_expressive.yml

# Finetune SRD to generate real-length speaker identity
python scripts/train_vqvae_expressive_cond.py --config=config/pose_diffusion_expressive.yml

Remember to change the vqvae_weight path in config file

Inference(Normalized/real-identity)

The third and fourth term controls whether to use real identity and whether use re-painting, make sure a SRD weight is loaded with correct path.

# metrics evaluation (Normalized/real-length)
python scripts/test_expressive.py eval latent_diffusion False(/True) False

# metrics evaluation vqvae
python scripts/test_expressive.py eval vqvae False False

# synthesize normalized short videos
python scripts/test_expressive.py short latent_diffusion False False

# synthesize real-length short videos
python scripts/test_expressive.py short latent_diffusion True False

# synthesize normalized long videos
python scripts/test_expressive.py long latent_diffusion False False

# synthesize real-length long videos
python scripts/test_expressive.py long latent_diffusion True False

Repainting Inference (Normalized/real-identity) (Not debugged)

Repainting applied for last 6 frames in short sequence and applied for last 6 frames of the last full step of generation.

The current code only support repaint for target sequence in dataset, you can modify the code to add in other poses and repaint for any part you want.

# synthesize short videos
python scripts/test_expressive.py short False(/True) True

# synthesize long videos
python scripts/test_expressive.py long False(/True) True

Generate co-speech keypoints on local images given 4 frames of seed gesture and 1 frame identity reference.

Required folder structure is shown in demos/clip1. Current implementation requires 4 frames for both seed gesture and speaker identity, will be fixed later.

python scripts/generate_visual.py # seed_folder # identity folder

#for example
python scripts/generate_visual.py demos/data/clip1 demos/data/clip1

Citation

If you find our work useful, please kindly cite as:

@article{ji2023c2g2,
  title={C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model},
  author={Ji, Longbin and Wei, Pengfei and Ren, Yi and Liu, Jinglin and Zhang, Chen and Yin, Xiang},
  journal={arXiv preprint arXiv:2308.15016},
  year={2023}
}

Acknowledgement

The codebase is developed based on Gesture Generation from Trimodal Context of Yoon et al, HA2G of Liu et al, and Diffgesture of Zhu et at.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
datasets_scripts		datasets_scripts
demos/clip1		demos/clip1
misc		misc
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

datasets_scripts

datasets_scripts

demos/clip1

demos/clip1

misc

misc

scripts

scripts

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

C2G2: Controllable Co-speech Gesture Generation

Abstract

Installation & Preparation (If you only want to train or inference without visualization)

Training (e.g for expressive)

Remember to change the vqvae_weight path in config file

Inference(Normalized/real-identity)

The third and fourth term controls whether to use real identity and whether use re-painting, make sure a SRD weight is loaded with correct path.

Repainting Inference (Normalized/real-identity) (Not debugged)

The current code only support repaint for target sequence in dataset, you can modify the code to add in other poses and repaint for any part you want.

Generate co-speech keypoints on local images given 4 frames of seed gesture and 1 frame identity reference.

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

C2G2-Gesture/C2G2

Folders and files

Latest commit

History

Repository files navigation

C2G2: Controllable Co-speech Gesture Generation

Abstract

Installation & Preparation (If you only want to train or inference without visualization)

Training (e.g for expressive)

Remember to change the vqvae_weight path in config file

Inference(Normalized/real-identity)

The third and fourth term controls whether to use real identity and whether use re-painting, make sure a SRD weight is loaded with correct path.

Repainting Inference (Normalized/real-identity) (Not debugged)

The current code only support repaint for target sequence in dataset, you can modify the code to add in other poses and repaint for any part you want.

Generate co-speech keypoints on local images given 4 frames of seed gesture and 1 frame identity reference.

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages