This is the official code for C2G2: Controllable Co-speech Gesture Generation.
Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines.
-
Clone this repository and install packages:
https://github.com/C2G2-Gesture/C2G2.git pip install -r requirements.txt
-
Download pretrained vqvae, latent_diffusion and SRD from here and put into your selected path.
-
Download pretrained fasttext model from here and put
crawl-300d-2M-subword.bin
andcrawl-300d-2M-subword.vec
atdata/fasttext/
. -
Download the autoencoder used for FGD. which include the following:
For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility the ckpt in the train_h36m_gesture_autoencoder folder.
For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided here.
Save the models in
output/train_h36m_gesture_autoencoder/gesture_autoencoder_checkpoint_best.bin
for TED Gesture, andoutput/TED_Expressive_output/AE-cos1e-3/checkpoint_best.bin
for TED Expressive. -
Download two datasets (original dataset based on pyarrow no longer supported well) from ted_expressive and ted_gesture, then put in
data/ted_expressive_pickle
anddata/ted_pickle
respectively.
# Train vqvae
python scripts/train_vqvae_expressive.py --config=config/pose_diffusion_expressive.yml
# Train latent_diffusion
python scripts/train_expressive_latent.py --config=config/pose_diffusion_expressive.yml
# Finetune SRD to generate real-length speaker identity
python scripts/train_vqvae_expressive_cond.py --config=config/pose_diffusion_expressive.yml
The third and fourth term controls whether to use real identity and whether use re-painting, make sure a SRD weight is loaded with correct path.
# metrics evaluation (Normalized/real-length)
python scripts/test_expressive.py eval latent_diffusion False(/True) False
# metrics evaluation vqvae
python scripts/test_expressive.py eval vqvae False False
# synthesize normalized short videos
python scripts/test_expressive.py short latent_diffusion False False
# synthesize real-length short videos
python scripts/test_expressive.py short latent_diffusion True False
# synthesize normalized long videos
python scripts/test_expressive.py long latent_diffusion False False
# synthesize real-length long videos
python scripts/test_expressive.py long latent_diffusion True False
Repainting applied for last 6 frames in short sequence and applied for last 6 frames of the last full step of generation.
The current code only support repaint for target sequence in dataset, you can modify the code to add in other poses and repaint for any part you want.
# synthesize short videos
python scripts/test_expressive.py short False(/True) True
# synthesize long videos
python scripts/test_expressive.py long False(/True) True
Generate co-speech keypoints on local images given 4 frames of seed gesture and 1 frame identity reference.
Required folder structure is shown in demos/clip1. Current implementation requires 4 frames for both seed gesture and speaker identity, will be fixed later.
python scripts/generate_visual.py # seed_folder # identity folder
#for example
python scripts/generate_visual.py demos/data/clip1 demos/data/clip1
If you find our work useful, please kindly cite as:
@article{ji2023c2g2,
title={C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model},
author={Ji, Longbin and Wei, Pengfei and Ren, Yi and Liu, Jinglin and Zhang, Chen and Yin, Xiang},
journal={arXiv preprint arXiv:2308.15016},
year={2023}
}
- The codebase is developed based on Gesture Generation from Trimodal Context of Yoon et al, HA2G of Liu et al, and Diffgesture of Zhu et at.