Implementation of our paper XmDA: Cross-modality Data Augmentation for End-to-End Sign Language Translation.
we propose a novel Cross-modality Data Augmentation (XmDA) approach to improve the end-to-end SLT performance. The main idea of XmDA is to leverage the powerful gloss-to-text translation capabilities (unimode, i.e., text-to-text) to end-to-end sign language translation (cross mode, i.e., video-to-text). Specifically, XmDA integrates two techniques, namely Cross-modality Mix-up and Cross-modality Knowledge Distillation (KD). The Cross-modality Mix-up technique combines sign language video features with gloss embeddings extracted from the gloss-to-text teacher model to generate mixed-modal augmented samples. Concurrently, the Cross-modality KD utilizes diversified spoken language texts generated by the powerful gloss-to-text teacher models to soften the target labels, thereby further diversifying and enhancing the augmented samples.

Figure 1: The overall framework of cross-modality data augmentation methods for SLT in this work. Components in gray indicate frozen parameters.
We conduct evaluations for the proposed XmDA approach on End-to-end SLT performance and analysis the sign representation distributions on PHOENIX-2014T dataset.
- This code is based on Sign Language Transformers but modified to realize Cross-modality KD and Cross-modality mix-up.
- For baseline end-to-end SLT, you can use the Sign Language Transformers.
- For gloss-to-text tearchers model, you can follow the PGen or use the original text-to-text Joey NMT framework.
- For put them to one, we expend Sign Language Transformers framework with Joey NMT and allow the new one can forward gloss-to-text and mix-to-text (i.e.,
forward_type
in [sign, gloss, mixup]).
- Create environment follow Sign Language Transformers.
- Reproduce PGen to obtain multi-references as sentence-level guidance from gloss-to-text teachers (or using
forward_type
= gloss). - Reproduce SMKD to pre-process the sign video.
- Pre-process dataset and put them into
./data/DATA-NAME/
(ref the format to https://github.com/neccam/slt)
python -m signjoey train_XmDA configs/Sign_XmDA.yaml
! Note that the default data directory is ./data
. If you download them to somewhere else, you need to update the data_path
parameters in your config file.
- Initial code release.
- Release Pre-process dataset.
- Share extensive qualitative and quantitative results & config files to generate them.
Please cite the paper below if you use this code in your research:
@article{ye2023cross,
title={Cross-modality Data Augmentation for End-to-End Sign Language Translation},
author={Ye, Jinhui and Jiao, Wenxiang and Wang, Xing and Tu, Zhaopeng and Xiong, Hui},
journal={arXiv preprint arXiv:2305.11096},
year={2023}
}