Cross-modality Data Augmentation for Sign Language Translation

Implementation of our paper XmDA: Cross-modality Data Augmentation for End-to-End Sign Language Translation.

Brief Introduction

we propose a novel Cross-modality Data Augmentation (XmDA) approach to improve the end-to-end SLT performance. The main idea of XmDA is to leverage the powerful gloss-to-text translation capabilities (unimode, i.e., text-to-text) to end-to-end sign language translation (cross mode, i.e., video-to-text). Specifically, XmDA integrates two techniques, namely Cross-modality Mix-up and Cross-modality Knowledge Distillation (KD). The Cross-modality Mix-up technique combines sign language video features with gloss embeddings extracted from the gloss-to-text teacher model to generate mixed-modal augmented samples. Concurrently, the Cross-modality KD utilizes diversified spoken language texts generated by the powerful gloss-to-text teacher models to soften the target labels, thereby further diversifying and enhancing the augmented samples.

Figure 1: The overall framework of cross-modality data augmentation methods for SLT in this work. Components in gray indicate frozen parameters.

Reference Performance

We conduct evaluations for the proposed XmDA approach on End-to-end SLT performance and analysis the sign representation distributions on PHOENIX-2014T dataset.

End-to-end SLT performance on PHOENIX-2014T dataset

The topological structure of input embeddings

Implementation Guidelines

This code is based on Sign Language Transformers but modified to realize Cross-modality KD and Cross-modality mix-up.
For baseline end-to-end SLT, you can use the Sign Language Transformers.
For gloss-to-text tearchers model, you can follow the PGen or use the original text-to-text Joey NMT framework.
For put them to one, we expend Sign Language Transformers framework with Joey NMT and allow the new one can forward gloss-to-text and mix-to-text (i.e., forward_type in [sign, gloss, mixup]).

Requirements

Create environment follow Sign Language Transformers.
Reproduce PGen to obtain multi-references as sentence-level guidance from gloss-to-text teachers (or using forward_type = gloss).
Reproduce SMKD to pre-process the sign video.
Pre-process dataset and put them into ./data/DATA-NAME/ (ref the format to https://github.com/neccam/slt)

Usage

python -m signjoey train_XmDA configs/Sign_XmDA.yaml

! Note that the default data directory is ./data. If you download them to somewhere else, you need to update the data_path parameters in your config file.

ToDo:

Initial code release.
Release Pre-process dataset.
Share extensive qualitative and quantitative results & config files to generate them.

Reference

Please cite the paper below if you use this code in your research:

@article{ye2023cross,
  title={Cross-modality Data Augmentation for End-to-End Sign Language Translation},
  author={Ye, Jinhui and Jiao, Wenxiang and Wang, Xing and Tu, Zhaopeng and Xiong, Hui},
  journal={arXiv preprint arXiv:2305.11096},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
configs		configs
images		images
signjoey		signjoey
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

configs

configs

images

images

signjoey

signjoey

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Cross-modality Data Augmentation for Sign Language Translation

Brief Introduction

Reference Performance

End-to-end SLT performance on PHOENIX-2014T dataset

The topological structure of input embeddings

Implementation Guidelines

Requirements

Usage

ToDo:

Reference

About

Releases

Packages

Contributors 2

Languages

Atrewin/SignXmDA

Folders and files

Latest commit

History

Repository files navigation

Cross-modality Data Augmentation for Sign Language Translation

Brief Introduction

Reference Performance

End-to-end SLT performance on PHOENIX-2014T dataset

The topological structure of input embeddings

Implementation Guidelines

Requirements

Usage

ToDo:

Reference

About

Resources

Stars

Watchers

Forks

Languages