Skip to content

mattiadg/FBK-Fairseq-ST

Repository files navigation

FBK-fairseq-ST

This code is not maintained any more! Please follow the new repository that is currently maintained.

FBK-fairseq-ST is an adaptation of FAIR's fairseq for direct speech translation. This repo is no longer active, you should refer to this fork under development.

This software has been used for the experiments of the following publications:

It also implements the speech translation model proposed in End-to-End Automatic Speech Translation of Audiobooks and the Gaussian distance penalty introduced in Self-Attentional Acoustic Models .

The pre-trained models for those papers can be found here and the respective dictionaries can be found here.

At the bottom of this file you can find the official documentation of this fairseq-py version.

Requirements and Installation

Please follow the instructions here: https://github.com/pytorch/pytorch#installation.

After PyTorch is installed, you can install fairseq with:

git clone git@github.com:mattiadg/FBK-Fairseq-ST.git
cd <path to downloaded FBK-fairseq repository>

pip install -r requirements.txt
python setup.py build develop

Preprocessing: Data preparation

To reproduce our experiments, the textual side of data should be first tokenized, then split in characters. For tokenization we used the Moses scripts:

$moses_scripts/tokenizer/tokenizer.perl -l $LANG < $INPUT_FILE | $moses_scripts/tokenizer/deescape-special-chars.perl > $INPUT_FILE.tok
bash FBK-fairseq-st/scripts/word_level2char_level.sh $INPUT_FILE.tok

Preprocessing: binarizing data

As of now, the only supported audio format are .npz and .h5.

python preprocess.py -s <source_language> -t <target_language> --format <h5 | npz> --inputtype audio \
	--trainpref <path_to_train_data> [[--validpref <path_to_validation_data>] \
	[--testpref <path_to_test_data>]] --destdir <path to output folder>

Remember that the input/output dataset must have the same name and be in the same folder. e.g. if you want to binarize IWSLT en-de data in foo/bar/train_iwslt an example of file structure is as follows:

foo
|---- bar
	|----- train_iwstlt
			  |---- my_data.npz
			  |---- my_data.de

In this example --trainpref should then be foo/bar/train_iwslt/my_data. The same holds for --validpref and --testpref i.e. the test and valid (dev) sets could be in different folders but the name of audio related text must be the same for every dataset split.

Training a new model

This is the minimum required command to train a seq2seq model for ST.

NOTE: training on cpu is not supported, CUDA_VISIBLE_DEVICES must be set (else all available gpus will be used).

python train.py <path_to folder_with_binary_data> \
	--arch {fconv | transformer | ast_seq2seq | ber2transf | ecc} \
	--save-dir <path_where_to_store_model_checkpoints> --task translation --audio-input

The path to the binarized data should point to the FOLDER, e.g. if your data is in foo/bar/{train-en-de.bin, train-en-de.idx, etc, etc} then <path_to folder_with_binary_data> should be foo/bar.

Specific architecture variants can be used by setting e.g. --arch transformer_iwslt_fbk is a variant of the transformer created by us the differences are in number of layers, dropout value ecc..

available architecures:
			transformer, transformer_iwslt_fbk,
			transformer_iwslt_de_en, transformer_wmt_en_de,
			transformer_vaswani_wmt_en_de_big,
			transformer_vaswani_wmt_en_fr_big,
			transformer_wmt_en_de_big,
			transformer_wmt_en_de_big_t2t, ast_seq2seq,
			ber2transf, transf2ber,fconv,
			fconv_iwslt_de_en, fconv_wmt_en_ro,
			fconv_wmt_en_de, fconv_wmt_en_fr (default: fconv)

The architecrure used for IWSLT2018 is ast_seq2seq. To reproduce our IWSLT2018 result on the IWSLT en-de Ted Corpus the following command should be used (please substitute the bracketed commands accordingly:

CUDA_VISIBLE_DEVICES=[gpu id] python train.py [path to binarized IWSLT data] \
    --clip-norm 5 \
    --max-sentences 32 \
    --max-tokens 100000 \
    --save-dir [output folder] \
    --max-epoch 150 \
    --lr 0.001 \
    --lr-shrink 1.0 \
    --min-lr 1e-08 \
    --dropout 0.2 \
    --lr-schedule fixed \
    --optimizer adam \
    --arch ast_seq2seq \
    --decoder-attention True \
    --seed 666 \
    --task translation \
    --skip-invalid-size-inputs-valid-test \
    --sentence-avg \
    --attention-type general \
    --learn-initial-state \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1

To reproduce the results on MuST-C of the paper "Adapting Transformer to End-to-End Spoken Language Translation" run the following (on 4 gpus):


CUDA_VISIBLE_DEVICES=[gpu id] python train.py [path to binarized MuST-C data] \
    --clip-norm 20 \
    --max-sentences 8 \
    --max-tokens 12000 \
    --save-dir [output folder] \
    --max-epoch 100 \
    --lr 5e-3 \
    --min-lr 1e-08 \
    --dropout 0.1 \
    --lr-schedule inverse_sqrt \
    --warmup-updates 4000 --warmup-init-lr 3e-4 \
    --optimizer adam \
    --arch speechconvtransformer_big \
    --task translation \
    --audio-input \
    --max-source-positions 1400 --max-target-positions 300 \
    --update-freq 16 \
    --skip-invalid-size-inputs-valid-test \
    --sentence-avg \
    --distance-penalty {log,gauss} \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1

Architecture-specific parameter can be specified through command line arguments (highest priority) or through code. Code must be added at the end of a model file in fairseq/models/ using the @register_model_architecture decorator. The possibile parameters to be changed can be found in every model file in the add_args method.

Generation: translating audio

python generate.py <path_to_binarized_data_FOLDER> --path \
	<path_to_checkpoint_to_use> --task translation --audio-input\
	[[--gen-subset valid] [--beam 5] [--batch 32] \
	[--quiet] [--skip-invalid-size-inputs-valid-test]] \
	

With the --quiet flag only the translations (i.e. hypothesis of the model) will be printed on stdout, no probabilities or reference translations will be shown. Remove the --quiet flag for a more verbose output.

NOTE: translations are generated following a length order criterion (shortest samples first). Thus the order is not the same as the one in the origianal [test | dev ] set. This is an optimization trick done by fairseq at generation time. Thus the original reference translation are usually not a good solution to compute the BLEU score. A specific reference translation file must be used. It is IMPORTANT to note that said reference translation file is dependent on the --batch value used to generate the hypothesis with the system. This happens because the length order used to output the translation is also dependent on such value. This means a specific reference translation file for every --batch value used must be created. A reference file can be created by generating the translations without --quiet flag, redirecting stdout to a file and then pass such file as input to the script sort-sentences.py, then bring it back to words:

python sort-sentences.py $TRANSLATION 5 > $TRANSLATION.sort
sh extract_words.sh $TRANSLATION.sort

For every other aspects please refer to the official fairseq-py documentation.

Note: Fairseq-py official documentation does not include audio processing and it could change to track fairseq official development, thus the official documentation could be incompatible with our version.

Citation

If you use this software for your research, then please cite it as:

@article{di2019adapting,
  title={Adapting Transformer to End-to-End Spoken Language Translation},
  author={Di Gangi, Mattia A and Negri, Matteo and Turchi, Marco},
  journal={Proc. Interspeech 2019},
  pages={1133--1137},
  year={2019}
}

Acknowledgment

This codebase is part of a project financially supported by an Amazon ML Grant.

======================================

The following was the official fairseq-py documentation when we began developing FBK-fairseq (August 2018)

Introduction

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:

Fairseq features:

  • multi-GPU (distributed) training on one machine or across multiple machines
  • fast beam search generation on both CPU and GP
  • large mini-batch training even on a single GPU via delayed updates
  • fast half-precision floating point (FP16) training
  • extensible: easily register new models, criterions, and tasks

We also provide pre-trained models for several benchmark translation and language modeling datasets.

Model

Requirements and Installation

Currently fairseq requires PyTorch version >= 0.4.0. Please follow the instructions here: https://github.com/pytorch/pytorch#installation.

If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run.

After PyTorch is installed, you can install fairseq with:

pip install -r requirements.txt
python setup.py build develop

Getting Started

The full documentation contains instructions for getting started, training new models and extending fairseq with new model types and tasks.

Pre-trained Models

We provide the following pre-trained models and pre-processed, binarized test sets:

Translation

Description Dataset Model Test set(s)
Convolutional
(Gehring et al., 2017)
WMT14 English-French download (.tar.bz2) newstest2014:
download (.tar.bz2)
newstest2012/2013:
download (.tar.bz2)
Convolutional
(Gehring et al., 2017)
WMT14 English-German download (.tar.bz2) newstest2014:
download (.tar.bz2)
Convolutional
(Gehring et al., 2017)
WMT17 English-German download (.tar.bz2) newstest2014:
download (.tar.bz2)
Transformer
(Ott et al., 2018)
WMT14 English-French download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
Transformer
(Ott et al., 2018)
WMT16 English-German download (.tar.bz2) newstest2014 (shared vocab):
download (.tar.bz2)
Transformer
(Edunov et al., 2018; WMT'18 winner)
WMT'18 English-German download (.tar.bz2) See NOTE in the archive

Language models

Description Dataset Model Test set(s)
Convolutional
(Dauphin et al., 2017)
Google Billion Words download (.tar.bz2) download (.tar.bz2)
Convolutional
(Dauphin et al., 2017)
WikiText-103 download (.tar.bz2) download (.tar.bz2)

Stories

Description Dataset Model Test set(s)
Stories with Convolutional Model
(Fan et al., 2018)
WritingPrompts download (.tar.bz2) download (.tar.bz2)

Usage

Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:

$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
$ python generate.py data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

# Scoring with score.py:
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

Join the fairseq community

License

fairseq(-py) is BSD-licensed. The license applies to the pre-trained models as well. We also provide an additional patent grant.

Credits

This is a PyTorch version of fairseq, a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross.

About

An adaptation of Fairseq to (End-to-end) speech translation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages