A text-only implementation of fairseq trained on the VaTeX dataset's English (EN) and Simplified Chinese (ZH) caption datasets.
This model reaches an average BLEU4 score of 16.76.
This model relies on pre-built shell scripts for installation. These scripts will run without necessary modification, but modifications can be made to optimize or customize installation.
Fairseq-VaTeX-ZH-EN requires several packages. All are installed by scripts, but it may be helpful to keep track of installations by this repository.
- Python 3.0 or greater
- fairseq
- NVIDIA apex
- CUDA 10.2
- nltk
- punkt
- sacremoses
- jieba
- subword-nmt
To install Fairseq-VaTeX-ZH-EN
, clone this repository and cd
into it. Run the prepare.sh
script to install the above requirements list and format folders. Depending on shell permissions, the script may need to be elevated with chmod 755
.
git clone https://github.com/BraedenLB/Fairseq-VaTeX-ZH-EN.git
cd Fairseq-VaTeX-ZH-EN
#chmod 755 prepare.sh
bash prepare.sh
This implementation of Fairseq trains on text-only caption data from the VaTeX dataset. The VaTeX datasets (Training Set v1.0
, Validaiton Set v1.0
, Public Test Set v1.1
) are downloaded from the VaTeX GitHub, tokenized, and processed in the preprocess.sh
script.
In the VaTeX dataset, the final 5 captions of each video are parallel translations. These parallel captions can be selected for use, but the full dataset is used by default. Additionally, the Public Test Set v1.1
set does not contain any Simplified Chinese (ZH), so 10,000 English (EN) and Chinese (ZH) captions are taken from the Validaiton Set v1.0
set to test the model.
The vatex_preprocess.py
script has optional arguments to modify dataset usage:
-f True
[True/False] if True, all captions. If False, utilize only parallel translations-t 10000
[int] number of captions from validation set to be used for testing (maximum 29,999)
NOTE: The small size of the validation and testing sets may raise warnings from Fairseq, but the model will still train and test.
The raw text JSON files are downloaded with wget
and tokenized with the vatex_preprocess.py
script in the preprocess.sh
script. English (EN) captions are tokenized using nltk's word_tokenize
. Chinese (ZH) captions are tokenized using jieba's lcut
. The jieba model utilized a pretrained, built-in dictionary to tokenize characters.
Tokenized .en
and .zh
files are created for each set train
, val
, and test
and saved in /vatex/tok
.
- mosesdecoder was also tested as a replacement for nltk, bit it was found that it unecessarily increased the complexity of tokenization.
Byte Pair Encoding (BPE) is used to encode tokenized caption data into a format readable by Fariseq. Because the BPE algorithm is language-independent, the same processing is done on both English (EN) and Chinese (ZH) datasets. The following implementations of BPE were tested:
The implemented method uses a hybrid of seq2seq and Chinese-English NMT, which both implement subword-nmt. Other tested methods were faster or more thorough, but Fairseq expects data encoded by the subword-nmt
package.
To preprocess the VaTeX data, run the preprocess.sh
script. Depending on shell permissions, the script may need to be elevated with chmod 755
.
#chmod 755 preprocess.sh
bash preprocess.sh
Once preprocessing has completed, define the locations of BPE files.
Variable | Location |
---|---|
TRAIN | ../Fairseq-VaTeX-ZH-EN/vatex/bpe/train.bpe10000 |
VAL | ../Fairseq-VaTeX-ZH-EN/vatex/bpe/val.bpe10000 |
TEST | ../Fairseq-VaTeX-ZH-EN/vatex/bpe/test.bpe10000 |
DATADIR | ../Fairseq-VaTeX-ZH-EN/vatex/data |
MODELS | ../Fairseq-VaTeX-ZH-EN/models |
Then, run fairseq-preprocess
with the following arguments:
cd ../Fairseq-VaTeX-ZH-EN
fairseq-preprocess --source-lang zh --target-lang en --trainpref $TRAIN --validpref $VAL --destdir $DATADIR --testpref $TEST --workers 20
This model trains using the lightconv_wmt_zh_en_big
ZH-EN model architecture.
To train the model, run fairseq-train
with the arguments below. Depending on system settings, directory variables (e.g., $MODELS
) may need to be replaced with directory strings (e.g., "../Fairseq-VaTeX-ZH-EN"
).
cd ../Fairseq-VaTeX-ZH-EN
fairseq-train $DATADIR --task translation --arch lightconv_wmt_zh_en_big --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 600 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 10 --save-dir $MODELS --save-interval 5
Once the model has been trained, it can be tested with fairseq-generate
with the arguments below. Depending on system settings, directory variables (e.g., $DATADIR
) may need to be replaced with directory strings (e.g., "../Fairseq-VaTeX-ZH-EN/vatex/data"
).
cd ../Fairseq-VaTeX-ZH-EN
fairseq-generate $DATADIR --task translation --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --path $MODELS/checkpoint_best.pt