This repo contains the codes of Causality for Machine Translation.
stores the scripts to run NMT
stores the code to generate MT data between cause and effect
data of five language pairs with clear translation
BPE parseddata/raw_datasets
data of five language pairs with clear translation direction and topic
BPE parseddata/raw_topic_datasets
, filtered with length ratio 1.5.
copy and run the following:
# $DIR_ABS_PATH="a path that you prefer"
# 1. prepare conda environment
conda create -n fairseq python==3.7 --yes
source activate fairseq
cd code/tools/fairseq
pip install --editable ./
pip install -r code/requirements.txt
pip install sacrebleu==1.5.1
mkdir -p data/run_nmt/ssl
To run the supervised experiments, the first step is to prepare the supervised training data with causal:anticausal=0:100, 25:75, 50:50, 75:25, 100:0. Following is an example of preparing en-es data.
# Translation from En to Es
bash code/ 0 100 0 0 0 data/bpe_datasets
bash code/ 0 75 25 0 0 data/bpe_datasets
bash code/ 0 50 50 0 0 data/bpe_datasets
bash code/ 0 25 75 0 0 data/bpe_datasets
bash code/ 0 0 100 0 0 data/bpe_datasets
# Translation from Es to En
bash code/ 0 100 0 1 0 data/bpe_datasets
bash code/ 0 75 25 1 0 data/bpe_datasets
bash code/ 0 50 50 1 0 data/bpe_datasets
bash code/ 0 25 75 1 0 data/bpe_datasets
bash code/ 0 0 100 1 0 data/bpe_datasets
Here, bash code/ {lang_ix} {cau_ratio} {ant_ratio} {mt_dir} {if_ssl} {dataset}
(language pair index) can be0
(causal ratio) is the percentage of causal training data mixed, can be a float number between0~100
. Causal and anti-causal are defined according to the alphabetical order of languages, e.g. foren-es
is causal andes->en
is anti-causalant_ratio
(anti-causal ratio) is the percentage of anti-causal training data mixed, similar tocau_ratio
(machine translation direction) can be0
(alphabetical order e.g. from en to es) or1
(reverse alphabetical order e.g. from es to en)if_ssl
(if using SSL to train) can be0
(no ssl),1
(back-translation) The commands will make directories likedata/run_nmt/ssl/all_100_0_en-es_ssl0
location of the dataset used
The next step is to run the experiments
# Supervised MT, translate from en to es, data constituent: `en-to-es`:`es-to-en`=100:0
bash code/ 0 100_0 0 0 1 0,1,2,3,4,5,6,7
Here, bash code/ {lang_ix} {directory} {mt_dir} {if_ssl} {hyp} {CUDA_VISIBLE_DEVICES}
(language pair index) can be0
the directory used,{100_0}_en-es_ssl0
=> directory =100_0
(machine translation direction) can be0
(alphabetical order e.g. from en to es) or1
(reverse alphabetical order e.g. from es to en)if_ssl
(if using SSL to train) can be0
(no ssl),1
(hyperparameter) index of current hyperparameter set
Finally, test the supervised model with code/
bash code/ 0 100-0 0 0 1 0
Here, bash code/ {lang_ix} {directory} {mt_dir} {if_ssl} {hyp} {CUDA_VISIBLE_DEVICES}
Again, step one is data preparation. Following is an example of en-es pair
bash code/ 0 data/bpe_datasets
Here, bash code/ {lang_ix} {dataset}
(language pair index) can be0
location of the dataset used
This script construct all folders for self-training(e.g. data/run_nmt/ssl/all_50_50_en-es_ssl1
, data/run_nmt/ssl/all_50_50_es-en_ssl1
) and back-translation(e.g. data/run_nmt/ssl/all_50_50_en-es_ssl4
, data/run_nmt/ssl/all_50_50_es-en_ssl4
). The labeled data in semi-supervised experiments are all 50%:50% mixed by causal and anti-causal data
Then, run ST and BT through following commands
# SSL by Self-training, supervised data (50% en->es) : (50% es->en), unlabeled data 100% en->es(only English side), translate from en to es
bash code/ 0 50_50 0 1 0,1,2,3,4,5,6,7
Here, bash code/ {lang_ix} {directory} {mt_dir} {hyp} {CUDA_VISIBLE_DEVICES}
Self-training can be tested through
bash code/ 0 50-50 0 2 1 0
Here, bash code/ {lang_ix} {directory} {mt_dir} {if_ssl} {hyp} {CUDA_VISIBLE_DEVICES}
# SSL by Back-translation, supervised data (50% en->es) : (50% es->en), unlabeled data 100% es->en(only Spanish side), translate from en to es
bash code/ 0 50_50 0 4 1 0,1,2,3,4,5,6,7
Here, bash code/ {lang_ix} {directory} {mt_dir} {if_ssl} {hyp} {CUDA_VISIBLE_DEVICES}
Back-translation can be tested through
bash code/ 0 50-50 0 4 1 0
Here, bash code/ {lang_ix} {directory} {mt_dir} {if_ssl} {hyp} {CUDA_VISIBLE_DEVICES}