This is the project page for the Tensorflow implementation of the paper, "Unsupervised Interlingual Semantic Representations from Sentence Embeddings for Zero-Shot Cross-Lingual Transfer", to be presented at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20).
Copyright 2020 Superb AI, Inc.
The code, cache files, and the models are all released under the Apache 2.0 license.
Authors: Channy Hong, Jaeyeon Lee, Jung Kwon Lee.
Paper: AAAI-20 link
Overview blog post: Medium
Script for training the ISR Encoder. Requires monolingual corpora cache files for training.
Prerequisites:
The following cache files saved in the 'data_dir' directory:
- Monolingual corpora sentences cache files, as "mc_##.npy" (e.g. "mc_en.npy") where ## corresponds to ISO 639-1 Code of each 'train_languages'; refer to Parsing and Caching Scripts section below.
- (If do_mid_train_eval) XNLI dev examples cache file, as "DEV.npy"; refer to Parsing and Caching Scripts section below.
(If do_mid_train_eval,) The following model files in the 'mid_train_eval_nli_model_path' directory (the trailing 'nli_solver' is the model name and not part of the directory):
- English NLI solver model files, as "nli_solver.meta", "nli_solver.index", and "nli_solver.data-00000-of-00001" (note that 'mid_train_eval_nli_target_language' should be fixed as English when using this NLI solver).
python train_isr.py \
--data_dir=data \
--output_dir=outputs/isr_training_model \
--train_languages=English,Spanish,German,Chinese,Arabic \
--embedding_size=768 \
--train_batch_size=32 \
--Dis_Gen_train_ratio=10 \
--Dis_learning_rate=0.00001 \
--Gen_learning_rate=0.00001 \
--lambda_Dis_cls=1.0 \
--lambda_Dis_gp=1.0 \
--lambda_Gen_cls=10.0 \
--lambda_Gen_rec=1.0 \
--lambda_Gen_isr=1.0 \
--beta1=0.5 \
--beta2=0.999 \
--num_train_epochs=100 \
--save_checkpoints_steps=5000 \
--log_losses=True
--do_mid_train_eval=True \
--run_mid_train_eval_steps=5000 \
--mid_train_eval_nli_target_language=English \
--mid_train_eval_nli_model_path=nli_solver_path/nli_solver
Script to access the Tensorboard logs of the various losses (and if do_mid_train_eval, mid train evaluation accuracies) to help decide when to halt training of ISR Encoder. In our study, we stopped training when the generator seemed to be reasonably functional in generating sentences of correct target domain (classification task accuracy) without losing the semantics of the original sentence (English NLI solving accuracy).
tensorboard --port=6006 --logdir=outputs/isr_training_model
Code for training a classifier on top of fixed ISR Encoder. Requires NLI training examples (mostly available in high-resource language, i.e. English) for training.
Prerequisites:
The following cache files saved in the 'data_dir' directory:
- NLI training examples cache file(s), as "bse_##.npy" (e.g. "bse_en.npy") where ## corresponds to ISO 639-1 Code of each 'xnli_train_languages'; refer to Parsing and Caching Scripts section below. Theoretically, NLI training examples from multiple languages can be used jointly from training the classifier on top of ISR (while the underlying assumption is that only English training examples are widely available currently).
- (if do_mid_train_eval) XNLI dev examples cache file, as "DEV.npy".
The following files in the 'isr_encoder_dir' directory:
- The ISR Encoder model files outputted from the ISR Encoder Training section above, as "isr_encoder.meta", "isr_encoder.index", "isr_encoder.data-00000-of-00001". Alternatively, the ISR Encoder trained during our study can downloaded here: "isr_encoder.meta", "isr_encoder.index", and "isr_encoder.data-00000-of-00001".
- The language reference file, as "language_reference.json". The language reference file corresponding to our study's ISR Encoder can be downloaded here: "language_reference.json"
python train_classifier.py \
--data_dir=data \
--isr_encoder_dir=isr_encoder_dir \
--isr_encoder_name=isr_encoder \
--output_dir=outputs/custom_output_model_name \
--xnli_train_languages=English \
--embedding_size=768 \
--train_batch_size=32 \
--dropout_rate=0.5 \
--learning_rate=0.00005 \
--beta1=0.9 \
--beta2=0.999 \
--num_train_epochs=100 \
--save_checkpoints_steps=5000 \
--log_losses=True \
--do_mid_train_eval=True \
--mid_train_xnli_eval_languages=English,Spanish,German,Chinese,Arabic \
--run_mid_train_eval_steps=5000 \
--mid_train_eval_batch_size=32
Script to access the Tensorboard logs of the classifier loss and training accuracy (and if do_mid_train_eval, evaluation accuracies on dev examples).
tensorboard --port=6006 --logdir=outputs/custom_output_model_name
1. Download the Wikipedia dump of the language of interest (.XML file).
2. Use WikiExtractor to extract and clean text from the XML file, outputting a file (e.g. "wiki_00") in the "AA" folder within the 'output' directory. The "100G" 'bytes' parameter in our sample usage is to ensure that only 1 file is outputted (rather than broken up into multiple):
Prerequisites:
- The downloaded dump file (e.g. "en_dump.xml") in the current directory.
python WikiExtractor.py \
--output=en_extracted \
--bytes=100G \
en_dump.xml
3. Run mc_custom_extraction.py on once-extracted file to perform custom extraction and cleanup to output a .txt file.
Prerequisites:
- The once-extracted dump file renamed to its ISO 639-1 Code (e.g. for once extracted English dump file, renamed to "en" from "wiki_00") in the 'source_file_path' directory.
python mc_custom_extraction.py \
--source_file_path=once_extracted \
--output_dir=custom_extracted \
--language=en \
--char_count_lower_bound=4 \
--char_count_upper_bound=385 \
--output_num_examples=392702
The monolingual corpora .txt files used in our study can be downloaded here:
mc_en.txt
mc_es.txt
mc_de.txt
mc_zh.txt
mc_ar.txt
4. Run bse_cache.py to produce cache files.
Prerequisites:
- bert-as-service installed.
- BERT-Base, Multilingual Cased model (refer to BERT Multilingual GitHub page for more details) saved in the 'bert_dir' directory.
- The custom extracted .txt file in the 'data_dir' directory, as "mc_##.txt" where ## corresponds to ISO 639-1 Code of the text.
python bse_cache.py \
--data_dir=custom_extracted \
--language=English \
--data_type=mc \
--output_dir=bse_cache_files \
--bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12
The monolingual corpora cache files used in our study can be downloaded here:
mc_en.npy
mc_es.npy
mc_de.npy
mc_zh.npy
mc_ar.npy
1. Download the XNLI dev and test examples ("xnli.dev.tsv" and "xnli.test.tsv") from the XNLI project page. Also download the XNLI machine translated training examples, which includes the original English MNLI training examples (as "multinli.train.en.tsv").
2. Run bse_cache.py to produce cache files.
Prerequisites:
- bert-as-service installed.
- BERT-Base, Multilingual Cased model saved in the 'bert_dir' directory.
- The English MNLI training examples file in the 'data_dir' directory, as "multinli.train.en.tsv".
python bse_cache.py \
--data_dir=xnli_data \
--language=English \
--data_type=mnli \
--output_dir=bse_cache_files \
--bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12
The English MNLI training examples cache file used in our study can be downloaded here: bse_en.npy
Prerequisites:
- bert-as-service installed.
- BERT-Base, Multilingual Cased model saved in the 'bert_dir' directory.
- The XNLI dev examples file in the 'data_dir' directory, as "xnli.dev.tsv".
python bse_cache.py \
--data_dir=xnli_data \
--data_type=dev \
--output_dir=bse_cache_files \
--bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12
The XNLI dev examples cache file used in our study can be downloaded here: DEV.npy
Prerequisites:
- bert-as-service installed.
- BERT-Base, Multilingual Cased model saved in the 'bert_dir' directory.
- The XNLI test examples file in the 'data_dir' directory, as "xnli.test.tsv".
python bse_cache.py \
--data_dir=xnli_data \
--data_type=test \
--output_dir=bse_cache_files \
--bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12
The XNLI test examples cache file used in our study can be downloaded here: TEST.npy