Skip to content

ChannyHong/ISREncoder

Repository files navigation

ISREncoder

This is the project page for the Tensorflow implementation of the paper, "Unsupervised Interlingual Semantic Representations from Sentence Embeddings for Zero-Shot Cross-Lingual Transfer", to be presented at the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20).

Copyright 2020 Superb AI, Inc.
The code, cache files, and the models are all released under the Apache 2.0 license.
Authors: Channy Hong, Jaeyeon Lee, Jung Kwon Lee.

Paper: AAAI-20 link
Overview blog post: Medium


ISR Encoder Training

Script for training the ISR Encoder. Requires monolingual corpora cache files for training.

Prerequisites:

The following cache files saved in the 'data_dir' directory:

  • Monolingual corpora sentences cache files, as "mc_##.npy" (e.g. "mc_en.npy") where ## corresponds to ISO 639-1 Code of each 'train_languages'; refer to Parsing and Caching Scripts section below.
  • (If do_mid_train_eval) XNLI dev examples cache file, as "DEV.npy"; refer to Parsing and Caching Scripts section below.

(If do_mid_train_eval,) The following model files in the 'mid_train_eval_nli_model_path' directory (the trailing 'nli_solver' is the model name and not part of the directory):

python train_isr.py \
  --data_dir=data \
  --output_dir=outputs/isr_training_model \
  --train_languages=English,Spanish,German,Chinese,Arabic \
  --embedding_size=768 \
  --train_batch_size=32 \
  --Dis_Gen_train_ratio=10 \
  --Dis_learning_rate=0.00001 \
  --Gen_learning_rate=0.00001 \
  --lambda_Dis_cls=1.0 \
  --lambda_Dis_gp=1.0 \
  --lambda_Gen_cls=10.0 \
  --lambda_Gen_rec=1.0 \
  --lambda_Gen_isr=1.0 \
  --beta1=0.5 \
  --beta2=0.999 \
  --num_train_epochs=100 \
  --save_checkpoints_steps=5000 \
  --log_losses=True
  --do_mid_train_eval=True \
  --run_mid_train_eval_steps=5000 \
  --mid_train_eval_nli_target_language=English \
  --mid_train_eval_nli_model_path=nli_solver_path/nli_solver

Tensorboard

Script to access the Tensorboard logs of the various losses (and if do_mid_train_eval, mid train evaluation accuracies) to help decide when to halt training of ISR Encoder. In our study, we stopped training when the generator seemed to be reasonably functional in generating sentences of correct target domain (classification task accuracy) without losing the semantics of the original sentence (English NLI solving accuracy).

tensorboard --port=6006 --logdir=outputs/isr_training_model

Classifier Training

Code for training a classifier on top of fixed ISR Encoder. Requires NLI training examples (mostly available in high-resource language, i.e. English) for training.

Prerequisites:

The following cache files saved in the 'data_dir' directory:

  • NLI training examples cache file(s), as "bse_##.npy" (e.g. "bse_en.npy") where ## corresponds to ISO 639-1 Code of each 'xnli_train_languages'; refer to Parsing and Caching Scripts section below. Theoretically, NLI training examples from multiple languages can be used jointly from training the classifier on top of ISR (while the underlying assumption is that only English training examples are widely available currently).
  • (if do_mid_train_eval) XNLI dev examples cache file, as "DEV.npy".

The following files in the 'isr_encoder_dir' directory:

  • The ISR Encoder model files outputted from the ISR Encoder Training section above, as "isr_encoder.meta", "isr_encoder.index", "isr_encoder.data-00000-of-00001". Alternatively, the ISR Encoder trained during our study can downloaded here: "isr_encoder.meta", "isr_encoder.index", and "isr_encoder.data-00000-of-00001".
  • The language reference file, as "language_reference.json". The language reference file corresponding to our study's ISR Encoder can be downloaded here: "language_reference.json"
python train_classifier.py \
  --data_dir=data \
  --isr_encoder_dir=isr_encoder_dir \
  --isr_encoder_name=isr_encoder \
  --output_dir=outputs/custom_output_model_name \
  --xnli_train_languages=English \
  --embedding_size=768 \
  --train_batch_size=32 \
  --dropout_rate=0.5 \
  --learning_rate=0.00005 \
  --beta1=0.9 \
  --beta2=0.999 \
  --num_train_epochs=100 \
  --save_checkpoints_steps=5000 \
  --log_losses=True \
  --do_mid_train_eval=True \
  --mid_train_xnli_eval_languages=English,Spanish,German,Chinese,Arabic \
  --run_mid_train_eval_steps=5000 \
  --mid_train_eval_batch_size=32

Tensorboard

Script to access the Tensorboard logs of the classifier loss and training accuracy (and if do_mid_train_eval, evaluation accuracies on dev examples).

tensorboard --port=6006 --logdir=outputs/custom_output_model_name

Parsing and Caching Scripts

Producing a monolingual corpora cache file from Wikipedia dump

1. Download the Wikipedia dump of the language of interest (.XML file).

2. Use WikiExtractor to extract and clean text from the XML file, outputting a file (e.g. "wiki_00") in the "AA" folder within the 'output' directory. The "100G" 'bytes' parameter in our sample usage is to ensure that only 1 file is outputted (rather than broken up into multiple):

Prerequisites:

  • The downloaded dump file (e.g. "en_dump.xml") in the current directory.
python WikiExtractor.py \
 --output=en_extracted \
 --bytes=100G \
en_dump.xml

3. Run mc_custom_extraction.py on once-extracted file to perform custom extraction and cleanup to output a .txt file.

Prerequisites:

  • The once-extracted dump file renamed to its ISO 639-1 Code (e.g. for once extracted English dump file, renamed to "en" from "wiki_00") in the 'source_file_path' directory.
python mc_custom_extraction.py \
  --source_file_path=once_extracted \
  --output_dir=custom_extracted \
  --language=en \
  --char_count_lower_bound=4 \
  --char_count_upper_bound=385 \
  --output_num_examples=392702

The monolingual corpora .txt files used in our study can be downloaded here:
mc_en.txt
mc_es.txt
mc_de.txt
mc_zh.txt
mc_ar.txt

4. Run bse_cache.py to produce cache files.

Prerequisites:

python bse_cache.py \
  --data_dir=custom_extracted \
  --language=English \
  --data_type=mc \
  --output_dir=bse_cache_files \
  --bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12

The monolingual corpora cache files used in our study can be downloaded here:
mc_en.npy
mc_es.npy
mc_de.npy
mc_zh.npy
mc_ar.npy

Producing a NLI examples cache file from XNLI dataset

1. Download the XNLI dev and test examples ("xnli.dev.tsv" and "xnli.test.tsv") from the XNLI project page. Also download the XNLI machine translated training examples, which includes the original English MNLI training examples (as "multinli.train.en.tsv").

2. Run bse_cache.py to produce cache files.

English MNLI training examples

Prerequisites:

python bse_cache.py \
  --data_dir=xnli_data \
  --language=English \
  --data_type=mnli \
  --output_dir=bse_cache_files \
  --bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12

The English MNLI training examples cache file used in our study can be downloaded here: bse_en.npy

XNLI dev examples

Prerequisites:

python bse_cache.py \
  --data_dir=xnli_data \
  --data_type=dev \
  --output_dir=bse_cache_files \
  --bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12

The XNLI dev examples cache file used in our study can be downloaded here: DEV.npy

XNLI test examples

Prerequisites:

python bse_cache.py \
  --data_dir=xnli_data \
  --data_type=test \
  --output_dir=bse_cache_files \
  --bert_dir=../pretrained_models/multi_cased_L-12_H-768_A-12

The XNLI test examples cache file used in our study can be downloaded here: TEST.npy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages