NeuAligner

NeuAligner is a tool that can extract word alignments from contextualized word embeddings and allows you to fine-tune contextualized embeddings on parallel corpora for word alignment.

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Fine-tuning on parallel data

If there is some parallel data available, you can fine-tune your contextualized embedding model. An example for fine-tuning multilingual BERT (we found that this hyper-parameter setting is pretty robust):

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 python run_train.py \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_tlm \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 2000 \
    --max_steps 20000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE \
    --overwrite_output_dir \

Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 python run_align.py \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32 \

This produces outputs in the i-j Pharaoh format. A pair i-j indicates that the ith word (zero-indexed) of the source sentence is aligned to the jth word of the target sentence.

You can also set MODEL_NAME_OR_PATH to the path of your fine-tuned model.

Acknowledgements

Some of the code is borrowed from HuggingFace Transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
tools		tools
README.md		README.md
activations.py		activations.py
configuration_bert.py		configuration_bert.py
configuration_utils.py		configuration_utils.py
file_utils.py		file_utils.py
modeling.py		modeling.py
modeling_utils.py		modeling_utils.py
requirements.txt		requirements.txt
run_align.py		run_align.py
run_train.py		run_train.py
sparsemax.py		sparsemax.py
tokenization_bert.py		tokenization_bert.py
tokenization_utils.py		tokenization_utils.py
train_utils.py		train_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuAligner

Input format

Fine-tuning on parallel data

Extracting alignments

Acknowledgements

About

Releases

Packages

Languages

15091444119/NeuAligner

Folders and files

Latest commit

History

Repository files navigation

NeuAligner

Input format

Fine-tuning on parallel data

Extracting alignments

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages