TRAVIS: (multilingual) Multiword Expression identification model

This repository contains the code for the TRAVIS model built for the PARSEME Shared Task 2020 on semisupervised identification of verbal multiword expressions. TRAVIS is a fully feature-independent model, relying only on the contextual embeddings. The model ranked second in the open track of the shared task, see (kurfali, 2020) for details.

TRAVIS is a tool for performing multi-word expression (MWE) identification via fine-tuning language models.

Shared Task Data

The shared task data used to train and evaluate TRAVIS can be obtained via:

    git clone https://gitlab.com/parseme/sharedtask-data.git

Pre-processing

TRAVIS models MWE identification as a token classification task. Therefore, it expects files with the standard "one token \t label" pair per line format where empty lines specify a new training instance. The .cupt files can be converted to desired format using

   python utils/preprocess.py  sharedtask-data_path/1.2 output_path  target_language[optional]

Preprocess script will produce the corresponding csv files in the output_path.

Training

Here is a sample script to train a model using TRAVIS:

    python run_ner.py \
      --data_dir path_to_data \
      --lang "tr" \
      --model_name_or_path dbmdz/bert-base-turkish-128k-cased \
      --output_dir path_to_output_dir \
      --max_seq_length 256 \
      --num_train_epochs 3 \
      --per_gpu_train_batch_size 16 \
      --per_gpu_eval_batch_size 32 \
      --do_train \
      --do_eval \
      --overwrite_output_dir

BERT fine-tuning procedure is known to be prone to variance; therefore, you may want to consider it to fine-tune it for several times with different seeds (see train.sh). In the shared task, we submitted the predictions of the model with the best performance on the development set out of four runs.

See run_experiments.sh to run all the experiments described in the paper.

PREDICTION

To run a trained model on your validation file, you can use the following command (predict.sh):

python run_ner.py \
  --data_dir path_to_data \
  --model_name_or_path path_to_trained_model \
  --output_dir path_to_trained_model \
  --max_seq_length 512 \
  --num_train_epochs 3 \
  --per_gpu_eval_batch_size 32 \
  --do_predict \
  --predict_file path_to_target_file

The predictions of the model will be saved to the "predictions" folder. If no predict_file is provided, the prediction will be run on the original test file which is assumed to be in --data_dir.

The predictions on the shared task files can be converted back to the original .cupt format using

python utils/postprocess.py prediction_file_path original_cupt_file_path output_dir

Publications

If you use this resource, please consider citing

@inproceedings{kurfali2020travis,
  title={TRAVIS at PARSEME Shared Task 2020: How good is (m) BERT at seeing the unseen?},
  author={Kurfali, Murathan},
  booktitle={International Conference on Computational Linguistics (COLING), Barcelona, Spain (Online), December 13, 2020},
  pages={136--141},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
utils		utils
.gitignore		.gitignore
data_loader.py		data_loader.py
predict.sh		predict.sh
readme.md		readme.md
run_experiments.sh		run_experiments.sh
run_ner.py		run_ner.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRAVIS: (multilingual) Multiword Expression identification model

Shared Task Data

Pre-processing

Training

PREDICTION

Publications

About

Releases

Packages

Languages

MurathanKurfali/travis

Folders and files

Latest commit

History

Repository files navigation

TRAVIS: (multilingual) Multiword Expression identification model

Shared Task Data

Pre-processing

Training

PREDICTION

Publications

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages