Skip to content

Latest commit

 

History

History
155 lines (106 loc) · 22.1 KB

File metadata and controls

155 lines (106 loc) · 22.1 KB

Task 2: Idiomaticity Representation

Task 2 is tests models' ability to accurately represent sentences regardless of whether or not they contain idiomatic expressions. This is tested using Semantic Text Similarity (STS) and the metric for this task is the Spearman Rank correlation between models' output STS between sentences containing idiomatic expressions and the same sentences with the idiomatic expressions replaced by non-idiomatic paraphrases (which capture the correct meaning of the MWEs).

We perform all training 5 times with different random seeds and pick the best performing model.

Please see the paper for more details on the task.

Table of Contents

Adding Idiom Tokens to 🤗 Transformers Models

Since we explore the impact of tokenizing MWEs as single tokens (the idiom principle), we first ensure that these tokens are added to pre-trained language models.

This is done using scripts in the Tokenize folder.

  • downloadModels.py will download the required model from 🤗 Transformers.
  • updateVocab.py updates the vocabulary of the model (This uses the "unused" tokens so currently only works for BERT and mBERT. Use tokenizer.add_tokens as described here for a generic solution.
  • tokenCheck.py will run a check to ensure that the tokenizer now tokenizes idioms with a single token.

Creating Sentence Transformers models

We use Sentence Transformers to generate sentence embeddings that can be compared using cosine similarity.

We modify the original package to allow it to handle the updated tokenization. Please install the version provided with this repository.

Here are the steps to create a Sentence Transformer Model:

Creating the Evaluation Data

Since this task requires models to be self consistent, we need to create evaluation data (or format it for use in our models) using a model that outputs semantic text similarity (such as the one trained above).

This is done using scripts in the folder CreateEvaluationData.

  • Start with the evaluation data available in the "NoResults" folders for EN and PT. These folders contain additional information regarding tokenization (for select tokenize and all tokenize) and similarities (which is what we need to ensure consistency). This data is created using the script createEvalData.py, but it is NOT recommended that you run this script as it might generate a slightly different dataset based on your random number generator.
  • Run predictSentSims.py (with the STS model created above) to generate sentence similarities.
  • Run runGlueEval.sh with the model used to identify idioms to differentiate between all tokenized and select tokenized (we use the one-shot model from Task 1 A)
  • Run combineCreateFinalEvalData.py to generate the final evaluation data.

Generating Pre-Training Data

This step is only required for Subtask A.

The processed pre-training data is available for both English and Portuguese.

Extract Data from Common Crawl

This step is only required when not using the pre-training data made available above.

We obtain pre-train data from the common crawl news corpus. This can be done using scripts in the ProcessCommonCrawl folder.

  • processCCNews.py will download CC News from 2020 and store relevant files.
  • createPreTrainData.py will create the data required for pre-training with additional files that have information to generate the select tokenize pre-train model (please see paper for details).

Preparing Pre-train data

This step (also) is only required when not using the pre-training data made available above.

  • The output of the previous step should result in the following files: all_replace_data.txt classification_sents.csv no_replace_data.txt vocab_update.txt
  • Split all_replace_data.txt into train and eval. (We use split -l 400000 for English and split -l 4000 for PT)
  • Run runGlue.sh to so as to generate predictions on which usage is idiomatic (This is used to generate the 'select' data)
  • Run createReplaceByPrediction.py to use the predictions above to generate 'select' replaced data for pre-training.
  • Split select_replace_data.txt into train and eval. (We use split -l 400000 for English and split -l 4000 for PT)

Subtask A - Pre-Training for Idiom Representation

Once the evaluation data and pre-training data have been created and the models have been modified to include single tokens for idioms, these scripts can be used for pre-training and evaluation.

Pre-Training

Converting to Sentence Transformer Models

Each of the pre-trained models must be converted to Sentence Transformer models by training them on STS data so their output embeddings can be compared using cosine similarity. This can be done using steps described in the section Creating Sentence Transformers models above.

We do this five times with different seeds and pick the model that performs the best on the ordinary STS dataset used to train Sentence Transformers (which does NOT contain any information on the MWEs we work with).

Evaluation

You can evaluate the pre-trained representations of MWEs using scripts in the folder Task2/SubtaskA-Pre_Train/Evaluation.

  • We test each of the best models from the previous steps using the common script for task 2 evaluation (task2Evaluation.py).
  • You can run all the tests (default model, default model with special MWE tokenization, and models pre-trained with "all" and "select" pre-training data using the script SubtaskA-Pre_Train/Evaluation/eval.sh. Be sure to update the path of the models. [Please see paper for an explanation of each of these four variations]

Subtask B - Fine-Tuning for Idiom Representation

Fine-tuning models to better represent idioms also requires creating training data (or formatting training data) in a manner similar to that of creating/formatting evaluation data. This section describes the steps required in formatting the training data, training models and finally the evaluation.

Create Fine-Tuning Data

Fine-tuning data can be created using the scripts in the folder Task2/SubtaskB-Fine_Tune/CreateFineTuneData.

  • createFineTuneData.py extracts data from the raw json files along with creating files for predicting idiomaticity (required for "all" tokenized and "select" tokenized) and sentences similarity (required for ensuring self consistency).
  • predictSentSims.py will predict sentence similarity. This script uses a Sentence Transformers model with idiom tokens added (see section Creating Sentence Transformers models and Adding Idiom Tokens to 🤗 Transformers Models).
  • Run runGlueForTrainData.sh with the model used to identify idioms to differentiate between "all" tokenized and "select" tokenized (we use the one-shot model from Task 1 A)
  • combineCreateFinalTrainData.py combines all the different files and creates the final training data for all three variations (no tokenization change, idioms always replaced with new tokens, idioms replaced by new tokens only when we identify the usage as idiomatic).

Fine-Tuning

The data created above can now be used to train model a sentence transformer model.

IMPORTANT: We must start with a model that is already trained on the non-idiomatic STS data as described in the section Creating Sentence Transformers models above. The model should be able to handle the special tokens that use for idioms.

The script Task2/SubtaskB-Fine_Tune/FineTune/stsTrainer.py can be used to perform this fine tuning for all variations (no tokenization, "select" tokenization, and "all" tokenization").

Evaluation

The models trained above can be evaluated (all three variations - with no special tokenization, with "all" idioms tokenized, with only those instances of idioms identified to be idiomatic "select" tokenized ) using the same evaluation script: Task2/Evaluation/task2Evaluation.py

The following shell script provides all the required commands: Task2/SubtaskB-Fine_Tune/Evaluation/evalTask2B.sh

Pre-Trained and Fine-Tuned Models

The following models associated with Task 2 are publicly available. When training models, we train each 5 times with a different random seed and pick the best performing model (available here).

NOTE: Please note that Sentence Transformer models can't be directly used with the 🤗 Transformers link. They need to be downloaded to local disk (using git clone) before being used. Please remember to use git lfs!

No. 🤗 Transformers Name Lang Subtask Details
1 harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedNoPreTrain EN A BERT Base with tokenizer updated to handle MWEs as single tokens. No additional pre-training.
2 harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedALLReplacePreTrain EN A BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "ALL Replace" strategy.
3 harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedSelectReplacePreTrain EN A BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "Select Replace" strategy.
4 harish/AStitchInLanguageModels-Task2_EN_SentTransTokenizedNoPreTrain EN A Model No. 1 above converted to Sentence Transformer model with STS training
5 harish/AStitchInLanguageModels-Task2_EN_SentTransALLReplacePreTrain EN A Model No. 2 above converted to Sentence Transformer model with STS training
6 harish/AStitchInLanguageModels-Task2_EN_SentTransSelectReplacePreTrain EN A Model No. 3 above converted to Sentence Transformer model with STS training
7 harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedNoPreTrain PT A Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens. No additional pre-training.
8 harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedALLReplacePreTrain PT A Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "ALL Replace" strategy.
9 harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedSelectReplacePreTrain PT A Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "Select Replace" strategy.
10 harish/AStitchInLanguageModels-Task2_PT_SentTransTokenizedNoPreTrain PT A Model No. 7 above converted to Sentence Transformer model with (PT) STS training
11 harish/AStitchInLanguageModels-Task2_PT_SentTransALLReplacePreTrain PT A Model No. 8 above converted to Sentence Transformer model with (PT) STS training
12 harish/AStitchInLanguageModels-Task2_PT_SentTransSelectReplacePreTrain PT A Model No. 9 above converted to Sentence Transformer model with (PT) STS training
13 harish/AStitchInLanguageModels-Task2_EN_SentTransDefaultFineTuned EN B Sentence Transformer with default tokenization fine tuned on idiomatic STS data
14 harish/AStitchInLanguageModels-Task2_EN_SentTransAllTokenizedFineTuned EN B Sentence Transformer with special idiom tokenization fine tuned on idiomatic STS data tokenized using the "ALL replace" strategy.
15 harish/AStitchInLanguageModels-Task2_EN_SentTransSelectTokenizedFineTuned EN B Sentence Transformer with special idiom tokenization fine tuned on idiomatic STS data tokenized using the "Select replace" strategy.
16 harish/AStitchInLanguageModels-Task2_PT_SentTransDefaultFineTuned PT B Sentence Transformer with default tokenization fine tuned on idiomatic (PT) STS data
17 harish/AStitchInLanguageModels-Task2_PT_SentTransAllTokenizedFineTuned PT B Sentence Transformer with special idiom tokenization fine tuned on (PT) idiomatic STS data tokenized using the "ALL replace" strategy.
18 harish/AStitchInLanguageModels-Task2_PT_SentTransSelectTokenizedFineTuned PT B Sentence Transformer with special idiom tokenization fine tuned on (PT) idiomatic STS data tokenized using the "Select replace" strategy.