Task 2: Idiomaticity Representation

Task 2 is tests models' ability to accurately represent sentences regardless of whether or not they contain idiomatic expressions. This is tested using Semantic Text Similarity (STS) and the metric for this task is the Spearman Rank correlation between models' output STS between sentences containing idiomatic expressions and the same sentences with the idiomatic expressions replaced by non-idiomatic paraphrases (which capture the correct meaning of the MWEs).

We perform all training 5 times with different random seeds and pick the best performing model.

Please see the paper for more details on the task.

Adding Idiom Tokens to 🤗 Transformers Models
Creating Sentence Transformers models
Creating the Evaluation Data
Generating Pre-Training Data
- Extract Data from Common Crawl
- Preparing Pre-train data
Subtask A - Pre-Training for Idiom Representation
Subtask B - Fine-Tuning for Idiom Representation
Pre-Trained and Fine-Tuned Models

Adding Idiom Tokens to 🤗 Transformers Models

Since we explore the impact of tokenizing MWEs as single tokens (the idiom principle), we first ensure that these tokens are added to pre-trained language models.

This is done using scripts in the Tokenize folder.

downloadModels.py will download the required model from 🤗 Transformers.
updateVocab.py updates the vocabulary of the model (This uses the "unused" tokens so currently only works for BERT and mBERT. Use tokenizer.add_tokens as described here for a generic solution.
tokenCheck.py will run a check to ensure that the tokenizer now tokenizes idioms with a single token.

Creating Sentence Transformers models

We use Sentence Transformers to generate sentence embeddings that can be compared using cosine similarity.

We modify the original package to allow it to handle the updated tokenization. Please install the version provided with this repository.

Here are the steps to create a Sentence Transformer Model:

Use createSentTransformerModel.py to create a sentence transformer model starting from a model whose tokens have been updated to include idioms (see above).
Run tokenCheck.py to check that the sentence transformer model uses the new tokens.
Use training_stsbenchmark.py (and training_stsbenchmark_PT.py for Portuguese) to train the model with STS data so it outputs embeddings that can be compared using cosine similarity.

Creating the Evaluation Data

Since this task requires models to be self consistent, we need to create evaluation data (or format it for use in our models) using a model that outputs semantic text similarity (such as the one trained above).

This is done using scripts in the folder CreateEvaluationData.

Start with the evaluation data available in the "NoResults" folders for EN and PT. These folders contain additional information regarding tokenization (for select tokenize and all tokenize) and similarities (which is what we need to ensure consistency). This data is created using the script createEvalData.py, but it is NOT recommended that you run this script as it might generate a slightly different dataset based on your random number generator.
Run predictSentSims.py (with the STS model created above) to generate sentence similarities.
Run runGlueEval.sh with the model used to identify idioms to differentiate between all tokenized and select tokenized (we use the one-shot model from Task 1 A)
Run combineCreateFinalEvalData.py to generate the final evaluation data.

Generating Pre-Training Data

This step is only required for Subtask A.

The processed pre-training data is available for both English and Portuguese.

Extract Data from Common Crawl

This step is only required when not using the pre-training data made available above.

We obtain pre-train data from the common crawl news corpus. This can be done using scripts in the ProcessCommonCrawl folder.

processCCNews.py will download CC News from 2020 and store relevant files.
createPreTrainData.py will create the data required for pre-training with additional files that have information to generate the select tokenize pre-train model (please see paper for details).

Preparing Pre-train data

This step (also) is only required when not using the pre-training data made available above.

The output of the previous step should result in the following files: all_replace_data.txt classification_sents.csv no_replace_data.txt vocab_update.txt
Split all_replace_data.txt into train and eval. (We use split -l 400000 for English and split -l 4000 for PT)
Run runGlue.sh to so as to generate predictions on which usage is idiomatic (This is used to generate the 'select' data)
Run createReplaceByPrediction.py to use the predictions above to generate 'select' replaced data for pre-training.
Split select_replace_data.txt into train and eval. (We use split -l 400000 for English and split -l 4000 for PT)

Subtask A - Pre-Training for Idiom Representation

Once the evaluation data and pre-training data have been created and the models have been modified to include single tokens for idioms, these scripts can be used for pre-training and evaluation.

Pre-Training

Run preTrain.py to continue pre-training from an existing 🤗 Transformers checkpoint (we do not train from scratch). The model used must have tokens associated with MWEs inserted as described in section Adding Idiom Tokens to 🤗 Transformers Models above.

Converting to Sentence Transformer Models

Each of the pre-trained models must be converted to Sentence Transformer models by training them on STS data so their output embeddings can be compared using cosine similarity. This can be done using steps described in the section Creating Sentence Transformers models above.

We do this five times with different seeds and pick the model that performs the best on the ordinary STS dataset used to train Sentence Transformers (which does NOT contain any information on the MWEs we work with).

Evaluation

You can evaluate the pre-trained representations of MWEs using scripts in the folder Task2/SubtaskA-Pre_Train/Evaluation.

We test each of the best models from the previous steps using the common script for task 2 evaluation (task2Evaluation.py).
You can run all the tests (default model, default model with special MWE tokenization, and models pre-trained with "all" and "select" pre-training data using the script SubtaskA-Pre_Train/Evaluation/eval.sh. Be sure to update the path of the models. [Please see paper for an explanation of each of these four variations]

Subtask B - Fine-Tuning for Idiom Representation

Fine-tuning models to better represent idioms also requires creating training data (or formatting training data) in a manner similar to that of creating/formatting evaluation data. This section describes the steps required in formatting the training data, training models and finally the evaluation.

Create Fine-Tuning Data

Fine-tuning data can be created using the scripts in the folder Task2/SubtaskB-Fine_Tune/CreateFineTuneData.

createFineTuneData.py extracts data from the raw json files along with creating files for predicting idiomaticity (required for "all" tokenized and "select" tokenized) and sentences similarity (required for ensuring self consistency).
predictSentSims.py will predict sentence similarity. This script uses a Sentence Transformers model with idiom tokens added (see section Creating Sentence Transformers models and Adding Idiom Tokens to 🤗 Transformers Models).
Run runGlueForTrainData.sh with the model used to identify idioms to differentiate between "all" tokenized and "select" tokenized (we use the one-shot model from Task 1 A)
combineCreateFinalTrainData.py combines all the different files and creates the final training data for all three variations (no tokenization change, idioms always replaced with new tokens, idioms replaced by new tokens only when we identify the usage as idiomatic).

Fine-Tuning

The data created above can now be used to train model a sentence transformer model.

IMPORTANT: We must start with a model that is already trained on the non-idiomatic STS data as described in the section Creating Sentence Transformers models above. The model should be able to handle the special tokens that use for idioms.

The script Task2/SubtaskB-Fine_Tune/FineTune/stsTrainer.py can be used to perform this fine tuning for all variations (no tokenization, "select" tokenization, and "all" tokenization").

Evaluation

The models trained above can be evaluated (all three variations - with no special tokenization, with "all" idioms tokenized, with only those instances of idioms identified to be idiomatic "select" tokenized ) using the same evaluation script: Task2/Evaluation/task2Evaluation.py

The following shell script provides all the required commands: Task2/SubtaskB-Fine_Tune/Evaluation/evalTask2B.sh

Pre-Trained and Fine-Tuned Models

The following models associated with Task 2 are publicly available. When training models, we train each 5 times with a different random seed and pick the best performing model (available here).

NOTE: Please note that Sentence Transformer models can't be directly used with the 🤗 Transformers link. They need to be downloaded to local disk (using git clone) before being used. Please remember to use git lfs!

No.	🤗 Transformers Name	Lang	Subtask	Details
1	harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedNoPreTrain	EN	A	BERT Base with tokenizer updated to handle MWEs as single tokens. No additional pre-training.
2	harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedALLReplacePreTrain	EN	A	BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "ALL Replace" strategy.
3	harish/AStitchInLanguageModels-Task2_EN_BERTTokenizedSelectReplacePreTrain	EN	A	BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "Select Replace" strategy.
4	harish/AStitchInLanguageModels-Task2_EN_SentTransTokenizedNoPreTrain	EN	A	Model No. 1 above converted to Sentence Transformer model with STS training
5	harish/AStitchInLanguageModels-Task2_EN_SentTransALLReplacePreTrain	EN	A	Model No. 2 above converted to Sentence Transformer model with STS training
6	harish/AStitchInLanguageModels-Task2_EN_SentTransSelectReplacePreTrain	EN	A	Model No. 3 above converted to Sentence Transformer model with STS training

7	harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedNoPreTrain	PT	A	Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens. No additional pre-training.
8	harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedALLReplacePreTrain	PT	A	Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "ALL Replace" strategy.
9	harish/AStitchInLanguageModels-Task2_PT_mBERTTokenizedSelectReplacePreTrain	PT	A	Multilingual BERT Base with tokenizer updated to handle MWEs as single tokens and additionally pre-trained using the "Select Replace" strategy.
10	harish/AStitchInLanguageModels-Task2_PT_SentTransTokenizedNoPreTrain	PT	A	Model No. 7 above converted to Sentence Transformer model with (PT) STS training
11	harish/AStitchInLanguageModels-Task2_PT_SentTransALLReplacePreTrain	PT	A	Model No. 8 above converted to Sentence Transformer model with (PT) STS training
12	harish/AStitchInLanguageModels-Task2_PT_SentTransSelectReplacePreTrain	PT	A	Model No. 9 above converted to Sentence Transformer model with (PT) STS training

13	harish/AStitchInLanguageModels-Task2_EN_SentTransDefaultFineTuned	EN	B	Sentence Transformer with default tokenization fine tuned on idiomatic STS data
14	harish/AStitchInLanguageModels-Task2_EN_SentTransAllTokenizedFineTuned	EN	B	Sentence Transformer with special idiom tokenization fine tuned on idiomatic STS data tokenized using the "ALL replace" strategy.
15	harish/AStitchInLanguageModels-Task2_EN_SentTransSelectTokenizedFineTuned	EN	B	Sentence Transformer with special idiom tokenization fine tuned on idiomatic STS data tokenized using the "Select replace" strategy.
16	harish/AStitchInLanguageModels-Task2_PT_SentTransDefaultFineTuned	PT	B	Sentence Transformer with default tokenization fine tuned on idiomatic (PT) STS data
17	harish/AStitchInLanguageModels-Task2_PT_SentTransAllTokenizedFineTuned	PT	B	Sentence Transformer with special idiom tokenization fine tuned on (PT) idiomatic STS data tokenized using the "ALL replace" strategy.
18	harish/AStitchInLanguageModels-Task2_PT_SentTransSelectTokenizedFineTuned	PT	B	Sentence Transformer with special idiom tokenization fine tuned on (PT) idiomatic STS data tokenized using the "Select replace" strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Task 2: Idiomaticity Representation

Table of Contents

Adding Idiom Tokens to 🤗 Transformers Models

Creating Sentence Transformers models

Creating the Evaluation Data

Generating Pre-Training Data

Extract Data from Common Crawl

Preparing Pre-train data

Subtask A - Pre-Training for Idiom Representation

Pre-Training

Converting to Sentence Transformer Models

Evaluation

Subtask B - Fine-Tuning for Idiom Representation

Create Fine-Tuning Data

Fine-Tuning

Evaluation

Pre-Trained and Fine-Tuned Models

Files

README.md

Latest commit

History

README.md

File metadata and controls

Task 2: Idiomaticity Representation

Table of Contents

Adding Idiom Tokens to 🤗 Transformers Models

Creating Sentence Transformers models

Creating the Evaluation Data

Generating Pre-Training Data

Extract Data from Common Crawl

Preparing Pre-train data

Subtask A - Pre-Training for Idiom Representation

Pre-Training

Converting to Sentence Transformer Models

Evaluation

Subtask B - Fine-Tuning for Idiom Representation

Create Fine-Tuning Data

Fine-Tuning

Evaluation

Pre-Trained and Fine-Tuned Models