Named Entity Recognition for Idiomaticity Detection

Official repository for the paper NER4ID at SemEval-2022 Task 2: Named Entity Recognition for Idiomaticity Detection.

Please consider citing our work if you use data and/or code from this repository.

Bibtex

@inproceedings{tedeschi-navigli-2022-ner4id,
    title = "{NER}4{ID} at {S}em{E}val-2022 Task 2: Named Entity Recognition for Idiomaticity Detection",
    author = "Tedeschi, Simone  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.semeval-1.25",
    doi = "10.18653/v1/2022.semeval-1.25",
    pages = "204--210",
    abstract = "Idioms are lexically-complex phrases whose meaning cannot be derived by compositionally interpreting their components. Although the automatic identification and understanding of idioms is essential for a wide range of Natural Language Understanding tasks, they are still largely under-investigated.This motivated the organization of the SemEval-2022 Task 2, which is divided into two multilingual subtasks: one about idiomaticity detection, and the other about sentence embeddings. In this work, we focus on the first subtask and propose a Transformer-based dual-encoder architecture to compute the semantic similarity between a potentially-idiomatic expression and its context and, based on this, predict idiomaticity. Then, we show how and to what extent Named Entity Recognition can be exploited to reduce the degree of confusion of idiom identification systems and, therefore, improve performance.Our model achieves 92.1 F1 in the one-shot setting and shows strong robustness towards unseen idioms achieving 77.4 F1 in the zero-shot setting. We release our code at https://github.com/Babelscape/ner4id.",
}

System Overview

In a nutshell, NER4ID is a high-performing idiom identification system that: i) uses Named Entity Recognition (NER) to pre-identify non-idiomatic expressions, and ii) exploits a novel Transformer-based dual-encoder architecture to compute the semantic similarities between the remaining potentially-idiomatic expressions and their contexts and, based on these, predict idiomaticity.

The following figure shows the overall architecture of our NER4ID system:

Data

The datasets used to train and evaluate our NER4ID system are those provided by SemEval-2022 Task 2 organizers. Each entry contains a multi-word expression (MWE) in context, and the aim of the system is to determine whether such MWE is used with a literal or idiomatic meaning in that context. Datasets are provided for three different languages: English, Portuguese and Galician.

Additionally, two different settings are available: zero-shot and one-shot. In the "zero-shot" setting, MWEs (potentially idiomatic phrases) in the training set are completely disjoint from those in the test and development sets. In the "one-shot" setting, they included one positive and one negative training example for each MWE in the test and development sets.

The datasets are available in the /data folder.

Implementation

We implemented our idiom identification system with PyTorch using the Transformers library to load the weights of a BERT-based model.

We fine-tuned our idiom identification system for 100 epochs with a Mean-Squared Error loss criterion, adopting an early stopping strategy with a patience value of 20, Adam optimizer and a learning rate of 10-5, as standard when fine-tuning the weights of a pretrained language model. Additionally, we set δ=0, and use 32 as batch size, with 4 steps of gradient accumulation.

To identify entities, instead, we employed wikineural-multilingual-ner, a Multilingual BERT (mBERT) model fine-tuned on the WikiNEuRal dataset. We compare systems by means of their Macro F1 scores, as specified by the competition rules.

We provide a Python Notebook that illustrates all the modules that characterize the NER4ID system. For ease of use, we simplify the notebook in the following points:

Instead of using BERT-base-cased for English and of BERT-base-portuguese-cased for Portuguese and Galician, we use a single BERT-base-multilingual-cased model;
Rather than ensembling the predictions of 9 model checkpoints, we consider the predictions only of the best model;
To identify entities, we use rely on commonly-used SpaCy NER tagger.

For further implementation details, please refer to the paper.

License

NER4ID is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.

Acknowledgments

We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon2020 research and innovation programme (http://mousse-project.org/).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
checkpoints		checkpoints
code		code
data		data
img		img
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints

checkpoints

code

code

data

data

img

img

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Named Entity Recognition for Idiomaticity Detection

Bibtex

System Overview

Data

Implementation

License

Acknowledgments

About

Releases

Packages

Languages

License

Babelscape/ner4id

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition for Idiomaticity Detection

Bibtex

System Overview

Data

Implementation

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages