Skip to content

SapienzaNLP/mulan

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MuLaN: Multilingual Label propagatioN for Word Sense Disambiguation

MuLaN (Multilingual Label propagatioN, IJCAI 2020) is a label propagation technique tailored to WSD and capable of automatically producing sense-tagged training datasets in multiple languages. Simply put, by jointly leveraging contextualized word embeddings and the multilingual information enclosed in knowledge bases, MuLaN projects sense information from a source tagged corpus in language L1 towards a target unlabelled one in language L2, possibly different from L1.

If you find either our code or our release datasets useful in your work, please cite us with:

@inproceedings{ijcai2020-531,
  title     = {Mu{L}a{N}: Multilingual Label propagatio{N} for Word Sense Disambiguation},
  author    = {Barba, Edoardo and Procopio, Luigi and Campolungo, Niccolò and Pasini, Tommaso and Navigli, Roberto},
  booktitle = {Proceedings of the Twenty-Ninth International Joint Conference on
               Artificial Intelligence, {IJCAI-20}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},             
  pages     = {3837--3844},
  year      = {2020},
  month     = {7},
  doi       = {10.24963/ijcai.2020/531},
  url       = {https://doi.org/10.24963/ijcai.2020/531},
}

Released Datasets

We release the datasets we referenced within our paper, providing silver-tagged data in German, Spanish, French and Italian:

Dataset # sentences # instances # distinct lemmas # distinct synsets # distinct senses % of transferred synsets
SemCor + WNG 154835 722812 59073 69404 91274 100.0
mulan-de 207801 245173 19108 18676 21776 26.91
mulan-es 262391 452584 30383 42618 56252 61.41
mulan-fr 228757 310756 21218 24600 28701 35.44
mulan-it 279320 415761 28244 32489 43559 46.81

Generating a new corpus

We also release our transferring code, thus allowing to generate new tagged corpora.

Environment Setup

  1. Install conda
  2. Run setup.sh to setup the environment
    bash setup.sh
    
  3. Setup data folder so that it looks like the following:
    $ tree -L 2 ~/mulan/data
    ├── bn2wn.txt
    ├── mapped-datasets
    │   ├── sample-source
    │   └── sample-target
    └── wsd-datasets
        ├── SemCor
        └── WNGT
    That is, you have to create the bn2wn.txt file. In order to achieve this, you may use the code and instructions in SapienzaNLP/mwsd-datasets: the section BabelNet To WordNet mapping produces exactly this file. The file should look the following:
    <babelnet-id> <\t> <first-associated-wordnet-id> <\t> <second-associated-wordnet-id> ...
    
  4. Setup the vocabs folder so that it looks like the following:
    $ tree -L 1 ~/mulan/vocabs
    ├── ...
    ├── lemma2synsets.<desired-languge>.txt
    └── ...
    Once again, you may use the code and instructions in SapienzaNLP/mwsd-datasets in order to generate the mappings from lemmas to the possible BabelNet synsets in the desired languages (section Build the Inventory): take the file inventory.<language>.withgold.txt (we suggest sticking to the WordNet subgraph and using the -s wn option for most cases), rename it to lemma2synsets.<desired-languge>.txt and place it in the vocabs/ folder. This file should look like the following:
    <lemma>#<pos> <\t> <first-associated-babelnet-id> <\t> <second-associated-babelnet-id> ...
    

Projection Flow

All the transfering code is organized using a "gun firing" analogy. This metaphor was not meant to stick around; rather, it simply made easier talking and reasoning about the various steps. However, we ended up getting attached to it:

  1. load: MuLaN takes as input a source and a target corpus, vectorizes them and stores them in the vectorization/ folder (section 3.1 in the paper)
  2. spot targets: MuLaN computes, for each annotated instance in the source corpus, a list of possible projections onto the target one. These transfer coordinates are stored in the coordinates/ folder (section 3.2)
  3. aim (or, informally, compute firing priorities): MuLaN filters the proposed transfers (with a backward check) and assigns a globally consistent score to each (x, y), where x is an annotated source instance and y a target instance MuLaN proposed transferring x upon. The output of this step is stored inside the coordinates/ folder (first part of section 3.3)
  4. fire: MuLaN uses the coordinates file produced at the previous step to automatically generate a new annotated corpus (second part of section 3.3)

Folder Structure

$ tree -L 1 ~/mulan
├── mulan
├── cache
├── coordinates
├── data
├── README.md
├── transfer
├── vectorization
└── vocabs
  • mulan: code directory
  • cache: where some intermediate data structures (i.e. LevelDB databases) are saved
  • coordinates: where coordinates are saved
  • data: datasets and mappings
  • transfer: where transfer results are saved
  • vectorization: where vectorization results are saved
  • vocab: where vocabs are saved

Input Data Preparation

MuLaN's pipeline takes as input 2 corpora (a source and a target one), specified through the Corpus enum in mulan/corpora.py and with the associated data stored in data/mapped-datasets in our predefined input format:

$ head -1 ~/mulan/data/mapped-datasets/sample-source/data.txt 
d1.s1 <\t> I I PRON X <\t> have have VERB X <\t> a a DET X <\t> dog dog NOUN bn:00015267n
$ head -1 ~/mulan/data/mapped-datasets/sample-target/0.txt
sample:1.1 <\t> Io io PRON X <\t> ho avere VERB X <\t> un un DET X <\t> cane cane NOUN X

MuLaN only accepts input in such format. Most likely, you'll need to preprocess your data into this structure; to pos-lemma tag the file, we suggest using Stanza.

Output Format

As the output format, we follow the scheme introduced in SemEval 2013 task 12 and later chosen as the standard WSD format for the evaluation framework presented in Raganato et al, 2017.

A dataset consists of 2 files:

  • a .xml file, storing the actual sentences and marking tagged instances (i.e. tokens) with an id
  • a .txt file, mapping instances ids with the actual labels (synsets in our case)

To make an example:

$ tree -L 1 ~/mulan/transfer/SAMPLE_SOURCE_MBERT-SAMPLE_TARGET_MBERT/
├── transfer.data.xml
└── transfer.gold.key.txt

Running the projection code

You can create a new corpus either using our fine-grained scripts:

PYTHONPATH=mulan/ python mulan/transfer/1_retrieve_targets_manifesto.py --language <language>
PYTHONPATH=mulan/ python mulan/transfer/2_load.py <source-corpus-enum>
PYTHONPATH=mulan/ python mulan/transfer/2_load.py <target-corpus-enum>
PYTHONPATH=mulan/ python mulan/transfer/3_spot_targets.py --source-enum <source-corpus-enum> --target-enum <target-corpus-enum> --coordinates-folder <output-coordinates-folder>
PYTHONPATH=mulan/ python mulan/transfer/4_compute_priorities.py --source-enum <semcor-enum> --target-enum <target-corpus-enum> --coordinates-folder <output-coordinates-folder>
PYTHONPATH=mulan/ python mulan/transfer/5_word_fire.py --language <language> --name <name> --coordinates <coordinates-path>,<source-corpus-enum>,<target-corpus-enum> --coordinates <coordinates-path>,<source-corpus-enum>,<target-corpus-enum> --output-folder <output-folder>

or with a simpler:

bash pipeline.sh <target-language> <source-corpus-enum> <target-corpus-enum>

It may be necessary to manually edit the code in order to change some hyperparameters (i.e. the encoder). We plan on moving our code to an AllenNLP-like structure, with json configurations being given as input, so as to better support hyperparameters modifications; however, we do not have an ETA for this yet.

Example

We provide a simple example you may use to better understand the intermediate stages (and, obviously, to check that everything is working correctly):

  • source (located at data/mapped-datasets/sample-source): I have a (dog, bn:00015267n)
  • target (located at data/mapped-datasets/sample-target): Io ho un cane Running:
bash pipeline.sh it SAMPLE_SOURCE_MBERT SAMPLE_TARGET_MBERT

will project the sense of dog in I have a dog towards cane in Io ho un cane, producing:

  • intermediate files in vectorization and coordinates

  • final result in transfer

About

Multilingual Label propagatioN for Word Sense Disambiguation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •