supWSD is a supervised word sense disambiguation system. The flexible framework of supWSD allows users to combine different preprocessing modules, to select features extractors and choose which classifier to use. SupWSD is very light and has very small memory requirements. To configure the system, supWSD use a simple xml file.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
plugins/babelnet
resources/wndictionary
src/main/java/it/si3p/supwsd
LICENSE
README.md
pom.xml
supconfig.xml
supconfig.xsd

README.md

supWSD

supWSD is a supervised word sense disambiguation system. The flexible framework of supWSD allows users to combine different preprocessing modules, to select features extractors and choose which classifier to use. SupWSD is very light and has very small memory requirements; it provides a simple xml file to configure the system.

SUPWSD TOOLKIT

Toolkit

SUPWSD JAVA API

Java API

SUPWSD PYTHON API

Python API

DEMO INTERFACE

Demo online

Trained Models

Models available for download, based on different features and corpora. All models use WordNet 3.0 as sense inventory. The word embeddings used can be downloaded here [2.6GB]. The vocab file can be downloaded here. [159MB]

training data: SemCor with Stanford CoreNLP as preprocessor
  • models that include surrounding words, PoS tags of surroundings words, and local collocations as features.
  • models that include surrounding words, PoS tags of surroundings words, local collocations, and embeddings (integrated using exponential decay) as features.
  • models that include PoS tags of surroundings words, local collocations, and embeddings (integrated using exponential decay) as features.
training data: SemCor+OMSTI with Stanford CoreNLP as preprocessor
  • models that include surrounding words, PoS tags of surroundings words, and local collocations as features.
  • models that include surrounding words, PoS tags of surroundings words, local collocations, and embeddings (integrated using exponential decay) as features.
  • models that include PoS tags of surroundings words, local collocations, and embeddings (integrated using exponential decay) as features.

Files & Directories

Name Description
config Babelnet configuration folder, containing the properties files.
lib Babelnet libraries folder, containing all the necessary .jar files.
models folder contains the models of the preprocessing components (only openNLP).
resource folder contains the lexical semantic resources (sense inventories).
src/main/java/it/uniroma1/lcl/supWSD source code package.
LICENSE license file.
README this file.
pom.xml Maven configuration file.
supWSD.xml supWSD configuration file.
supWSD.xsd supWSD.xml schema definition. This file describes the elements in supWSD.xml and verify that each item adheres to the description of the element.

Install

  1. Download the jar file from the site (link will shortly follow)
  2. Move the jar to one directory (take supWSD for example)
  3. Copy the following files & directories into supWSD dir: supWSD.xml, supWSD.xsd, config, models, resources.

Requirement

  1. This software requires java 8 (JRE 1.8) or higher version.
  2. Since supWSD uses JWNL for accessing WordNet, you must define the path of the Wordnet dictionary. In resources/wndictionary/prop.xml you can find this line <param name="dictionary_path" value="dict" /> : the value "dict" specifies the path of the WordNet dictionary.

Quick Start

Assume the jar file is moved to directory "supWSD".

To train one model, type in a shell open to this directory:
java -jar supWSD.jar train supWSD.xml train.xml train.keys
train function name.
supWSD.xml configuration file.
train.xml corpus file.
train.keys keys file.

To test one file, type in a shell open to this directory:
java -jar supWSD.jar test supWSD.xml test.xml test.keys
test function name.
supWSD.xml configuration file.
test.xml test file.
test.keys keys file.

Configuration

supWSD configuration file allows to tune the entire disambiguation process:

Tag Attributes Meaning
working_directory the location where the results will be saved (models, stats and scores).
parser mns dataset parser; supWSD has 5 different parser types: lexical, senseval, semeval7, semeval13, semeval15 and plain. You can also implement and integrate a new parser. MNS (Model Name System) is the path to the file containing the lexelt information of testing instances.
preprocessing within this tag you can set the components to be used in the preprocessing pipeline. For each component you can specify the model to be applied using the model attribute. The simple component performs string splitting using the value of the model attribute. You can also implement and integrate a new component using the factory method pattern. If you want to bypass a phase, set the value of the child tag at none.
splitter model which component to use for sentence splitting: stanford, open_nlp, simple, none.
tokenizer model which component to use for sentence tokenization: stanford, open_nlp, penn_tree_bank, simple, none.
tagger model which component to use for part of speech tagging: stanford, open_nlp, tree_tagger, simple, none.
lemmatizer model which component to use for lemmatization: stanford, open_nlp, jwnl, tree_tagger, simple, none.
dependency_parser model which component to use for dependency parsing: stanford, none.
features which features have to be extracted. To disable an extractor, set the tag's value to false.
pos_tags cutoff cutoff: remove all the elements less than threshold in frequency.
local_collocations cutoff sequences sequences: file containing one extraction sequence per line (default: "-2 \n -1 \n 1 \n 2 \n -2,-1 \n -1,1 \n 1,2 \n -3,-1 \n -2,1 \n -1,2 \n 1,3").
surrounding_words cutoff stopwords window stopwords: file containing a list of stop words, one word per line (default: SurroundingStopWordsFilter.DEFAULT_FILTERS); window : number of sentences (on a single side) in the neighborhood of the current sentence that will be used to extract words. Set -1 to extract all words.
word_embeddings cache strategy vectors vocab window cache: vector cache size as a percentage of the number of vectors [0-1]. strategy: AVG (Average) computes the centroid of the embeddings of all the surrounding words; FRA (Fractional decay) vectors are weighted based on their distance from the target word; EXP (Exponential decay) the weight of vectors decay exponentially. window: number of words (on a single side) in the neighborhood of the tag word that will be used to extract embeddings. vectors: file containing word embeddings, one vector per line (word vector). vocab: file containing vocabulary words, one word per line (word frequency) - this attribute is required only when cache size is less than 1.
syntactic_relations extract different types of syntactic relations depending on the POS of word.
classifier liblinear or libsvm : the classifier trains a model for each annotated word. The model will be used to classify test instances.
writer all: export results to a file; single: generate a file for each test instance; plain: create a plain text file, a sentence for each line with senses and probabilities for disambiguated words.
sense_inventory dict the sense inventory used for testing instances: wordnet, babelnet or none. For Wordnet you must set the attribute dict and specify the path of the WordNet dictionary.

References

If you use this system, please cite the following paper:

@InProceedings{papandreaetal:EMNLP2017Demos,
  author    = {Papandrea, Simone  and  Raganato, Alessandro  and  Delli Bovi, Claudio},
  title     = {SupWSD: A Flexible Toolkit for Supervised Word Sense Disambiguation},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {103--108}
}