Skip to content

Latest commit

 

History

History
164 lines (114 loc) · 11.6 KB

README.md

File metadata and controls

164 lines (114 loc) · 11.6 KB

WUGs

Scripts to process Word Usage Graphs (WUGs).

If you use this software for academic research, please cite these papers:

Find WUG data sets on the WUGsite.

Usage

Under scripts/ we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:

bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh

For the alternative pipeline with multiple possible clustering algorithms (Correlation Clustering, Weighted Stochastic Block Model, Chinese Whispers, Louvain method) and custom plotting functionalities, instead run:

bash -e scripts/run_uug2.sh

There are two scripts for external use with the DURel annotation tool allowing to specify input directory and other parameters from the command line (find usage examples in test.sh:

bash -e scripts/run_system.sh $dir ...
bash -e scripts/run_system2.sh $dir ...

Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence. Important: The script uses simple test parameters; in order to improve the clustering load parameters_opt.sh in run_uug.sh or run_usg.sh.

We recommend you to run the scripts within a Python Anaconda environment. You have two options:

  1. Create and activate the conda environment yourself, and then install the required packages with conda env update --file packages.yml.
  2. Run source install_packages.sh. This will create the conda environment and install all required packages.

Both installation options were tested on Linux. You can test if your installation is working by running

bash -e test.sh

After installation, please check whether pygraphviz was installed correctly. There have been recurring errors with pygraphviz installation across operating systems. If an error occurs, you can check this page for solutions. On Linux, installing graphviz through the package manager is recommended.

Description

  • data2join.py: joins annotated data
  • data2annotators.py: extracts mapping from users to (anonymized) annotators
  • data2agr.py: computes agreement on full data
  • use2graph.py: adds uses to graph
  • sense2graph.py: adds senses to graph, for usage-sense graphs
  • sense2node.py: adds sense annotation data to nodes, if available
  • judgments2graph.py: adds judgments to graph
  • graph2cluster.py: clusters graph
  • extract_clusters.py: extract clusters from graph
  • graph2stats.py: extracts statistics from graph, including change scores
  • graph2plot.py: plots interactive graph in 2D

Please find the parameters for the current optimized WUG versions in parameters_opt.sh. Note that the parameters for the SemEval versions in parameters_semeval.sh will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.

For annotating and plotting your own graphs we recommend to use the DURel Tool.

Additional scripts and data

  • misc/usim2data.sh: downloads USim data and converts it to WUG format
  • misc/make_release.sh: create data for publication from pipeline output (compare to format of published data sets on WUGsite)

Input

For usage-usage graphs:

  • uses: find examples at test_uug/data/*/uses.csv
  • judgments: find examples at test_uug/data/*/judgments.csv

For usage-sense graphs:

  • uses: find examples at test_usg/data/*/uses.csv
  • senses: find examples at test_usg/data/*/senses.csv
  • judgments: find examples at test_usg/data/*/judgments.csv

Note: The column 'identifier' in each uses.csv should identify each word usage uniquely across all words.

Input Format

The uses.csv files must contain one use per line with the following fields specified as header and separated by :

<lemma>\t<pos>\t<date>\t<grouping>\t<identifier>\t<description>\t<context>\t<indexes_target_token>\t<indexes_target_sentence>\n

The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as language, lemmatization, etc.

Find information on the individual fields below:

  • lemma: the lemma form of the target word in the respective word use
  • pos: the POS tag if available (else put space character)
  • date: the date of the use if available (else put space character)
  • grouping: any string assigning uses to groups (e.g. time-periods, corpora or dialects)
  • identifier: an identifier unique to each use across lemmas. We recommend to use this format: filename-sentenceno-tokenno
  • description: any additional information on the use if available (else put space character)
  • context: the text of the use. This will be shown to annotators.
  • indexes_target_token: The character indexes of the target token in context (Python list ranges as used in slicing, e.g. 17:25)
  • indexes_target_sentence: The character indexes of the target sentence (containing the target token) in context (e.g. 0:30 if context contains only one sentence, or 10:45 if it contains additional surrounding sentences). The part of the context beyond the specified character range will be marked as background in gray.

The judgments.csv files must contain one use pair judgment per line with the following fields specified as header and separated by :

<identifier1>\t<identifier2>\t<annotator>\t<judgment>\t<comment>\t<lemma>\n

The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as the round of annotation, etc.

Find information on the individual fields below:

  • identifier1: identifier of the first use in the use pair (must correspond to identifier in uses.csv)
  • identifier2: identifier of the second use in the use pair
  • annotator: annotator name
  • judgment: annotator judgment on graded scale (e.g. 1 for unrelated, 4 for identical)
  • comment: the annotator's comment (if any)
  • lemma: the lemma form of the target word in both uses

Further reading

Find further research on WUGs in these papers:

BibTex

@inproceedings{Schlechtweg2021dwug,
 title = {{DWUG}: A large Resource of Diachronic Word Usage Graphs in Four Languages},
 author = {Schlechtweg, Dominik  and Tahmasebi, Nina  and Hengchen, Simon  and Dubossarsky, Haim  and McGillivray, Barbara},
 booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
 publisher = {Association for Computational Linguistics},
 address = {Online and Punta Cana, Dominican Republic},
 pages = {7079--7091},
 url = {https://aclanthology.org/2021.emnlp-main.567},
 year = {2021}
}

@phdthesis{Schlechtweg2023measurement,
  author  = "Schlechtweg, Dominik",
  title   = "Human and Computational Measurement of Lexical Semantic Change",
  school  = "University of Stuttgart",
  address = "Stuttgart, Germany",
  url = {http://dx.doi.org/10.18419/opus-12833}
  year    = 2023
}

License

Creative Commons Attribution No Derivatives 4.0 International.