Scripts to process Word Usage Graphs (WUGs).
If you use this software for academic research, please cite these papers:
- Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.
Find WUG data sets on the WUGsite.
Under scripts/
we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with
chmod 755 scripts/*.sh
Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:
bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh
For the alternative pipeline with multiple possible clustering algorithms (Correlation Clustering, Weighted Stochastic Block Model, Chinese Whispers, Louvain method) and custom plotting functionalities, instead run:
bash -e scripts/run_uug2.sh
There are two scripts for external use with the DURel annotation tool allowing to specify input directory and other parameters from the command line (find usage examples in test.sh
:
bash -e scripts/run_system.sh $dir ...
bash -e scripts/run_system2.sh $dir ...
Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence. Important: The script uses simple test parameters; in order to improve the clustering load parameters_opt.sh
in run_uug.sh
or run_usg.sh
.
We recommend you to run the scripts within a Python Anaconda environment. You have two options:
- Create and activate the conda environment yourself, and then install the required packages with
conda env update --file packages.yml
. - Run
source install_packages.sh
. This will create the conda environment and install all required packages.
Both installation options were tested on Linux. You can test if your installation is working by running
bash -e test.sh
After installation, please check whether pygraphviz was installed correctly. There have been recurring errors with pygraphviz installation across operating systems. If an error occurs, you can check this page for solutions. On Linux, installing graphviz through the package manager is recommended.
data2join.py
: joins annotated datadata2annotators.py
: extracts mapping from users to (anonymized) annotatorsdata2agr.py
: computes agreement on full datause2graph.py
: adds uses to graphsense2graph.py
: adds senses to graph, for usage-sense graphssense2node.py
: adds sense annotation data to nodes, if availablejudgments2graph.py
: adds judgments to graphgraph2cluster.py
: clusters graphextract_clusters.py
: extract clusters from graphgraph2stats.py
: extracts statistics from graph, including change scoresgraph2plot.py
: plots interactive graph in 2D
Please find the parameters for the current optimized WUG versions in parameters_opt.sh
. Note that the parameters for the SemEval versions in parameters_semeval.sh
will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.
For annotating and plotting your own graphs we recommend to use the DURel Tool.
misc/usim2data.sh
: downloads USim data and converts it to WUG formatmisc/make_release.sh
: create data for publication from pipeline output (compare to format of published data sets on WUGsite)
For usage-usage graphs:
- uses: find examples at
test_uug/data/*/uses.csv
- judgments: find examples at
test_uug/data/*/judgments.csv
For usage-sense graphs:
- uses: find examples at
test_usg/data/*/uses.csv
- senses: find examples at
test_usg/data/*/senses.csv
- judgments: find examples at
test_usg/data/*/judgments.csv
Note: The column 'identifier' in each uses.csv
should identify each word usage uniquely across all words.
The uses.csv
files must contain one use per line with the following fields specified as header and separated by :
<lemma>\t<pos>\t<date>\t<grouping>\t<identifier>\t<description>\t<context>\t<indexes_target_token>\t<indexes_target_sentence>\n
The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as language, lemmatization, etc.
Find information on the individual fields below:
- lemma: the lemma form of the target word in the respective word use
- pos: the POS tag if available (else put space character)
- date: the date of the use if available (else put space character)
- grouping: any string assigning uses to groups (e.g. time-periods, corpora or dialects)
- identifier: an identifier unique to each use across lemmas. We recommend to use this format:
filename-sentenceno-tokenno
- description: any additional information on the use if available (else put space character)
- context: the text of the use. This will be shown to annotators.
- indexes_target_token: The character indexes of the target token in
context
(Python list ranges as used in slicing, e.g.17:25
) - indexes_target_sentence: The character indexes of the target sentence (containing the target token) in
context
(e.g.0:30
if context contains only one sentence, or10:45
if it contains additional surrounding sentences). The part of the context beyond the specified character range will be marked as background in gray.
The judgments.csv
files must contain one use pair judgment per line with the following fields specified as header and separated by :
<identifier1>\t<identifier2>\t<annotator>\t<judgment>\t<comment>\t<lemma>\n
The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as the round of annotation, etc.
Find information on the individual fields below:
- identifier1: identifier of the first use in the use pair (must correspond to identifier in uses.csv)
- identifier2: identifier of the second use in the use pair
- annotator: annotator name
- judgment: annotator judgment on graded scale (e.g. 1 for unrelated, 4 for identical)
- comment: the annotator's comment (if any)
- lemma: the lemma form of the target word in both uses
Find further research on WUGs in these papers:
- Bill Noble, Francesco Periti, Nina Tahmasebi. 2024. Improving Word Usage Graphs with Edge Induction. In Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change.
- Dominik Schlechtweg, Shafqat Mumtaz Virk, Pauline Sander, Emma Sköldberg, Lukas Theuer Linke, Tuo Zhang, Nina Tahmasebi, Jonas Kuhn, Sabine Schulte im Walde. 2024. The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.
- Mariia Fedorova, Andrey Kutuzov, Nikolay Arefyev, Dominik Schlechtweg. 2024. Enriching Word Usage Graphs with Cluster Definitions. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation.
- Jing Chen, Emmanuele Chersoni, Dominik Schlechtweg, Jelena Prokic, Chu-Ren Huang. 2023. ChiWUG: A Graph-based Evaluation Dataset for Chinese Lexical Semantic Change Detection. Proceedings of the 4th International Workshop on Computational Approaches to Historical Language Change 2023 (LChange'23).
- Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Enstad, Alexandra Wittemann. 2022. NorDiaChange: Diachronic Semantic Change Dataset for Norwegian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.
- Anna Aksenova, Ekaterina Gavrishina, Elisei Rykov, Andrey Kutuzov. 2022. RuDSI: Graph-based Word Sense Induction Dataset for Russian. In Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing.
- Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change.
- Gioia Baldissin, Dominik Schlechtweg, Sabine Schulte im Walde. 2022. DiaWUG: A Dataset for Diatopic Lexical Semantic Variation in Spanish. In Proceedings of the 13th Language Resources and Evaluation Conference.
- Dominik Schlechtweg, Enrique Castaneda, Jonas Kuhn, Sabine Schulte im Walde. 2021. Modeling Sense Structure in Word Usage Graphs with the Weighted Stochastic Block Model. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics.
- Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Serge Kotchourko. 2021. Optimizing Human Annotation of Word Usage Graphs in a Realistic Simulation Environment. Bachelor thesis.
- Benjamin Tunc. 2021. Optimierung von Clustering von Wortverwendungsgraphen. Bachelor thesis.
@inproceedings{Schlechtweg2021dwug,
title = {{DWUG}: A large Resource of Diachronic Word Usage Graphs in Four Languages},
author = {Schlechtweg, Dominik and Tahmasebi, Nina and Hengchen, Simon and Dubossarsky, Haim and McGillivray, Barbara},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
address = {Online and Punta Cana, Dominican Republic},
pages = {7079--7091},
url = {https://aclanthology.org/2021.emnlp-main.567},
year = {2021}
}
@phdthesis{Schlechtweg2023measurement,
author = "Schlechtweg, Dominik",
title = "Human and Computational Measurement of Lexical Semantic Change",
school = "University of Stuttgart",
address = "Stuttgart, Germany",
url = {http://dx.doi.org/10.18419/opus-12833}
year = 2023
}
Creative Commons Attribution No Derivatives 4.0 International.