Estimating POS Annotation Consistency of Different Treebanks in a Language

Documentation

For documentation, please refer to Chapter 4 of the thesis document.

Abstract

We introduce a new symmetric metric (called θ_pos) that utilises the non-symmetric KL_cpos³ metric to allow us to compare the annotation consistency between different annotated treebanks of a given language, when annotated under the same guideline. We can set a maximal value threshold for this new metric so that a pair of treebanks can be considered harmonious in their annotation consistency if θ_pos value surpasses this threshold. For the calculation of the threshold, we estimate the effects of

the size variation, and
the generic distribution of data

in the treebanks of the considered pair on the θ_pos metric. The estimations are based on data from treebanks of distinct language families, making the threshold language-independent. Extensible to any given guideline, we demonstrate the utility of the proposed metric by listing the treebanks in Universal Dependencies version 2.5 (UDv2.5) data that are annotated consistently with other treebanks of the same language.

Included Files

docs/*: Files used for documentation
scripts/*: Scripts used in the experiment. Refer here for details.
treebanks_to_compare.tsv: File generated when treebanks_to_compare.sh is run on UDv2.5 data.
UDv2.5_scores.tsv: Contains θ_pos score for different treebanks across different languages in UDv2.5. Treebank marked with False value in treebanks_to_compare.tsv file are not included in computation of the score.

Using This Module

To start with the module, clone this repository in your system, and then run the commands as required:

make getdata

Downloads the required dependencies using requirements.txt file, UDv2.5 data using the link here and then prepares working copies of the treebanks in the current directory. Also invoked with any other command to make sure the data is correctly in place.

make all_scores

Get θ_pos scores for all the treebank combinations in UDv2.5 in TSV format. Results stored in UDv2.5_scores.tsv file.

make size_control

Get θ_pos and coverage scores for Czech-PDT and Estonian-EDT data, to study the variance of θ_pos with respect to the change in size of dataset. Results stored in size_control directory.

make get_trigrams

Get the variance of POS trigrams over the change in dataset size for Czech-PDT and Estonian-EDT data. Generates unique_trigrams directory and plots the results therein. The generated plots are saved in docs directory.

make genre_control

Get the variance of θ_pos score across different genres in Polish-LFG and Finnish-TDT data. Generates genre_control directory.

make genres_additive

Get the variance of θ_pos score when genres are added, with data from Polish-LFG dataset. Generates genre_control/genres_additive directory.

make clean

Removes the generated files and folders in the directory.

Results on UDv2.5 Data

References

Rudolf Rosa and Zdeněk Žabokrtský. KLcpos3 - a Language Similarity Measure for Delexicalized Parser Transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 243–249, Beijing, China, July 2015. Association for Computational Linguistics. doi: https://doi.org/10.3115/v1/P15-2040. URL https://www.aclweb.org/anthology/P15-2040.
Prafulla Kalapatap, N. N. Tejas, Siddharth Dalmia, Prakhar Gupta, Bhaswant Inguva, and Aruna Malapati. A Novel Similarity Measure: Voronoi Audio Similarity for Genre Classification. International Journal of Intelligent Systems Technologies and Applications, 16(4):309–318, January 2017. ISSN 1740-8865. doi: https://doi.org/10.1504/IJISTA.2017.088054.
Elias Pampalk, Arthur Flexer, and Gerhard Widmer. Improvements of Audio- Based Music Similarity and Genre Classificaton. In ISMIR, volume 5, pages 634–637. London, UK, 2005.
Ricardo Casañ-Pitarch. A Proposal for Genre Analysis: The AMS model. In Chelo Vargas-Sierra, editor, Professional and Academic Discourse: an Interdisciplinary Perspective, volume 2 of EPiC Series in Language and Linguistics, pages 235–246. EasyChair, 2017. doi: https://doi.org/10.29007/hbg9. URL https://easychair.org/publications/paper/b6rp
Douglas Biber. A typology of English texts. Linguistics, 27(1):3–44, 1989. ISSN 0024-3949. doi: https://doi.org/10.1515/ling.1989.27.1.3.
Douglas Biber. Variation across speech and writing. Cambridge University Press,1991.
Douglas Biber. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, 1995.
Francis Heylighen and Jean-Marc Dewaele. Formality of Language: definition, measurement and behavioral determinants. Interner Bericht, Center “Leo Apostel”, Vrije Universiteit Brüssel, 4, 1999.
Alejandro Mosquera and Paloma Moreda Pozo. The Use of Metrics for Measuring Informality Levels in Web 2.0 Texts. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, 2011. URL https://www.aclweb.org/anthology/W11-4523.
Zeman, Daniel; Nivre, Joakim; Abrams, Mitchell; et al., 2019, Universal Dependencies 2.5, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3105.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Estimating POS Annotation Consistency of Different Treebanks in a Language

Contents

Documentation

Abstract

Included Files

Using This Module

Results on UDv2.5 Data

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Estimating POS Annotation Consistency of Different Treebanks in a Language

Contents

Documentation

Abstract

Included Files

Using This Module

Results on UDv2.5 Data

References