Skip to content

ICS02: 7. Using Treebanks

Gabriel Bodard edited this page Feb 25, 2019 · 13 revisions

Sunoikisis Digital Classics, Spring 2019

Session 7. Using treebanked corpora: Universal Dependencies

Thursday Feb 21, 16:00 UK = 18:00 EET

Convenors: Timo Korkiakangas (Helsinki) & Marco Passarotti (UCSC, Milan)

YouTube link: https://youtu.be/EFWxTfkdzVA

Slides: Korkiakangas: Querying Perseus Treebanks; Passarotti: Universal Dependencies

Session outline

The objective of this session is two-fold. First, it introduces Universal Dependencies (UD). Second, it gives a practical example of how to query treebanks to answer simple research questions.

  1. UD (http://universaldependencies.org/) is one of the most notable projects currently ongoing in computational linguistics. The project, run by contributors from the research community, aims at creating a collection of dependency treebanks for different languages built according to a cross-linguistically consistent annotation style meant to complement (but not to replace) the single language/treebank-specific schemes. Started in 2014 with the first set of guidelines, UD has published a new release of the collection of the treebanks roughly every six months. Version 2 (v2), which introduces a new set of guidelines, was released in March 2017. The current version is 2.3 (November 2018). It includes 129 treebanks and 76 languages. The session will introduce the basic aspects of the annotation style of UD v2 as well as the format of source data.

  2. There are several software tools that can be used to query dependency treebanks. We present a use case that illustrates a simplified treebank query from the set-up of research question to the interpretation of results. To exploit all the levels of linguistic annotation, a powerful query syntax is needed. We will use the PML Tree Query extension (http://ufal.mff.cuni.cz/tred/documentation/ar01-toc.html) of TrEd Treebank Editor (https://ufal.mff.cuni.cz/tred/) on data from the Latin Dependency Treebanks. The annotation of the Latin Dependency Treebanks is based on the Perseus Guidelines (https://github.com/PerseusDL/treebank_data/blob/master/v1/latin/docs/guidelines.pdf), not on UD. Given that not all treebanks are yet converted into universal dependencies, one has to tackle various types of treebank annotation. This means understanding the underlying annotation principles. After the session, the student will understand the logical steps involved in querying treebanks for study/research purposes.

Seminar readings

  • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC-2016, pp. 1659-1666: http://www.lrec-conf.org/proceedings/lrec2016/pdf/348_Paper.pdf
  • González Saavedra, B. and Passarotti, M. 2018. "Using Tectogrammatical Annotation for Studying Actors and Actions in Sallust's Bellum Catilinae." The Prague Bulletin of Mathematical Linguistics 111, 5-28: https://ufal.mff.cuni.cz/pbml/111/art-saavedra-passarotti.pdf (pay attention to research setting and querying rather than to technical details)

Further reading

Essay title

Exercise

  1. Run the on-line web application of the NLP pipeline UDPipe (http://lindat.mff.cuni.cz/services/udpipe/) on a couple of texts in different languages. Evaluate the results manually.
  2. OR Try your hand in PML-TQ in PML-TQ web client, where the Preseus Latin Treebank and the Index Thomisticus Treebank are available in UD, at http://lindat.mff.cuni.cz/services/pmltq/#!/home (instructions at http://ufal.mff.cuni.cz/pmltqdoc/doc/pmltq_tutorial_web_client.html).
Clone this wiki locally