# Universal Dependencies

In this section, we will dive deeper into Universal Dependencies, the framework which we have already encountered in connection with syntactic parsing and morphological analysis in [Part II](../part_ii/03_basic_nlp.ipynb) and [Part III](01_multilingual_nlp.ipynb).

After reading through this section, you should:

- understand the goals of Universal Dependencies as a project
- understand the key assumptions concerning linguistic structures in Universal Dependencies
- understand the basics of Universal Dependencies as an annotation schema
- know how to leverage the annotations provided by Universal Dependencies

## A brief introduction to Universal Dependencies as a project

[Universal Dependencies](https://universaldependencies.org/introduction.html) is a collaborative project that seeks to develop a common framework for describing the structure of diverse languages ([de Marneffe et al. 2021](https://doi.org/10.1162/coli_a_00402)). More specifically, the project seeks to enable systematic description of grammatical structures and morphological features across languages, which naturally also enables drawing comparisons between languages. 

The goal – broad applicability across diverse languages – lends the project the epithet "Universal", whereas the term "Dependencies" refers to the way the proposed annotation schema describes syntactic structures, which will be expanded on shortly below.

Linguistic corpora that contain annotations for syntactic structures are often called *treebanks*, because syntactic structures are generally represented using tree structures. In this context, then, a treebank is simply a collection of syntactic trees, which have been consistently annotated using the Universal Dependencies annotation schema.

The number of treebanks annotated using the Universal Dependencies schema has grown steadily over the years (for a recent overview of 90 treebanks, see Nivre et al. [2020](https://aclanthology.org/2020.lrec-1.497/)). The design and creation of such treebanks has been documented in detail for various languages, such as Finnish (Haverinen et al. [2014](https://link.springer.com/article/10.1007/s10579-013-9244-1)), Wolof (Dione [2019](https://aclanthology.org/W19-8003/)) and Hindi/Urdu (Bhat et al. [2017](https://link.springer.com/chapter/10.1007/978-94-024-0881-2_24)).

To better understand the effort behind Universal Dependencies as a project, one should acknowledge that developing a consistent annotation schema for describing the structure of diverse languages, such as Finnish, Wolof and Hindi/Urdu, is far from trivial.

As pointed out in de Marneffe et al. ([2021: 302–303](https://doi.org/10.1162/coli_a_00402)), the Universal Dependencies annotation schema is a compromise between several criteria:

    -

## Some basic assumptions behind Universal Dependencies

- An introduction to UD: https://doi.org/10.1162/coli_a_00402
- Criticism of UD from the DG community: https://www.glossa-journal.org/article/id/5124/

- head and dependents
- phrasal units: nominals, clauses and modifiers

The description of linguistic structures in the Universal Dependencies framework revolves around three types of phrasal units: **nominals**, **clauses** and **modifiers**.

To put it simply, nominals are used for representing things, whereas clauses are used for representing events. Modifiers, in turn, can be used to describe both nominals and clauses more specifically.

### Nominals

Let's begin by exploring Universal Dependencies by focusing on nominals.

(nominal groups in SFL: https://doi.org/10.1080/00437956.2021.1957545)
(English noun phrases: https://doi.org/10.1017/CBO9780511627699)

To get started, we import spaCy and load a medium-sized language model for English, and store the model under the variable `nlp`.

In [3]:
# Import the spaCy library
import spacy

# Use the load() function to load a medium-sized language model for English.
# Store the language model under the variable 'nlp'.
nlp = spacy.load('en_core_web_md')

Next, we import the *displacy* module from spaCy to visualise syntactic dependencies, as we learned in [Part II](../part_ii/03_basic_nlp.ipynb#Syntactic-parsing).

In [4]:
# Import the displacy module from spaCy
from spacy import displacy

We then define a string – "A large green bird" – that we feed to the language model under `nlp`, and assign the resulting *Doc* object under the variable `nominal_group`.

In [5]:
# Feed a string to the language model; store the result under the variable 'nominal_group'
nominal_group = nlp('A large green bird')

Next, we use the `render()` function to draw the syntactic dependencies between the *Tokens* in the *Doc* object `nominal_group`.

By passing the string `dep` to the argument `style`, we explicitly instruct *displacy* to visualise the syntactic dependencies (because *displacy* can also visualise [named entities](../part_ii/03_basic_nlp.ipynb#Named-entity-recognition)).

In [6]:
# Render the syntactic dependencies using the render() function from displacy
displacy.render(nominal_group, style='dep')

This gives us a visualisation of the syntactic dependencies between the four *Tokens*.

Three arcs lead out from the noun "bird" and point towards the *Tokens* "A", "large" and "green". This means that the noun "bird" acts as the **head**, whereas the three other *Tokens* are the **dependents** of this head.

These dependencies are further specified by syntactic relations defined in Universal Dependencies, which are given by the label below each arc.

In this case, the head noun "bird" has two adjectival modifiers (`amod`), "large" and "green", and a determiner (`det`), "a".

If we loop over the *Tokens* in the *Doc* object under the variable `nominal_group` and print out the syntactic dependencies for each Token, which are available under the attribute `dep_`, we can see that the head noun has the dependency tag `ROOT`.

In other words, the syntactic dependencies that describe this nominal group are built around the noun.

In [7]:
# Loop over each Token in the Doc object 'nominal_group'
for token in nominal_group:
    
    # Print out each Token and its dependency tag
    print(token, token.dep_)

A det
large amod
green amod
bird ROOT


This also leads us to the next phrasal unit, namely modifiers.

### Modifiers

### Clauses

In [22]:
clause = nlp('I saw a large green bird.')

displacy.render(clause, style='dep')

In [17]:
for token in clause:
    
    print(token, token.dep_)

I nsubj
saw ROOT
a det
large amod
green amod
bird dobj


## Understanding the annotation schema

- universal POS tags
- universal morphological features
- syntactic relations

## Making the most of Universal Dependencies

- the more you know, the further you go: finding patterns
- evaluating parsers trained using UD corpora: labeled attachment score

In [23]:
for token in clause:
    
    if token.dep_ == 'dobj':
        
        descendants = list(token.subtree)
        
        start, end = descendants[0].i, descendants[-1].i + 1 
        
        displacy.render(clause[start:end], style='dep')