# Working with discourse-level annotations

So far, we have applied various natural language processing techniques to plain text, such as part-of-speech tagging, parsing syntactic dependencies and so on.

In this section, we turn towards pre-existing linguistic annotations, focusing especially on annotations that target phenomena above the level of a clause.

After reading through this section, you should:

 - understand the CoNLL-U annotation schema
 - know how to load CoNLL-U annotated corpora into spaCy
 - visualise discourse structures using spaCy

## Introducing the CoNLL-U annotation schema

CoNLL-X is an annotation schema for describing linguistic features of diverse languages, which was originally developed to facilitate collaboration on so-called shared tasks (see e.g. Nissim et al. [2017](https://doi.org/10.1162/COLI_a_00304)) in research on natural language processing (Buchholz and Marsi [2006](https://www.aclweb.org/anthology/W06-2920)).

[CoNLL-U](https://universaldependencies.org/format.html) is a further development of this annotation schema for the Universal Dependencies formalism, which was introduced in [Part II](../notebooks/part_ii/03_basic_nlp.ipynb#Syntactic-parsing).

The CoNLL-U annotation schema is used for distributing linguistic corpora in projects that build on the Universal Dependencies formalism.

One can find, for example, CoNLL-U annotated corpora for ancient languages such Akkadian (Luukko et al. [2020](https://www.aclweb.org/anthology/2020.tlt-1.11)) and Coptic (Zeldes and Abrams [2018](https://www.aclweb.org/anthology/W18-6022)).

### The components of the CoNLL-U annotation schema

CoNLL-U annotations are distributed as plain text files (see [Part II](http://localhost:8888/notebooks/part_ii/01_basic_text_processing.ipynb#Computers-and-text)).

The annotation files contain three types of lines: **comment lines**, **word lines** and **blank lines**.

**Comment lines** precede word lines and start with a hash character (#). These lines can be used to provide metadata about the sentence.

Each **word line** contains annotations for a single word or token.

These annotations are provided using the following fields, separated by tabulator characters:

```console
ID	FORM	LEMMA	UPOS	XPOS	FEATS	HEAD	DEPS	MISC
```

 1. `ID`: Word index
 2. `FORM`: The form of a word or punctuation symbol
 3. `LEMMA`: Lemma or the base form of a word
 4. `UPOS`: Universal part-of-speech tag
 5. `XPOS`: Language-specific part-of-speech tag
 6. `FEATS`: Morphological features
 7. `HEAD`: Syntactic head of the current word
 8. `DEPREL`: Universal dependency relation to the HEAD
 9. `DEPS`:
 10. `MISC`: Any additional annotations
 
Finally, a **blank line** is used to separate sentences.

## Interacting with CoNLL-U annotations in Python

Let's start by importing [conllu](https://github.com/EmilStenstrom/conllu/), a small Python library for parsing CoNLL-U annotations into data structures native to Python.

In [None]:
# Import the conllu library
import conllu

We then open a plain text file with annotations in the CoNLL-U format from the [Georgetown Multilayer Corpus](https://corpling.uis.georgetown.edu/gum/) (GUM; see Zeldes [2017](http://dx.doi.org/10.1007/s10579-016-9343-x)), read its contents using the `read()` method and store the result under the variable `annotations`.

In [None]:
# Open the plain text file for reading; assign under 'data'
data = open("data/GUM_whow_parachute.conllu", "r", encoding="utf-8")

# Read the file contents and assign under 'annotations'
annotations = data.read()

# Check the type of the resulting object
type(annotations)

This gives us a Python string object. Let's print out the first 1000 characters of this string.

In [None]:
# Print the first 1000 characters of the string under 'annotations'
print(annotations[:1000])

As you can see, the string object contains comment lines prefixed with a hash, followed by word lines with annotations for various fields. 

An underscore `_` is used to indicate fields with empty or missing values on the word lines. 

In the GUM corpus, the final field `MISC` contains values such as `Discourse` and `Entity` that provide annotations for discourse relations and entities such as events and objects.

Here the question is: how to extract all this information programmatically from a *string* object?

This is where the `conllu` module comes in handy, because its `parse()` function is capable of extracting information from CoNLL-U formatted strings.

In [None]:
# Use the parse() function to parse the annotations; store under 'sentences'
sentences = conllu.parse(annotations)

The `parse()` function returns a Python list populated by *TokenList* objects. This object type is native to the conllu library.

Let's examine the first item in the list `sentences`.

In [None]:
sentences[0]

This gives us a *TokenList* object.

To start with, the information provided using comment lines in the CoNLL-U schema is provided under the `metadata` attribute.

In [None]:
# Get the metadata for the first item in the list
sentences[0].metadata

This shows that the GUM corpus contains four types of metadata for each sentence: `newdoc_id` for document identifier, `sent_id` for sentence identifier, `text` for plain text and `s_type` for sentence type or mood (Zeldes & Simonson [2017](https://www.aclweb.org/anthology/W16-1709): 69).

Superficially, the object stored under the `metadata` attribute looks like a Python dictionary, but is actually a conllu *Metadata* object.

This object, however, behaves just like a Python dictionary in the sense that it consists of key and value pairs, which are accessed just like those in a Python dictionary.

To exemplify, to retrieve the sentence type (or mood), simply use the key `s_type` to access the *Metadata* object.

In [None]:
# Get the sentence type under 's_type'
sentences[0].metadata['s_type']

This returns the string `inf`, which corresponds to infinitive.

Coming back to the *TokenList* object, as the name suggest, the items in a *TokenList* consist of individual *Token* objects.

Let's access the first *Token* object `[0]` in the first *TokenList* object `[0]` under `sentences`. 

In [None]:
# Get the first token in the first sentence
sentences[0][0]

Just like the *TokenList* above, the *Token* object is a dictionary-like object with keys and values.

As you can see, the dictionary under the key `misc` holds information about discourse relations, which explicate how the pieces of text relate to each other.

In [None]:
counter = 0

discourse_units = []
relations = []

for sentence in sentences:
    
    for token in sentence:
        
        counter += 1
        
        if token['misc'] is not None and 'Discourse' in token['misc']:
            
            discourse_units.append(counter)
            
            relation, edus = token['misc']['Discourse'].split(':')
                        
    discourse_units.append(counter)
                        

## Converting CoNLL-U annotations into spaCy *Docs*

spaCy provides a convenience function, `conllu_to_docs()`, which allows converting CoNLL-U annotated data into spacy *Doc* objects.

Let's start by importing the function from the `training` submodule. We also import the class for the *Doc* object.

In [None]:
# Import the 'conllu_to_docs' function and the Doc class
from spacy.training.converters import conllu_to_docs
from spacy.tokens import Doc

The `conllu_to_docs()` function takes a Python string object as input.

We pass the string object `annotations` that contains CoNLL-U annotations to the function, and set the argument `no_print` to `True` to prevent the `conllu_to_docs()` function from printing status messages.

The function returns a Python generator object, which means we must cast the output into a list to examine its contents.

In [None]:
# Provide the string object under 'annotations' to the 'conllu_to_docs' function. 
# Set 'no_print' to True and cast the result into a Python list; store under 'docs'.
docs = list(conllu_to_docs(annotations, no_print=True))

# Get the first two items in the resulting list
docs[:2]

This gives us a list with *Doc* objects. 

By default, the `conllu_to_docs()` function groups every ten sentences in the CoNLL-u files into a single spaCy object. 

This, however, is not an optimal solution, as having every document its own *Doc* object would make more sense rather than an arbitrary grouping.

To fix this, we can use the `from_docs()` method of the *Doc* object to combine the *Doc* objects in the list `docs`. 

In [None]:
# Combine Doc objects in the list 'docs' into a single Doc; assign under 'doc'
doc = Doc.from_docs(docs)

# Check variable type and length
type(doc), len(doc)

This gives us a single spaCy *Doc* object with 890 *Tokens*.

If we loop over the first eight *Tokens* in the *Doc* object `doc` and print out their linguistic annotations, the results shows that the information from the CoNLL-U annotations have been carried over to the *Doc* object.

In [None]:
# Loop over the first 8 Tokens using the range() function
for token_ix in range(0, 8):
    
    # Use the current number under 'token_ix' to fetch a Token from the Doc.
    # Assign the Token object under the variable 'token'. 
    token = doc[token_ix]
    
    # Print the Token and its linguistic annotations
    print(token, token.tag_, token.pos_, token.morph, token.dep_, token.head)

TODO: Make a note about the missing properties, e.g. (`noun_chunks`).