# Working with discourse-level annotations

In the previous sections, we have applied various natural language processing techniques to plain text in order to create linguistic annotations for them.

In this section, we turn towards using pre-existing linguistic annotations, focusing especially on annotations that target phenomena above the level of a clause.

After reading through this section, you should:

 - know the basics of the CoNLL-U annotation schema
 - know how to create a spaCy *Doc* object manually
 - know how to annotate *Spans* in a *Doc* object using *SpanGroups*
 - know how to load CoNLL-U annotated corpora into spaCy

## Introducing the CoNLL-U annotation schema

CoNLL-X is an annotation schema for describing linguistic features of diverse languages (Buchholz and Marsi [2006](https://www.aclweb.org/anthology/W06-2920)), which was originally developed to facilitate collaboration on so-called shared tasks in research on natural language processing (see e.g. Nissim et al. [2017](https://doi.org/10.1162/COLI_a_00304)).

[CoNLL-U](https://universaldependencies.org/format.html) is a further development of this annotation schema for the Universal Dependencies formalism, which was introduced in [Part II](../notebooks/part_ii/03_basic_nlp.ipynb#Syntactic-parsing). This annotation schema is commonly used for distributing linguistic corpora in projects that build on the Universal Dependencies formalism.

In addition to numerous modern languages, one can find, for example, CoNLL-U annotated corpora for ancient languages such Akkadian (Luukko et al. [2020](https://www.aclweb.org/anthology/2020.tlt-1.11)) and Coptic (Zeldes and Abrams [2018](https://www.aclweb.org/anthology/W18-6022)).

### The components of the CoNLL-U annotation schema

CoNLL-U annotations are distributed as plain text files (see [Part II](http://localhost:8888/notebooks/part_ii/01_basic_text_processing.ipynb#Computers-and-text)).

The annotation files contain three types of lines: **comment lines**, **word lines** and **blank lines**.

**Comment lines** precede word lines and start with a hash character (#). These lines can be used to provide metadata about the sentence.

Each **word line** contains annotations for a single word or token in a sentence.

These annotations are provided using the following fields, separated by tabulator characters:

```console
ID	FORM	LEMMA	UPOS	XPOS	FEATS	HEAD	DEPS	MISC
```

 1. `ID`: Word index
 2. `FORM`: The form of a word or punctuation symbol
 3. `LEMMA`: Lemma or the base form of a word
 4. `UPOS`: Universal part-of-speech tag
 5. `XPOS`: Language-specific part-of-speech tag
 6. `FEATS`: Morphological features
 7. `HEAD`: Syntactic head of the current word
 8. `DEPREL`: Universal dependency relation to the HEAD
 9. `DEPS`: [Enhanced dependency relations](https://universaldependencies.org/u/overview/enhanced-syntax.html)
 10. `MISC`: Any additional annotations
 
Finally, a **blank line** is used to separate sentences.

### Interacting with CoNLL-U annotations in Python

To explore CoNLL-U annotations using Python, let's start by importing [conllu](https://github.com/EmilStenstrom/conllu/), a small library for parsing CoNLL-U annotations into data structures native to Python.

In [None]:
# Import the conllu library
import conllu

We then open a plain text file with annotations in the CoNLL-U format from the [Georgetown Multilayer Corpus](https://corpling.uis.georgetown.edu/gum/) (GUM; see Zeldes [2017](http://dx.doi.org/10.1007/s10579-016-9343-x)), read its contents using the `read()` method and store the result under the variable `annotations`.

In [None]:
# Open the plain text file for reading; assign under 'data'
data = open("data/GUM_whow_parachute.conllu", "r", encoding="utf-8")

# Read the file contents and assign under 'annotations'
annotations = data.read()

# Check the type of the resulting object
type(annotations)

This gives us a Python string object. Let's print out the first 1000 characters of this string.

In [None]:
# Print the first 1000 characters of the string under 'annotations'
print(annotations[:1000])

As you can see, the string object contains comment lines prefixed with a hash, followed by word lines with annotations for various fields. 

An underscore `_` is used to indicate fields with empty or missing values on the word lines. 

In the GUM corpus, the final field `MISC` contains values such as `Discourse` and `Entity` that provide annotations for discourse relations and entities such as events and objects.

Here the question is: how to extract all this information programmatically from a *string* object?

This is where the `conllu` module comes in handy, because its `parse()` function is capable of extracting information from CoNLL-U formatted strings.

In [None]:
# Use the parse() function to parse the annotations; store under 'sentences'
sentences = conllu.parse(annotations)

The `parse()` function returns a Python list populated by *TokenList* objects. This object type is native to the conllu library.

Let's examine the first item in the list `sentences`.

In [None]:
sentences[0]

This gives us a *TokenList* object.

To start with, the information contained in the comment lines in the CoNLL-U schema is provided under the `metadata` attribute.

In [None]:
# Get the metadata for the first item in the list
sentences[0].metadata

This shows that the GUM corpus uses the comment lines to provide four types of metadata for each sentence: `newdoc_id` for document identifier, `sent_id` for sentence identifier, `text` for plain text and `s_type` for sentence type or mood (Zeldes & Simonson [2017](https://www.aclweb.org/anthology/W16-1709): 69).

Superficially, the object stored under the `metadata` attribute looks like a Python dictionary, but the object is actually a conllu *Metadata* object.

This object, however, behaves just like a Python dictionary in the sense that it consists of key and value pairs, which are accessed just like those in a dictionary.

To exemplify, to retrieve the sentence type (or mood), simply use the key `s_type` to access the *Metadata* object.

In [None]:
# Get the sentence type under 's_type'
sentences[0].metadata['s_type']

This returns the string `inf`, which corresponds to infinitive.

Coming back to the *TokenList* object, as the name suggest, the items in a *TokenList* consist of individual *Token* objects.

Let's access the first *Token* object `[0]` in the first *TokenList* object `[0]` under `sentences`. 

In [None]:
# Get the first token in the first sentence
sentences[0][0]

Just like the *TokenList* above, the *Token* object is a dictionary-like object with keys and values.

The dictionary under the key `misc` holds information about discourse relations, which describe how parts of a text relate to each other using [Rhetorical Structure Theory](https://www.sfu.ca/rst) (Mann & Thompson [1988](https://doi.org/10.1515/text.1.1988.8.3.243)).

In this case, the annotation states that a relation named **preparation** holds between units 1 and 11.

These units correspond to *elementary discourse units* instead of words or sentences in the document. 

As such, they define an additional level of *segmentation*, which seeks to capture units of discourse that are placed in various relations to one another.

## Adding discourse-level annotations to *Doc* objects

In [None]:
# Set up a variable with value 0 that we will use for counting
# the Tokens that we process
counter = 0

# Set up placeholder lists for the information that we will extract
# from the CoNLL-U annotations. These lists will be used to create 
# a spaCy Doc object below.
words = []
spaces = []
sent_starts = []

# We use these lists to keep track of sentences, discourse units
# and the relations that hold between them.
discourse_units = []
sent_types = []
relations = []

We use the value stored under the variable `counter` to keep track of the boundaries for sentences and elementary discourse units as we loop over the *TokenList* objects stored in the list `sentences`. 

In [None]:
# Loop over each TokenList object
for sentence in sentences:
    
    # When we begin looping over a new sentence, set the value of
    # the variable 'is_start' to True.
    is_start = True
    
    # Add the sentence type to the list 'sent_types'
    sent_types.append(sentence.metadata['s_type'])
        
    # Proceed to loop over the Tokens in the TokenList object
    for token in sentence:
        
        # Use the key 'form' to retrieve the annotations for the 
        # Token and append it to the placeholder list.
        words.append(token['form'])
        
        # Check if this Token begins a sentence by evaluating whether
        # the variable 'is_start' is True.
        if is_start:
            
            # If the Token starts a sentence, add value True to the list
            # 'sent_starts'.
            sent_starts.append(True)
            
            # Set the variable 'is_start' to False until the next sentence
            # starts and the variable is set to True again.
            is_start = False
        
        # If the variable 'is_start' is False
        else:
            
            # Append value 'False' to the list 'sent_starts'
            sent_starts.append(False)
        
        # Check if the key 'misc' contains anything, and if the key
        # holds the value 'Discourse', proceed to the code block below
        if token['misc'] is not None and 'Discourse' in token['misc']:
            
            # The presence of the key 'Discourse' indicates the beginning
            # of a new elementary discourse unit; add its index to the list
            # 'discourse_units'.
            discourse_units.append(counter)
            
            # Unpack the relationship definition; start by splitting the
            # relation name from the elementary discourse units. Assign
            # the resulting objects under 'relation' and 'edus'.
            relation, edus = token['misc']['Discourse'].split(':')
            
            # Try to split the relation annotation into two parts
            try:
                
                # Split at the '->' string and assign to 'source'
                # and 'target', respectively.
                source, target = edus.split('->')
                
                # Deduct -1 from both 'source' and 'target', because 
                # the identifiers used in the GUM corpus are not 
                # zero-indexed, but spaCy spans that correspond to
                # elementary discourse units will be. Also cast the
                # numbers into integers.
                source, target = int(source) - 1, int(target) - 1
            
            # The root node of the RST tree will not have a target,
            # which raises a ValueError since there is only one item.
            except ValueError:
                
                # Assign the first item in 'edus' to 'source' and set
                # target to None. 
                source, target = edus[0], None
                
                # Deduct -1 from 'source' as explained above.
                source = int(source) - 1 
                
            # Compile the relation definition into a three tuple and
            # append to the list 'relations'.
            relations.append((relation, source, target))
            
        # Check if the current Token is followed by a whitespace. If this is
        # not the case, e.g. for the Token at the end of a TokenList, this
        # information is available under the 'misc' key.
        if token['misc'] is not None and 'SpaceAfter' in token['misc']:
            
            # If the 'misc' key holds a dictionary with the key 'SpaceAfter'
            # with a value 'No', proceed below
            if token['misc']['SpaceAfter'] == 'No':
                
                # Append the Boolean value 'False' to the list 'spaces'.
                # Note the missing quotation marks: this is a Boolean value
                # (True / False)
                spaces.append(False)
            
        # If the 'SpaceAfter' key is not found under 'misc', the token is followed
        # by a space.
        else:

            # Append True to the list of spaces
            spaces.append(True)
        
        # Update the counter as we finish looping over a Token object
        counter += 1

This collects the information needed for creating a spaCy *Doc* object, together with the discourse-level annotations that we add to the *Doc* object afterwards.

Typically, we would create a spaCy *Doc* object by passing some text to a *Language* object, as shown in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).

In this case, however, we need to preserve the tokens defined in the CoNLL-U annotations, because this information is needed to align the discourse-level annotations correctly for both sentences and elementary discourse units.

In other words, we cannot take the risk that spaCy tokenises the text differently, because this would result in misaligned annotations for sentences and elementary discourse units.

Hence we create a spaCy *Doc* object manually by importing the *Doc* class. We also load a small language model for English and store it under the variable `nlp`.

In [None]:
# Import the Doc class and the spaCy library
from spacy.tokens import Doc
import spacy

# Load a small language model for English; store under 'nlp'
nlp = spacy.load('en_core_web_sm')

Now we can use the *Doc* class to create a *Doc* object manually by providing the information in the list `words`, `spaces` and `sent_starts` that we just created as input.

In addition, we must pass a *Vocabulary* object to the `vocab` argument to associate the *Doc* with a given language.

In [None]:
# Create a spaCy Doc object "manually"; assign under the variable 'doc'
doc = Doc(vocab=nlp.vocab, 
          words=words, 
          spaces=spaces,
          sent_starts=sent_starts
          )

This gives us a spaCy *Doc* object with *Tokens* and sentence boundaries.

In [None]:
# Retrieve Tokens up to index 8 from the Doc object
doc[:8]

As you can see, spaCy has successfully assigned the input data to *Token* objects.

The sentence boundaries, in turn, are used to define the sentences under the attribute `sents`.

In [None]:
# Retrieve the first five sentences in the Doc object
list(doc.sents)[:5]

However, because we discarded the linguistic information contained in the CoNLL-U annotations, the attributes of the *Token* objects in our *Doc* object are empty.

Let's fetch the fine-grained part-of-speech tag for the first *Token* in the *Doc* object.

In [None]:
# Get the fine-grained part-of-speech tag for Token at index 0
doc[0].tag_

As this shows, the `tag_` attribute is empty.

We can create these annotations afterwards by passing the *Doc* object stored under `doc` to various components of the *Language* object.

These components are accessible under the attribute `pipeline`, as we learned in [Part II](../part_ii/04_basic_nlp_continued.ipynb#Modifying-spaCy-pipelines).

Let's loop over the components of the `pipeline` and apply them to the *Doc* object under `doc`.

In [None]:
# Loop over the name / component pairs under the 'pipeline' attribute
# of the Language object 'nlp'.
for name, component in nlp.pipeline:
    
    # Use a formatted string to print out the 'name' of the component
    print(f"Now applying component {name} ...")
    
    # Feed the existing Doc object to the component and store the updated
    # annotations under the variable of the same name ('doc').
    doc = component(doc)

If we now examine the attribute `tag_` of the first *Token* object, the fine-grained part-of-speech tag has been added to the *Token*.

In [None]:
# Get the fine-grained part-of-speech tag for Token at index 0
doc[0].tag_

Furthermore, we have access to additional linguistic annotations produced by spaCy, such as noun phrases under the attribute `noun_chunk`.

In [None]:
# Get the first five noun phrases in the Doc object
list(doc.noun_chunks)[:5]

Our manually-defined sentence boundaries, however, remain the same!

In [None]:
# Get the first five sentences in the Doc object
list(doc.sents)[:5]

### Adding information on sentence mood

In [None]:
from spacy.tokens.span_group import SpanGroup
from spacy.tokens import Span

In [None]:
Span.set_extension('mood', default=None)

In [None]:
sent_group = SpanGroup(doc=doc, name="sentences", spans=list(doc.sents))

In [None]:
doc.spans['sentences'] = sent_group

In [None]:
for mood, span in zip(sent_types, doc.spans['sentences']):
    
    span._.mood = mood

In [None]:
doc.spans['sentences'][4]._.mood

### Adding information on discourse relations

In [None]:
# Create a placeholder list to hold slices of the Doc object that correspond
# to discourse units.
edu_spans = []

# Proceed to loop over discourse unit boundaries using Python's range() function.
# This will give us numbers, which we use to index the 'discourse_units' list that
# contains the indices that mark the beginning of a discourse unit.
for i in range(len(discourse_units)):
    
    # Try to execute the following code block
    try:
        
        # Get the current item in the list 'discourse_units' and the next item; assign
        # under variables 'start' and 'end'.
        start, end = discourse_units[i], discourse_units[i + 1]
    
    # If the next item is not available, because we've reached the final item in the list,
    # this will raise an IndexError that we catch here.
    except IndexError:
        
        # Assign the start of the discourse unit as usual, set the length of the Doc 
        # object as the value for 'end' to mark the end point of the discourse unit. 
        start, end = discourse_units[i], len(doc)

    # Use the 'start' and 'end' variables to slice the Doc object; append the
    # resulting Span object to the list 'edu_spans'.
    edu_spans.append(doc[start:end])

In [None]:
# Get the first five Spans in the list 'edu_spans'
edu_spans[:5]

In [None]:
# Register three custom attributes for Span objects, which correspond to
# elementary discourse unit id, the id of the element acting as the target,
# and the name of the relation.
Span.set_extension('edu_id', default=None)
Span.set_extension('target_id', default=None)
Span.set_extension('relation', default=None)

In [None]:
# Create a SpanGroup object from the Spans in the 'edu_spans' list
edu_group = SpanGroup(doc=doc, name="edus", spans=edu_spans)

# Assign the SpanGroup under the key 'edus'
doc.spans['edus'] = edu_group

In [None]:
for relation in relations:
    
    rel_name, source, target = relation[0], relation[1], relation[2]
    
    doc.spans['edus'][source]._.edu_id = source
    doc.spans['edus'][source]._.target_id = target
    doc.spans['edus'][source]._.relation = rel_name

In [None]:
doc.spans['edus'][5]._.relation

## Converting CoNLL-U annotations into *Doc* objects

If you do not need to enrich spaCy objects with additional information, but simply wish to convert CoNLL-U annotations into *Doc* objects, spaCy provides a convenience function, `conllu_to_docs()`, for converting CoNLL-U annotated data into spacy *Doc* objects.

Let's start by importing the function from the `training` submodule, as this function is mainly used for loading CoNLL-U annotated data for training language models. We also import the class for the *Doc* object.

In [None]:
# Import the 'conllu_to_docs' function and the Doc class
from spacy.training.converters import conllu_to_docs
from spacy.tokens import Doc

The `conllu_to_docs()` function takes a Python string object as input.

We pass the string object `annotations` that contains CoNLL-U annotations to the function, and set the argument `no_print` to `True` to prevent the `conllu_to_docs()` function from printing status messages.

The function returns a Python generator object, which we must cast into a list to examine its contents.

In [None]:
# Provide the string object under 'annotations' to the 'conllu_to_docs' function. 
# Set 'no_print' to True and cast the result into a Python list; store under 'docs'.
docs = list(conllu_to_docs(annotations, no_print=True))

# Get the first two items in the resulting list
docs[:2]

This gives us a list with *Doc* objects. By default, the `conllu_to_docs()` function groups every ten sentences in the CoNLL-u files into a single spaCy object. 

This, however, is not an optimal solution, as having every document its own *Doc* object would make more sense rather than an arbitrary grouping.

To do so, we can use the `from_docs()` method of the *Doc* object to combine the *Doc* objects in the list `docs`. 

In [None]:
# Combine Doc objects in the list 'docs' into a single Doc; assign under 'doc'
doc = Doc.from_docs(docs)

# Check variable type and length
type(doc), len(doc)

This gives us a single spaCy *Doc* object with 890 *Tokens*.

If we loop over the first eight *Tokens* in the *Doc* object `doc` and print out their linguistic annotations, the results shows that the information from the CoNLL-U annotations have been carried over to the *Doc* object.

In [None]:
# Loop over the first 8 Tokens using the range() function
for token_ix in range(0, 8):
    
    # Use the current number under 'token_ix' to fetch a Token from the Doc.
    # Assign the Token object under the variable 'token'. 
    token = doc[token_ix]
    
    # Print the Token and its linguistic annotations
    print(token, token.tag_, token.pos_, token.morph, token.dep_, token.head)

However, if we attempt to retrieve the noun phrases in the *Doc* objects available under the attribute `noun_chunks`, spaCy will return an error.

In [None]:
list(doc.noun_chunks)

This raises an error, because the *Doc* that we created using the `conllu_to_docs()` function does not have a *Language* and a *Vocabulary* associated with it.

The noun phrases are created using language-specific rules from syntactic parses, but spaCy does not know which language it is working with.

Because the language of a *Doc* cannot be defined manually, we must use a trick involving the *DocBin* object that we learned about in [Part II](../part_ii/04_basic_nlp_continued.ipynb#Writing-processed-texts-to-disk).

The *DocBin* is a special object type for writing spaCy annotations to disk.

In [None]:
# Import the DocBin object from the 'tokens' submodule
from spacy.tokens import DocBin

# Create an empty DocBin object
doc_bin = DocBin()

# Add the current Doc to the DocBin
doc_bin.add(doc)

Instead of writing the *DocBin* object to disk, we simply retrieve the *Doc* objects from the *DocBin* using the `get_docs()` method, which requires a *Vocabulary* object as input to the `vocab` argument.

The *Vocabulary* is used to associate the *Doc* objects with a given language.

The `get_docs()` method returns a generator, which we must cast into a list.

In [None]:
# Use the 'get_docs' method to retrieve the Docs from the DocBin
docs = list(doc_bin.get_docs(vocab=nlp.vocab))

If we now examine the *Doc* object, which is naturally the first and only item in the list, and retrieve its attribute `noun_chunks`, we can also get the noun phrases.

In [None]:
list(docs[0].noun_chunks)[:10]