# Computational linguistics

We are going to use [spaCy](https://spacy.io), a popular library for NLP

In [2]:
# !python -m spacy download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md') # load a medium sized model for English



## Relationship extraction

Understanding a text requires both understanding individual words and the relationship between those words. While we have already talked about the meaning of individual words (embeddings and similarity assessment with their use, as well as the problem of POS-tagging, which reveals what part of speech a given word represents), we did not talk much about the relationships between words.

The relations between the words in a sentence are governed by grammar, thanks to which we can understand how the ideas mentioned in the sentences are related to each other. Previous research in the field of natural language processing has proposed the so-called dependency trees (dependency tree or dependency parse tree), as a visualization of grammatical dependencies between words in the form of a tree. The root of this tree is the most often the most important verb in the sentence. Connections between nodes in a dependency tree are labeled with relationship names.

Visualizations of generated dependency trees for given sentences can be generated at: https://explosion.ai/demos/displacy

The labels on the edges of the tree are described at: https://nlp.stanford.edu/software/dependencies_manual.pdf in Chapter 2.

We can visualize the dependency tree (no labels on node connections) using spaCy and NLTK. Run the following code to observe the result:

In [3]:
from nltk import Tree # Helper object to visualize the tree

doc = nlp("The quick brown fox jumps over the lazy dog. Mary met Mike.") # Sample sentences to process

def to_nltk_tree(node): # tworzymy drzewo
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.text, [to_nltk_tree(child) for child in node.children])
    else:
        return node.text

for sent in doc.sents:
    print(sent)
    print("-----------------------------------")
    to_nltk_tree(sent.root).pretty_print() # create a tree and provide a beautiful visualizatiion ;)
    print("\n\n\n")

The quick brown fox jumps over the lazy dog.
-----------------------------------
        jumps                    
  ________|______________         
 |        |             over     
 |        |              |        
 |       fox            dog      
 |    ____|_____      ___|____    
 .  The quick brown the      lazy





Mary met Mike.
-----------------------------------
     met     
  ____|____   
Mary Mike  . 







What can a dependency tree be useful for?
We can use such a tree, for example, to simplify sentences, discover relationships between sentence elements, or, for example, to discover what part of the text an emotionally charged phrase refers to ("I really like these grandma's country dumplings, but I don't despise a good kebab either" = > I like dumplings, I don't despise kebab)

Let's use the dependency tree to create a simplified representation of a sentence containing a relation (verb) and the arguments of this relation in the form `relation(argument1, argument2,...)`

**Task: Simple Relationship Extraction Using Dependency Tree**

Using the attributes of the tokens created by spaCy after running the nlp() function (https://spacy.io/api/token#attributes) - create a CSV-like (space separated lists) representation with the following attributes (columns):
<ol>
<li>identifier of the word in the document</li>
<li>word text</li>
<li>dependency tree label on "parent" connection</li>
<li>parent text from dependency tree</li>
<li>a list of children from the dependency tree</li>
</ol>

The expected result:

<pre>
0 The det fox []
1 quick amod fox []
2 brown amod fox []
3 fox nsubj jumps [The, quick, brown]
4 jumps ROOT jumps [fox, over, .]
5 over prep jumps [dog]
6 the det dog []
7 lazy amod dog []
8 dog pobj over [the, lazy]
9 . punct jumps []


10 Mary nsubj met []
11 met ROOT met [Mary, Mike, .]
12 Mike dobj met []
13 . punct met []
</pre>

In [10]:
from nltk import Tree # Helper for tree generation

def relatioshipExtraction(doc):
    for sent in doc.sents:
        for token in sent:
            index = token.i
            word = token.text
            dep = token.dep_
            head = token.head.text
            children = [child.text for child in token.children]
            print(f"{index} {word} {dep} {head} {children}")
        print()    

doc = nlp("The quick brown fox jumps over the lazy dog. Mary met Mike.") # Sample sentences
relatioshipExtraction(doc)

0 The det fox []
1 quick amod fox []
2 brown amod fox []
3 fox nsubj jumps ['The', 'quick', 'brown']
4 jumps ROOT jumps ['fox', 'over', '.']
5 over prep jumps ['dog']
6 the det dog []
7 lazy amod dog []
8 dog pobj over ['the', 'lazy']
9 . punct jumps []

10 Mary nsubj met []
11 met ROOT met ['Mary', 'Mike', '.']
12 Mike npadvmod met []
13 . punct met []



We see that the most important word is the verb "jumps" (root of the dependency tree (ROOT)).
We also see that the words are grouped accordingly. Children of the word 'fox' are ['The', 'quick', 'brown'] - so the terms that define what this fox is like! (Similar case for the word dog)


**Task: Relationship extraction**

Knowing how to retrieve information about the dependency tree from Token objects in spaCy, write a parsing function that for each sentence (sentence processed by spaCy) will extract the most important relation name (a verb marked as a ROOT), as well as arguments of this relation (subject and object) based on generated dependency tree.

<ol>
<li>The relation should be stored in the predicate variable</li>
<li>The subject, let's define it as a token from the sentence, which is connected to the ROOT by the relation 'nsubj', should be stored in the variable subj.</li>
<li>and the object can be defined, for example, as: an element connected with the ROOT by the relation 'dobj', or, if the ROOT has no connection 'dobj', but is connected with the element by the relation 'prep' (preposition in relation to the verb), then the predicate is a token that is linked to this preposition by the relation 'pobj'. If the second situation occurs, i.e. the preposition is connected directly with the ROOT - and only this preposition with the term, the preposition should be added to the string stored in the predicate variable (For simplicity, let's assume that the preposition always follows the verb). Store the object in the 'obj' variable.</li>
</ol>
To understand how the object relation works, look at the expected output of this task and the dependency tree generated in the first snippet of this section.

The expected result:
<pre>
jumps over(fox, dog)
meth(Mary, Mike)
</pre>

While the second example of met(Mary, Mike) is obvious, the first should identify the word 'jumps' as a relation, note that there is no direct object (no 'dobj' for root), instead we have the preposition `over`, which in turn is combined with the expected object ('dog'). Therefore, we add the preposition to the name of the predicate, replacing the previous jumps with jumps over, and the complement is the element connected with the preposition by the relation 'pobj': dog.

In [20]:
from nltk import Tree # Helper function used to print a tree

doc = nlp("The quick brown fox jumps over the lazy dog. Mary met Mike.") # przykładowe zdania do przetworzenia

def parse(sent):
    root = sent.root

    predicate = root.text
    subj = None
    obj = None

    for child in root.children:
        if child.dep_ == "nsubj":
            subj = child.text
            break

    for child in root.children:
        if child.dep_ == "dobj":
            obj = child.text
            break

    if obj is None:
        for child in root.children:
            if child.dep_ == "prep":
                prep = child
                for grandchild in prep.children:
                    if grandchild.dep_ == "pobj":
                        predicate = f"{predicate} {prep.text}"
                        obj = grandchild.text
                        break
            if obj is not None:
                break
    print("{pred}({subj}, {obj})".format(pred=predicate, subj=subj, obj=obj))
    
for sent in doc.sents:
    parse(sent)

jumps over(fox, dog)
met(Mary, None)


# Optional:
Download BRAT annotation tool from GitHub: https://github.com/nlplab/brat and try to run the annotation server locally. This server may be used to mark spans of text and relations between them. You need to download the package, unpack it, and install via `sudo ./install.sh`. Then, after creating an account, you should be able to run the server via `python standalone.py`. Please remember that you need to log in after running the server (there is a button in the top-right corner).

Datasets are stored in folders, files to annotate are simple textual files. Annotations are also serialized as textual names sharing the same filename, but anding with `*.ann` extension.

The whole configuration of our tasks is stored in config files, out of which `annotation.conf` file is the most important one (that should be placed in the same folder as our dataset (e.g., https://github.com/nlplab/brat/blob/master/example-data/tutorials/bio/annotation.conf). We have also config files for visuals (visual.conf, e.g., https://github.com/nlplab/brat/blob/master/example-data/tutorials/bio/visual.conf), where we can define colors for our annotations and keyboard shortcuts (kb_shortcuts.conf, e.g., https://github.com/nlplab/brat/blob/master/example-data/tutorials/bio/kb_shortcuts.conf).

More information about the tool can be found on the website: https://brat.nlplab.org
