# 7: Dependency parsing

## 1. Converting tabular format to NLTK DependencyDraph format

Let's start by importing the NLTK data structure for dependency trees:

In [1]:
from nltk.parse import DependencyGraph

Let's then build a tabular parse tree for the sentence "I saw the cat". The NLTK tabular format consists of four columns:

1. word form
1. POS tag
1. ID of head word (the syntactic head of the sentence has head ID 0)
1. Dependency relation

Both the POS tag and dependency relation need to be written in uppercase.

In [2]:
tree='''I   PRON   2   NSUBJ
saw   VERB   0   ROOT
the   DET    4   DET
cat   NOUN   2   OBJ
'''

We can then convert this to a `DependencyGraph`. These dependency graphs give you a logical format for the sentence `(HEAD DEP1 ... DEPN)`, where head is the head word of a construction and `DEP1`, ..., `DEPN` are its dependents. These may then in turn have their own dependents.

In [3]:
dg = DependencyGraph(tree)
print(dg.tree())

(saw I (cat the))


## 2. Using dependency treebanks

NLTK offers a large dependency annotated treebank which has been automatically converted from the Penn Treebank. This treebank **is not** in UD format but can still be useful if we want to train dependency parsers.

Let's start by downloading the treebank:

In [4]:
import nltk
nltk.download("dependency_treebank")

[nltk_data] Downloading package dependency_treebank to
[nltk_data]     /Users/lxy/nltk_data...
[nltk_data]   Package dependency_treebank is already up-to-date!


True

Let's print the first sentence:

In [5]:
from nltk.corpus import dependency_treebank

print(dependency_treebank.sents()[0])


['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']


We can then print the dependency tree for the first sentence:

In [6]:
graph = dependency_treebank.parsed_sents()[0]

print(graph)

defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x7fca1812bd30>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'ROOT': [8]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'NNP',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '',
                 'head': 2,
                 'lemma': 'Pierre',
                 'rel': '',
                 'tag': 'NNP',
                 'word': 'Pierre'},
             2: {'address': 2,
                 'ctag': 'NNP',
                 'deps': defaultdict(<class 'list'>, {'': [1, 3, 6, 7]}),
                 'feats': '',
                 'head': 8,
                 'lemma': 'Vinken',
                 'rel': '',
                 'tag': 'N

We can access individual tokens by their ID number:

In [7]:
print(graph.get_by_address(2))

{'address': 2, 'word': 'Vinken', 'lemma': 'Vinken', 'ctag': 'NNP', 'tag': 'NNP', 'feats': '', 'head': 8, 'deps': defaultdict(<class 'list'>, {'': [1, 3, 6, 7]}), 'rel': ''}


This also gives us access to the children in the tree:

In [8]:
for address in graph.get_by_address(2)['deps']['']:
    print(graph.get_by_address(address))

{'address': 1, 'word': 'Pierre', 'lemma': 'Pierre', 'ctag': 'NNP', 'tag': 'NNP', 'feats': '', 'head': 2, 'deps': defaultdict(<class 'list'>, {}), 'rel': ''}
{'address': 3, 'word': ',', 'lemma': ',', 'ctag': ',', 'tag': ',', 'feats': '', 'head': 2, 'deps': defaultdict(<class 'list'>, {}), 'rel': ''}
{'address': 6, 'word': 'old', 'lemma': 'old', 'ctag': 'JJ', 'tag': 'JJ', 'feats': '', 'head': 2, 'deps': defaultdict(<class 'list'>, {'': [5]}), 'rel': ''}
{'address': 7, 'word': ',', 'lemma': ',', 'ctag': ',', 'tag': ',', 'feats': '', 'head': 2, 'deps': defaultdict(<class 'list'>, {}), 'rel': ''}


## 3. Dependency parsing using SpaCy

NLTK doesn't really provide a good dependency parser that we could apply on our own sentences but we can get one from the SpaCy toolkit. 

You should start by running the commands:
```
pip3 install spacy
python3 -m spacy download en_core_web_sm
```
in the terminal.

We can now use SpaCy to dependency parse some input sentences. 

**Note!** The output is in UD 1.0 format, not UD 2.0. This means that some relations look a bit different. For example, we've got "dobj" for direct objects instead of "obj".   

In [9]:
import spacy

sentence = "foxes like cats but they don't really like dogs"
nlp = spacy.load("en_core_web_sm")

sent = nlp(sentence)
for token in sent:
    print(((token,token.i),(token.head, token.head.i),token.dep_))

((foxes, 0), (foxes, 0), 'ROOT')
((like, 1), (foxes, 0), 'prep')
((cats, 2), (like, 1), 'pobj')
((but, 3), (foxes, 0), 'cc')
((they, 4), (like, 8), 'nsubj')
((do, 5), (like, 8), 'aux')
((n't, 6), (like, 8), 'neg')
((really, 7), (like, 8), 'advmod')
((like, 8), (foxes, 0), 'conj')
((dogs, 9), (like, 8), 'dobj')


We can also visualize parse trees using SpaCy:

In [10]:
from spacy import displacy

displacy.render(sent, style="dep")
        

## 4. Dependency parsing using SpaCy

Following the example above, we can parse the sentence "The gray cat meowed" using SpaCy and print the parse tree:

In [11]:
sentence = "the gray cat meowed"
nlp = spacy.load("en_core_web_sm")

sent = nlp(sentence)

for token in sent:
    print(((token,token.i),(token.head, token.head.i),token.dep_))

((the, 0), (cat, 2), 'det')
((gray, 1), (cat, 2), 'amod')
((cat, 2), (meowed, 3), 'nsubj')
((meowed, 3), (meowed, 3), 'ROOT')


And visualize the dependency tree:

In [12]:
displacy.render(sent, style="dep")

## 5. Tree conversion

Write a function `spacy_to_dict` which converts your SpaCy dependency tree into a disctionary, where the keys are the ID numbers for the words in your sentence and the values are sets of child IDs. You should add a root element with ID 0 whose child is the syntactic head of the entire sentence. 

You should be able to build the function using the following properties of SpaCy trees:

1. You can access the children of a token in a SpaCy tree using `token.children`.
1. For any token token.i will give you its ID number (this also applies to child tokens in `token.children`). Note! SpaCy trees do not have a separate `ROOT` element and indexing for regular word forms starts at 0. You will, therefore, need to add one to your indices. 
1. Unlike NLKT dependency trees, SpaCy trees mark the syntactic head of the entire sentence as a word that is its own head.

For example, the output of `spacy_to_dict` for `sent` should be:

```
{0:set([4]), 1:set(), 2:set(), 3:set([1,2]), 4:set([3])}
```

In [13]:
def spacy_to_dict(spacy_tree):
    d = {}
    for token in spacy_tree:
        d[token.i+1] = set()
        if token.head.i == token.i:   # if this is tehe head of the sentence
            d[0] = set([token.i+1])
        for c in token.children:  # append every token that is dependent of this word
            d[token.i+1].add(c.i+1)
    return d

spacy_to_dict(sent)

{1: set(), 2: set(), 3: {1, 2}, 4: {3}, 0: {4}}

In [22]:
sent

the gray cat meowed

In [21]:
for token in sent:
    print(token.i)
    print("head",token.head.i)
    for c in token.children:
        print('children',c)
    # break

0
head 2
1
head 2
2
head 3
children the
children gray
3
head 3
children cat
