This notebook is for working with [treebank data](https://github.com/gregorycrane/gAGDT/tree/master/data/xml). (The schema is described [here](https://github.com/gregorycrane/gAGDT))

We will use [Stanza](https://stanfordnlp.github.io/stanza/index.html) to train a dependency parser on our data which we can use for creating exercises (like finding the subject or object of verbs)

---
    TODO:
- [ ] Use Stanza to [train](https://stanfordnlp.github.io/stanza/training.html) a dependency parser on our treebank data
- [ ] Use our parser to create exercises (e.g. finding the subject or object of verbs)

---
## Installations

In [None]:
!pip install stanza

# Import the package
import stanza

***
## Format all the treebank data correctly

### Notes on Formatting

- Stanza likes its annotations to be stored in  [CoNLL-U](https://universaldependencies.org/format.html) format. More info on UD guidelines can be found [here](https://universaldependencies.org/guidelines.html).

- According to the note [here](https://github.com/stanfordnlp/stanza#batching-to-maximize-pipeline-speed), it increases performance if we concatenate documents together and divide them with `\n\n`

In [7]:
# Code to convert treebank data goes here...

---
## Train our Parser

The tutorial [here](https://github.com/stanfordnlp/stanza-train) outlines these steps:

1. create a folder for holding training data, cd into that folder
2. ```git clone git@github.com:stanfordnlp/stanza.git
cp config/config.sh stanza/scripts/config.sh
cp config/xpos_vocab_factory.py stanza/stanza/models/pos/xpos_vocab_factory.py
cd stanza
source scripts/config.sh```
3. modify the `config.sh` script (which is used to set environment variables)

NOTE: stanzatrain/config/xpos_vocab_factory.py is designed to use the UD_English-TEST and should be modified to use a different treebank

--- 
## Use our Parser

### Construct a Processor
By default, this includes all processors (tokenization, multi-word expansion, part-of-speech tagging, lemmatization, dependency parsing and named entity recognition). This can be changed with the `processors` argument

Info on dependency parsing can be found [here](https://stanfordnlp.github.io/stanza/depparse.html) 

According to the [tutorial](https://github.com/stanfordnlp/stanza-train#initializing-processors-with-trained-models), we just need to provide the path for our model file

NOTE: the code below is only a placeholder

In [None]:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_model_path='saved_models/tokenize/en_test_tokenizer.pt')

In [None]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en')

# Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
print("Building a Chinese pipeline...")
zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=False, use_gpu=False)

In [None]:
# Annotate Text 

# Processing English text
en_doc = en_nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
print(type(en_doc))

# Processing Chinese text
zh_doc = zh_nlp("达沃斯世界经济论坛是每年全球政商界领袖聚在一起的年度盛事。")
print(type(zh_doc))

# Print information about a word
word = en_doc.sentences[0].words[0]
print(word)

## Misc...

These [quizzes](https://github.com/gregorycrane/Homerica/tree/master/quizzes) contain every morphological paradigm in Iliad 1