# EUROLAN 2019

## Prepare for the training

### 1. Merge corpora

The `/data/corpus` directory contains two corpora:
1. The manually annotated corpus and
2. The rest of the training corpus

In order to be used by the model, these corpora need to be merged. This can be done with the utility script `utils/merge.py`:

In [None]:
python3 ./utils/merge.py --manually-annotated /data/corpus/team_annotation.xml --training-corpus /data/corpus/rest_of_corpus.xml --output-corpus /data/corpus/train.xml

### 2. Convert the training corpus into `CoNLL-U` format

The corpus resulted after the merge is in `xml` format but `NLP-Cube` requires `CoNLL-U` format. To convert the corpus into the required format we need to feed it to the converter application.

#### 2.1. Create directories required by the converter

The converter requires an input and output directories. Let's create them in `/tmp/`:

In [None]:
! mkdir /tmp/xml && mkdir /tmp/conllu

#### 2.2. Move the corpus into the `/tmp/xml` directory

In [None]:
! mv /data/corpus/train.xml /tmp/xml

#### 2.3. Invoke the converter app

In [None]:
! java -Dfile.encoding=utf-8 -jar /work/bin/TreeBankAnnotatorToConllU.jar /tmp/xml /tmp/conllu

Let's see how the file looks:

In [None]:
! head /tmp/conllu/train.conll

#### 2.4. Move the converted corpus back to `/data/corpus` directory

In [None]:
! mv /tmp/conllu/train.conll /data/corpus

## Train the model

In [None]:
! python3 /work/NLP-Cube/cube/main.py \
    --train=parser \
    --train-file=/data/corpus/train.conll \
    --dev-file=/data/corpus/train.conll \
    --embeddings /data/cc.ro.300.vec \
    --store /model/1.0/parser \
    --batch-size 1000 \
    --set-mem 8000 \
    --autobatch \
    --patience 1

## References

1. [NLP-Cube: End-to-End Raw Text Processing With Neural Networks](http://www.aclweb.org/anthology/K18-2017), Boroș, Tiberiu and Dumitrescu, Stefan Daniel and Burtica, Ruxandra, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. p. 171--179. October 2018

```
@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {Boroș, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}
```

2. [Learning Word Vectors for 157 Languages](https://arxiv.org/abs/1802.06893), E. Grave, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov
```
@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}
```

3. Mărănduc, Cătălina Perez, Cenel-Augusto, 2015. A Romanian Dependency Treebank. In International Journal of Computational Linguistics and Applications, vol. 6, no. 2, issue July-December 2015, p. 25–40
```
@article{DBLP:journals/ijcla/MaranducP15,
  author    = {Catalina Maranduc and
               Cenel{-}Augusto Perez},
  title     = {A Romanian Dependency Treebank},
  journal   = {Int. J. Comput. Linguistics Appl.},
  volume    = {6},
  number    = {2},
  pages     = {83--103},
  year      = {2015},
  url       = {http://www.ijcla.bahripublications.com/2015-2/IJCLA-2015-2-pp-083-103-A-Romanian-Dependency-Treebank.pdf},
  timestamp = {Fri, 07 Apr 2017 20:51:02 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/ijcla/MaranducP15},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```