End-to-end Parser for Eastern Armenian
This repository contains necessary tools to parse raw Eastern Armenian text. It has a script,
run.sh, which takes raw text as an input and produces a CoNLL-U file with lemmas, morphological features, part-of-speech tags and dependency trees.
- The parser segments the text into sentences and tokenizes them using ArmTreeBank's Tokenizer module.
- Lemmatization, POS tagging and dependency parsing is performed by a neural network called COMBO, which is developed and open-sourced by Piotr Rybak and Alina Wroblewska from Institute of Computer Science, Polish Academy of Sciences. If you use this network, please cite their paper.
- We have trained COMBO on the training set of the ArmTDP treebank from UD v2.3.
- The accuracy of the parser is far from perfect. It has been trained only on ~500 sentences. The table below shows the accuracy on the test set of the same treebank.
|Dependency parsing (Labelled attachment score)||55.25%|
Visualization of the current parser
The model is hosted on DigitalOcean: https://parser.yerevann.com/
Instructions (for End to end parsing)
- Make sure you have all the requirements installed
pip install -r requirements.txt
- Clone the repo (to get the submodules don't forget to include the
git clone --recursive https://github.com/Armtreebank/End-to-end-Parser.git
- Run the following command to get the
.conllufile with predictions for every sentence of the input
python3 predict.py --model_path path_to_model.pkl --input_path sample.txt --output_path sample.conllu
Instructions (for COMBO training)
cd COMBO python3 -m src.main --mode autotrain --train train_data_path.conllu --valid valid_data_path.conllu --model model.pkl --force_trees