Skip to content
End-to-end Parser for Eastern Armenian (tokenization, lemmatization, POS tagging, dependency parsing)
Python HTML CSS JavaScript
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
COMBO
Tokenizer
static
.ebignore
.gitignore
Procfile
README.md
application.py
model.pkl
predict.py
requirements.txt

README.md

End-to-end Parser for Eastern Armenian

This repository contains necessary tools to parse raw Eastern Armenian text. It has a script, run.sh, which takes raw text as an input and produces a CoNLL-U file with lemmas, morphological features, part-of-speech tags and dependency trees.

  • The parser segments the text into sentences and tokenizes them using ArmTreeBank's Tokenizer module.
  • Lemmatization, POS tagging and dependency parsing is performed by a neural network called COMBO, which is developed and open-sourced by Piotr Rybak and Alina Wroblewska from Institute of Computer Science, Polish Academy of Sciences. If you use this network, please cite their paper.
  • We have trained COMBO on the training set of the ArmTDP treebank from UD v2.3.
  • The accuracy of the parser is far from perfect. It has been trained only on ~500 sentences. The table below shows the accuracy on the test set of the same treebank.
Metric Accuracy
Lemmatization 88.05%
Part-of-speech tagging 85.07%
Morphological features 70.21%
Dependency parsing (Labelled attachment score) 55.25%

Visualization of the current parser

The model is hosted on DigitalOcean: https://parser.yerevann.com/

alt text

Instructions (for End to end parsing)

  • Make sure you have all the requirements installed
pip install -r requirements.txt
  • Clone the repo (to get the submodules don't forget to include the --recursive flag)
git clone --recursive https://github.com/Armtreebank/End-to-end-Parser.git
  • Run the following command to get the .conllu file with predictions for every sentence of the input
python3 predict.py --model_path path_to_model.pkl --input_path sample.txt --output_path sample.conllu

Instructions (for COMBO training)

cd COMBO
python3 -m src.main --mode autotrain --train train_data_path.conllu --valid valid_data_path.conllu --model model.pkl --force_trees

Acknowledgements

This project is supported by ANSEF grant Lingu-5008 and ISTC Research Grant.

You can’t perform that action at this time.