Skip to content

Correct grammatical and spelling errors in Ukrainian texts

Notifications You must be signed in to change notification settings

BonySmoke/grammar-tag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

General Information

Train und run the Seq2Tag model for the task of Grammatical Error Correction for the Ukrainian language.

Installation

Currently, there is no PyPI package for this project, but I hope to add it soon!

First, please install Poetry. Then, in the root of the project run poetry install. This will install all the needed dependencies.

Training

At the moment, there is no CLI command to train the model.
However, you can do it directly from code:

from ua_gec import Corpus
from gec.seq2tag import Seq2TagManager

# you can pass any custom list of documents compatible with the UA-GEC python package annotation
corpus = Corpus(partition="all", annotation_layer="gec-only")
seq2tag = Seq2TagManager(corpus=corpus, min_error_occurrence=3)
seq2tag.train()
seq2tag.push() # you will need to log in to your HuggingFace account first

Accuracy & Performance

The model was trained only on the GEC part of the UA-GEC dataset.
It reaches the F0.5 score of 0.6707 on the UNLP 2023 Shared Task in Grammatical Error Correction for Ukrainian. The model is not supposed to be used in production but it serves as a foundation for training larger models using synthetic data.

Since the model predicts the transformation tag for a token instead of rewriting it, the model is pretty fast. Correcting the UA-GEC test dataset (1509 documents) with 3 stages takes only ~82 seconds on a single GPU.

Interface

We use Gradio to interact with the model. The interface expects a Seq2Tag model to explain predictions.

To start the web interface, please run poetry run gradio interface.py.

About

Correct grammatical and spelling errors in Ukrainian texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages