<a href="https://colab.research.google.com/github/TartuNLP/grammar-worker/blob/main/GEC_and_spell_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spell-checking and Grammatical Error Correction Demo

Demo for using [https://koodivaramu.eesti.ee/tartunlp/corrector](https://koodivaramu.eesti.ee/tartunlp/corrector) that corrects Estonian text using spell-checking and grammatical error correction (GEC) models.

## Setup

Clone the repo, install dependencies and download models. It is advisable to create a Python 3.10 environment outside of Colab.

In [None]:
! git clone https://github.com/TartuNLP/grammar-worker.git
%cd grammar-worker
! apt-get install swig3.0
! pip install -r requirements.txt
! python -c 'import nltk; nltk.download(\'punkt\')'

fatal: destination path 'grammar-worker' already exists and is not an empty directory.
/content/grammar-worker
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig3.0 is already the newest version (3.0.12-2.2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting git+https://github.com/TartuNLP/fairseq.git@mtee-0.1.0 (from -r requirements.txt (line 7))
  Cloning https://github.com/TartuNLP/fairseq.git (to revision mtee-0.1.0) to /tmp/pip-req-build-356kaow4
  Running command git clone --filter=blob:none --quiet https://github.com/TartuNLP/fairseq.git /tmp/pip-req-build-356kaow4
  Running command git checkout -q 1a6f364b8af6e746dd1fc623c8cf670a0be5b696
  Resolved https://github.com/TartuNLP/fairseq.git to commit 1a6f364b8af6e746dd1fc623c8cf670a0be5b696
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25

In [None]:
! git lfs install

# GEC models

! git clone https://huggingface.co/tartuNLP/en-et-de-cs-nelb models/tartuNLP/en-et-de-cs-nelb
! git clone https://huggingface.co/tartuNLP/GEC-noisy-nmt-ut models/tartuNLP/GEC-noisy-nmt-ut
! git clone https://huggingface.co/tartuNLP/GEC-synthetic-pretrain-ut-ft models/tartuNLP/GEC-synthetic-pretrain-ut-ft

# Spell models

! git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_model_6000000_lines models/Jaagup/etnc19_reference_corpus_model_6000000_lines
! git clone https://huggingface.co/Jaagup/etnc19_web_2019 models/Jaagup/etnc19_web_2019
! git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_6000000_web_2019_600000 models/Jaagup/etnc19_reference_corpus_6000000_web_2019_600000


Updated git hooks.
Git LFS initialized.
Cloning into 'models/tartuNLP/en-et-de-cs-nelb'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 38 (delta 11), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (38/38), 1.38 MiB | 8.19 MiB/s, done.
Cloning into 'models/tartuNLP/GEC-noisy-nmt-ut'...
remote: Enumerating objects: 19, done.[K
remote: Total 19 (delta 0), reused 0 (delta 0), pack-reused 19[K
Unpacking objects: 100% (19/19), 63.25 KiB | 4.52 MiB/s, done.
Cloning into 'models/tartuNLP/GEC-synthetic-pretrain-ut-ft'...
remote: Enumerating objects: 21, done.[K
remote: Total 21 (delta 0), reused 0 (delta 0), pack-reused 21[K
Unpacking objects: 100% (21/21), 63.46 KiB | 4.88 MiB/s, done.
Filtering content: 100% (4/4), 828.89 MiB | 36.44 MiB/s, done.
Cloning into 'models/Jaagup/etnc19_reference_corpus_model_6000000_lines'...
remote: Enumerating objects: 12, done.[K
r

## Models in action
It is possible to use only speller or only GEC model or both models.

In [None]:
from pprint import pprint
from dataclasses import asdict
from gec_worker import GEC, read_gec_config
from gec_worker import Speller, read_speller_config
from gec_worker.dataclasses import Request
from gec_worker import MultiCorrector


### Loading the models


Three available GEC models are

* `en-et-de-cs-nelb` - > second iteration model, both the highest precision & recall compared to the other ones (preferred)
* `GEC-synthetic-pretrain-ut-ft` - > slightly higher precision & lower recall
* `GEC-noisy-nmt-ut` - > slightly higher recall & lower precision

**We suggest using the `en-et-de-cs-nelb` model.** It is a No Error Left Behind (NELB) model that is based on No Language Left Behing (NLLB) translation model. It has significantly better performance compared to the other ones.

In [None]:
# Let's load the highest-performing model

gec_config = read_gec_config('models/GEC-nelb-1.3b.yaml')
gec = GEC(gec_config)




Three available spell-checking models are

* `etnc19_reference_corpus_6000000_lines` - > highest recall, lowest precision
* `etnc19_web_2019` - > highest precision, lowest recall
* `etnc19_reference_corpus_6000000_web_2019_600000` - > average precision, average recall

In [None]:
# Let's load the model with the highest recall

spell_config = read_speller_config('models/spell_etnc19_reference_corpus_model_6000000_lines.yaml')
speller = Speller(spell_config)


### Preparing input data

From Str to Request.

In [None]:
source_text = 'Ükss väega vikase lause olema see'
#source_text = 'See onn üks väega viggane lause'
request = Request(text=source_text, language='et')


### Spell-checking

Only applying the speller.

In [None]:
response = speller.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'See on üks väga vigane lause',
 'corrections': [{'replacements': [{'value': 'on'}],
                  'span': {'end': 7, 'start': 4, 'value': 'onn'}},
                 {'replacements': [{'value': 'väga vigane'}],
                  'span': {'end': 25, 'start': 12, 'value': 'väega viggane'}}],
 'original_text': 'See onn üks väega viggane lause',
 'status': 'OK',
 'status_code': 200}


'See on üks väga vigane lause'

### Grammatical error correction

Only applying the GEC model.

In [None]:
response = gec.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väga vigane lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks väga vigane'}],
                  'span': {'end': 17,
                           'start': 0,
                           'value': 'Ükss väega vikase'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väga vigane lause on see.'

### Spell-checking and GEC

To determine the order in which the correctors are applied, create a model list using the MultipleCorrections class and then add the speller and GEC corrector to the list sequentially.

In [None]:
multi_corrector = MultiCorrector()
multi_corrector.add_corrector(speller)
multi_corrector.add_corrector(gec)


In [None]:
response = multi_corrector.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väga vigane lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks väga vigane'}],
                  'span': {'end': 17,
                           'start': 0,
                           'value': 'Ükss väega vikase'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väga vigane lause on see.'

## Comparing the models

There are two GEC and three spell-checking models that exhibit varying behaviors, here are some examples of that.

### Three GEC models

The `GEC-noisy-nmt-ut` model exhibits higher error correction capability but is prone to confusion, while the `GEC-synthetic-pretrain-ut-ft` model is more stable but corrects fewer errors.

In [None]:
gec_config_nelb = read_gec_config('models/GEC-nelb-1.3b.yaml')
gec_nelb = GEC(gec_config_nelb)

model_config_sp = read_gec_config('models/GEC-synthetic-pretrain-ut-ft.yaml')
gec_sp = GEC(model_config_sp)

model_config_nmt = read_gec_config('models/GEC-noisy-nmt-ut.yaml')
gec_nmt = GEC(model_config_nmt)

In [None]:
source_text_longer = 'Gramatikliste veade parantamine on põõnev ülessanne. Ükss väega vikase lause olema see. Mudel oskama selles ikka parandusi teha.'
request_longer = Request(text=source_text_longer, language='et')


In [None]:
response_sp = gec_sp.process_request(request_longer)
#pprint(asdict(response_sp))
response_sp.corrected_text


'Grammatiliste veade parandamine on põnev ülesanne. Üks väga vigane lause on see. Mudel oskab selles ikka parandusi teha.'

In [None]:
response_nmt = gec_nmt.process_request(request_longer)
#pprint(asdict(response_nmt))
response_nmt.corrected_text


In [None]:
response_nelb = gec_nelb.process_request(request_longer)
#pprint(asdict(response_nmt))
response_nelb.corrected_text

'Grammatiliste vigade parandamine on põnev ülesanne. Üks väga vigane lause on see. Mudel oskab selles ikka parandusi teha.'

### Three spellers

The `etnc19_reference_corpus_model_6000000_lines`is able to find more spelling mistakes, but it is not always completely accurate. On the other hand, the `etnc19_web_2019` model allows more mistakes to remain in the text but makes fewer incorrect edits. The `etnc19_reference_corpus_6000000_web_2019_600000` model falls somewhere in between these two models.

In [None]:
source_text_spell = 'Õikekiria veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel osgab seda ikla parandada.'
request_spell = Request(text=source_text_spell, language='et')


In [None]:
# NB! the models are huge and Colab memory limited, monitor that

speller_ref_web = speller # spelling.Spelling("etnc19_reference_corpus_6000000_web_2019_600000/etnc19_reference_corpus_6000000_web_2019_600000.bin")
response = speller_ref_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õigekirja teade parandamine on põnev ülessanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [None]:
speller_ref_config = read_speller_config('models/spell_etnc19_reference_corpus_model_6000000_lines.yaml')
speller_ref = Speller(speller_ref_config)

response = speller_ref.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õigekirja teade parandamine on põnev ülesanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [None]:
speller_web_config = read_speller_config('models/spell_etnc19_web_2019.yaml')
speller_web = Speller(speller_web_config)

response = speller_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õikekiria veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel oskab seda ikka parandada.'