<a href="https://colab.research.google.com/github/TartuNLP/grammar-worker/blob/gec-and-spell/GEC_and_spell_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spell-checking and Grammatical Error Correction Demo

Demo for using [https://koodivaramu.eesti.ee/tartunlp/corrector](https://koodivaramu.eesti.ee/tartunlp/corrector) that corrects Estonian text using spell-checking and grammatical error correction (GEC) models. 

## Setup

Clone the repo, install dependencies and download models.

In [1]:
! git clone https://koodivaramu.eesti.ee/tartunlp/corrector.git 
%cd corrector
! pip install -r requirements.txt
! python -c "import nltk; nltk.download(\"punkt\")"


Cloning into 'corrector'...
remote: Enumerating objects: 273, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 273 (delta 71), reused 122 (delta 71), pack-reused 151[K
Receiving objects: 100% (273/273), 80.09 KiB | 942.00 KiB/s, done.
Resolving deltas: 100% (137/137), done.
/content/corrector
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/TartuNLP/fairseq.git@mtee-0.1.0 (from -r requirements.txt (line 7))
  Cloning https://github.com/TartuNLP/fairseq.git (to revision mtee-0.1.0) to /tmp/pip-req-build-e6trtnnk
  Running command git clone --filter=blob:none --quiet https://github.com/TartuNLP/fairseq.git /tmp/pip-req-build-e6trtnnk
  Running command git checkout -q 1a6f364b8af6e746dd1fc623c8cf670a0be5b696
  Resolved https://github.com/TartuNLP/fairseq.git to commit 1a6f364b8af6e746dd1fc623c8cf670a0be5b696
  Running command git

In [2]:
! git lfs install

# GEC models

! cd models && git clone https://huggingface.co/TartuNLP/GEC-noisy-nmt-ut
! cd models && git clone https://huggingface.co/TartuNLP/GEC-synthetic-pretrain-ut-ft

# Spell models

! cd models/spellmodels && git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_model_6000000_lines
! cd models/spellmodels && git clone https://huggingface.co/Jaagup/etnc19_web_2019
! cd models/spellmodels && git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_6000000_web_2019_600000


Updated git hooks.
Git LFS initialized.
Cloning into 'GEC-noisy-nmt-ut'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 19 (delta 5), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (19/19), 63.25 KiB | 7.03 MiB/s, done.
Cloning into 'GEC-synthetic-pretrain-ut-ft'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 21 (delta 6), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (21/21), 63.46 KiB | 6.35 MiB/s, done.
Filtering content: 100% (4/4), 828.89 MiB | 70.29 MiB/s, done.
Cloning into 'etnc19_reference_corpus_model_6000000_lines'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 12 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unp

## Models in action
It is possible to use only speller or only GEC model or both models.

In [None]:
from pprint import pprint
from dataclasses import asdict
from gec_worker import GEC, read_model_config
from gec_worker import spelling
from gec_worker.dataclasses import Response, Request
from gec_worker import multiple_corrections


### Loading the models


Two available GEC models are

* `GEC-synthetic-pretrain-ut-ft` - > slightly higher precision & lower recall (preferred)
* `GEC-noisy-nmt-ut` - > slightly higher recall & lower precision 

In [4]:
# Let's first load the second model

model_config = read_model_config('models/GEC-noisy-nmt-ut-config.yaml')
gec = GEC(model_config)


Three available spell-checking models are

* `etnc19_reference_corpus_6000000` - > highest recall, lowest precision
* `etnc19_web_2019` - > highest precision, lowest recall
* `etnc19_reference_corpus_6000000_web_2019_600000` - > average precision, average recall

In [5]:
# Let's first load the last model, longer wait (few minutes)

speller = spelling.Spelling("etnc19_reference_corpus_6000000_web_2019_600000/etnc19_reference_corpus_6000000_web_2019_600000.bin")


### Prepearing input data

From Str to Request.

In [6]:
source_text="Ükss väega vikase lause olema see"
request = Request(text=source_text, language='et')


### Spell-checking

Only applying speller.

In [7]:
response = speller.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väega vigase lause olema see',
 'corrections': [{'replacements': [{'value': 'Üks'}],
                  'span': {'end': 4, 'start': 0, 'value': 'Ükss'}},
                 {'replacements': [{'value': 'vigase'}],
                  'span': {'end': 17, 'start': 11, 'value': 'vikase'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väega vigase lause olema see'

### Grammatical error correction

Only applying the GEC model.

In [8]:
response = gec.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks vägeva vikase lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks vägeva'}],
                  'span': {'end': 10, 'start': 0, 'value': 'Ükss väega'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks vägeva vikase lause on see.'

### Spell-checking and GEC

To determine the order in which the correctors are applied, create a model list using the MultipleCorrections class and then add the speller and GEC corrector to the list sequentially.

In [9]:
model_list = multiple_corrections.MultipleCorrections()
model_list.add_corrector(speller)
model_list.add_corrector(gec)


In [10]:
response = model_list.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väga vigane lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks väga vigane'}],
                  'span': {'end': 17,
                           'start': 0,
                           'value': 'Ükss väega vikase'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väga vigane lause on see.'

## Comparing the models

There are two GEC and three spell-checking models that exhibit varying behaviors, here are some examples of that.

### Two GEC models

The GEC-noisy-nmt-ut model exhibits higher error correction capability but is prone to confusion, while the GEC-synthetic-pretrain-ut-ft model is more stable but corrects fewer errors.

In [11]:
model_config_sp = read_model_config('models/GEC-synthetic-pretrain-ut-ft-config.yaml')
gec_sp = GEC(model_config_sp)

model_config_nmt = read_model_config('models/GEC-noisy-nmt-ut-config.yaml')
gec_nmt = GEC(model_config_nmt)

In [12]:
source_text_longer = "Gramatikliste veade parantamine on põõnev ülessanne. Ükss väega vikase lause olema see. Mudel oskama selles ikka parandusi luua."
request_longer = Request(text=source_text_longer, language='et')


In [13]:
response_sp = gec_sp.process_request(request_longer)
#pprint(asdict(response_sp))
response_sp.corrected_text


'Grammatiliste veade parandamine on põnev ülesanne. Üks väga vigane lause on see. Mudel oskab selles ikka parandusi luua.'

In [14]:
response_nmt = gec_nmt.process_request(request_longer)
#pprint(asdict(response_nmt))
response_nmt.corrected_text


'Grammatiliste vigade parandamine on põdev ülesseamine. Üks vägeva vikase lause on see. Mudel oskab selles ikka parandusi teha.'

### Three spellers


In [21]:
source_text_spell = "Gramatikliste veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel osgab seda ikla parandada."
request_spell = Request(text=source_text_spell, language='et')


In [20]:
# using the model we loaded previously, etnc19_reference_corpus_6000000_web_2019_600000

speller_ref_web = speller # spelling.Spelling("etnc19_reference_corpus_6000000_web_2019_600000/etnc19_reference_corpus_6000000_web_2019_600000.bin")
response = speller_ref_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Dramaatiliste teade parandamine on põnev ülessanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [19]:
speller_ref = spelling.Spelling("etnc19_reference_corpus_model_6000000_lines/etnc19_reference_corpus_model_6000000_lines.bin")
response = speller_ref.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Dramaatiliste teade parandamine on põnev ülesanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [18]:
speller_web = spelling.Spelling("etnc19_web_2019/etnc19_web_2019.bin")
response = speller_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Gramatikliste veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel oskab seda ikka parandada.'