<a href="https://colab.research.google.com/github/TartuNLP/grammar-worker/blob/gec-and-spell/GEC_and_spell_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spell-checking and Grammatical Error Correction Demo

Demo for using [https://koodivaramu.eesti.ee/tartunlp/corrector](https://koodivaramu.eesti.ee/tartunlp/corrector) that corrects Estonian text using spell-checking and grammatical error correction (GEC) models. 

## Setup

Clone the repo, install dependencies and download models. It is advisable to create a Python 3.10 environment outside of Colab. 

In [1]:
! git clone https://github.com/TartuNLP/grammar-worker.git
%cd grammar-worker
! git checkout gec-and-spell
! apt-get install swig3.0
! pip install -r requirements.txt
! python -c "import nltk; nltk.download(\"punkt\")"


Cloning into 'grammar-worker'...
remote: Enumerating objects: 359, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (95/95), done.[K
remote: Total 359 (delta 58), reused 55 (delta 23), pack-reused 239[K
Receiving objects: 100% (359/359), 108.96 KiB | 9.08 MiB/s, done.
Resolving deltas: 100% (183/183), done.
/content/grammar-worker
Branch 'gec-and-spell' set up to track remote branch 'gec-and-spell' from 'origin'.
Switched to a new branch 'gec-and-spell'
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig3.0
0 upgraded, 1 newly installed, 0 to remove and 24 not upgraded.
Need to get 1,109 kB of archives.
After this operation, 5,555 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 swig3.0 amd64 3.0.12-2.2ubuntu1 [1,109 kB]
Fetched 1,109 kB in 0

In [2]:
! git lfs install

# GEC models

! git clone https://huggingface.co/tartuNLP/GEC-noisy-nmt-ut models/tartuNLP/GEC-noisy-nmt-ut
! git clone https://huggingface.co/tartuNLP/GEC-synthetic-pretrain-ut-ft models/tartuNLP/GEC-synthetic-pretrain-ut-ft

# Spell models

! git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_model_6000000_lines models/Jaagup/etnc19_reference_corpus_model_6000000_lines
! git clone https://huggingface.co/Jaagup/etnc19_web_2019 models/Jaagup/etnc19_web_2019
! git clone https://huggingface.co/Jaagup/etnc19_reference_corpus_6000000_web_2019_600000 models/Jaagup/etnc19_reference_corpus_6000000_web_2019_600000


Updated git hooks.
Git LFS initialized.
Cloning into 'models/tartuNLP/GEC-noisy-nmt-ut'...
remote: Enumerating objects: 19, done.[K
remote: Total 19 (delta 0), reused 0 (delta 0), pack-reused 19[K
Unpacking objects: 100% (19/19), 63.25 KiB | 6.32 MiB/s, done.
Filtering content: 100% (3/3), 725.10 MiB | 31.82 MiB/s, done.
Cloning into 'models/tartuNLP/GEC-synthetic-pretrain-ut-ft'...
remote: Enumerating objects: 21, done.[K
remote: Total 21 (delta 0), reused 0 (delta 0), pack-reused 21[K
Unpacking objects: 100% (21/21), 63.46 KiB | 3.53 MiB/s, done.
Filtering content: 100% (4/4), 828.89 MiB | 33.40 MiB/s, done.
Cloning into 'models/Jaagup/etnc19_reference_corpus_model_6000000_lines'...
remote: Enumerating objects: 12, done.[K
remote: Total 12 (delta 0), reused 0 (delta 0), pack-reused 12[K
Unpacking objects: 100% (12/12), 1.39 KiB | 713.00 KiB/s, done.
Cloning into 'models/Jaagup/etnc19_web_2019'...
remote: Enumerating objects: 6, done.[K
remote: Total 6 (delta 0), reused 0 (delt

## Models in action
It is possible to use only speller or only GEC model or both models.

In [3]:
from pprint import pprint
from dataclasses import asdict
from gec_worker import GEC, read_gec_config
from gec_worker import Speller, read_speller_config
from gec_worker.dataclasses import Request
from gec_worker import MultiCorrector


### Loading the models


Two available GEC models are

* `GEC-synthetic-pretrain-ut-ft` - > slightly higher precision & lower recall (preferred)
* `GEC-noisy-nmt-ut` - > slightly higher recall & lower precision 

In [4]:
# Let's first load the second model

gec_config = read_gec_config('models/GEC-noisy-nmt-ut.yaml')
gec = GEC(gec_config)


Three available spell-checking models are

* `etnc19_reference_corpus_6000000` - > highest recall, lowest precision
* `etnc19_web_2019` - > highest precision, lowest recall
* `etnc19_reference_corpus_6000000_web_2019_600000` - > average precision, average recall

In [5]:
# Let's first load the last model, longer wait (few minutes)

spell_config = read_speller_config('models/spell_etnc19_reference_corpus_6000000_web_2019_600000.yaml')
speller = Speller(spell_config)


### Prepearing input data

From Str to Request.

In [6]:
source_text = "Ükss väega vikase lause olema see"
request = Request(text=source_text, language='et')


### Spell-checking

Only applying speller.

In [7]:
response = speller.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väega vigase lause olema see',
 'corrections': [{'replacements': [{'value': 'Üks'}],
                  'span': {'end': 4, 'start': 0, 'value': 'Ükss'}},
                 {'replacements': [{'value': 'vigase'}],
                  'span': {'end': 17, 'start': 11, 'value': 'vikase'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väega vigase lause olema see'

### Grammatical error correction

Only applying the GEC model.

In [8]:
response = gec.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks vägeva vikase lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks vägeva'}],
                  'span': {'end': 10, 'start': 0, 'value': 'Ükss väega'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks vägeva vikase lause on see.'

### Spell-checking and GEC

To determine the order in which the correctors are applied, create a model list using the MultipleCorrections class and then add the speller and GEC corrector to the list sequentially.

In [9]:
multi_corrector = MultiCorrector()
multi_corrector.add_corrector(speller)
multi_corrector.add_corrector(gec)


In [10]:
response = multi_corrector.process_request(request)
pprint(asdict(response))
response.corrected_text


{'corrected_text': 'Üks väga vigane lause on see.',
 'corrections': [{'replacements': [{'value': 'Üks väga vigane'}],
                  'span': {'end': 17,
                           'start': 0,
                           'value': 'Ükss väega vikase'}},
                 {'replacements': [{'value': 'on see.'}],
                  'span': {'end': 33, 'start': 24, 'value': 'olema see'}}],
 'original_text': 'Ükss väega vikase lause olema see',
 'status': 'OK',
 'status_code': 200}


'Üks väga vigane lause on see.'

## Comparing the models

There are two GEC and three spell-checking models that exhibit varying behaviors, here are some examples of that.

### Two GEC models

The `GEC-noisy-nmt-ut` model exhibits higher error correction capability but is prone to confusion, while the `GEC-synthetic-pretrain-ut-ft` model is more stable but corrects fewer errors.

In [11]:
model_config_sp = read_gec_config('models/GEC-synthetic-pretrain-ut-ft.yaml')
gec_sp = GEC(model_config_sp)

model_config_nmt = read_gec_config('models/GEC-noisy-nmt-ut.yaml')
gec_nmt = GEC(model_config_nmt)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-et/resolve/v1.4.1/models/tokenize/edt.pt:   0%|         …

INFO:stanza:Loading these models for language: et (Estonian):
| Processor | Package |
-----------------------
| tokenize  | edt     |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Done loading processors!


In [12]:
source_text_longer = "Gramatikliste veade parantamine on põõnev ülessanne. Ükss väega vikase lause olema see. Mudel oskama selles ikka parandusi luua."
request_longer = Request(text=source_text_longer, language='et')


In [13]:
response_sp = gec_sp.process_request(request_longer)
#pprint(asdict(response_sp))
response_sp.corrected_text


'Grammatiliste veade parandamine on põnev ülesanne. Üks väga vigane lause on see. Mudel oskab selles ikka parandusi luua.'

In [14]:
response_nmt = gec_nmt.process_request(request_longer)
#pprint(asdict(response_nmt))
response_nmt.corrected_text


'Grammatiliste vigade parandamine on põdev ülesseamine. Üks vägeva vikase lause on see. Mudel oskab selles ikka parandusi teha.'

### Three spellers

The `etnc19_reference_corpus_model_6000000_lines`is able to find more spelling mistakes, but it is not always completely accurate. On the other hand, the `etnc19_web_2019` model allows more mistakes to remain in the text but makes fewer incorrect edits. The `etnc19_reference_corpus_6000000_web_2019_600000` model falls somewhere in between these two models.

In [None]:
source_text_spell = "Õikekiria veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel osgab seda ikla parandada."
request_spell = Request(text=source_text_spell, language='et')


In [None]:
# NB! the models are huge and Colab memory limited, monitor that

speller_ref_web = speller # spelling.Spelling("etnc19_reference_corpus_6000000_web_2019_600000/etnc19_reference_corpus_6000000_web_2019_600000.bin")
response = speller_ref_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õigekirja teade parandamine on põnev ülessanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [None]:
speller_ref_config = read_speller_config('models/spell_etnc19_reference_corpus_model_6000000_lines.yaml')
speller_ref = Speller(speller_ref_config)

response = speller_ref.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õigekirja teade parandamine on põnev ülesanne. Üks väega vigane lause on see. Mudel oskab seda ikla parandada.'

In [None]:
speller_web_config = read_speller_config('models/spell_etnc19_web_2019.yaml')
speller_web = Speller(speller_web_config)

response = speller_web.process_request(request_spell)
#pprint(asdict(response))
response.corrected_text


'Õikekiria veade parantamine on põnev ülessanne. Ükss väega vikane lause on see. Mudel oskab seda ikka parandada.'