<a href="https://colab.research.google.com/github/LukasStankevicius/Towards-Lithuanian-Grammatical-Error-Correction/blob/main/Supplementary_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a supplementary code material for our work **Towards Lithuanian Grammatical Error Correction** which will be presented at [ 11th Computer Science On-line Conference 2022](https://csoc.openpublish.eu/)

# Contents:
* [Simple usage](#simple_usage)
* [Advanced usage](#advances_usage)
* [Automatic evaluation](#evaluation)
* [How we trained the tokenizer](#tokenizer)
* [How we trained the model](#training_model)
 * [Optimizer and scheduler](#opt)
 * [Data](#data)
 * [Final training script](#final)




Install libraries that we will need in this notebook:

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 39.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 62.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

Import

#Simple usage<a name='simple_usage'></a>

In [None]:
from transformers import pipeline
name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
my_pipeline = pipeline(task="text2text-generation", model=name, framework="pt")

Downloading:   0%|          | 0.00/765 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [None]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

The summary can be obtained by:

In [None]:
corrected_text = my_pipeline(text)[0]["generated_text"]
print(corrected_text)

Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.


#Advanced usage<a name='advances_usage'></a>

In [None]:
from transformers import ByT5Tokenizer, T5ForConditionalGeneration

name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
tokenizer = ByT5Tokenizer.from_pretrained(name)
model = T5ForConditionalGeneration.from_pretrained(name)
def decode(x):
    return tokenizer.decode(x, skip_special_tokens=True)

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [None]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

And generation parameters ([documentation](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate), [explanation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)):

In [None]:
g_kwargs = dict(max_length=1024, num_beams=1, min_length=15)

The summary can be obtained by:

In [None]:
input_dict = tokenizer([text], return_tensors='pt')
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.tolist()))[0]

'Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.'

If you do a lot of compute you can take advantage of GPU (of course if you have one). Obtain summary with:

In [None]:
input_dict = {key:value.to("cuda:0") for key, value in input_dict.items()}
model = model.to("cuda:0")
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.cpu().tolist()))[0]

# Preprocessing


In [None]:
import os

user = "LukasStankevicius"
repo = "Towards-Lithuanian-Grammatical-Error-Correction"

# remove local directory if it already exists
if os.path.isdir(repo):
    !rm -rf {repo}

!git clone https://github.com/{user}/{repo}.git

from fixes import NormalizeKabutes, other_fixes, DeleteSpaceBeforePunctuation, AddSpaceAfterPoint, AddSpaceBefore_m_d


##Correct some common error patterns

In [None]:
uy['text'] = uy['text'].str.normalize("NFKC")
uy['text'] = other_fixes(uy['text'])
uy['text'] = NormalizeKabutes().replace(uy['text'])
uy['text'] = AddSpaceBefore_m_d().replace(uy['text'])
uy['text'] = AddSpaceAfterPoint().replace(uy['text'])
uy['text'] = DeleteSpaceBeforePunctuation().replace(uy['text'])

## Filter the text samples based on some statistical distributions