<a href="https://colab.research.google.com/github/LukasStankevicius/Towards-Lithuanian-Grammatical-Error-Correction/blob/main/Supplementary_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a supplementary code material for our work **Towards Lithuanian Grammatical Error Correction** which will be presented at [ 11th Computer Science On-line Conference 2022](https://csoc.openpublish.eu/)

Here you can find:
* how to use our model;
* how we prepared the dataset;
* how we trained the model.

# Contents:
* [Simple usage](#simple_usage)
* [Advanced usage](#advances_usage)
* [Automatic evaluation](#evaluation)
* [How we trained the tokenizer](#tokenizer)
* [How we trained the model](#training_model)
 * [Optimizer and scheduler](#opt)
 * [Data](#data)
 * [Final training script](#final)




Install libraries that we will need in this notebook:

In [1]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 44.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

# Usage

In [2]:
from transformers import pipeline
name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
my_pipeline = pipeline(task="text2text-generation", model=name, framework="pt")

# Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'
corrected_text = my_pipeline(text)[0]["generated_text"]
print(corrected_text)

Downloading:   0%|          | 0.00/765 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.


##Advanced usage<a name='advances_usage'></a>

In [3]:
from transformers import ByT5Tokenizer, T5ForConditionalGeneration

name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
tokenizer = ByT5Tokenizer.from_pretrained(name)
model = T5ForConditionalGeneration.from_pretrained(name)
def decode(x):
    return tokenizer.decode(x, skip_special_tokens=True)

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [4]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

And generation parameters ([documentation](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate), [explanation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)):

In [5]:
g_kwargs = dict(max_length=1024, num_beams=1, min_length=15)

The summary can be obtained by:

In [6]:
input_dict = tokenizer([text], return_tensors='pt')
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.tolist()))[0]

'Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.'

If you do a lot of compute you can take advantage of GPU (of course if you have one). Obtain summary with:

In [None]:
input_dict = {key:value.to("cuda:0") for key, value in input_dict.items()}
model = model.to("cuda:0")
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.cpu().tolist()))[0]

# How we did it


### Dummy dataframe

In [50]:
import pandas as pd

# Given the following text from https://www.diktantas.lt/news/diktantas-tekstas-miline:
text = 'Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės.'
df = pd.DataFrame([[text]], columns=['text'])
df.style.set_properties( **{'width-min': '200px'})

Unnamed: 0,text
0,"Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės."


### Download additional code from the project

In [38]:
import os
user = "LukasStankevicius"
repo = "Towards-Lithuanian-Grammatical-Error-Correction"
target_dir = 'source'
# remove local directory if it already exists
if os.path.isdir(target_dir):
    !rm -rf {target_dir}
!git clone https://github.com/{user}/{repo}.git

!mv /content/{repo}/ /content/{target_dir}

Cloning into 'Towards-Lithuanian-Grammatical-Error-Correction'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 36 (delta 12), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (36/36), done.


##Preprocessing (3.1 subsection in our paper)
Fixing common mistakes by autmatic means, filtering strange text, deduplicating text items

In [51]:
from source.fixes import NormalizeKabutes, other_fixes, DeleteSpaceBeforePunctuation, AddSpaceAfterPoint, AddSpaceBefore_m_d
from source.filters import my_filter


df['text'] = df['text'].str.normalize("NFKC")
df['text'] = other_fixes(df['text'])

print('fixing kabutes')
df['text'] = NormalizeKabutes().replace(df['text'])

print('fixing 25d. -> 25 d.')
df['text'] = AddSpaceBefore_m_d().replace(df['text'])

print('fixing Labas.Kaip sekasi? -> Labas. Kaip sekasi?')
df['text'] = AddSpaceAfterPoint().replace(df['text'])

print('fixing varlės , buožalviai : -> varlės, buožgalviai:')
df['text'] = DeleteSpaceBeforePunctuation().replace(df['text'])

df = my_filter(df, min_characters=20, min_lithuanian_fraction=0.98, min_fraction_of_spaces_to_non_spaces=0.02)

df.drop_duplicates('text', inplace=True)


2022-03-06 13:59:38,561 - source.filters - INFO - We start with 1 rows

2022-03-06 13:59:38,568 - source.filters - INFO - Filtering by length removed 0 rows

2022-03-06 13:59:38,573 - source.filters - INFO - Filtering by how lithuanian removed 0 rows more

2022-03-06 13:59:38,577 - source.filters - INFO - Filtering by fraction of spaces to non spaces removed 0 rows even more

2022-03-06 13:59:38,580 - source.filters - INFO - Now we are left with 1 rows. From initial only  100.00 % remains.


fixing kabutes
fixing 25d. -> 25 d.
fixing Labas.Kaip sekasi? -> Labas. Kaip sekasi?
fixing varlės , buožalviai : -> varlės, buožgalviai:


## Generating synthetic mistakes (4.1 subsection in our paper)

### First download github typo corpus file for typographical error statistics calculation.

In [46]:
url = "https://github-typo-corpus.s3.amazonaws.com/data/github-typo-corpus.v1.0.0.jsonl.gz"
! wget {url}

import gzip
import shutil
gz_file = 'github-typo-corpus.v1.0.0.jsonl.gz'

with gzip.open(gz_file, 'rb') as f_in:
    with open('github-typo-corpus.v1.0.0.jsonl', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

!rm {gz_file}

--2022-03-06 13:51:19--  https://github-typo-corpus.s3.amazonaws.com/data/github-typo-corpus.v1.0.0.jsonl.gz
Resolving github-typo-corpus.s3.amazonaws.com (github-typo-corpus.s3.amazonaws.com)... 52.216.147.75
Connecting to github-typo-corpus.s3.amazonaws.com (github-typo-corpus.s3.amazonaws.com)|52.216.147.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43769081 (42M) [application/x-gzip]
Saving to: ‘github-typo-corpus.v1.0.0.jsonl.gz’


2022-03-06 13:51:24 (11.5 MB/s) - ‘github-typo-corpus.v1.0.0.jsonl.gz’ saved [43769081/43769081]



### Corrupting!!!

In [52]:
from source.mistake_generator import generate_mistakes
from source.typos import Typo

frac=0.02  # roughly 2% of characters are corrupted

t = Typo(corpus='github', weight=frac*100)  # may take a while the first time
df['corrupted'] = generate_mistakes(df['text'], frac=frac).apply(t.generate_errors)

loading precomputed: github_init_stats_qwerty.pickle


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  series = series.str.replace(r"\s", lambda x: x[0] if random() > frac else "")
  series = series.str.replace(r"\B", lambda x: x[0] if random() > frac else " ")


In [53]:
df.style.set_properties( **{'width-min': '200px'})

Unnamed: 0,text,corrupted
0,"Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės.","Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, tingdė jų syvus, bet nuo skaus mo trūkčiojo tik motersKūna's. Šerk šno ji nematė, tačeu jautė, kaipgysuose krauj as sbragsėdamas virsta ledo kristalėlisaic. Jos plaukai tarsi jūržolės pl aikstėsi ant pagalvės."


### If you need speed
Corrupting with typos statistics is extremely slow. So we used datasets library which can gently handle multiprocessing.

In [None]:
# ! pip install datasets
# from datasets import load_from_disk

# import pickle
# def ff(x, rank, frac):
#     t = Typo(corpus='github', weight=frac*100)
#     r = {'corrupted': generate_mistakes(pd.Series(x), frac=frac).apply(t.generate_errors).tolist()}
#     time_print = pd.Timestamp.now().strftime('%m-%d-%H-%M-%S')
#     # save statistics of induced typos:
#     with open(f"typo_statistics/{time_print}_{rank}.pickle", 'wb') as f:
#         pickle.dump(t.mistakes_generated, f)
#     return r

# ds = load_from_disk(f'my_dataset')

# ds = ds.map(ff, input_columns='text', writer_batch_size=2**18, batched=True, batch_size=2**18,
#             fn_kwargs={'frac': frac}, num_proc=16, with_rank=True)

## Training