<a href="https://colab.research.google.com/github/LukasStankevicius/Towards-Lithuanian-Grammatical-Error-Correction/blob/main/Supplementary_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a supplementary code material for our work **Towards Lithuanian Grammatical Error Correction** which will be presented at [ 11th Computer Science On-line Conference 2022](https://csoc.openpublish.eu/)
<a id='main' name="main"></a>
Here you can find:
* [how to use our model](#usage);
* [how we prepared the dataset](#how);
 * [preprocessing;](#preprocessing)
 * [synthetic mistakes;](#synthetic)
* [how we trained the model.](#training)




Install libraries that we will need in this notebook:

In [1]:
! pip install transformers datasets

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 44.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

[back to top](#main)
# Usage<a name='usage' id='usage'></a>

In [1]:
from transformers import pipeline
name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
my_pipeline = pipeline(task="text2text-generation", model=name, framework="pt")

# Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'
corrected_text = my_pipeline(text)[0]["generated_text"]
print(corrected_text)

Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.


##Advanced usage<a name='advances_usage'></a>

In [2]:
from transformers import ByT5Tokenizer, T5ForConditionalGeneration

name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
tokenizer = ByT5Tokenizer.from_pretrained(name)
model = T5ForConditionalGeneration.from_pretrained(name)
def decode(x):
    return tokenizer.decode(x, skip_special_tokens=True)

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [3]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

And generation parameters ([documentation](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate), [explanation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)):

In [4]:
g_kwargs = dict(max_length=1024, num_beams=1, min_length=15)

The summary can be obtained by:

In [5]:
input_dict = tokenizer([text], return_tensors='pt')
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.tolist()))[0]

'Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.'

If you do a lot of compute you can take advantage of GPU (of course if you have one). Obtain summary with:

In [7]:
# input_dict = {key:value.to("cuda:0") for key, value in input_dict.items()}
# model = model.to("cuda:0")
# output = model.generate(**input_dict, **g_kwargs)
# list(map(decode, output.cpu().tolist()))[0]

[back to top](#main)
# How we did it<a name='how' id='how'></a>


#### Dummy dataframe
only for this toy example to quickly understant and run

In [8]:
import pandas as pd

# Given the following text from https://www.diktantas.lt/news/diktantas-tekstas-miline:
text = 'Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės.'
df = pd.DataFrame([[text]], columns=['text'])
df.style.set_properties( **{'width-min': '200px'})

Unnamed: 0,text
0,"Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės."


### Download additional code from the project

In [None]:
import os
user = "LukasStankevicius"
repo = "Towards-Lithuanian-Grammatical-Error-Correction"
target_dir = 'source'
# remove local directory if it already exists
if os.path.isdir(target_dir):
    !rm -rf {target_dir}
!git clone https://github.com/{user}/{repo}.git

!mv /content/{repo}/ /content/{target_dir}

[back to top](#main)
##Preprocessing (3.1 subsection in our paper)<a name='preprocessing' id='preprocessing'></a>
Fixing common mistakes by automatic means, filtering strange text, deduplicating text items

In [10]:
from source.fixes import NormalizeKabutes, other_fixes, DeleteSpaceBeforePunctuation, AddSpaceAfterPoint, AddSpaceBefore_m_d
from source.filters import my_filter


df['text'] = df['text'].str.normalize("NFKC")
df['text'] = other_fixes(df['text'])

print('fixing kabutes')
df['text'] = NormalizeKabutes().replace(df['text'])

print('fixing 25d. -> 25 d.')
df['text'] = AddSpaceBefore_m_d().replace(df['text'])

print('fixing Labas.Kaip sekasi? -> Labas. Kaip sekasi?')
df['text'] = AddSpaceAfterPoint().replace(df['text'])

print('fixing varlės , buožalviai : -> varlės, buožgalviai:')
df['text'] = DeleteSpaceBeforePunctuation().replace(df['text'])

df = my_filter(df, min_characters=20, min_lithuanian_fraction=0.98, min_fraction_of_spaces_to_non_spaces=0.02)

df.drop_duplicates('text', inplace=True)


2022-03-06 15:09:36,132 - source.filters - INFO - We start with 1 rows

2022-03-06 15:09:36,135 - source.filters - INFO - Filtering by length removed 0 rows

2022-03-06 15:09:36,139 - source.filters - INFO - Filtering by how lithuanian removed 0 rows more

2022-03-06 15:09:36,142 - source.filters - INFO - Filtering by fraction of spaces to non spaces removed 0 rows even more

2022-03-06 15:09:36,143 - source.filters - INFO - Now we are left with 1 rows. From initial only  100.00 % remains.


fixing kabutes
fixing 25d. -> 25 d.
fixing Labas.Kaip sekasi? -> Labas. Kaip sekasi?
fixing varlės , buožalviai : -> varlės, buožgalviai:


[back to top](#main)
## Generating synthetic mistakes (4.1 subsection in our paper)<a name='synthetic' id='synthetic'></a>

### First download github typo corpus file for typographical error statistics calculation.

In [46]:
url = "https://github-typo-corpus.s3.amazonaws.com/data/github-typo-corpus.v1.0.0.jsonl.gz"
! wget {url}

import gzip
import shutil
gz_file = 'github-typo-corpus.v1.0.0.jsonl.gz'

with gzip.open(gz_file, 'rb') as f_in:
    with open('github-typo-corpus.v1.0.0.jsonl', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

!rm {gz_file}

--2022-03-06 13:51:19--  https://github-typo-corpus.s3.amazonaws.com/data/github-typo-corpus.v1.0.0.jsonl.gz
Resolving github-typo-corpus.s3.amazonaws.com (github-typo-corpus.s3.amazonaws.com)... 52.216.147.75
Connecting to github-typo-corpus.s3.amazonaws.com (github-typo-corpus.s3.amazonaws.com)|52.216.147.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43769081 (42M) [application/x-gzip]
Saving to: ‘github-typo-corpus.v1.0.0.jsonl.gz’


2022-03-06 13:51:24 (11.5 MB/s) - ‘github-typo-corpus.v1.0.0.jsonl.gz’ saved [43769081/43769081]



#### Dummy dataset
Corrupting with typos statistics is extremely slow. So we used datasets library which can gently handle multiprocessing.

In [29]:
from datasets import Dataset
dataset = Dataset.from_pandas(df, preserve_index =False)
dataset

Dataset({
    features: ['text'],
    num_rows: 1
})

### Splitting into chunks
if a text sequence is longer than 1024 bytes, it would need to be truncated, end the tail will be thrown away. This is to do not throw it away.

In [30]:
def chunk_examples(examples, n_max):
    chunks = []
    for sentence in examples:
        chunks += [sentence[i:i + n_max] for i in range(0, len(sentence), n_max)]
    return {'text': chunks}

dataset = dataset.map(chunk_examples, batched=True, input_columns='text', 
                      fn_kwargs={'n_max': 700})
dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['text'],
    num_rows: 1
})

### Corrupting!!!


In [31]:
from datasets import load_from_disk
from source.mistake_generator import generate_mistakes
from source.typos import Typo
import pickle

def ff(x, rank, frac):
    t = Typo(corpus='github', weight=frac*100) # takes very long the first time
    r = {'corrupted': generate_mistakes(pd.Series(x), frac=frac).apply(t.generate_errors).tolist()}
    time_print = pd.Timestamp.now().strftime('%m-%d-%H-%M-%S')
    # save statistics of induced typos:
    with open(f"typo_statistics_{time_print}_{rank}.pickle", 'wb') as f:
        pickle.dump(t.mistakes_generated, f)
    return r

frac=0.02  # roughly 2% of characters are corrupted

# We had num_proc=16 to corrupt in a reasonable time
dataset = dataset.map(ff, input_columns='text', writer_batch_size=2**18, batched=True, batch_size=2**18,
            fn_kwargs={'frac': frac}, num_proc=1, with_rank=True)
dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

loading precomputed: github_init_stats_qwerty.pickle


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  series = series.str.replace(r"\s", lambda x: x[0] if random() > frac else "")
  series = series.str.replace(r"\B", lambda x: x[0] if random() > frac else " ")


Dataset({
    features: ['text', 'corrupted'],
    num_rows: 1
})

In [32]:
dataset[0]

{'corrupted': "Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, tingdė jų syvus, bet nuo skaus mo trūkčiojo tik motersKūna's. Šerk šno ji nematė, tačeu jautė, kaipgysuose krauj as sbragsėdamas virsta ledo kristalėlisaic. Jos plaukai tarsi jūržolės pl aikstėsi ant pagalvės.",
 'text': 'Švito. Ažūrinės speigo adatėlės smigo į medžių šakas, stingdė jų syvus, bet nuo skausmo trūkčiojo tik moters kūnas. Šerkšno ji nematė, tačiau jautė, kaip gyslose kraujas spragsėdamas virsta ledo kristalėliais. Jos plaukai tarsi jūržolės plaikstėsi ant pagalvės.'}

[back to top](#main)
## Training<a name='training' id='training'></a>

#### Dummy dataset

In [33]:
from datasets import concatenate_datasets
dataset = concatenate_datasets([dataset]*10000)
dataset

Dataset({
    features: ['text', 'corrupted'],
    num_rows: 10000
})

In [34]:
from transformers import T5ForConditionalGeneration, ByT5Tokenizer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

checkpoint = "google/byt5-small"
output_dir = f"my_output_dir"
batch_size = 128
max_length = 1024

dataset = dataset.train_test_split(test_size=0.001, shuffle=True, seed=42)
dd2 = dataset['test'].train_test_split(test_size=0.5)
dataset['test'], dataset['valid'] = dd2['test'], dd2['train']
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'corrupted'],
        num_rows: 9990
    })
    test: Dataset({
        features: ['text', 'corrupted'],
        num_rows: 5
    })
    valid: Dataset({
        features: ['text', 'corrupted'],
        num_rows: 5
    })
})

### Tokenizing

In [35]:
tokenizer = ByT5Tokenizer.from_pretrained(checkpoint)

def preprocess_function(examples):
    model_inputs = tokenizer(examples["corrupted"], max_length=max_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["text"], max_length=max_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset = dataset.map(preprocess_function, batched=True, remove_columns=['text', 'corrupted'],
                      desc='tokenizing')
dataset

tokenizing:   0%|          | 0/10 [00:00<?, ?ba/s]

tokenizing:   0%|          | 0/1 [00:00<?, ?ba/s]

tokenizing:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9990
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5
    })
})

In [37]:
# YOU NEED CUDA TO RUN THIS WITHOUT CRASHING
%load_ext tensorboard
%tensorboard --logdir runs

model = T5ForConditionalGeneration.from_pretrained(checkpoint)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True)

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=128//batch_size,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_first_step=True,
    log_level='info',
    seed=0,
    dataloader_drop_last=True,
    dataloader_num_workers=1,
    remove_unused_columns=False,
    #     optim='adafactor',
    adafactor=True,
    lr_scheduler_type='constant',
    # skip_memory_metrics=False,
    predict_with_generate=True,
    generation_max_length=max_length,
    generation_num_beams=1,
    # no_cuda=True,
    report_to='none',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_strategy='epoch'
)

trainer = Seq2SeqTrainer(
    model=model, args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    data_collator=data_collator,
    tokenizer=tokenizer)

trainer.train()