This is a supplementary code material for our work **Towards Lithuanian Grammatical Error Correction** which will be presented at [ 11th Computer Science On-line Conference 2022](https://csoc.openpublish.eu/)

# Contents:
* [Simple usage](#simple_usage)
* [Advanced usage](#advances_usage)
* [Automatic evaluation](#evaluation)
* [How we trained the tokenizer](#tokenizer)
* [How we trained the model](#training_model)
 * [Optimizer and scheduler](#opt)
 * [Data](#data)
 * [Final training script](#final)




Install libraries that we will need in this notebook:

In [1]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 39.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 62.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

Import

#Simple usage<a name='simple_usage'></a>

In [2]:
from transformers import pipeline
name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
my_pipeline = pipeline(task="text2text-generation", model=name, framework="pt")

Downloading:   0%|          | 0.00/765 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44k [00:00<?, ?B/s]

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [3]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

The summary can be obtained by:

In [4]:
corrected_text = my_pipeline(text)[0]["generated_text"]
print(corrected_text)

Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.


#Advanced usage<a name='advances_usage'></a>

In [5]:
from transformers import ByT5Tokenizer, T5ForConditionalGeneration

name= "LukasStankevicius/ByT5-Lithuanian-gec-100h"
tokenizer = ByT5Tokenizer.from_pretrained(name)
model = T5ForConditionalGeneration.from_pretrained(name)
def decode(x):
    return tokenizer.decode(x, skip_special_tokens=True)

Given the following text from https://www.diktantas.lt/pasitikrink-lietuviu-kalbos-zinias::

In [6]:
text = 'Sveiki pardodu tvarkyngą "Audi" firmos automobylį. Kątik iš Amerikės. Viena savininka prižiurietas ir mylietas Automobylis. Dar turu patobulintą „Mersedes“ su automatinia greičių pavara už 4000 evrų (iš Amerikės). Taippat tvarkingas.'

And generation parameters ([documentation](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate), [explanation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)):

In [7]:
g_kwargs = dict(max_length=1024, num_beams=1, min_length=15)

The summary can be obtained by:

In [8]:
input_dict = tokenizer([text], return_tensors='pt')
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.tolist()))[0]

'Sveiki parduodu tvarkingą „Audi“ firmos automobilį. Ką tik iš Amerikės. Viena savininkas prižiūrintas ir mylimas automobilis. Dar turiu patobulintą „Mersedes“ su automatine greičių pavara už 4000 eurų (iš Amerikės). Taip pat tvarkingas.'

If you do a lot of compute you can take advantage of GPU (of course if you have one). Obtain summary with:

In [None]:
input_dict = {key:value.to("cuda:0") for key, value in input_dict.items()}
model = model.to("cuda:0")
output = model.generate(**input_dict, **g_kwargs)
list(map(decode, output.cpu().tolist()))[0]

# Preprocessing
##Correct some common error patterns

In [None]:
import os

user = "LukasStankevicius"
repo = "Towards-Lithuanian-Grammatical-Error-Correction"

# remove local directory if it already exists
if os.path.isdir(repo):
    !rm -rf {repo}

!git clone https://github.com/{user}/{repo}.git

In [None]:
from fixes import NormalizeKabutes, other_fixes, DeleteSpaceBeforePunctuation, AddSpaceAfterPoint, AddSpaceBefore_m_d



In [None]:
uy['text'] = uy['text'].str.normalize("NFKC")
uy['text'] = other_fixes(uy['text'])
uy['text'] = NormalizeKabutes().replace(uy['text'])
uy['text'] = AddSpaceBefore_m_d().replace(uy['text'])
uy['text'] = AddSpaceAfterPoint().replace(uy['text'])
uy['text'] = DeleteSpaceBeforePunctuation().replace(uy['text'])

## Filter the text samples based on some statistical distributions

# Automatic evaluation<a name='evaluation'></a>
We evaluated summaries with [ROUGE](https://www.aclweb.org/anthology/W04-1013/). It measures *n-gram* overlap between reference and generated texts. However, one should not completely trust it as the same meaning can be expressed by different words (*n-grams*). Yet it is almost the best we can do (automated and fast). Lithuanian language is quite rich with different word stem endings so we also "helped" ROUGE by stemming words.


Combining the two:

In [None]:
class MyStemmer:
    def __init__(self):
        self.stemmer = Stemmer.Stemmer('lithuanian')

    def stem(self, token):
        return self.stemmer.stemWord(token)


class MyRougeScorer(rouge_scorer.RougeScorer):
    # I rewrite init to have different stemmer
    def __init__(self, rouge_types, use_stemmer=False):
        self.rouge_types = rouge_types
        self._stemmer = MyStemmer() if use_stemmer else None

Now, given the gold reference and generated summary:

In [None]:
ground_truth = "Kai Lietuva dar buvo okupuota ir mūsų šalies krepšininkai privalėjo žaisti TSRS rinktinėje, keli jų buvo ryškūs lyderiai."
generated_text = "Lietuvos krepšinio federacijos (LKF) prezidento Arvydo Sabonio rezultatyvumo vidurkis yra aukščiausias tarp visų Sovietų Sąjungos rinktinėje atstovavusių žaidėjų, skaičiuojant tuos, kurie sužaidė bent po 50 oficialių rungtynių."

Let's calculate ROUGE:

In [None]:
rouge_types = ['rouge1', 'rouge2', 'rougeL']
scorer = MyRougeScorer(rouge_types, use_stemmer=True)
score = scorer.score(ground_truth, generated_text)
print({s:score[s].fmeasure for s in rouge_types})

{'rouge1': 0.20689655172413793, 'rouge2': 0.03571428571428572, 'rougeL': 0.1724137931034483}


We monitored training by calculating ROUGE for 4096 validation pairs and noticed that after 250000 training steps our model started to overfit.


# How we trained the tokenizer<a name='tokenizer'></a>


Now we need a very big text file. Suppose we have one with over 1000000 lines in it and name it `"my_big_text_file.txt"`. Be warned that the following code requires a lot of memory (you can reduce number of lines sampled by lowering `input_sentence_size`) and can take several hours.

In [None]:
default_kwargs = {
    "model_type": 'unigram', "pad_id": 0, "eos_id": 1, "unk_id": 2, "bos_id": -1, "pad_piece": '<pad>',
    "eos_piece": '</s>',
    "unk_piece": '<unk>', "input_sentence_size": 1000000, "max_sentencepiece_length": 64, "add_dummy_prefix": True
}
# more options are here: https://github.com/google/sentencepiece/blob/master/doc/options.md
spm.SentencePieceTrainer.train(
    input="my_big_text_file.txt",
    model_prefix="my_new_tokenizer",
    vocab_size=32000,
    split_by_whitespace=True,
    **default_kwargs
)
# normalization_rule_name=nmt_nfkc_cf if you want to lowercase

Now that our sentencepiece model is trained, let's put it in our `T5Tokenizer` from `transformers` library:

In [None]:
tokenizer = T5Tokenizer("my_new_tokenizer.model", do_lower_case=False)
tokenizer._add_tokens(new_tokens=[f"<extra_id_{i}>" for i in range(100)] + ['</s>', '<pad>', '<unk>'],
                      special_tokens=True)
tokenizer.save_pretrained("MyNewT5Tokenizer")

So now you can load your trained tokenizer with:

In [None]:
tokenizer = T5Tokenizer.from_pretrained("MyNewT5Tokenizer")

# How we trained the model<a name='training_model'></a>

## Optimizer and scheduler<a name='opt'></a>
We used [T5](https://arxiv.org/abs/1910.10683) transformer model. It was originally trained using [Adafactor](https://arxiv.org/abs/1804.04235) optimizer. We used it with with 10 000 warm-up steps followed by inverse square root internal learning rate schedule. All of this is set internally, so we create `Dummy`, the fake learning rate scheduler.

In [None]:
class Dummy:
    def step(self):
        return 1

    def get_last_lr(self):
        return [1]

    def state_dict(self):
        return {"dummy_key": 1}

    def load_state_dict(self, state_dict):
        pass

    def get_lr(self):
        return [1]

## Data<a name='data'></a>
Our training corpus consisted of over 6GB text file and was to big to load into the Colab RAM. So we:  
1. encoded it with our trained tokenizer - each string was converted to list of numbers from 0 to 32000;  
2. as our maximum number is 32000, we changed type of our lists to numpy arrays of type `uint16` which can contain integers from 0 to 65535;  
These "tricks" enabled us to load our pandas dataframe into Colab memory without memory errors.

For example purposes we will construct an example dataset

In [None]:
# this will produce 10 rows with exactly the same line
df = pd.DataFrame.from_records(data=[("Čia yra naujienų straipsnio pagrindinė dalis.","O čia yra santrauka.")]*10, columns=["main", "summary"])
# load tokenizer
tokenizer = T5Tokenizer.from_pretrained("LukasStankevicius/t5-base-lithuanian-news-summaries-175")
# encode and reduce memory footprint with uint16 dtype
for col in ["main", "summary"]:
  df[col] = df[col].apply(tokenizer.encode, max_length=512, truncation=True)
  df[col] = df[col].apply(np.asarray, dtype=np.uint16)
# this will produce 400 000 rows with exactly the same line
df = pd.concat([df for i in range(40000)])
# shuffle rows
df = df.sample(frac=1)
# split to train and valid parts
df.iloc[:-4096].to_pickle("my_pandas_train_dataframe_pickle.gz")
df.iloc[-4096:].to_pickle("my_pandas_valid_dataframe_pickle.gz")

df.head()

Unnamed: 0,main,summary
8,"[902, 22, 835, 1881, 3502, 401, 4, 1]","[133, 211, 22, 1992, 892, 26, 4, 1]"
1,"[902, 22, 835, 1881, 3502, 401, 4, 1]","[133, 211, 22, 1992, 892, 26, 4, 1]"
5,"[902, 22, 835, 1881, 3502, 401, 4, 1]","[133, 211, 22, 1992, 892, 26, 4, 1]"
6,"[902, 22, 835, 1881, 3502, 401, 4, 1]","[133, 211, 22, 1992, 892, 26, 4, 1]"
5,"[902, 22, 835, 1881, 3502, 401, 4, 1]","[133, 211, 22, 1992, 892, 26, 4, 1]"


The following are dataset (loading pairs) and colloator (combining individual pairs into batches) classes:

In [None]:
class My_Dataset(Dataset):
    def __init__(self, pickle_path):
        df = pd.read_pickle(pickle_path)
        self.examples = list(zip(df["main"], df["summary"]))

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

class MyCollator:
    """
This collator is used for already encoded strings. It only truncates and pads
    """

    def __init__(self, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __call__(self, list_of_tuples):
        train_x, train_y = zip(*list_of_tuples)
        # truncate
        train_x, train_y = [seq[: self.max_length] for seq in train_x], [seq[: self.max_length] for seq in train_y]

        # first the targets
        n_items = len(train_y)
        tt = self.tokenizer.pad({"input_ids": train_y}, padding=True,
                                return_tensors="pt", return_attention_mask=True)

        decoder_input_ids = torch.cat((torch.zeros(size=(n_items, 1), dtype=torch.int64), tt['input_ids']), axis=1)
        decoder_attention_mask = torch.cat((torch.ones(size=(n_items, 1), dtype=torch.int64), tt['attention_mask']),
                                           axis=1)

        decoder_input_ids = decoder_input_ids[:, :-1]  # one item is added at beginning, so one at the end to remove
        decoder_attention_mask = decoder_attention_mask[:, :-1]

        # now inputs
        inputs_dict = self.tokenizer.pad({"input_ids": train_x},  padding=True, return_tensors="pt",
                                         return_attention_mask=True)
        # finally combine the two
        return {"decoder_input_ids": decoder_input_ids, "decoder_attention_mask": decoder_attention_mask,
                "labels": tt['input_ids'], **inputs_dict}

## Final training script<a name='final'></a>
You will definitely need GPU here



In [None]:
output_dir = "output_directory_for_my_model"

kwargs = TrainingArguments(
    fp16=True, per_device_train_batch_size=4, gradient_accumulation_steps=32,
    num_train_epochs=30, output_dir=output_dir, evaluation_strategy="steps", 
    per_device_eval_batch_size=4, max_grad_norm=None, logging_steps=2000, 
    save_steps=5000, eval_steps=2000, dataloader_num_workers=1, adafactor=True
)

tokenizer = T5Tokenizer.from_pretrained("LukasStankevicius/t5-base-lithuanian-news-summaries-175")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

trainer = Trainer(
    train_dataset=My_Dataset("my_pandas_train_dataframe_pickle.gz"), 
    eval_dataset=My_Dataset("my_pandas_valid_dataframe_pickle.gz"),
    model=model, data_collator=MyCollator(tokenizer), tokenizer=tokenizer,
    args=kwargs, 
    optimizers=(Adafactor((param for param in model.parameters() if param.requires_grad),
                           relative_step=True, warmup_init=True), Dummy()))

trainer.train()
trainer.save_model(output_dir)
trainer.state.save_to_json(output_dir + "/trainer_state.json")