# GEC model training

This notebook is used to train a grammatical error correction model and generate test set predictions for the "Recent Advances in Natural Language Generation" course project.

Important things to note:

* Pay attention to code comments saying `!!! TODO !!!` as these indicate parts that you may need to update.
* Before training your model, make sure that GPUs are enable for this notebook:
  - Navigate to Edit→Notebook Settings
  - Select GPU from the Hardware Accelerator drop-down

This notebook is based on: https://github.com/huggingface/notebooks/blob/master/examples/summarization-tf.ipynb

In [1]:
! pip install transformers[sentencepiece]
! pip install datasets
! pip install rouge-score nltk
! pip install jiwer



In [2]:
from datasets import load_metric
from datasets import Dataset
from google.colab import files
import numpy as np
import nltk
import time
import transformers
from transformers import AutoTokenizer
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from tqdm.auto import tqdm
from typing import Optional, Sequence

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Loading the dataset

In [4]:
def read_dataset(path: str, contains_targets: bool = True) -> Dataset:
  """Reads in data from a text/tsv file and returns a Huggingface dataset.

  Each line of the file should contain a source-target pair separated by a tab
  when `contains_targets` is True. When it's False, each line should contain
  just a source (used for the test set that doesn't have targets).
  """
  sources = []
  target_lists = []
  with open(path) as f:
    for line in f:
      line = line.rstrip('\n')
      if contains_targets:
        source, target = line.split('\t')
      else:
        source, target = line, ''
      sources.append(source)
      target_lists.append([target])
  features = {"sentence": sources, "corrections": target_lists}
  return Dataset.from_dict(features)


def read_dataset2(path: str, contains_targets: bool = True) -> Dataset:
  """Reads in data from a text/tsv file and returns a Huggingface dataset.

  Each line of the file should contain a source-target pair separated by a tab
  when `contains_targets` is True. When it's False, each line should contain
  just a source (used for the test set that doesn't have targets).
  """
  sources = []
  target_lists = []
  with open(path, 'rb') as f:
    for line in f:
      line = line.rstrip(b'\n')
      line = line.rstrip(b'"\r')
      if contains_targets:
        try:
            source, target = line.split(b'\t')
            source = source.decode()
            if source[0] == '\"':
                source = source[1:]
            target = target.decode()
        except:
            continue
      else:
        source, target = line, ''
      sources.append([source])
      target_lists.append(target)
  features = {"sentence": target_lists, "corrections": sources}
  return Dataset.from_dict(features)

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# Download a small sample of the cLang-8 dataset 
# (full data at: https://github.com/google-research-datasets/clang8).
! wget https://emalmi.kapsi.fi/gec/clang8_en_10k_sample.tsv
! wget https://emalmi.kapsi.fi/gec/clang8_en_100_test_sample.tsv

--2022-04-11 14:06:28--  https://emalmi.kapsi.fi/gec/clang8_en_10k_sample.tsv
Resolving emalmi.kapsi.fi (emalmi.kapsi.fi)... 91.232.155.81, 2001:67c:1be8:1337::443
Connecting to emalmi.kapsi.fi (emalmi.kapsi.fi)|91.232.155.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1081663 (1.0M) [text/tab-separated-values]
Saving to: ‘clang8_en_10k_sample.tsv.5’


2022-04-11 14:06:28 (5.22 MB/s) - ‘clang8_en_10k_sample.tsv.5’ saved [1081663/1081663]

--2022-04-11 14:06:28--  https://emalmi.kapsi.fi/gec/clang8_en_100_test_sample.tsv
Resolving emalmi.kapsi.fi (emalmi.kapsi.fi)... 91.232.155.81, 2001:67c:1be8:1337::443
Connecting to emalmi.kapsi.fi (emalmi.kapsi.fi)|91.232.155.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10069 (9.8K) [text/tab-separated-values]
Saving to: ‘clang8_en_100_test_sample.tsv.5’


2022-04-11 14:06:28 (343 MB/s) - ‘clang8_en_100_test_sample.tsv.5’ saved [10069/10069]



In [7]:
# !!! TODO !!! Instead of this cLang-8 training data, you should upload the
# dataset you've generated yourself. You can use `uploaded = files.upload()` to
# upload a locally stored dataset.
DatasetAll = read_dataset2('./Dataset1302050.tsv')
#Shuffle the dataset 
DatasetAll = DatasetAll.shuffle(seed=42)

dataDict = DatasetAll.train_test_split(train_size = 0.05, test_size=0.0001)
training_set = dataDict['train']
validation_set =dataDict['test']

In [8]:
training_set

Dataset({
    features: ['sentence', 'corrections'],
    num_rows: 102216
})

An example element from the training set:

In [9]:
training_set[984]

{'corrections': ['Sakal Arrondissement is an arrondissement of the Louga Department in the Louga Region of Senegal.'],
 'sentence': 'Sakal Arrondissement is an before can Arrondissement of the Louga Department ins the Louga Region of Senegal.'}

Load a metric to evaluate your model during training. The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [10]:
metric = load_metric("wer")
metric

Metric(name: "wer", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Compute WER score of transcribed segments against references.

Args:
    references: List of references for each speech input.
    predictions: List of transcriptions to score.
    concatenate_texts (bool, default=False): Whether to concatenate all input texts or compute WER iteratively.

Returns:
    (float): the word error rate

Examples:

    >>> predictions = ["this is the prediction", "there is an other sample"]
    >>> references = ["this is the reference", "there is another one"]
    >>> wer = datasets.load_metric("wer")
    >>> wer_score = wer.compute(predictions=predictions, references=references)
    >>> print(wer_score)
    0.5
""", stored examples: 0)

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

# Load Pre-Trained Model

Evaluate your model with `t5-small`. Additionally, you're welcome to try out other models / model sizes.

In [11]:
model_checkpoint = 't5-small'

# Other pre-trained checkpoints you may want to try:
# model_checkpoint = 'google/t5-v1_1-small'
# model_checkpoint = 'google/t5-v1_1-base'

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on one sentence or a pair of sentences:

In [13]:
tokenizer("I have travelled around most of European countries and I be able to understanding , talking and writin highly three languages : Spanish , English and French .")

{'input_ids': [27, 43, 3, 21043, 300, 167, 13, 1611, 1440, 11, 27, 36, 3, 179, 12, 1705, 3, 6, 2508, 11, 3, 210, 13224, 29, 1385, 386, 8024, 3, 10, 5093, 3, 6, 1566, 11, 2379, 3, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [14]:
with tokenizer.as_target_tokenizer():
  print(tokenizer(["Hello, this is a sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [15]:
max_input_length = 128
max_target_length = 128


def preprocess_function(examples):
  inputs = [doc for doc in examples["sentence"]]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

  # Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = [corr[0] for corr in examples["corrections"]]
    labels = tokenizer(labels, max_length=max_target_length, truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

In [16]:
tokenized_training_set = training_set.map(preprocess_function, batched=True)
tokenized_validation_set = validation_set.map(preprocess_function, batched=True)

  0%|          | 0/103 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [17]:
model.config.max_length=128

The results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

In [18]:
# !!! TODO !!! You're welcome to tune the following hyper-parameters but it's
# also fine to use the following default values.
batch_size = 80
learning_rate = 1e-3
# learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 2

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [19]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model,
                                       return_tensors="tf")

In [20]:
validation_set

Dataset({
    features: ['sentence', 'corrections'],
    num_rows: 205
})

Now we convert our input datasets to TF datasets using this collator. There's a built-in method for this: `to_tf_dataset()`. Make sure to specify the collator we just created as our `collate_fn`!

Computing the `ROUGE` metric can be slow because it requires the model to generate outputs token-by-token. To speed things up, we make a `generation_dataset` that contains only 200 examples from the validation dataset, and use this for `ROUGE` computations.

In [21]:
tf_training_set = tokenized_training_set.to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
tf_validation_set = tokenized_validation_set.to_tf_dataset(
    batch_size=8,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)

Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally - we can train on this as our loss value simply by not specifying a loss when we `compile()`.

In [22]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Now we can train our model. We add KerasMetricCallback to compute advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [23]:
def metric_fn(eval_predictions):
  predictions, labels = eval_predictions
  for prediction in predictions:
    prediction[prediction < 0] = tokenizer.pad_token_id  # Replace masked label tokens
  decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  for label in labels:
    label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
  result = {"wer": metric.compute(predictions=decoded_predictions,
                                  references=decoded_labels)}
  # Add mean generated length
  prediction_lens = [
      np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
  ]
  result["gen_len"] = np.mean(prediction_lens)

  return result

In [24]:
%%time

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=tf_validation_set, predict_with_generate=True
)
callbacks = [metric_callback]

model.fit(
    tf_training_set, validation_data=tf_validation_set, epochs=num_train_epochs,
    callbacks=callbacks
)



Epoch 1/2
Epoch 2/2
CPU times: user 35min 39s, sys: 7min, total: 42min 39s
Wall time: 56min 11s


In [25]:
model

<transformers.models.t5.modeling_tf_t5.TFT5ForConditionalGeneration at 0x7f5a24b1bad0>

In [26]:
def run_inference(model, dataset, tokenizer):
  all_sources = []
  all_predictions = []
  for i, batch in enumerate(dataset):
    print(f'Batch: {i}')
    attention_mask = batch["attention_mask"]
    predictions = model.generate(batch['input_ids'],
                                 attention_mask=attention_mask)
    decoded_predictions = tokenizer.batch_decode(predictions,
                                                 skip_special_tokens=True)
    all_predictions.extend(decoded_predictions)
    all_sources.extend(tokenizer.batch_decode(batch['input_ids'],
                                              skip_special_tokens=True))
  return all_sources, all_predictions

In [27]:
sources, predictions = run_inference(model, tf_validation_set, tokenizer)
for source, pred in zip(sources, predictions):
  print(f'Source:     {source}')
  print(f'Prediction: {pred}\n')

Batch: 0
Batch: 1
Batch: 2
Batch: 3
Batch: 4
Batch: 5
Batch: 6
Batch: 7
Batch: 8
Batch: 9
Batch: 10
Batch: 11
Batch: 12
Batch: 13
Batch: 14
Batch: 15
Batch: 16
Batch: 17
Batch: 18
Batch: 19
Batch: 20
Batch: 21
Batch: 22
Batch: 23
Batch: 24
Batch: 25
Source:     more 1950) Yurus (10.
Prediction: 1950) Yurus (10.

Source:     His failed sugar beet mill, which had was known for many yehars as Folly"", was ""Ross' demolished in 1908.
Prediction: His failed sugar beet mill, which had been known for many years as ""Ross' demolished in 1908.

Source:     The works of (for example Stratonice, Méhul 1792; Arioadnt, most 1799), Cherubini (Lodoska, 1791; Médée, 1797; Les dexu journées, 1800) and Le Sueur (La caverne, 1793) in particular show the influence of serious French operat, especially Gluck, and a willingness to taken ont previously taboo subjects (e.g. incest in Méhul's ons few Mélidore et Phros
Prediction: The works of (for example Stratonice, Méhul 1792; Ariodnt, 1799), Cherubini (Lodos

## Download and preprocess the BEA test data

The original dataset comes from:
https://www.cl.cam.ac.uk/research/nl/bea2019st/data/wi+locness_v2.1.bea19.tar.gz

In [28]:
! wget https://emalmi.kapsi.fi/gec/ABCN.test.bea19.orig

--2022-04-11 15:06:00--  https://emalmi.kapsi.fi/gec/ABCN.test.bea19.orig
Resolving emalmi.kapsi.fi (emalmi.kapsi.fi)... 91.232.155.81, 2001:67c:1be8:1337::443
Connecting to emalmi.kapsi.fi (emalmi.kapsi.fi)|91.232.155.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 437326 (427K)
Saving to: ‘ABCN.test.bea19.orig.3’


2022-04-11 15:06:00 (2.01 MB/s) - ‘ABCN.test.bea19.orig.3’ saved [437326/437326]



In [29]:
inference_batch_size = 512
bea_dataset = read_dataset('ABCN.test.bea19.orig', contains_targets=False)
tokenized_bea_dataset = bea_dataset.map(preprocess_function, batched=True)
tf_bea_dataset = tokenized_bea_dataset.to_tf_dataset(
    batch_size=inference_batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator)

  0%|          | 0/5 [00:00<?, ?ba/s]

In [34]:
%%time
# Run inference
sources, predictions = run_inference(model, tf_bea_dataset, tokenizer)

print('Sample predictions:\n')
for source, pred in list(zip(sources, predictions))[20:40]:
  print(f'{source}\n{pred}\n')

Batch: 0
Batch: 1
Batch: 2
Batch: 3
Batch: 4
Batch: 5
Batch: 6
Batch: 7
Batch: 8
Sample predictions:

What I'm sure about is the fact that I'm going to need an oficial certificate in order to prove that I've studied the language.
What I am sure about is the fact that I am going to need an official certificate in order to prove that I have studied the language.

I study English because I love it and I 'd like to speak very fluently and watch a movie without the subtitle.
I study English because I love it and I'd like to speak fluently and watch a movie without the subtitle.

I understand perfectly when a native speaker is talking but when I have to talk in English it's not so easy.
I understand perfectly when a native speaker is talking but when I have to talk in English it's not so easy.

I'm very motivated and I'm going to continue with my studies, but I need to have clear where I'm going.
I am very motivated and I am going to continue with my studies, but I need to have clear where I

## Download the predictions locally as a zip file and upload them to: https://competitions.codalab.org/competitions/20228#participate

NB: You need to sign up to the site before you can upload the predictions.

In [31]:
import zipfile
import datetime

pred_fname = f'bea_predictions_{datetime.datetime.now().isoformat()}.txt'
with open(pred_fname, 'w') as fout:
  for pred in predictions:
    fout.write(f'{pred}\n')

zip_fname = pred_fname + '.zip'
zipfile.ZipFile(zip_fname, mode='w').write(pred_fname)
files.download(zip_fname)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>