# Learning to Fine-Tune
***
[Alejandro Ciuba](https://alejandrociuba.github.io)
***
Gotten from: https://medium.com/@radicho/fine-tuning-the-multilingual-t5-model-from-huggingface-with-keras-f7f619ec5cfe and https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-for-text2text-and-building-a-demo-with-streamlit-c72009631887
**GOAL:** 

In [1]:
from datasets import load_dataset
from transformers import (AutoModelForSeq2SeqLM, 
                          AutoTokenizer, 
                          DataCollatorForSeq2Seq,
                          Seq2SeqTrainingArguments,
                          Seq2SeqTrainer)

import evaluate
import nltk

import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


***
## Preprocessing

In [3]:
dataset = load_dataset("csv", data_files="datasets/train.csv")["train"].shuffle(seed=42)
type(dataset)

datasets.arrow_dataset.Dataset

In [4]:
# No splits to index before the names
dataset.column_names

['Id', 'Language', 'Text', 'Expected']

In [5]:
def preprocess(examples):
    '''
    Tokenizers the inputs and targets to feed to the model.
    '''

    padding = "max_length"
    max_length = 200

    inputs = [i for i in examples["Text"]]
    targets = [t for t in examples["Expected"]]

    # Memory reqs grow quadratically with input size, stops at max_length
    tokens = tokenizer(inputs, max_length=max_length, padding=padding, truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=max_length, padding=padding, truncation=True, return_tensors="pt")

    tokens["labels"] = labels["input_ids"]
    return tokens

train_dataset = dataset.map(preprocess, batched=True)

Map:   0%|          | 0/500000 [00:00<?, ? examples/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Map: 100%|██████████| 500000/500000 [01:07<00:00, 7419.92 examples/s]


In [19]:
train_dataset[10]

{'Id': 'ca2db162-eb21-4e4a-b92b-77b9ecb2da60',
 'Language': 'en',
 'Text': 'Furtner develozment funding was raised fron privatx equity sources anc in 2001 Airbus, Cathay Pacific and VRockwell Collins became shareholders.',
 'Expected': 'Further development funding was raised from private equity sources and in 2001 Airbus, Cathay Pacific and Rockwell Collins became shareholders.',
 'input_ids': [19159,
  92033,
  269,
  110879,
  360,
  1143,
  259,
  80088,
  639,
  52220,
  345,
  197807,
  11144,
  329,
  259,
  265,
  75887,
  259,
  46337,
  461,
  297,
  281,
  6003,
  259,
  94982,
  261,
  3942,
  15265,
  35979,
  305,
  434,
  52278,
  8311,
  97079,
  330,
  41428,
  8364,
  88091,
  260,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  

***
## Preparing for Fine-Tuning

In [12]:
batch_size = 8
name = "mt5-practice-spellchecker"
dir = f"models/{name}"

args = Seq2SeqTrainingArguments(
    output_dir=dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="bleu",
)

In [13]:
# Load our metric
metric = evaluate.load("bleu")

In [14]:
# Compute the bleu score for the loss function
def compute_metrics(eval_pred):

    preds, labels = eval_pred

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)

    return {k: round(v, 4) for k, v in result.items()}

In [15]:
data_collocator=DataCollatorForSeq2Seq(tokenizer=tokenizer)

In [16]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=data_collocator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

***
## Fine-Tuning

In [17]:
trainer.train()

  0%|          | 0/62500 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.49 GiB (GPU 0; 7.78 GiB total capacity; 6.06 GiB already allocated; 1.04 GiB free; 6.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF