In [16]:
# Dependencies Install
#%pip install -q -U transformers datasets evaluate sacrebleu wandb accelerate huggingface_hub peft trl bitsandbytes scipy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [1]:
import gc
gc.collect()

73

In [2]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

# Translation

Parameters Definition

In [3]:
source_lang, source_lang_iso = "Spanish", "spa"
target_lang, target_lang_iso = "Wayuu", "guc" # or pbb, Paez
base_model = "t5-base" # or t5-small, t5-large, google/mt5-base, facebook/bart-large, etc

In [6]:
# Login to Hugging Face
from huggingface_hub import login
login()

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/broomva/.cache/huggingface/token
Login successful


In [4]:
# Login to Weights and Biases
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mbroomva[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/grupo8/.netrc


True

## Dataset Processing

In [7]:
from datasets import load_dataset

dataset = load_dataset(f"Broomva/translation_{target_lang_iso}_{source_lang_iso}")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 76676
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 19170
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 23962
    })
})

Then take a look at an example:

In [8]:
dataset["train"][100]

{'id': 44238,
 'translation': {'guc': "saashin tü wiwüliakat nayeena eeka sulu'u nuluwataaya maleiwa na wayuu oonookana sümaa nünüiki alateetkat mapeena nm",
  'spa': 'la biblia dice que dios acabara con ellos y un nuevo cielo el reino de dios gobernara sobre una nueva tierra la humanidad obediente'}}

## Preprocess

In [9]:
from transformers import AutoTokenizer

checkpoint = base_model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
prefix = f"translate {source_lang} to {target_lang}: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=512, truncation=True)
    return model_inputs

In [11]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map: 100%|███████████████████████████████████████████████████████████████████████████| 103111/103111 [00:08<00:00, 12162.03 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████| 25778/25778 [00:02<00:00, 11572.70 examples/s]


Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [12]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

2023-11-30 11:30:59.271000: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 11:30:59.452655: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 11:31:00.084904: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-11-30 11:31:00.084966: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] 

## Evaluate

In [14]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Train

In [15]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [17]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [18]:
new_model_name = f'{base_model}-translation-{source_lang_iso}-{target_lang_iso}'

In [None]:
from transformers import EarlyStoppingCallback

training_args = Seq2SeqTrainingArguments(
    output_dir=f"./results/{new_model_name}",
    evaluation_strategy="epoch",
    save_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=5,
    num_train_epochs=10,
    predict_with_generate=True,
    push_to_hub=True,
    load_best_model_at_end = True,
    report_to="wandb",
    warmup_steps=10,
    logging_steps=1,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
)

trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


In [None]:
trainer.push_to_hub()

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to translate to another language. For T5, you need to prefix your input depending on the task you're working on. For translation from English to French, you should prefix your input as shown below:

In [26]:
text = "translate Spanish to Wayuu: hola"

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for translation with your model, and pass your text to it:

In [27]:
from transformers import pipeline

translator = pipeline("translation", model=f'Broomva/{new_model_name}')
translator(text)

[{'translation_text': 'eesü'}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return the `input_ids` as PyTorch tensors:

In [28]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f'Broomva/{new_model_name}')
inputs = tokenizer(text, return_tensors="pt").input_ids

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [29]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(f'Broomva/{new_model_name}')
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

Decode the generated token ids back into text:

In [30]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"jo'uya"