<a href="https://colab.research.google.com/github/HamdanXI/nlp_adventure/blob/main/T5_Translation_Gloss_to_Text%20WORKING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!hugginface-cli logic

In [1]:
!pip install transformers[sentencepiece] datasets
!pip install transformers[torch]
!pip install sacrebleu sentencepiece



In [6]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration

raw_datasets = load_dataset("aslg_pc12")
checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

raw_datasets = raw_datasets.rename_column("text", "labels")

def tokenize_function(example):
    source = example["gloss"]
    target = example["labels"]

    # Tokenizing source and target without returning tensors and without padding
    tokenized_source = tokenizer(source, truncation=True, max_length=512, return_tensors="pt")
    tokenized_target = tokenizer(target, truncation=True, max_length=512, return_tensors="pt")

    return {
        "input_ids": tokenized_source["input_ids"][0],
        "attention_mask": tokenized_source["attention_mask"][0],
        "labels": tokenized_target["input_ids"][0]
    }

tokenized_datasets = raw_datasets.map(tokenize_function)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)
tokenized_datasets["validation"] = tokenized_datasets.pop("test")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Map:   0%|          | 0/87710 [00:00<?, ? examples/s]

In [7]:
from transformers import TrainingArguments
from transformers import Trainer

training_args = TrainingArguments("test-trainer")

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [8]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.7977
1000,0.4943
1500,0.4195
2000,0.3743
2500,0.3424
3000,0.3314
3500,0.3214
4000,0.3036
4500,0.2898
5000,0.2771


TrainOutput(global_step=29604, training_loss=0.21976070857631114, metrics={'train_runtime': 8462.3324, 'train_samples_per_second': 27.985, 'train_steps_per_second': 3.498, 'total_flos': 1.49410672939008e+16, 'train_loss': 0.21976070857631114, 'epoch': 3.0})

In [9]:
trainer.save_model()

In [13]:
trainer.push_to_hub(repo_name="t5-gloss-to-text")

TypeError: ignored