
# **Huấn luyện model T5 (transformer) sử dụng dataset khác**
---


## 1.Chuẩn bị môi trường

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# install pytorch with GPU accelerated
# (see https://pytorch.org/get-started/locally/ )
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu114

# install sentencepiece for lingual modeling
!pip install omegaconf hydra-core fairseq sentencepiece

# install huggingface libraries
!pip install transformers datasets evaluate

# install additional packages
!pip install protobuf==3.20.3
!pip install absl-py rouge_score nltk
!pip install numpy

# install jupyter if you run code in notebook
!pip install jupyter
!pip install datasets transformers==4.28.0
!pip install --upgrade accelerate
!pip install bert_score
!pip install transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu114
Collecting omegaconf
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hydra-core
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fairseq
  Downloading fairseq-0.12.2.tar.gz (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64

In [3]:
import os
# drive path to save check point and load trained file
content="/content/drive/MyDrive/HTK"

## **2.Tải Dataset**
*Model T5 ban đầu sử dụng dataset Colossal Clean Crawled Corpus (C4)* chi tiết tại :https://jmlr.org/papers/volume21/20-074/20-074.pdf

Link dataset Colossal Clean Crawled Corpus (C4): https://www.tensorflow.org/datasets/catalog/c4?hl=vi

Trong bài này nhóm em thực hiện chọn dataset xlsum để tái huấn luyện dữ liệu.

Vì lý do phần cứng nên nhóm em thực hiện giảm 50% lượng dữ liệu để có thể huấn luyện.

In [22]:
from datasets import load_dataset,DatasetDict

#load dataset with 50% of each type "train","test","validation"
ds_train = load_dataset("csebuetnlp/xlsum", name="english",split="train[:80%]")
ds_test = load_dataset("csebuetnlp/xlsum", name="english",split="test[:80%]")
ds_val = load_dataset("csebuetnlp/xlsum", name="english",split="validation[:80%]")

#create dataset dictionary
ds = DatasetDict({"train":ds_train,"test":ds_test,"validation":ds_val})
ds



DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 245218
    })
    test: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 9228
    })
    validation: Dataset({
        features: ['id', 'url', 'title', 'summary', 'text'],
        num_rows: 9228
    })
})

In [5]:
ds["train"][0]

{'id': 'uk-wales-56321577',
 'url': 'https://www.bbc.com/news/uk-wales-56321577',
 'title': 'Weather alert issued for gale force winds in Wales',
 'summary': 'Winds could reach gale force in Wales with stormy weather set to hit the whole of the country this week.',

## **3.Chuẩn bị model**

### **3.1 Chuẩn bị T5 tokenizer**

In [6]:
from transformers import AutoTokenizer
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small",use_fast=False)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

In [23]:
def tokenize_sample_data(data):
  # Max token size is 14536 and 215 for inputs and labels, respectively.
  # Here I restrict these token size.
  # We limited max_length of token to saving memory and computing resources and time, which can improve performance and avoid out of memory
  input_feature = t5_tokenizer(data["text"], truncation=True, max_length=1024)
  label = t5_tokenizer(data["summary"], truncation=True, max_length=128)
  return {
    "input_ids": input_feature["input_ids"],
    "attention_mask": input_feature["attention_mask"],
    "labels": label["input_ids"],
  }

tokenized_ds = ds.map(
  tokenize_sample_data,
  remove_columns=["id", "url", "title", "summary", "text"],
  batched=True,
  batch_size=128)

tokenized_ds

Map:   0%|          | 0/245218 [00:00<?, ? examples/s]

Map:   0%|          | 0/9228 [00:00<?, ? examples/s]

Map:   0%|          | 0/9228 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 245218
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9228
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9228
    })
})

### **3.2 Chuẩn bị model**

Model T5 trên Hugging Face được chia thành các loại theo kích cỡ như sau:


*   T5-small (https://huggingface.co/t5-small):    
  *   Tổng tham số: **60M**
  *   Ngôn ngữ hỗ trợ (NLP): **English, French, Romanian, German**
*   T5-base (https://huggingface.co/t5-base):    
  *   Tổng tham số: **220M**
  *   Ngôn ngữ hỗ trợ (NLP): **English, French, Romanian, German**
*   T5-large (https://huggingface.co/t5-large):    
  *   Tổng tham số: **770M**
  *   Ngôn ngữ hỗ trợ (NLP): **English, French, Romanian, German**

Vì các tham số tại base, large quá lớn, gây ảnh hưởng đến vùng nhớ nên nhóm em quyết định sử dụng t5-small để có thể đáp ứng nhu cầu phần cứng của nhóm.

In [8]:
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#see  https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/configuration#transformers.PretrainedConfig
#load pretrain model and config
mt5_config = AutoConfig.from_pretrained(
  "t5-small",
  max_length=128,
  length_penalty=0.6, # Exponential penalty to the length that is used with beam-based generation,length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.
  no_repeat_ngram_size=2,
  num_beams=15,
)
model = (AutoModelForSeq2SeqLM
         .from_pretrained(  "t5-small", config=mt5_config)
         .to(device))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [9]:
from transformers import DataCollatorForSeq2Seq
# see https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
  t5_tokenizer,
  model=model,
  return_tensors="pt")

### **3.3 Tạo hàm đánh giá dựa trên Rouge**
Tham khảo : https://huggingface.co/spaces/evaluate-metric/rouge

In [10]:
import evaluate
import numpy as np
import torch
from nltk.tokenize import RegexpTokenizer
rouge_metric = evaluate.load("rouge")
# define function for custom tokenization
def tokenize_sentence(arg):
  encoded_arg = t5_tokenizer(arg)
  return t5_tokenizer.convert_ids_to_tokens(encoded_arg.input_ids)

# define function to get ROUGE scores with custom tokenization
def metrics_func(eval_arg):
  preds, labels = eval_arg
  # Replace -100 token that generate by tokenizer
  labels = np.where(labels != -100, labels, t5_tokenizer.pad_token_id)
  # Convert id tokens to text
  text_preds = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
  text_labels = t5_tokenizer.batch_decode(labels, skip_special_tokens=True)
  # Insert a line break (\n) in each sentence for ROUGE scoring
  text_preds = [(p if p.endswith(("!", "! ", "?", "? ", ".",";",",","'","\"")) else p + ".") for p in text_preds]
  text_labels = [(l if l.endswith(("!", "! ", "?", "? ", ".",";",",","'","\"")) else l + ".") for l in text_labels]
  sent_tokenizer_en = RegexpTokenizer(u'[^!! ?? .]*[!! ?? :.,;"\']')
  text_preds = ["\n".join(np.char.strip(sent_tokenizer_en.tokenize(p))) for p in text_preds]
  text_labels = ["\n".join(np.char.strip(sent_tokenizer_en.tokenize(l))) for l in text_labels]
  # compute ROUGE score with custom tokenization
  rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

  rouge_scores = rouge_metric.compute(
    predictions=text_preds,
    references=text_labels,
    tokenizer=tokenize_sentence
  )
  rouge_dict = dict((rn, round(rouge_scores[rn] * 100, 2)) for rn in rouge_names)

  my_dict = {
      'rouge1': rouge_dict['rouge1'],
      'rouge2': rouge_dict['rouge2'],
      'rougeL': rouge_dict['rougeL'],
      'rougeLsum': rouge_dict['rougeLsum'],


  }
  return my_dict

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [11]:
from torch.utils.data import DataLoader

sample_dataloader = DataLoader(
  tokenized_ds["test"].with_format("torch"),
  collate_fn=data_collator,
  batch_size=5)
for batch in sample_dataloader:
  with torch.no_grad():
    preds = model.generate(
      batch["input_ids"].to(device),
      num_beams=15,
      num_return_sequences=1,
      no_repeat_ngram_size=1,
      remove_invalid_values=True,
      max_length=128,
    )
  labels = batch["labels"]
  break

#Evaluating model not pretrained
metrics_func([preds, labels])



{'rouge1': 31.66, 'rouge2': 5.57, 'rougeL': 18.69, 'rougeLsum': 52.51}

In [12]:

from transformers import Seq2SeqTrainingArguments
summary=os.path.join(content,"t5-small-summarize")

if(not os.path.exists(summary)):
  os.makedirs(summary)

# config argument on training see:https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
  output_dir = summary,
  log_level = "error",
  num_train_epochs = 10,
  learning_rate = 5e-4,
  lr_scheduler_type = "linear",
  warmup_steps = 90,
  optim = "adafactor",
  weight_decay = 0.01,
  per_device_train_batch_size = 2,
  per_device_eval_batch_size = 1,
  gradient_accumulation_steps = 16,
  evaluation_strategy = "steps",
  eval_steps = 100,
  predict_with_generate=True,
  generation_max_length = 128,
  save_steps =1000,
  logging_steps = 10,
  push_to_hub = False
)

### **3.4 Huấn luyện dữ liệu**

In [None]:
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
  model = model,
  args = training_args,
  data_collator = data_collator,
  compute_metrics = metrics_func,
  train_dataset = tokenized_ds["train"],
  eval_dataset = tokenized_ds["validation"].select(range(20)),
  tokenizer = t5_tokenizer,
)

# trainer.train()
trainer.train("/content/drive/MyDrive/HTK/t5-small-summarize/checkpoint-68000")

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
68100,2.0949,2.088825,40.47,20.6,34.6,56.6
68200,2.1242,2.091639,41.34,21.54,35.83,56.72


## **4.Lưu lại kết quả train**

In [14]:
import os
from transformers import AutoModelForSeq2SeqLM

# save fine-tuned model in local
os.makedirs(os.path.join(content,"trained_t5_small_summarize"), exist_ok=True)
if hasattr(trainer.model, "module"):
  trainer.model.module.save_pretrained(os.path.join(content,"trained_t5_small_summarize"))
else:
  trainer.model.save_pretrained(os.path.join(content,"trained_t5_small_summarize"))


In [15]:
import os
from transformers import AutoModelForSeq2SeqLM
# load local model
model = (AutoModelForSeq2SeqLM
         .from_pretrained(os.path.join(content,"trained_t5_small_summarize"))
         .to(device))

## **5.Dự đoán và kiểm thử**

In [16]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [17]:
from torch.utils.data import DataLoader

# Predict with test data (first 5 rows)
sample_dataloader = DataLoader(
  tokenized_ds["test"].with_format("torch"),
  collate_fn=data_collator,
  batch_size=5)
for batch in sample_dataloader:
  with torch.no_grad():
    preds = model.generate(
      batch["input_ids"].to(device),
      num_beams=15,
      num_return_sequences=1,
      no_repeat_ngram_size=1,
      remove_invalid_values=True,
      max_length=128,
    )
  labels = batch["labels"]
  break

# Replace -100 (see above)
labels = np.where(labels != -100, labels, t5_tokenizer.pad_token_id)

# Convert id tokens to text
text_preds = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
text_labels = t5_tokenizer.batch_decode(labels, skip_special_tokens=True)



In [18]:
# Show result
print("***** Input's Text *****")
print(ds["test"]["text"][1])
print("***** Summary Text (True Value) *****")
print(text_labels[1])
print("***** Summary Text (Generated Text) *****")
print(text_preds[1])

***** Input's Text *****
But Eluned Morgan conceded that it would be "difficult for us to stop" from a legal point of view. Her comments were criticised by a Labour AM. Alun Davies said threatening legal action "sounds like the last breath before you're thrown out of the pub". Mr Davies said he was not convinced the Welsh Government would "have a leg to stand on" in trying to shape international trade deals after Brexit. Following Donald Trump's comments during last week's trade visit that the NHS would be "on the table" in any future trade talks between the UK and the USA, Eluned Morgan said there was "absolutely no prospect whatsoever of us allowing the Welsh NHS to be part of any negotiation." The US President then rowed back on his initial comments following criticism from a number of MPs. Asked about her response to President Trump's remarks as she gave evidence to the Assembly's Brexit committee on Monday, Ms Morgan said "legally, it would be difficult for us to stop because we d

## **6.Kiểm thử Model sau train**

In [19]:
inpt_text="Prime Minister Pham Minh Chinh wants to look into the possibility of developing a standard, high-speed railway that connects Vietnam and China. At a meeting with Chinese President Xi Jinping at the Great Hall of the People in China's Beijing on Tuesday, Chinh requested China to bolster the progress of opening the market for Vietnam's agricultural products, and create opportunities for Vietnam to open more offices to promote trade in China. Chinh also wants China to give more quota for Vietnamese goods in transit to a third country by China’s railway, as well as looking into cooperation possibilities for a standard, high-speed railway connecting the two countries. Chinh also commended Chinese businesses for expanding high-quality investments into Vietnam, wishing for more exchanges between both countries and to contribute to a resilient social foundation for bilateral relations. Chinh stressed that a stable and long-term relations with China has always been a strategic choice of foremost priority for Vietnam's external relations. In response, Xi said China has always regarded Vietnam as a priority when it comes to its diplomatic policies, and wants to bolster the relationship between the two countries and the two Parties.Xi said China is ready to maintain strategic exchanges with Vietnam, import goods from Vietnam and bolster railway, road and border infrastructure connectivity. Xi commends Vietnam for participating in China's global initiatives and promoting peace, cooperation and development in the region and the world."

In [20]:

input_feature = t5_tokenizer(inpt_text, truncation=True, max_length=1024, return_tensors="pt")
preds = model.generate(
      input_feature["input_ids"].to(device),
      num_beams=15,
      num_return_sequences=1,
      no_repeat_ngram_size=1,
      remove_invalid_values=True,
      max_length=128,
    )

In [21]:
text_pr = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
print("***** Input's Text *****")
print(inpt_text)
print("***** Summary Text (Generated Text) *****")
print(text_pr)

***** Input's Text *****
Prime Minister Pham Minh Chinh wants to look into the possibility of developing a standard, high-speed railway that connects Vietnam and China. At a meeting with Chinese President Xi Jinping at the Great Hall of the People in China's Beijing on Tuesday, Chinh requested China to bolster the progress of opening the market for Vietnam's agricultural products, and create opportunities for Vietnam to open more offices to promote trade in China. Chinh also wants China to give more quota for Vietnamese goods in transit to a third country by China’s railway, as well as looking into cooperation possibilities for a standard, high-speed railway connecting the two countries. Chinh also commended Chinese businesses for expanding high-quality investments into Vietnam, wishing for more exchanges between both countries and to contribute to a resilient social foundation for bilateral relations. Chinh stressed that a stable and long-term relations with China has always been a st