# Text summarization

In this notebook, I will load a Transformer (LLM: T5 [Text-To-Text Transfer Transformer]) from Hugging Face to summarize text from a dataset (NLP area). In order to do it, we will apply a fine-tuning.

In [12]:
# Libraries
from datasets import Dataset, load_dataset
import torch
# T5Tokenizer -> to token data
# T5ForConditionalGeneration -> the model that will be load
# TrainingArguments, Trainer -> to train the model
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer

In [13]:
# We will work with CPU
print(torch.cuda.is_available()) 

False


## Data

**Step 1**: Load the data 

In [14]:
dataset = load_dataset("xsum", trust_remote_code=True)

In [None]:
def preprocess_function(examples, tokenizer = T5Tokenizer.from_pretrained("t5-small"), max_input_length=512, max_target_length=128):
    inputs_doc = ["summarize: " + str(doc) for doc in examples["document"]]
    model_inputs = tokenizer(inputs_doc, padding="max_length", max_length=max_input_length, truncation=True)
    with tokenizer.as_target_tokenizer():
        inputs_summ = [str(s) for s in examples["summary"]]
        labels = tokenizer(inputs_summ, padding="max_length", max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [16]:
# To define the data
tam_train, tam_val = 1000, 200
small_train_dataset = dataset["train"].select(range(tam_train))
small_val_dataset = dataset["validation"].select(range(tam_val))

In [17]:
# To tokenize the data
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_val = small_val_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 1000/1000 [00:01<00:00, 548.66 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 556.39 examples/s]


## Transformer

In [18]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [19]:
#fp16=True if GPU
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=3,
    fp16=False
)

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=T5Tokenizer
)

  trainer = Trainer(


In [21]:
# To train the model
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss


TypeError: PreTrainedTokenizerBase.save_pretrained() missing 1 required positional argument: 'save_directory'

In [None]:
# Model evaluation
sample = dataset["validation"][0]["document"]
input_text = "summarize: " + sample
input_ids = T5Tokenizer(input_text, return_tensors="pt", truncation=True).input_ids

# Generar el resumen
summary_ids = model.generate(input_ids, max_length=128)
print("Resumen generado:\n", T5Tokenizer.decode(summary_ids[0], skip_special_tokens=True))

In [None]:
print("Resumen real:\n", dataset["validation"][0]["summary"])

In [None]:
model.save_pretrained("t5-small-summarizer")
T5Tokenizernizer.save_pretrained("t5-small-summarizer")