<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/summarization/Summerazation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook cnn_dailymail dataset is used to fine-tune a t5 model for summarization task

## Installing & Importing Necessary Libraries

bellow the required libraries for this notebook is downloaded ind installed 

In [1]:
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate
!pip install -q rouge_score
!pip install -q sentencepiece

In [2]:
import transformers
import datasets
import numpy as np
import torch
import evaluate
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Dataset
I've used cnn_dailymail dataset but for shorter training time I only used a fraction of data

In [3]:
raw_datasets = datasets.load_dataset("cnn_dailymail", "3.0.0", split= "train[:2000]")



In [4]:
raw_datasets = raw_datasets.train_test_split(test_size= .1)

In [5]:
raw_datasets["validation"] = raw_datasets.pop("test")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 1800
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 200
    })
})

As mentioned earlier t5 model will be fine-tuned for the task

In [6]:
checkpoint = "google/t5-v1_1-base"

## Preprocessing DataSet

In [7]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

In [8]:
max_input_length = 512
max_target_length = 64

In [9]:
def preprocess(example):
    model_inputs = tokenizer(example["article"], 
                             max_length= max_input_length,
                             truncation= True,)
    labels = tokenizer(example["highlights"], 
                       max_length= max_target_length,
                       truncation= True,)
    model_inputs["labels"] = labels["input_ids"]
    model_inputs["label_mask"] = labels["attention_mask"]
    return model_inputs

In [10]:
tokenized_datasets = raw_datasets.map(preprocess,
                                      batched= True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [11]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels', 'label_mask'],
        num_rows: 1800
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels', 'label_mask'],
        num_rows: 200
    })
})

## Metric

In [12]:
metric = evaluate.load("rouge")

## Model & Configuration

In [13]:
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [14]:
batch_size = 8
num_train_epochs = 3
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = "t5-cnn3-fine-tuned"

In [15]:
args = transformers.Seq2SeqTrainingArguments(
    output_dir= model_name,
    evaluation_strategy= "epoch",
    learning_rate= 5e-5,
    per_device_train_batch_size= 8,
    per_device_eval_batch_size= 8,
    weight_decay= .01,
    save_total_limit= 1,
    num_train_epochs= num_train_epochs,
    predict_with_generate= True,
    logging_steps = logging_steps,
    save_strategy="epoch"
)

In [16]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens= True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens= True)
    decoded_preds = ["\n".join(nltk.tokenize.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.tokenize.sent_tokenize(label.strip())) for label in decoded_labels]
    result = metric.compute(
        predictions= decoded_preds,
        references= decoded_labels,
        use_stemmer= True,
    )
    result = {k: round((v * 100), 4) for k, v in result.items()}
    return result

In [17]:
data_collator = transformers.DataCollatorForSeq2Seq(
    tokenizer= tokenizer,
    model= model,
)

In [18]:
tokenized_datasets = tokenized_datasets.remove_columns(raw_datasets["train"].column_names)

## Defining Trainer & Training Model

In [19]:
trainer = transformers.Seq2SeqTrainer(
    model,
    args,
    train_dataset= tokenized_datasets["train"],
    eval_dataset= tokenized_datasets["validation"],
    data_collator= data_collator,
    tokenizer= tokenizer,
    compute_metrics = compute_metrics,
)

In [20]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: label_mask. If label_mask are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1800
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 675
  Number of trainable parameters = 247577856
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,6.6546,2.49903,21.6841,7.0373,16.962,20.6638
2,3.2528,2.325507,12.817,4.5456,10.4818,12.186
3,3.1066,2.315136,7.7011,3.1949,6.44,7.3525


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: label_mask. If label_mask are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8
Saving model checkpoint to t5-cnn3-fine-tuned/checkpoint-225
Configuration saved in t5-cnn3-fine-tuned/checkpoint-225/config.json
Model weights saved in t5-cnn3-fine-tuned/checkpoint-225/pytorch_model.bin
tokenizer config file saved in t5-cnn3-fine-tuned/checkpoint-225/tokenizer_config.json
Special tokens file saved in t5-cnn3-fine-tuned/checkpoint-225/special_tokens_map.json
Copy vocab file to t5-cnn3-fine-tuned/checkpoint-225/spiece.model
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: label_mask. If label_mask are not expected by `T5ForConditionalGeneration.forwa

TrainOutput(global_step=675, training_loss=4.338005913628472, metrics={'train_runtime': 992.5924, 'train_samples_per_second': 5.44, 'train_steps_per_second': 0.68, 'total_flos': 3697689703219200.0, 'train_loss': 4.338005913628472, 'epoch': 3.0})

## Model Usage

In [22]:
summarizer = transformers.pipeline("summarization", model= "/content/t5-cnn3-fine-tuned/checkpoint-675")

loading configuration file /content/t5-cnn3-fine-tuned/checkpoint-675/config.json
Model config T5Config {
  "_name_or_path": "/content/t5-cnn3-fine-tuned/checkpoint-675",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 32128
}

loading configuration file /content/t5-cnn3-fine-tuned/checkpoint-675/config.json
Model config T5Config {
  "

In [27]:
text = raw_datasets["validation"][3]["article"]

In [28]:
summarizer(text[:512])

[{'summary_text': ', Illinois, Sen. Barack Obama says "change has come to America" Sen. Barack'}]