<a href="https://colab.research.google.com/github/ProGenei/GhadeerNoohT5/blob/main/Text_Summarization_2_Pytorch_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [1]:
!pip install transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 k

In [2]:
from datasets import load_dataset

In [3]:
billsum = (load_dataset('CUTD/news_articles_df', split='train')
        .train_test_split(train_size=800, test_size=200))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


news_articles_df.csv:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8378 [00:00<?, ? examples/s]

In [4]:
billsum

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 200
    })
})

In [5]:
train_ds = billsum['train']

In [6]:
billsum['train'][6]

{'Unnamed: 0': 755,
 'summarizer': 'وتتوزع ميزانية المجلس بين 28،081 مليون دينار بالنسبة لنفقات التصرف، و1،365 مليون دينار بالنسبة لنفقات التنمية. \nتمت خلال الجلسة العامة المنعقدة اليوم السبت المصادقة على مشروع ميزانية مجلس نواب الشعب لسنة 2017، بموافقة 177 نائبا واحتفاظ 17 نائبا بأصواتهم.',
 'text': 'تمت خلال الجلسه العامه المنعقده اليوم السبت المصادقه مشروع ميزانيه مجلس نواب الشعب لسنه بموافقه نائبا واحتفاظ نائبا باصواتهم وتم ضبط نفقات التصرف والتنميه بمشروع ميزانيه المجلس للسنه القادمه حدود مليون مقابل مليون مرسمه سنه بانخفاض قدره مليون يعادل نسبه بالمائه وتتوزع ميزانيه المجلس مليون بالنسبه لنفقات التصرف مليون بالنسبه لنفقات التنميه'}

In [7]:
billsum['train'].features

{'Unnamed: 0': Value(dtype='int64', id=None),
 'summarizer': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None)}

In [8]:
train_ds

Dataset({
    features: ['Unnamed: 0', 'summarizer', 'text'],
    num_rows: 800
})

In [9]:
billsum

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 200
    })
})

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [10]:
billsum['test']

Dataset({
    features: ['Unnamed: 0', 'summarizer', 'text'],
    num_rows: 200
})

In [11]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [12]:
prefix = "summarizer: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, padding='max_length',max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summarizer"], max_length=128,padding='max_length', truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [13]:
train = billsum['train'].map(preprocess_function, batched=True, remove_columns=['Unnamed: 0', 'summarizer', 'text'])

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [14]:
test = billsum['test'].map(preprocess_function, batched=True, remove_columns=['Unnamed: 0', 'summarizer', 'text'])

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

{'Unnamed: 0': Value(dtype='int64', id=None),

 'summarizer': Value(dtype='string', id=None),

 'text': Value(dtype='string', id=None)}

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [15]:
from transformers import AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [16]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [17]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer,Trainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [18]:
import evaluate

rouge = evaluate.load("rouge")

import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [19]:
training_args = Seq2SeqTrainingArguments(
    output_dir="'/results",
    eval_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1,
)



In [20]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=test,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [21]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=100, training_loss=1.6423846435546876, metrics={'train_runtime': 69.8902, 'train_samples_per_second': 11.447, 'train_steps_per_second': 1.431, 'total_flos': 216546882355200.0, 'train_loss': 1.6423846435546876, 'epoch': 1.0})

## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [47]:
text = "summarizer: هذا النص هو مثال لنص يمكن أن يستبدل في نفس المساحة، لقد تم توليد هذا النص من مولد النص العربى، حيث يمكنك أن تولد مثل هذا النص أو العديد من النصوص الأخرى إضافة إلى زيادة عدد الحروف التى يولدها التطبيق. إذا كنت تحتاج إلى عدد أكبر من الفقرات يتيح لك مولد النص العربى زيادة عدد الفقرات كما تريد، النص لن يبدو مقسما ولا يحوي أخطاء لغوية، مولد النص العربى مفيد لمصممي المواقع على وجه الخصوص، حيث يحتاج العميل فى كثير من الأحيان أن يطلع على صورة حقيقية لتصميم الموقع. ومن هنا وجب على المصمم أن يضع نصوصا مؤقتة على التصميم ليظهر للعميل الشكل كاملاً،دور مولد النص العربى أن يوفر على المصمم عناء البحث عن نص بديل لا علاقة له بالموضوع الذى يتحدث عنه التصميم فيظهر بشكل لا يليق. هذا النص يمكن أن يتم تركيبه على أي تصميم دون مشكلة فلن يبدو وكأنه نص منسوخ، غير منظم، غير منسق، أو حتى غير مفهوم. لأنه مازال نصاً بديلاً ومؤقتاً. هذا النص هو مثال لنص يمكن أن يستبدل في نفس المساحة، لقد تم توليد هذا النص من مولد النص العربى، حيث يمكنك أن تولد مثل هذا النص أو العديد من النصوص الأخرى إضافة إلى زيادة عدد الحروف التى يولدها التطبيق."

In [49]:
from transformers import pipeline

summarizer = pipeline("summarization", tokenizer=checkpoint, model=checkpoint)
summarizer(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': ' ,  and  () () . : ;  = . .'}]

In [50]:
summarizer(text)

[{'summary_text': ' ,  and  () () . : ;  = . .'}]

In [42]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(text, return_tensors="tf").input_ids

In [43]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint, from_pt=True)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [44]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'                                                 '