<a href="https://colab.research.google.com/github/Khuzamaalk/T5_BootCamp/blob/main/M_Text_Summarization_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [None]:
#from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

from datasets import load_dataset

dataset = load_dataset('CUTD/news_articles_df')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


news_articles_df.csv:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8378 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 8378
    })
})

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(dataset['train'])
df.head()

Unnamed: 0.1,Unnamed: 0,summarizer,text
0,0,\nأشرف رئيس الجمهورية الباجي قايد السبسي اليوم...,اشرف رئيس الجمهوريه الباجي قايد السبسي اليوم ب...
1,1,"\nتحصل كتاب ""المصحف وقراءاته"" الذي ألفه باحثون...",تحصل كتاب المصحف وقراءاته الفه باحثون تونسيون ...
2,2,تونس حاضرة من جهة أخرى ستكون تونس حاضرة في قائ...,احتضن جناح تونس القريه الدوليه للافلام بمدينه ...
3,3,واستأجرت صاحبة المشروع المحامية والكاتبة سيران...,شهدت برلين الجمعه افتتاح مسجد فريد نوعه الاقل ...
4,4,\nنعت وزارة الشّؤون الثّقافيّة المنشد الصّوفي ...,نعت وزاره المنشد عز بن محمود انتقل جوار يوم تن...


In [None]:
print(f'first text:\n {df["text"]}')
print("\n----------------------------------------\n")
print(f'first summary:\n {df["summarizer"]}')

first text:
 0       اشرف رئيس الجمهوريه الباجي قايد السبسي اليوم ب...
1       تحصل كتاب المصحف وقراءاته الفه باحثون تونسيون ...
2       احتضن جناح تونس القريه الدوليه للافلام بمدينه ...
3       شهدت برلين الجمعه افتتاح مسجد فريد نوعه الاقل ...
4       نعت وزاره المنشد عز بن محمود انتقل جوار يوم تن...
                              ...                        
8373    تاجل الاضراب العام قطاع الصحه مقررا تنفيذه الي...
8374    كشف الناشطان كريم نوار وعفيف زقيه اشرفا عمليه ...
8375    فرقه الابحاث والتفتيش للحرس الوطني بطبلبه ولاي...
8376    قرر الاهالي بمناطق هيشر وعين القارصي والغولايث...
8377    تمكنت وحدات الحرس الوطني بمحطه الاستخلاص ببرج ...
Name: text, Length: 8378, dtype: object

----------------------------------------

first summary:
 0       \nأشرف رئيس الجمهورية الباجي قايد السبسي اليوم...
1       \nتحصل كتاب "المصحف وقراءاته" الذي ألفه باحثون...
2       تونس حاضرة من جهة أخرى ستكون تونس حاضرة في قائ...
3       واستأجرت صاحبة المشروع المحامية والكاتبة سيران...
4       \nنعت وزار

In [None]:
df = df.drop(columns= 'Unnamed: 0')
df.head()

Unnamed: 0,summarizer,text
0,\nأشرف رئيس الجمهورية الباجي قايد السبسي اليوم...,اشرف رئيس الجمهوريه الباجي قايد السبسي اليوم ب...
1,"\nتحصل كتاب ""المصحف وقراءاته"" الذي ألفه باحثون...",تحصل كتاب المصحف وقراءاته الفه باحثون تونسيون ...
2,تونس حاضرة من جهة أخرى ستكون تونس حاضرة في قائ...,احتضن جناح تونس القريه الدوليه للافلام بمدينه ...
3,واستأجرت صاحبة المشروع المحامية والكاتبة سيران...,شهدت برلين الجمعه افتتاح مسجد فريد نوعه الاقل ...
4,\nنعت وزارة الشّؤون الثّقافيّة المنشد الصّوفي ...,نعت وزاره المنشد عز بن محمود انتقل جوار يوم تن...


## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [None]:
from transformers import AutoTokenizer

checkpoint = "UBC-NLP/AraT5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/2.44M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [None]:
prefix = "summarize: "

def preprocess(texts):

  inputs = [prefix + doc for doc in texts['text']]

  model_inputs = tokenizer(inputs, max_length= 512, truncation= True)
  labels = tokenizer(text_target= texts['summarizer'], max_length= 512, truncation= True)

  model_inputs['labels'] = labels['input_ids']

  return model_inputs

In [None]:
tokenized_data = dataset.map(preprocess, batched = True)

Map:   0%|          | 0/8378 [00:00<?, ? examples/s]

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tf_model.h5:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at UBC-NLP/AraT5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [None]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [None]:
train_size = int(0.7 * len(tokenized_data["train"]))
validation_size = int(0.15 * len(tokenized_data["train"]))
test_size = int(0.15 * len(tokenized_data["train"]))

#update the tokenized data with the new splits
tokenized_data = {"train": tokenized_data["train"].select(range(train_size)),
                     "validation": tokenized_data["train"].select(range(train_size, train_size + validation_size)),
                     "test": tokenized_data["train"].select(range(train_size + validation_size, train_size + validation_size + test_size))}

In [None]:
batch_size=4
tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_data["validation"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [None]:
model.compile(optimizer=optimizer)

## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    num_train_epochs=3,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tf_train_set,
    eval_dataset=tf_test_set,
    tokenizer=tokenizer,
    data_collator=data_collator,
)



AttributeError: 'TFT5ForConditionalGeneration' object has no attribute 'to'

## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [None]:
trainer.train()

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=1)



<tf_keras.src.callbacks.History at 0x797709e2a770>

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [None]:
model.save_pretrained("my_model")

In [None]:
tokenizer.save_pretrained("/content/my_model")

('/content/my_model/tokenizer_config.json',
 '/content/my_model/special_tokens_map.json',
 '/content/my_model/spiece.model',
 '/content/my_model/added_tokens.json',
 '/content/my_model/tokenizer.json')

In [None]:
text = '''أتشعرُ أنّك مرهقٌ جداً يا فتى؟ متعبٌ من كلّ شيءٍ، وساخطٌ على كلّ شيءْ، تبدُو لِي كذلك، وعيناكَ الضيّقتانِ، تزيدانِ من حدّتكْ، كلّما اكتملتْ تلكَ العقدةُ الّتي تعلُو وجهكْ.

اهدأ، فأنا أستطيعُ أنْ أتفهّم غضبكْ ونقمتكَ على الحياةِ كلّها، وأنتَ تجلسُ كلّ صّباحٍ في هذهِ الزاويةِ المعتمةِ منْ هذا الكوكبِ المقفرِ، تنتظرُ منْ يمرُّ من هُنا راغباً في مسحِ حذائهِ.

تشعرُ بالخجلْ أليسَ كذلك؟ أو ربّما تشعرُ أنّك مطحونٌ في ركنٍ منسيٍ من هذا الكونْ، تشعرُ بالرّغبةِ في البكاءْ، كلّما ناولكَ أحدهمْ نظيرَ عملكْ، أنا أفهمكْ حقاً، لكنّي أفهمُ أيضاً أنّنا لا نختارُ ما نحنُ عليهْ، بينما نستطيعُ تغييرهُ بأيدينا مسْتقبلاً، أنتَ تبْنِي نفسكْ، فلا تسْتهنْ بكلّ الّذي تقُومْ به الآن.

غداً حينَ ستكبرْ، ستدركُ أنّك قدْ صقلتَ الرّجولة فيكَ مبكراً جداً، وأنّ الطّفولةَ الّتي حُرمتَ جنّتها، ستعوّضُ برجولةٍ مكتملةٍ وقادرةٍ على مواجهةِ صُعوباتِ الحياة، أنتَ تصنعُ من نفسكَ الآن رجلاً، وقليلُون جداً همُ الرجالُ على هذا الكوكبْ.

يا صّغيرِي، لا تخجلْ منْ نفسكْ أبداً، فأنتَ الآن درسٌ للعالمِ كلّه.'''

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="/content/my_model")
summarizer(text)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at /content/my_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': 'طعنة طعنة، Palaistemptemptemptemptemp ميناء ميناءtemptemptemptemptemp ميناء،،'}]