<a href="https://colab.research.google.com/github/Asma-Ahmed-Aqil-AL-Zubaidi/Tuwaiq_Academy_week_8/blob/main/Text_Summarization_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [36]:
!pip install datasets



In [37]:
!pip install --upgrade torch



In [38]:
!pip install transformers datasets sklearn


Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [39]:
!pip install --upgrade accelerate



In [40]:
!pip install transformers datasets sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [41]:
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, T5ForConditionalGeneration, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainer

In [42]:
dataset = load_dataset("CUTD/news_articles_df")

train_data, test_data = train_test_split(dataset['train'], test_size=0.2, random_state=42)


train_dataset = Dataset.from_dict(train_data)
test_dataset = Dataset.from_dict(test_data)

In [43]:
train_dataset[:3]

{'Unnamed: 0': [5677, 664, 4366],
 'summarizer': ['\nنظمت منطقة الحرس الوطني بالمحرس ولاية صفاقس يوم أمس حملة أمنيّة كبرى إستثنائيّة، بحسب بيان لوزارة الداخلية. - إحباط عمليّات تهريب على متن 05 سيّارات كانت محمّلة ببضائع مهرّبة تتمثل في: - 3100 لترا من المحروقات. وأسفرت الحملة عن تحقيق النتائج التالية: - إلقاء القبض على 17 شخصا مفتش عنهم.',
  '\nأكد وزير الفلاحة والموارد المائية والصيد البحري\xa0سمير الطيب انه تم تحديد تسعيرة اللتر الواحد من زيت الزيتون ب8 دنانير\xa0فقط بعد الاتفاق بين الوزارة و الديوان الوطني للزيت.',
  'كما تم الاحتفاظ بـ 3 مشتبه بهم آخرين (تتراوح أعمارهم بين 22 و 40 سنة) وحجز مبلغ مالي قدره 8060 دينار و الدراجة النارية المستعملة في العملية . \nأعلنت وزارة الداخلية اليوم الثلاثاء أنه تم إيقاف الأشخاص المتورطين في قتل شخص بعد محاولة سلبه أمواله في جريمة وقعت في شارع آلان سافاري بتونس العاصمة. وتم الاحتفاظ بالمشتبه الرئيسي (18سنة) الذي أكد ما جاء بأطوار القضية واعترف بطعنه للمتضرر على مستوى رجله مما أدى لاحقا لوفاته.'],
 'text': ['نظمت منطقه الحرس الوطني بالمحرس ولايه 

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [44]:
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base")



## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [45]:
train_dataset.column_names

['Unnamed: 0', 'summarizer', 'text']

In [46]:
def preprocess_function(examples):

    inputs = ["summarize: " + doc for doc in examples["text"]]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True)


    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summarizer"], max_length=150, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/6702 [00:00<?, ? examples/s]



Map:   0%|          | 0/1676 [00:00<?, ? examples/s]

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [47]:
model = T5ForConditionalGeneration.from_pretrained("UBC-NLP/AraT5-base")

# وظيفة لتحويل جميع الأوزان إلى contiguous
def make_contiguous(model):
    for param in model.parameters():
        if not param.is_contiguous():
            param.data = param.data.contiguous()

# جعل النموذج متجاورًا في الذاكرة
make_contiguous(model)



In [48]:
print("تجهيز الأداة للتعامل مع البيانات...")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)


تجهيز الأداة للتعامل مع البيانات...


In [49]:
#model = T5ForConditionalGeneration.from_pretrained("UBC-NLP/AraT5-base")

#data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# تحميل النموذج
#print("تحميل النموذج الجاهز...")
#model = T5ForConditionalGeneration.from_pretrained("UBC-NLP/AraT5-base")

# جعل النموذج متجاورًا في الذاكرة
#make_contiguous(model)


data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)


## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [50]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
)



## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [51]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [52]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,7.7154,7.122348


TrainOutput(global_step=1676, training_loss=8.885829006003878, metrics={'train_runtime': 820.8838, 'train_samples_per_second': 8.164, 'train_steps_per_second': 2.042, 'total_flos': 1576515858923520.0, 'train_loss': 8.885829006003878, 'epoch': 1.0})

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [82]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


T5ForConditionalGeneration(
  (shared): Embedding(110080, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(110080, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo

In [85]:
sample_text ="كشف خبير مصري عن بعض التفاصيل التي تسربت من نتائج تحقيقات أحداث 7 أكتوبر الماضي"

In [86]:
#sample_text = "تابع الخبير المصري أن المسؤولين الإسرائيليين، تواصلا هاتفيا، واتفقا على أن ما يحدث عبارة عن مجرد تدريبات لحماس، ولذلك لم يتخذا أي إجراء، كما لم يرفعا حالة التأهب للقوات مثلما هو معتاد في هذا الأمر، مؤكدا أن الجيش الإسرائيلي فوجئ في الصباح التالي بالهجوم واكتشف مقتل العديد من جنوده وكان رد فعله بطيئاً، حيث اقتصر على تحريك المروحيات الأباتشي التي قتلت عدداً كبيراً من المشاركين في حفل موسيقي إسرائيلي خلال هروبهم"  # النص الذي نريد تلخيصه
inputs = tokenizer("summarize: " + sample_text, return_tensors="pt", max_length=512, truncation=True)


inputs = {key: value.to(device) for key, value in inputs.items()}
summary_ids = model.generate(inputs["input_ids"])

In [87]:
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

أكد أن تصريح تصريح تصريح تصريح في تصريح في تصريح في تصريح في تصريح في تصريح في تصريح في
