In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [2]:
!pip install PyArabic
!pip install nltk

Collecting PyArabic
  Downloading PyArabic-0.6.15-py3-none-any.whl.metadata (10 kB)
Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyArabic
Successfully installed PyArabic-0.6.15


# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `yalsaffar/mt5-small-Arabic-Summarization` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/arabic_dialogue_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [3]:
from datasets import load_dataset

#Load the dataset
dataset = load_dataset("CUTD/arabic_dialogue_df")

#Split the dataset into training and testing sets
dataset = dataset["train"].train_test_split(test_size=0.2)
train_dataset = dataset["train"]
test_dataset = dataset["test"]
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


arabic_dialogue_df.csv:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'dialogue', 'summary'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['Unnamed: 0', 'dialogue', 'summary'],
        num_rows: 3000
    })
})

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [4]:
from transformers import AutoTokenizer

#Initialize a tokenizer
tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/833 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/416 [00:00<?, ?B/s]

## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [5]:
import re
import pyarabic.araby as araby
import nltk

# Combined preprocessing function
def preprocess_text(text):
    #Removing links (URLs):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    #Removing special characters and punctuation:
    text = re.sub(r'[^\w\s]', '', text)

    #Removing Arabic diacritics (Tashkeel) and elongated letters (Tatweel):
    text = araby.strip_tashkeel(text)
    text = araby.strip_tatweel(text)

    #Normalizing Hamza:
    text = araby.normalize_hamza(text)

    return text

In [6]:
#Define a preprocessing function
def preprocess_function(examples):
    # Apply the combined preprocessing on dialogues and summaries
    inputs = ["summarize: " + preprocess_text(doc) for doc in examples["dialogue"]]
    summaries = [preprocess_text(summary) for summary in examples["summary"]]

    # Tokenize inputs and summaries
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=summaries, max_length=128, truncation=True)

    # Labels will be the tokenized summaries
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Preprocess the dataset
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [7]:
from transformers import DataCollatorForSeq2Seq

#Define the Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="yalsaffar/mt5-small-Arabic-Summarization", padding=True)

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [8]:
from transformers import AutoModelForSeq2SeqLM

#Load the Pretrained Model
model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization")

config.json:   0%|          | 0.00/896 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/206 [00:00<?, ?B/s]

## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [9]:
#Define Training Arguments
from transformers import Seq2SeqTrainingArguments

#training configuration with parameters like learning rate, batch size, and number of epochs
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
)



## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [10]:
from transformers import Seq2SeqTrainer

#Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [11]:
#Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,3.236,2.672254


Non-default generation parameters: {'max_length': 128, 'num_beams': 15, 'length_penalty': 0.6, 'no_repeat_ngram_size': 2}
Non-default generation parameters: {'max_length': 128, 'num_beams': 15, 'length_penalty': 0.6, 'no_repeat_ngram_size': 2}


TrainOutput(global_step=750, training_loss=3.1659915364583333, metrics={'train_runtime': 230.2427, 'train_samples_per_second': 52.119, 'train_steps_per_second': 3.257, 'total_flos': 6494797687357440.0, 'train_loss': 3.1659915364583333, 'epoch': 1.0})

In [13]:
# Save the model
model.save_pretrained('/content/model')

# Save the tokenizer
tokenizer.save_pretrained('/content/model')

Non-default generation parameters: {'max_length': 128, 'num_beams': 15, 'length_penalty': 0.6, 'no_repeat_ngram_size': 2}


('/content/model/tokenizer_config.json',
 '/content/model/special_tokens_map.json',
 '/content/model/spiece.model',
 '/content/model/added_tokens.json',
 '/content/model/tokenizer.json')

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [17]:
# Use a pipeline as a high-level helper
from transformers import pipeline

sample_text = """
يحكي أن مجموعة من الأرانب الجميلة كانوا يعيشون معا في الغابة وكانوا دائما يتعاونون معا في جلب الطعام ويقتسموه معا وكانوا يساعدون بعضهم البعض في كافة الأمور،
وكانوا يحبون بعضهم حب شديد، وفي أحد الأيام هجم ثعلب كبير على الارانب وقال لهم سوف اعيش معكم واكون سيد هذا الغابة وانتم ستكونون عبيد لي تخدموني وتحضروا لي الطعام والشراب،
ومن سيخلف اوامري سوف اكله.\n
وبعد أن كانت حياة الارانب مليئة بالفرحة والسعادة والامل أصبحوا في غاية الحزن والتعب حيث كان الثعلب يعزبهم ويضربهم ويجعلهم يخدمونه طوال اليوم ويحرمهم من الطعام،
ومن يعترض على الظلم يقوم الثعلب بتعذيبه وحبسه، ظل الارانب على هذا الحال فترة طويلة حتي أصبحوا ضعفاء ليس لديهم القدرة على ذل الثعلب لهم.\n
انتصار الارانب\n
وفي يوم خرج الثعلب من الغابة لكي يتنزه مع اصدقائه الثعالب فتجمع الارانب مع بعضهم واتفقوا على أن يتوحدوا ويقفوا في وجه الثعلب كي يرحل عنهم ويعيشون في سلام وفرح مثلما كانوا.\n
وبالفعل تجمع جميع الارانب الكبار منهم والصغار ولم يتخلف أحد وقاموا بعمل خطة للتخلص من الثعلب، حيث قاموا بنصب الشباك على أبواب الغابة التي سيدخل منها الثعلب،
وانقسموا الى مجموعات واختبوا خلف اشجار الغابة لكي يراقبوا الثعلب حين يدخل في الفخ.\n
وبالفعل وقع الثعلب في الفخ وفرح الارانب كثيرا وقاموا بحبسه فظل يتوسل لهم أن يرحل خارج مدينتهم فتركوه يرحل وبذلك انتصر الخير وعاد الارانب يعيشون في سعادة وتعاون من جديد.\n
الارنب الشقي\n
يحكي أن ارنب كان يعيش مع والدته وفي يوم طلب الارنب من والدته أن يذهب ليلعب في الغابة فرفضت امه خوفا عليه من الثعلب،
فاستغل الارنب الصغير عدم انتباه امه واسرع في الخروج من المنزل، و ظل يلعب في الغابة وهو سعيد وظل يشم الورود فراه الثعلب وظل يطارده لكي يستغل الوقت المناسب للهجوم عليه.\n
واستغل الثعلب توقف الارنب الصغير عن اللعب وجلوسه اسفل شجرة فحاول الهجوم عليه ولكن الارنب الصغير تمكن من الهرب واختبي في جحر صغير وظل مختبئ فيه حتى ابتعد الثعلب
وعاد إلى البيت مسرعا ، فوجد امه تبكي عليه من شدة القلق وعندما رأته حضنته بقوة وقالت له أين كنت يا بني؟.\n
فحكي الارنب الصغير لأمه ما حدث معه واعتذر لها لأنه لم يسمع كلامها وخرج من البيت دون أن يخبرها وظل يقبل يدها،
فقالت له الام انا اسمحك ولكن لا تعيد الأمر مرة اخرى.
"""

summ = pipeline("text2text-generation", model="/content/model", tokenizer='/content/model')

result = summ(sample_text)

print(result)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'generated_text': 'هجم ثعلب كبير على مجموعة من الأرانب الجميلة التي كانوا يعيشون معا في الغابة'}]
