<a href="https://colab.research.google.com/github/Fatis092/repo52/blob/main/Text_Summarization_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/arabic_dialogue_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [None]:
! pip install datasets



In [None]:
from datasets import load_dataset
dataset = load_dataset("CUTD/arabic_dialogue_df", split='train[:80%]')

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")





## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [None]:
dataset

Dataset({
    features: ['Unnamed: 0', 'dialogue', 'summary'],
    num_rows: 3000
})

In [None]:
def preprocess_data(examples):
  inputs = [ex for ex in examples["dialogue"]]
  model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["summary"], max_length=128, truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

In [None]:
train_data = dataset.map(preprocess_data, batched=True, remove_columns=["dialogue", "summary"])

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [None]:
from transformers import DataCollatorForSeq2Seq

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    padding=True,
)

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")

## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="no",
    learning_rate=3e-5,
    per_device_train_batch_size=4,
    num_train_epochs=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    data_collator=data_collator
)



OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 15877 has 14.74 GiB memory in use. Of the allocated memory 13.71 GiB is allocated by PyTorch, and 923.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [None]:
trainer.train()

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [None]:
text = "ذهبت الى السوق ثم صدمت قط بسيارتي ومات وحزنت حزنا شديدا"

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Summary:", decoded_output)


Generated Summary: رجل يغامر بسيارة ب100 مليون دولار
