# 📘 Welcome to the Arabic Text Summarization Project

## Overview
Welcome, students 👩‍🎓👨‍🎓, to our exciting journey into the world of Natural Language Processing (NLP)! In this project, we'll be delving into the fascinating task of text summarization with a focus on the Arabic language 📚. Our goal is to develop a model that can efficiently summarize Arabic text, making it easier to grasp the essence of large documents quickly 🚀.

## Project Objectives
- **Understanding Text Summarization**: Learn the fundamentals of how text summarization works 📝.
- **Exploring NLP Models**: Get hands-on experience with advanced NLP models like AraGPT2 🤖.
- **Model Fine-Tuning and Training**: Discover how to fine-tune pre-trained models on a custom dataset for specific tasks like summarization 🧠.
- **Practical Application**: Apply your knowledge to build a model that can summarize Arabic texts 🌐.

## Dataset
We'll be using a custom dataset of Arabic texts and their summaries 📖. This dataset will allow us to train our model to understand and generate concise summaries.

We generated this dataset using ChatGPT 😜
If you've read this sentence, send me a message.




## ⚠️ **Important: Use GPU Runtime** ⚠️

To ensure this notebook functions correctly and efficiently, it is **crucial to use a GPU runtime**. Follow these steps to enable GPU acceleration:

1. **Open Runtime settings**: At the top of the page, click on `Runtime` in the menu bar. 🔄

2. **Change the runtime type**: In the dropdown menu, select `Change runtime type`. 🛠️

3. **Select GPU as the hardware accelerator**: In the dialog that appears, under `Hardware accelerator`, choose `GPU T4` from the dropdown menu. 🖥️

4. **Save the settings**: Click `Save` to apply the changes. 💾

By enabling GPU, the computations in this notebook will be significantly faster, especially for tasks like training neural networks, processing large datasets, or performing complex calculations.


## PART1: Load AraGPT2

Using the link below, learn how to load araGPT2 base model.

https://huggingface.co/aubmindlab/aragpt2-base

In [None]:
!pip install arabert

Collecting arabert
  Downloading arabert-1.0.1-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyArabic (from arabert)
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting farasapy (from arabert)
  Downloading farasapy-0.0.14-py3-none-any.whl (11 kB)
Collecting emoji==1.4.2 (from arabert)
  Downloading emoji-1.4.2.tar.gz (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.0/185.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.4.2-py3-none-any.whl size=186460 sha256=90c624f3535be0439dfab0f1b2179416be25a

In [None]:
from transformers import GPT2TokenizerFast, pipeline
from transformers import GPT2LMHeadModel
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor

In [None]:
#TODO: Complete this cell
MODEL_NAME= 'aubmindlab/aragpt2-base'  # indicates the use of a GPT-2 model trained by the "AubMindLab" on Arabic language data.
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text="الجزائر بلد"
text_clean = arabert_prep.preprocess(text)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/553M [00:00<?, ?B/s]

Some weights of the model checkpoint at aubmindlab/aragpt2-base were not used when initializing GPT2LMHeadModel: ['ln_f.weight', 'ln_f.bias']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at aubmindlab/aragpt2-base and are newly initialized: ['emb_norm.weight', 'emb_norm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

'الجزائر بلد الى و بشكلجب انون هي القيمة ، ولا انها شكل أنها هى فيه العلامة شبه البيان انهرم الأم بصورة كافة هو قيمة فينإ أن عن الأسرة تماما الاسرة النظام نصفأن بأكملهالا�بن شكله الأصل والمدر هما جميع الممثلةالإ على علاء بأنحيات وكذلك عبرله تعتبرلتالات والاسرة الأفراد فهي نظ وكل ى ممثلة كما المستقبل إلى عليه وبقية جيهان وتلك ربيتى يعتبرذا وتعتبر وهو فى له مختلف الافراد فيصلنت وج الشكل كل أيضا وباقي بالتأكيد العائلة كبير 7 27 نور نوراالت 11 ون منه نظام ايضا القمة مثل كلها جاالس القاعدةج حتى ضوء كثيرة 2 تلكث او فإن وهي الاصل ويتم بأكملها بقية الدولة لذلك آ أنه الله وسائر منهم عموما عمر نتمنى الملك مريم بالاضافة 23انته علية المراجعالل� الامة منذ الملاك جمعاء بوصول شعب بواسطةتي بما الل 2017 البند لة كيت كاملا فإنها مجموع فهو كذلك [ين لانعأت بالكامل نه منها الشعب والو 3ح يعكس وبعض'

### Print AraGPT Model and analyze the architecture

# TODO: print AraGPT2

In [None]:
# Print model information
print(f"Model Name: {MODEL_NAME}")
print(f"Model Type: {type(model)}")
print(f"Tokenizer Type: {type(tokenizer)}")
print(f"Model Configuration: {model.config}")

# Print model architecture details
print(f"Model Architecture: {model}")

Model Name: aubmindlab/aragpt2-base
Model Type: <class 'arabert.aragpt2.grover.modeling_gpt2.GPT2LMHeadModel'>
Tokenizer Type: <class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>
Model Configuration: GPT2Config {
  "_name_or_path": "aubmindlab/aragpt2-base",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 0,
  "do_sample": true,
  "embd_pdrop": 0.1,
  "eos_token_id": 0,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "max_length": 50,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "no_repeat_ngram_size": 3,
  "num_beams": 5,
  "reorder_and_upcast_attn": false,
  "repetition_penalty": 3.0,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summar

## PART2: Fine-tuning

To fine-tune AraGPT2 for text summarization, we use the file `arabic_texts_summaries.csv`

#### *Fine-tuning Steps:*


1.   Load datasets and split it into train/test
2.   Create Datalaoders of train and val.
3.   Resize model embeddings for new tokenizer length.
4.   Fine-tuning model by passing train data and evaluating it on val data during training.
5.   Store the tokenizer and fine-tuned model.
6.   Generate summaries for test set which is not used during fine tune.



In [27]:
from utils_data import *
from utils_tokenizer import *
from train import *

In [2]:
max_length = 1000
sum_length = 50
split_probability = 0.7

In [10]:
#train, val, test = process_data("data/arabic_texts_summaries.csv",max_length , sum_length, split_probability)
train, val, test = process_data("arabic_texts_summaries.csv",max_length , sum_length, split_probability)

# Affichage des informations sur les parties
print("Taille de la partie Training:", len(train))
print("Taille de la partie Validation:", len(val))
print("Taille de la partie Test:", len(test))

train size: 15
val size: 17
test size: 18
test head:
                                                text  \
5  تدور أحداث هذا النص حول مهرجان ثقافي. يبدأ الن...   

                                           summary  text_len  
5  الاحتفال بمهرجان ثقافي يعرض فنون وثقافات متنوعة        46  
Taille de la partie Training: 15
Taille de la partie Validation: 17
Taille de la partie Test: 18


In [11]:
# Add token to AraGPT2 tokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('aubmindlab/aragpt2-base')

special_tokens = {'bos_token':'<BOS>', 'eos_token':'<EOS>', 'pad_token':'<PAD>', 'additional_special_tokens':['<SUMMARIZE>']}
tokenizer.add_special_tokens(special_tokens)

print('tokenizer len: {}'.format(len(tokenizer)))

ignore_idx = tokenizer.pad_token_id


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

tokenizer len: 64004


In [12]:
# TODO: apply tokenizer
import os

tokenizer_dir ="tokenizer_path_save"
if not os.path.exists(tokenizer_dir):
  os.makedirs(tokenizer_dir) # Create output directory if needed

max_seq_len = 768
tokenizer.save_pretrained(tokenizer_dir)
tokenizer_len = len(tokenizer)
print('ignore_index: {}'.format(ignore_idx))
print('max_len: {}'.format(max_seq_len))

train, val, test = tokenize_dataset(tokenizer, train, val, test, max_seq_len) # Fix tokenize_dataset function in utils_tokenizer and call it

# Afficher de nouveau les informations sur les parties
print("Taille de la partie Training:", len(train))
print("Taille de la partie Validation:", len(val))
print("Taille de la partie Test:", len(test))

ignore_index: 64002
max_len: 768
Taille de la partie Training: 15
Taille de la partie Validation: 17
Taille de la partie Test: 18


In [15]:
#Generate train/val/test files
#save tokenized data
out_dir="tokenizer_data"
processed_set= "dataset"
data_dir = os.path.join(out_dir, processed_set)
if not os.path.exists(data_dir):
  os.makedirs(data_dir) # Create output directory if needed
file = os.path.join(data_dir,"train.csv")
train.to_csv(file, index=False)

file = os.path.join(data_dir,"val.csv")
val.to_csv(file, index=False)

file = os.path.join(data_dir,"test.csv")
test.to_csv(file, index=False)

In [16]:
# TODO: Visualize train and explain each column
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv('tokenizer_data/dataset/train.csv')

# Display basic information about the DataFrame
print(df.info())

#Display the content of train.csv file
print("\n Display the content of train.csv file \n")
print(df)

#for index, row in df.iterrows():
#   print(row)

#for colonne in df.columns:
#    print(f"Contenu de la colonne '{colonne}':")
#    print(df[colonne])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text_len   15 non-null     int64 
 1   encodings  15 non-null     object
dtypes: int64(1), object(1)
memory usage: 368.0+ bytes
None

 Display the content of train.csv file 

    text_len                                          encodings
0         49  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
1         52  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
2         46  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
3         46  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
4         46  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
5         49  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
6         49  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
7         46  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...
8         49  {'input_ids': [64000, 8908, 5368, 542, 3499, 1...


In [21]:
# TODO: Data Loaders
# Fix code in utils_data.py

import torch
train_dataset, val_dataset= get_gpt2_dataset(train, val) # call function get_gpt2_dataset

b = train_dataset.__getitem__(0) # check one data row

train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = 1)
val_dataloader = DataLoader(val_dataset, sampler = SequentialSampler(val_dataset), batch_size = 1)

train_loader_len =len(train_dataloader)
print(f"The length of train_loader is {train_loader_len}")


The length of train_loader is 15


In [28]:
config = {
    "out_dir": "./output_directory",
    "training_models": "./training_models_directory",
    "final_model": "my_final_model.pth",
}

# fine tune pretrained model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dir =  'aubmindlab/aragpt2-base'

train = Train(device, model_dir, tokenizer_len, ignore_idx, train_loader_len, config)
train.train_model(train_dataloader, val_dataloader)


AttributeError: 'dict' object has no attribute 'out_dir'