#Fine-tuning

To fine-tune AraGPT2 for text summarization, we use the file `arabic_texts_summaries.csv`

#### *Fine-tuning Steps:*


1.   Load datasets and split it into train/test
2.   Create Datalaoders of train and val.
3.   Resize model embeddings for new tokenizer length.
4.   Fine-tuning model by passing train data and evaluating it on val data during training.
5.   Store the tokenizer and fine-tuned model.
6.   Generate summaries for test set which is not used during fine tune.



In [None]:
! git clone https://github.com/HoussamEddineBoukhalfa/Text-Summarization.git

Cloning into 'Text-Summarization'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 36 (delta 8), reused 29 (delta 5), pack-reused 0[K
Receiving objects: 100% (36/36), 19.13 KiB | 4.78 MiB/s, done.
Resolving deltas: 100% (8/8), done.


In [None]:
!pip install arabert

Collecting arabert
  Downloading arabert-1.0.1-py3-none-any.whl (179 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m143.4/179.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyArabic (from arabert)
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting farasapy (from arabert)
  Downloading farasapy-0.0.14-py3-none-any.whl (11 kB)
Collecting emoji==1.4.2 (from arabert)
  Downloading emoji-1.4.2.tar.gz (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.0/185.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setu

In [None]:
from transformers import GPT2TokenizerFast, pipeline
from transformers import GPT2LMHeadModel
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor


In [None]:
from proj.src.utils_data import *
from proj.src.utils_tokenizer import *
from proj.src.train import *

In [None]:
max_length = 512
sum_length = 100
split_probability = 0.3

In [None]:
train, val, test = process_data("proj/data/arabic_texts_summaries.csv",max_length , sum_length, split_probability)

train size: 35
val size: 7
test size: 8
test head:
                                                 text  \
37  تدور أحداث هذا النص حول رحلة بحرية. يبدأ النص ...   

                            summary  text_len  
37  مغامرة بحرية تستكشف عجائب البحر        46  


In [None]:
# Add token to AraGPT2 tokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('aubmindlab/aragpt2-base')

special_tokens = {'bos_token':'<BOS>', 'eos_token':'<EOS>', 'pad_token':'<PAD>', 'additional_special_tokens':['<SUMMARIZE>']}
tokenizer.add_special_tokens(special_tokens)

print('tokenizer len: {}'.format(len(tokenizer)))

ignore_idx = tokenizer.pad_token_id


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

tokenizer len: 64004


In [None]:
import os

tokenizer_dir ="tokenizer_path_save"
if not os.path.exists(tokenizer_dir):
  os.makedirs(tokenizer_dir) # Create output directory if needed

max_seq_len = 768
tokenizer.save_pretrained(tokenizer_dir)
tokenizer_len = len(tokenizer)
print('ignore_index: {}'.format(ignore_idx))
print('max_len: {}'.format(max_seq_len))

train, val, test = tokenize_dataset(tokenizer ,train , val , test ,max_seq_len)


ignore_index: 64002
max_len: 768


In [None]:
#Generate train/val/test files
#save tokenized data
out_dir="tokenizer_data"
processed_set= "dataset"
data_dir = os.path.join(out_dir, processed_set)
if not os.path.exists(data_dir):
  os.makedirs(data_dir) # Create output directory if needed
file = os.path.join(data_dir,"train.csv")
train.to_csv(file, index=False)

file = os.path.join(data_dir,"val.csv")
val.to_csv(file, index=False)

file = os.path.join(data_dir,"test.csv")
test.to_csv(file, index=False)

In [None]:
train['encodings']

48    [input_ids, attention_mask]
47    [input_ids, attention_mask]
20    [input_ids, attention_mask]
2     [input_ids, attention_mask]
32    [input_ids, attention_mask]
3     [input_ids, attention_mask]
0     [input_ids, attention_mask]
19    [input_ids, attention_mask]
26    [input_ids, attention_mask]
34    [input_ids, attention_mask]
33    [input_ids, attention_mask]
24    [input_ids, attention_mask]
49    [input_ids, attention_mask]
10    [input_ids, attention_mask]
28    [input_ids, attention_mask]
9     [input_ids, attention_mask]
22    [input_ids, attention_mask]
40    [input_ids, attention_mask]
35    [input_ids, attention_mask]
15    [input_ids, attention_mask]
18    [input_ids, attention_mask]
45    [input_ids, attention_mask]
4     [input_ids, attention_mask]
39    [input_ids, attention_mask]
27    [input_ids, attention_mask]
46    [input_ids, attention_mask]
21    [input_ids, attention_mask]
36    [input_ids, attention_mask]
31    [input_ids, attention_mask]
16    [input_i

In [None]:
train

Unnamed: 0,text_len,encodings
48,49,"[input_ids, attention_mask]"
47,49,"[input_ids, attention_mask]"
20,49,"[input_ids, attention_mask]"
2,49,"[input_ids, attention_mask]"
32,49,"[input_ids, attention_mask]"
3,49,"[input_ids, attention_mask]"
0,49,"[input_ids, attention_mask]"
19,52,"[input_ids, attention_mask]"
26,49,"[input_ids, attention_mask]"
34,46,"[input_ids, attention_mask]"


##The columns

In [None]:
import torch
train_dataset, val_dataset= get_gpt2_dataset(train , val)

b = train_dataset.__getitem__(1) # check one data row

train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = 1)
val_dataloader = DataLoader(val_dataset, sampler = SequentialSampler(val_dataset), batch_size = 1)

train_loader_len = train_dataset.__len__()