## Introduction

This notebook explores the performance of the ARGBT2 Mega model, a variant of the ARGBT2 model from Hugging Face, when fine-tuned on the texts of Sheikh Al-Islam Ibn Taymiyyah, may Allah have mercy on him. The primary goal of this experiment is to evaluate how well the ARGBT2 Mega model can adapt and generate text in the style of Sheikh Al-Islam.

In this notebook, we employ several advanced configurations to optimize the training process. The model is fine-tuned using the LoRa (Low-Rank Adaptation) configuration, which aims to enhance the model's efficiency and performance by adapting only a subset of the model's parameters. Additionally, we have quantized the model to use 4-bit data type (NF4), which reduces the model's size and computational requirements while maintaining its effectiveness.

The notebook is intended to provide a preliminary assessment of the ARGBT2 Mega model's capabilities when working with classical Arabic texts. By fine-tuning the model on these texts, we aim to gain insights into its potential for generating content that aligns with the style and substance of Sheikh Al-Islam Ibn Taymiyyah's writings.

This experiment is part of an ongoing effort to understand and refine the model's performance, and the results will help inform future improvements and applications.

In [None]:
! pip install -U transformers accelerate BitsAndBytes
! pip install datasets huggingface_hub peft arabert



In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

from datasets import load_dataset
from arabert.preprocess import ArabertPreprocessor

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
import torch

In [None]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4',
                                bnb_4bit_compute_dtype=torch.float16, bnb_use_double_quan=True)

check_point = 'aubmindlab/aragpt2-mega'

model = AutoModelForCausalLM.from_pretrained(check_point, trust_remote_code=True, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(check_point, trust_remote_code=True)
ara_preprocess = ArabertPreprocessor(check_point)

raw_dataset = load_dataset('ahmadAlrabghi/Ibn-Taymiyyahs-works-shamilah')

Unused kwargs: ['bnb_use_double_quan']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


## Fixing The Tokenizer:


Issue:

The tokenizer dont have a pad token, and the pad token ID = eos token ID

and these two things could cause a strange behavoure when fine - tuning

So we need to set a paddig token manullay + set the padding token to be unique

In [None]:
print(f'tokenizer len befour padding token: {len(tokenizer)}')
tokenizer.add_special_tokens({'pad_token':'[PAD]'}) # adding the padding token
print(f'len after adding the new token: {len(tokenizer)}')

tokenizer len befour padding token: 64000
len after adding the new token: 64001


In [None]:
# changing the toekenizer len in the model config to make sure it has been modeified and got the padding token ..
model.resize_token_embeddings(len(tokenizer))

model.config.pad_token_id = tokenizer.pad_token_id # changin the pad token id in the model.config

In [None]:
# Verfy the changes:
print(f'padding token ID in the tokenizer: {tokenizer.pad_token_id}')
print(f'padding token ID in the model config {model.config.pad_token_id}')
print(f'the padding token from the tokenizer: {tokenizer.pad_token}')
print(f'eos token ID in the model: {model.config.eos_token_id}, eos token ID in the tokenizer {tokenizer.eos_token_id}')

padding token ID in the tokenizer: 64000
padding token ID in the model config 64000
the padding token from the tokenizer: [PAD]
eos token ID in the model: 0, eos token ID in the tokenizer 0


In [None]:
# setting the padding side to be left cus we are dealing with arabic language :)
# tokenizer.padding_side

# when I did that I faced some problems when training
# the defualt value here is 'right' so i wil stick with it ..

Now we have our model and tokenizer ready to work !

In [None]:
def tokenizer_function(examples):
  cleaned_text = [ara_preprocess.preprocess(text) for text in examples['text']]
  tokenized_data  = tokenizer(cleaned_text, padding=True, max_length=1024, truncation=True)
  return tokenized_data

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
train_dataset = raw_dataset['train'].map(tokenizer_function, batched=True)
eval_dataset = raw_dataset['test'].map(tokenizer_function, batched=True)

In [None]:
train_dataset

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 244955
})

In [None]:
# using just 20% of the data that we have..
# becuase train the model on all of that text requirse a lots of computation resourses + I am still testing the model and the data
sub_train = train_dataset.select(range(int(train_dataset.num_rows * 0.20)))
sub_train

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 48991
})

In [None]:
# getting the model layers name so we can apply lora on the criticle ones
for name, module in model.named_modules():
    print(name)


transformer
transformer.wte
transformer.wpe
transformer.emb_norm
transformer.drop
transformer.h
transformer.h.0
transformer.h.0.ln_1
transformer.h.0.attn
transformer.h.0.attn.c_attn
transformer.h.0.attn.c_proj
transformer.h.0.attn.attn_dropout
transformer.h.0.attn.resid_dropout
transformer.h.0.ln_2
transformer.h.0.mlp
transformer.h.0.mlp.c_fc
transformer.h.0.mlp.c_proj
transformer.h.0.mlp.act
transformer.h.0.mlp.dropout
transformer.h.1
transformer.h.1.ln_1
transformer.h.1.attn
transformer.h.1.attn.c_attn
transformer.h.1.attn.c_proj
transformer.h.1.attn.attn_dropout
transformer.h.1.attn.resid_dropout
transformer.h.1.ln_2
transformer.h.1.mlp
transformer.h.1.mlp.c_fc
transformer.h.1.mlp.c_proj
transformer.h.1.mlp.act
transformer.h.1.mlp.dropout
transformer.h.2
transformer.h.2.ln_1
transformer.h.2.attn
transformer.h.2.attn.c_attn
transformer.h.2.attn.c_proj
transformer.h.2.attn.attn_dropout
transformer.h.2.attn.resid_dropout
transformer.h.2.ln_2
transformer.h.2.mlp
transformer.h.2.mlp.c_f

In [None]:
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(r=4, lora_alpha=32, lora_dropout=0.1, bias='none', task_type='CAUSAL_LM',
                         # these are the self attetion layers in ara-gpt-2-mega, it may very from a model to another
                         target_modules = ["c_attn", "c_proj"])
model = get_peft_model(model, lora_config)

In [None]:
training_args = TrainingArguments(learning_rate=2e-6, per_device_train_batch_size=64,
                                  num_train_epochs=3, output_dir='outputs',
                                  gradient_accumulation_steps=2, warmup_steps=round(500*0.2), # about 30% of the first epoch, the best thing is to keep it at 20%
                                  optim='paged_adamw_8bit', per_device_eval_batch_size=4, push_to_hub=True, hub_model_id = 'ahmadAlrabghi/AraGPT2-IbnTaymiyyah' )
trainer = Trainer(model=model, tokenizer=tokenizer, data_collator=data_collator,
                  args=training_args, train_dataset=sub_train, eval_dataset=eval_dataset)

model.config.use_cache = False # just while training

In [15]:
trainer.train()
trainer.push_to_hub()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
500,3.8436
1000,3.6861


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
500,3.8436
1000,3.6861




CommitInfo(commit_url='https://huggingface.co/ahmadAlrabghi/AraGPT2-IbnTaymiyyah/commit/2c8d27bf37f8bb46f5f2855382f791470729ea90', commit_message='End of training', commit_description='', oid='2c8d27bf37f8bb46f5f2855382f791470729ea90', pr_url=None, pr_revision=None, pr_num=None)

In [16]:
from google.colab import drive
# drive.mount('/content/drive')
drive.flush_and_unmount('/content/drive')

Mounted at /content/drive


In [18]:
# zip the file to download it
! zip -r outputs.zip outputs

  adding: outputs/ (stored 0%)
  adding: outputs/checkpoint-1149/ (stored 0%)
  adding: outputs/checkpoint-1149/optimizer.pt (deflated 17%)
  adding: outputs/checkpoint-1149/vocab.json (deflated 75%)
  adding: outputs/checkpoint-1149/adapter_model.safetensors (deflated 41%)
  adding: outputs/checkpoint-1149/training_args.bin (deflated 51%)
  adding: outputs/checkpoint-1149/merges.txt (deflated 77%)
  adding: outputs/checkpoint-1149/README.md (deflated 66%)
  adding: outputs/checkpoint-1149/tokenizer.json (deflated 80%)
  adding: outputs/checkpoint-1149/adapter_config.json (deflated 51%)
  adding: outputs/checkpoint-1149/tokenizer_config.json (deflated 76%)
  adding: outputs/checkpoint-1149/scheduler.pt (deflated 56%)
  adding: outputs/checkpoint-1149/rng_state.pth (deflated 25%)
  adding: outputs/checkpoint-1149/trainer_state.json (deflated 56%)
  adding: outputs/checkpoint-1149/special_tokens_map.json (deflated 48%)
  adding: outputs/checkpoint-1149/added_tokens.json (stored 0%)
  add

In [19]:
 ! pwd

/content


In [26]:
# mount google drive so i can save the file in it
from google.colab import files
files.download('/content/outputs.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [28]:
# send the file to google drive
! cp /content/outputs.zip /content/drive/MyDrive