# Building the Final Model for Al-Tadmuriyah

In this notebook, I build the final model for Al-Tadmuriyah using the best hyperparameter configuration identified through the grid search conducted in the previous notebooks. The goal here is to fine-tune the model to generate high-quality text closely aligned with the unique language style of Shaykh al-Islam Ibn Taymiyyah as found in Al-Tadmuriyah.

By leveraging the optimal hyperparameters, I aim to maximize the model’s performance on this specialized dataset, allowing it to become proficient in predicting and generating text within the context of Al-Tadmuriyah. Once the model is trained, it is uploaded to the Hugging Face Hub, making it accessible for further experimentation and use by the broader community.

In this notebook, you will find:

The implementation of the best-performing hyperparameter combination.

The fine-tuning process for the Al-Tadmuriyah model.

Steps for uploading the final trained model to Hugging Face Hub.

In [1]:
! pip install -U transformers accelerate BitsAndBytes datasets
! pip install huggingface-hub arabert peft

Collecting BitsAndBytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3

In [2]:
# Signing in so we can upload the model to the huggingface hub ..
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from datasets import load_dataset
import torch
from arabert.preprocess import ArabertPreprocessor

In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.float16)

check_point = 'aubmindlab/aragpt2-large'
arabert_prep = ArabertPreprocessor(model_name=check_point)

In [5]:
model = AutoModelForCausalLM.from_pretrained(check_point, quantization_config=bnb_config,
                                             trust_remote_code=True)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(check_point, trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

configuration_aragpt2.py:   0%|          | 0.00/11.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/aubmindlab/aragpt2-large:
- configuration_aragpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_aragpt2.py:   0%|          | 0.00/83.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/aubmindlab/aragpt2-large:
- modeling_aragpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/3.20G [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]



In [6]:
print(f'tokenizer len befour padding token: {len(tokenizer)}')
tokenizer.add_special_tokens({'pad_token':'[PAD]'}) # adding the padding token
print(f'len after adding the new token: {len(tokenizer)}')

tokenizer len befour padding token: 64000
len after adding the new token: 64001


In [7]:
# changing the toekenizer len in the model config to make sure it has been modeified and got the padding token ..
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id # changin the pad token id in the model.config

In [8]:
# Verfy the changes:
print(f'padding token ID in the tokenizer: {tokenizer.pad_token_id}')
print(f'padding token ID in the model config {model.config.pad_token_id}')
print(f'the padding token from the tokenizer: {tokenizer.pad_token}')
print(f'eos token ID in the model: {model.config.eos_token_id}, eos token ID in the tokenizer {tokenizer.eos_token_id}')
print(f'the tokenizer len:{len(tokenizer)}, the model input ebedding layer:{model.get_input_embeddings()}')
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.pad_token, tokenizer.eos_token

padding token ID in the tokenizer: 64000
padding token ID in the model config 64000
the padding token from the tokenizer: [PAD]
eos token ID in the model: 0, eos token ID in the tokenizer 0
the tokenizer len:64001, the model input ebedding layer:Embedding(64001, 1280)


In [9]:
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # Since GPT-2 is a causal language model
    inference_mode=False,          # Set to True if only doing inference
    r=64,                           # Rank of the LoRA matrices
    lora_alpha=128,                 # Scaling factor
    lora_dropout=0.0,              # Dropout probability
    target_modules=[
        "attn.c_attn",  # Self-attention projection (q, k, v)
        "attn.c_proj",  # Self-attention output projection
        "mlp.c_fc",     # MLP intermediate projection
        "mlp.c_proj"    # MLP output projection
    ]
)
model = get_peft_model(model, lora_config)

In [10]:
def tokenizer_function(exampls):
  cleaned_text = [arabert_prep.preprocess(text) for text in exampls['combined']]
  return tokenizer(cleaned_text, padding=True, truncation=True, max_length=1024)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [11]:
from datasets import load_dataset
raw_dataset = load_dataset('ahmadAlrabghi/al_tadmoreyyah')
print(raw_dataset)
train_dataset = raw_dataset.select_columns(['combined'])
train_dataset = train_dataset.map(tokenizer_function, batched=True)
print(train_dataset)

README.md:   0%|          | 0.00/531 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/220k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/81 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'page', 'title', 'text', 'cleaned_text', 'len_cleand', 'combined', 'len_combined'],
        num_rows: 81
    })
})


Map:   0%|          | 0/81 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['combined', 'input_ids', 'attention_mask'],
        num_rows: 81
    })
})


In [15]:
! rm -r sample_data ahmadAlrabghi logs.zip logs


rm: cannot remove 'sample_data': No such file or directory
rm: cannot remove 'logs.zip': No such file or directory


In [16]:
lr = 2e-4
batch_size=2
grad_accum_steps=2
weight_decay=0.0

lora_alpha=128
lora_r=64
lora_dropout=0
datasize = 81
warm_up  = 81 / batch_size

output_dir = f"./models/al_tadmoreyyah_lr{lr}_bs{batch_size}_wa{weight_decay}_ga{grad_accum_steps}_r{lora_r}_alpha{lora_alpha}_dropout{lora_dropout}"

# saving the logging data to specific file to use it later in tensorboard:
logging_dir = f"./logs/al_tadmoreyyah_lr{lr}_bs{batch_size}_wa{weight_decay}_ga{grad_accum_steps}_r{lora_r}_alpha{lora_alpha}_dropout{lora_dropout}"



training_args = TrainingArguments(
    output_dir='ahmadAlrabghi/al_tadmoreyyah_model_public',
    per_device_train_batch_size=batch_size,
    num_train_epochs=60,
    weight_decay=weight_decay,
    learning_rate=lr,
    gradient_accumulation_steps=grad_accum_steps,
    logging_dir=logging_dir,
    logging_steps=10,
    logging_strategy='steps',
    save_steps=20,
    warmup_steps=round(warm_up),
    optim='paged_adamw_8bit',
    push_to_hub=False,
    hub_model_id='ahmadAlrabghi/al_tadmoreyyah_model_public',
    report_to='tensorboard',
    # log_level='info', # uncomment for more output info
    evaluation_strategy='no'
)



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset['train'],
    data_collator=data_collator,
    tokenizer=tokenizer
)



In [17]:
trainer.train()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
10,2.5915
20,2.6542
30,2.3571
40,2.3026
50,2.0788
60,2.3052
70,2.3361
80,2.0877
90,1.9095
100,1.9047


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enab

Step,Training Loss
10,2.5915
20,2.6542
30,2.3571
40,2.3026
50,2.0788
60,2.3052
70,2.3361
80,2.0877
90,1.9095
100,1.9047


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enab

TrainOutput(global_step=1200, training_loss=0.3708080008625984, metrics={'train_runtime': 7575.0205, 'train_samples_per_second': 0.642, 'train_steps_per_second': 0.158, 'total_flos': 1.358651951259648e+16, 'train_loss': 0.3708080008625984, 'epoch': 58.53658536585366})

In [18]:
# removing the pad token to save the model
vocab = tokenizer.get_vocab()
if '[PAD]' in vocab:
    vocab.pop('[PAD]')

new_vocab = list(vocab.keys())

tokenizer = AutoTokenizer.from_pretrained(
    tokenizer.name_or_path,
    vocab=new_vocab, trust_remote_code=True
)

model.config.pad_token_id = None
model.resize_token_embeddings(len(tokenizer))



Embedding(64000, 1280)

In [22]:
# model.save_pretrained('ahmadAlrabghi/al_tadmoreyyah_model_public_adapter')
model.push_to_hub('ahmadAlrabghi/al_tadmoreyyah_model_public_adapter',
                  tags=
                   [
                       f'lr:{lr}', f'epochs:{50}', f'lora-dropout:{lora_dropout}', f'train-batch:{batch_size}',
                       f'optim: 8bit-adam', f'weight-decay:{weight_decay}', f'gradient_accumulation_steps:{grad_accum_steps}',
                       f'lora-r:{lora_r}', f'lora-alpha:{lora_alpha}'
                       ]
                  )

IsADirectoryError: [Errno 21] Is a directory: 'ahmadAlrabghi/al_tadmoreyyah_model_public_adapter'

In [None]:
trainer.push_to_hub(
    model_name='ahmadAlrabghi/al_tadmoreyyah_model_public',
    # adding tags to differentiation between the models
    tags=
        [
            f'lr:{lr}', f'epochs:{50}', f'lora-dropout:{lora_dropout}', f'train-batch:{batch_size}',
            f'optim: 8bit-adam', f'weight-decay:{weight_decay}', f'gradient_accumulation_steps:{grad_accum_steps}',
            f'lora-r:{lora_r}', f'lora-alpha:{lora_alpha}'
            ]
    )

adapter_model.safetensors:   0%|          | 0.00/189M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/ahmadAlrabghi/al_tadmoreyyah_model_public/commit/f76527376dbbf609d0a7b08997879f750f6c9223', commit_message='End of training', commit_description='', oid='f76527376dbbf609d0a7b08997879f750f6c9223', pr_url=None, pr_revision=None, pr_num=None)

In [23]:
! zip -r logs.zip logs
from google.colab import files
files.download('logs.zip')

  adding: logs/ (stored 0%)
  adding: logs/al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r64_alpha128_dropout0/ (stored 0%)
  adding: logs/al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r64_alpha128_dropout0/events.out.tfevents.1726308167.e78eac08e8b9.413.1 (deflated 68%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>