### The goal of this relatively small notebook is to take a step in a larger project.

Here, I am showing the results of the Grid Search applied to models trained on the texts from Al-Tadmoreyyah by Shaykh al-Islam Ibn Taymiyyah. This is part of a broader project aimed at training a model to become proficient in understanding and generating texts across all the works of Shaykh al-Islam.

## Specific Objectives:
*Show performance*: Display the performance of some of the best hyperparameter combinations from the grid search on Al-Tadmoreyyah.

*Analyze changes*: Highlight how changes in the most critical hyperparameters affect the model’s performance.

In [1]:
import tensorboard
import os 
import re
import shutil

%load_ext tensorboard

In [45]:
source_path = 'logs'


hyper_dict = {'lr': [5e-5, 2e-4],
              'bs': [8, 4, 2],
              'wa': [0.01, 0.0],
              'ga': [2],
              'r': [32, 64],
              'alpha': [64, 128],
              'dropout': [0.0]}

# السلسلة الأصلية التي نريد تعديلها
fixed_para = 'al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r64_alpha128_dropout0.0'
params = fixed_para.split('_')[2:]
fixed_para, hyper_dict

('al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r64_alpha128_dropout0.0',
 {'lr': [5e-05, 0.0002],
  'bs': [8, 4, 2],
  'wa': [0.01, 0.0],
  'ga': [2],
  'r': [32, 64],
  'alpha': [64, 128],
  'dropout': [0.0]})

### Code Explanation

This script is designed to generate variations of a given fixed parameter string (`fixed_para`) by modifying one parameter at a time based on a set of predefined values stored in `hyper_dict`. The goal is to change each parameter (like learning rate `lr`, batch size `bs`, etc.) to every possible value in the dictionary, while keeping the other parameters constant.

#### Key Steps:

1. **Extract Parameters:**
   - The code first splits the `fixed_para` string to isolate each parameter and its corresponding value. For example, `lr0.0002` becomes `lr` and `0.0002`.

2. **Avoid Repetition:**
   - Before modifying a parameter, the code checks if the new value is the same as the current value in `fixed_para`. If it is, the code skips that value to avoid redundant combinations.

3. **Generate Combinations:**
   - For each parameter (e.g., `lr`, `bs`), the code loops over its possible values in `hyper_dict`. If the new value is different from the current one, it replaces the old value in `fixed_para` and stores the new string.

4. **Output:**
   - All unique combinations are printed and stored in a list called `combinations`. Each combination reflects the original string but with one parameter altered.

This ensures that the script generates only valid, non-repetitive variations of the original string.


In [17]:
combinations = []

# حلقة لتعديل قيمة معلمة واحدة في كل مرة
for param in params:
    # استخراج اسم المعلمة (مثل lr، bs، wa)
    para = re.split(r'\d+', param)[0]
    
    # القيمة الحالية لهذه المعلمة في السلسلة الأصلية
    current_value = re.findall(r'\d+\.?\d*', param)[0]
    
    # التأكد أن المعلمة موجودة في القاموس
    if para in hyper_dict:
        for value in hyper_dict[para]:
            # التحقق إذا كانت القيمة الجديدة هي نفس القيمة الحالية
            if str(value) == current_value:
                continue  # تخطي إذا كانت القيمة متطابقة
            
            # تعديل السلسلة
            # الجزء الأول قبل المعلمة
            start = fixed_para.split(param)[0] + para + str(value)
            # الجزء الأخير بعد المعلمة
            last = fixed_para.split(param)[1]
            # توليد السلسلة الجديدة
            combination = start + last
            # print('combination:', combination)
            combinations.append(combination)

# طباعة جميع التركيبات الممكنة
print("\nAll unique combinations generated:")
for combo in combinations:
    print(combo)
print(len(combinations))


All unique combinations generated:
al_tadmoreyyah_lr5e-05_bs2_wa0.0_ga2_r64_alpha128_dropout0.0
al_tadmoreyyah_lr0.0002_bs8_wa0.0_ga2_r64_alpha128_dropout0.0
al_tadmoreyyah_lr0.0002_bs4_wa0.0_ga2_r64_alpha128_dropout0.0
al_tadmoreyyah_lr0.0002_bs2_wa0.01_ga2_r64_alpha128_dropout0.0
al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r32_alpha128_dropout0.0
al_tadmoreyyah_lr0.0002_bs2_wa0.0_ga2_r64_alpha64_dropout0.0
6


In [83]:
import shutil
import os

# Destination directory
destination = r'F:\language_model_project\al_tadmorehhay_model\test_hyper_parameters'

# Ensure the destination directory exists
os.makedirs(destination, exist_ok=True)

# Loop through each directory and copy it to the destination
for dir_path in combinations:
    # Get the directory name from the path
    dir_path = os.path.join(r'F:\language_model_project\al_tadmorehhay_model\logs', dir_path)
    dir_name = os.path.basename(dir_path)
    # Form the full path in the destination
    dest_dir = os.path.join(destination, dir_name)
    
    # Create the directory in the destination
    os.makedirs(dest_dir, exist_ok=True)
    
    # Copy the entire directory contents to the new directory
    for item in os.listdir(dir_path):
        s = os.path.join(dir_path, item)
        d = os.path.join(dest_dir, item)
        if os.path.isdir(s):
            shutil.copytree(s, d)
        else:
            shutil.copy2(s, d)
    
    print(f"Copied {dir_path} to {dest_dir}")

Copied F:\language_model_project\al_tadmorehhay_model\logs\al_tadmoreyyah_lr5e-05_bs2_wa0.0_ga2_r64_alpha128_dropout0.0 to F:\language_model_project\al_tadmorehhay_model\test_hyper_parameters\al_tadmoreyyah_lr5e-05_bs2_wa0.0_ga2_r64_alpha128_dropout0.0
Copied F:\language_model_project\al_tadmorehhay_model\logs\al_tadmoreyyah_lr0.0002_bs8_wa0.0_ga2_r64_alpha128_dropout0.0 to F:\language_model_project\al_tadmorehhay_model\test_hyper_parameters\al_tadmoreyyah_lr0.0002_bs8_wa0.0_ga2_r64_alpha128_dropout0.0
Copied F:\language_model_project\al_tadmorehhay_model\logs\al_tadmoreyyah_lr0.0002_bs4_wa0.0_ga2_r64_alpha128_dropout0.0 to F:\language_model_project\al_tadmorehhay_model\test_hyper_parameters\al_tadmoreyyah_lr0.0002_bs4_wa0.0_ga2_r64_alpha128_dropout0.0
Copied F:\language_model_project\al_tadmorehhay_model\logs\al_tadmoreyyah_lr0.0002_bs2_wa0.01_ga2_r64_alpha128_dropout0.0 to F:\language_model_project\al_tadmorehhay_model\test_hyper_parameters\al_tadmoreyyah_lr0.0002_bs2_wa0.01_ga2_r64_

In [107]:
# the idea here is to show you how does changing one of the critical hyperparamere effect the performance of the model
# I check the original logs and came to conculoosion that this is the most critical ones.. 
# you can check the original logs (48 models) your self if you want..  
%tensorboard --logdir test_hyper_parameters

### Conclusion: Specialization of the Model for *Tadmuriyyah*

Through iterative experimentation with hyperparameters, the model's ability to specialize in the *Tadmuriyyah* text has been significantly enhanced. By increasing the learning rate, alpha rank, and lower alpha, the model has demonstrated an improved capacity to internalize the unique linguistic and stylistic features of *Tadmuriyyah*. This strategic adjustment of hyperparameters has allowed the model to effectively overfit to the training data, capturing the intricate patterns present in the text.

As a result, the model has evolved into a highly specialized tool, tailored specifically for *Tadmuriyyah*, offering more accurate predictions and better text generation that aligns closely with the distinct features of the book. These findings highlight the importance of fine-tuning hyperparameters to achieve a deep understanding of the text, making the model an invaluable resource for future analyses and applications related to *Tadmuriyyah*

In [5]:
%tensorboard --logdir logs --port 6007

kill: 6006: No such process


In [None]:
the first notebook
data extracting:
## Introduction

In this notebook, I am working with a classic Islamic text, *Al-Tadmuriyah* by Sheikh al-Islam Ibn Taymiyyah, which I have obtained from the Al-Shamila library in HTML format. The purpose of this notebook is to extract relevant text and information from this HTML file and then clean the data for further use. The extraction process will involve using Beautiful Soup to parse the HTML code, allowing me to retrieve important elements such as the text of the book, page numbers, and the book title.

The data extracted from this text contains some inconsistencies and issues that must be addressed through data cleaning techniques. Once the data is cleaned and properly formatted, it will be saved as a CSV file. This cleaned data will later be uploaded to the Hugging Face datasets for further use in another notebook, where it will be fed into a large language model for analysis and exploration.

This process is part of a broader project aimed at preserving and making accessible the writings of influential scholars in the digital age, particularly through the use of advanced natural language processing techniques.


sec notebook building the base line model and then performing the grid search after seeing that the base line model performed really poorly:
## Project Overview


This project aims to fine-tune the AraGPT2-large model on classical Islamic texts, specifically focusing on the book Al-Tadmuriyah by Shaykh al-Islam Ibn Taymiyyah. The goal is to have the model become highly specialized in predicting and generating text within this specific book, even if it leads to overfitting. Overfitting is acceptable and even encouraged for this task, as the primary objective is for the model to accurately represent and generate the unique language style and content of *Al-Tadmuriyah*.


To achieve this, I have followed these steps:

1. Data Collection: The text of Al-Tadmuriyah was scraped and processed for training. The code used for this scraping can be found on my [GitHub repository](#), and the processed dataset is available on Hugging Face datasets for public access.
   
2. Model Selection: I used the AraGPT2-large model from Hugging Face, a powerful language model for Arabic. The model was quantized to 4-bit precision to reduce computational load while still retaining its performance potential.

3. Training and Fine-Tuning: I trained the model for 50 epochs, initially observing poor performance. To address this, I performed a grid search over several hyperparameters to identify the best configuration for this task.

4. Model Evaluation and Logs: The results from different configurations are logged and visualized using TensorBoard. All models generated from the grid search are available on my [Hugging Face repository](#).

### What to Expect

In this notebook, you will find:
- Details on the data preparation process, including how the dataset was scraped and preprocessed.
- The fine-tuning steps applied to AraGPT2-large on Al-Tadmuriyah.
- A discussion of the initial training results, followed by the grid search for optimal hyperparameters.
- Insights into the performance of the model across different configurations, along with links to the trained models.
- Logs and visualizations of the training process using TensorBoard.



third note book data vis:
### The goal of this relatively small notebook is to take a step in a larger project.

Here, I am showing the results of the Grid Search applied to models trained on the texts from Al-Tadmoreyyah by Shaykh al-Islam Ibn Taymiyyah. This is part of a broader project aimed at training a model to become proficient in understanding and generating texts across all the works of Shaykh al-Islam.

## Specific Objectives:
*Show performance*: Display the performance of some of the best hyperparameter combinations from the grid search on Al-Tadmoreyyah.

*Analyze changes*: Highlight how changes in the most critical hyperparameters affect the model’s performance.
