# Introduction
In this notebook, we'll see how to fine-tune Transformers model on a language modeling task to generate arabic text. We will cover one of language modeling tasks which are:


- Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, its attention computations are masked so that tokens cannot attend to tokens to their right, as this would result in label leakage.


Install the important library

In [None]:
!pip install datasets 
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate
! pip install datasets transformers[sentencepiece]

import the important library

In [None]:
import transformers
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
import math
from transformers import pipeline

# Preparing the dataset

load a corpus and split it to training and validation dataset, you can find the corpus in the following link : https://sourceforge.net/projects/ksucca-corpus/files/

 ##### Split the corpus to training and testing set

In [None]:
# Load the text document as a single string
with open('/content/sample_data/aa1.txt', 'r') as f:
    text = f.read()

# Split the text into an array of strings using the newline character as the delimiter
lines = text.split('\n')

# Split the lines array into training and test datasets
train_lines, test_lines = train_test_split(lines, test_size=0.2, random_state=42)

# Join the training and test datasets into strings
train_text = '\n'.join(train_lines)
test_text = '\n'.join(test_lines)

# Write the training and test datasets to separate text files
with open('/content/sample_data/train.txt', 'w') as f:
    f.write(train_text)

with open('/content/sample_data/test.txt', 'w') as f:
    f.write(test_text)

load the dataset

In [None]:
datasets = load_dataset("text", data_files={"train": '/content/sample_data/train.txt', "validation": '/content/sample_data/test.txt'})

# Causal Language modeling

Causal language modeling is a type of natural language processing task that involves predicting the most likely next word or sequence of words given a context. The "causal" part of the term refers to the fact that the model generates text in a forward direction, i.e., it predicts the next word based on the previous words in the sequence.

identify the model (GPT2)

In [None]:
model_checkpoint = "aubmindlab/aragpt2-base"

load the tokenizer

In [None]:
#tokenizer will use the fast tokenization algorithm, which is based on byte-level byte-pair encoding (BPE) tells the tokenizer to use the fast tokenization algorithm, which is faster and supports additional features.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

define the tokenization function 

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenize the dataset

In [None]:
#batched=True: This parameter tells the map() method to apply the tokenize_function to examples in batches, rather than one at a time 
#divide the input dataset into smaller batches, each containing a fixed number of samples.(1000 default)
#num_proc=4: This parameter tells the map() method to use 4 processes to parallelize the tokenization.
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

preprocessing function that will group the texts

- The group_texts function takes a list of examples, where each example is a dictionary that contains the input text and label for a specific task. The function concatenates all the input texts together and splits the concatenated text into chunks of a maximum length of 128 tokens. The labels are also split in the same way. The function returns a dictionary that contains the split input texts and labels ( in a language modeling task, the input text is a sequence of words, and the label is the next word in the sequence.).

- By splitting the labels in the same way as the input texts, the function ensures that the labels correspond to the correct input text chunks. This allows the language model to learn to predict the next token in the sequence or classify the input text based on the corresponding label for each input text chunk.

- Note that by default, the map method will send a batch of 512 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 512 examples.

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder
    total_length = (total_length // 128) * 128
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + 128] for i in range(0, total_length, 128)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=512,
    num_proc=4,
)

load the model

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

identify the training argument

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=10,
)

train the model

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],

)
trainer.train()

 evaluate our model and get its perplexity 

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 38.32


save the model and tokenizer

In [None]:
model.save_pretrained("/content/sample_data/sa")
tokenizer.save_pretrained("/content/sample_data/sa")

load the model and tokenizer 

In [None]:
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained('/content/sample_data/sa')
tokenizer = AutoTokenizer.from_pretrained('/content/sample_data/sa')

create pipeline to generate text

In [None]:
#specifies the maximum length of the generated text (in terms of number of tokens) that the pipeline can output.
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, config={"max_length": 800})

# Testing Sample

In [None]:
print(pipe('جعل الله الكعبة البيت الحرام قياما')[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


جعل الله الكعبة البيت الحرام قياماوالذين آمنوا وعملوا الصالحات وأقاموا الصلاة وآتوا الزكاة واجتنبوا ما حرم الله ورسوله وأولئك هم الفاسقونفبأي آلاء ربكما تكذبانيا أيها الذين آمنوا اتقوا الله واعلموا أن الله غفور رحيم


In [None]:
print(pipe("قال قد وقع عليكم من ربكم رجس وغضب ")[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


قال قد وقع عليكم من ربكم رجس وغضب يحيق بكم إن كنتم صادقينولقد أرسلنا نوحا إلى قومه فقال يا قوم اعبدوا الله ولا تتبعوا أهواءهم وأطيعوا ما أنزل إليكم من التوراة والإنجيل ولو كرهتم أن تقولوا


In [None]:
print(pipe("والذين آمنوا وعملوا الصالحات وآمنوا بما نزل على")[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


والذين آمنوا وعملوا الصالحات وآمنوا بما نزل على قلوبهم من ربهم فأولئك هم الفائزونقالوا يا موسى ادع لنا ربك أن لا يهدي القوم الظالمينوقال رب إني أخاف عليكم عذاب يوم عظيمفإذا جاءتك رسلنا قالوا آمنا بالله واليوم
