## Masked Language Model for __Moroccan Arabic Wikipedia__ (aryRoBERTa<sub>BASE</sub>)

### * Environment Setups:

In [1]:
import os, torch, warnings 
from transformers import logging

logging.set_verbosity_warning()
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "True"
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

### * Hugging Face Setups:

In [2]:
from huggingface_hub import login

! git config --global credential.helper store

arywiki_hf_token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' # Use your huggingface token here
login(token=arywiki_hf_token, add_to_git_credential=True)

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


### * Create Hugging Face Repository:

In [3]:
! huggingface-cli repo create aryRoBERTa -y # Create a new repo on your huggingface account

[90mgit version 2.25.1[0m
[90mgit-lfs/2.9.2 (GitHub; linux amd64; go 1.13.5)[0m

You are about to create [1mSaiedAlshahrani/aryRoBERTa[0m

Your repo now lives at:
  [1mhttps://huggingface.co/SaiedAlshahrani/aryRoBERTa[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/SaiedAlshahrani/aryRoBERTa



In [4]:
! git clone https://huggingface.co/SaiedAlshahrani/aryRoBERTa # Clone the new repo from your huggingface account

Cloning into 'aryRoBERTa'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), 421 bytes | 140.00 KiB/s, done.


### * Train Byte-level Tokenizer:

In [5]:
from tokenizers import ByteLevelBPETokenizer

wiki_corpus = 'arywiki-20230101-pages-articles-processed.txt' # Use your preprocessed Wikipedia Corpus here

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(
    files=wiki_corpus, 
    vocab_size=52_000, min_frequency=2, 
    special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>']
)






In [6]:
tokenizer.save_model('aryRoBERTa')

['aryRoBERTa/vocab.json', 'aryRoBERTa/merges.txt']

In [7]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("aryRoBERTa", max_length=512, padding='max_length', truncation=True)

### * Initialize aryRoberta Model for MLM:

In [8]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000, max_position_embeddings=514,
    num_attention_heads=12, num_hidden_layers=6, type_vocab_size=1
)

In [9]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [10]:
print(f"# Number of Trainable Parameters: {format(model.num_parameters(),',d')}")

# Number of Trainable Parameters: 83,504,416


### * Prepare Moroccan Arabic Corpus:

In [11]:
with open(wiki_corpus, 'r', encoding='utf-8') as f: 
    arywiki_corpus = f.read().split('\n')

print(f'# Total Number of Samples: {format(len(arywiki_corpus),",d")}')

# Total Number of Samples: 4,674


In [12]:
import pandas as pd

arywiki_20230101 = pd.DataFrame(data={"text": arywiki_corpus})
arywiki_20230101.to_csv("aryRoBERTa/Moroccan_Arabic_Wikipedia__aryRoBERTa.csv", sep=',',index=False)

### * Push Moroccan Arabic Dataset to Hugging Face Hub: 

In [13]:
from datasets import load_dataset

dataset_to_hub = load_dataset("text", data_files={"train": 'aryRoBERTa/Moroccan_Arabic_Wikipedia__aryRoBERTa.csv'})
dataset_to_hub.push_to_hub("SaiedAlshahrani/Moroccan_Arabic_Wikipedia__aryRoBERTa") # Push dataset to your huggingface account

Using custom data configuration default-cdba445ecbdf3e88


Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-cdba445ecbdf3e88/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-cdba445ecbdf3e88/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

### * Tokenize Moroccan Arabic Dataset:

In [14]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer, 
    file_path=wiki_corpus, block_size=128
)

In [15]:
dataset[1]

{'input_ids': tensor([    0, 22270, 13419,   408,   819,  3121,   863,   530,  1016,   434,
           696,  1016,  1324, 10217,  6088, 43889, 10530,  2758,   502,   605,
           280,  9645,  1153, 10115,  7718,   529,   721,   307,  1016,   863,
         21863,   360,  7718, 10478,   570,  2140, 25929,  1016,   428,   459,
           930,  2561,  2621, 45570,  2017,   323,   570, 10183,  8771,  1309,
           557,   422,   427,   360,   560,   599,   516,   601,   350,   621,
           364,   404,   630,   617,   380,   618,   350,   357,   581,   531,
           619,   620,   583,   622,   350,   364,   628,   550,   574,   350,
           625,   364,   404,   629,   626,   360,   607,   350,   364,   339,
           398,   580,   380,   496,   339,   470,   467,   473,   350,   364,
           339,   398,   349,   612,   573,   380,   496,   339,   470,   467,
           473,   391,   392,   336,   434,   696,  1297,   336,  1163,   443,
             2])}

In [16]:
tokenizer.decode(dataset[1]["input_ids"])

'<s>آسفي بالأمازيغية هي مدينة مغربية جات إقليم آسفي جهة مراكش آسفي معروفة بالفخار والحوت وخصوصا السردين ومكنيين عليها حاضرة المحيط الحطة ديال آسفي جات كاطل على المحيط الأطلسي بين الجديدة والصويرة آسفي كاين بزاف دالبني قديم وتاريخي وهي من بين المدن القديمة المغرب ساكنين فيها واحد على حسب لإحصاء لعام تعليم نسبة لأمية اس ما كايعرفوش يقراو ولا يكتبو نسبة كان قاريين فوق انوي تانوي جامعة اقتصاد نسبة اس شيطين يقدرو يخدمو نسبة لبطالة اس ما خدامينش تايقلبو على خدمة نسبة اس اللي خدامين ولة ولا لعاطلين اللي سبق ليهوم خدمو نسبة اس اللي خدامين في لقطاع لخاص ولا لعاطلين اللي سبق ليهوم خدمو عيون لكلام تصنيف جهة مراكش أسفي تصنيف مدون لمغريب</s>'

### * Create Data Collator for Language Modeling:

In [17]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### * Train aryRoberta Model from Scratch:

In [18]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    push_to_hub=True, push_to_hub_model_id="aryRoBERTa",
    output_dir="aryRoBERTa", evaluation_strategy="no",
    auto_find_batch_size=True, num_train_epochs=5,
    learning_rate=1e-4, save_total_limit=3,
    adam_epsilon=1e-6, weight_decay=0.01,
    adam_beta1=0.9, adam_beta2=0.98,
    per_device_train_batch_size=128,
    logging_steps=35, save_steps=35,
    prediction_loss_only=False,
    report_to="tensorboard",
    data_seed=24, seed=42,
)

trainer = Trainer(
    model=model, 
    args=training_args,
    train_dataset=dataset, 
    data_collator=data_collator
)

/notebooks/aryRoBERTa is already a clone of https://huggingface.co/SaiedAlshahrani/aryRoBERTa. Make sure you pull the latest changes with `repo.git_pull()`.


In [19]:
history = trainer.train()

train = pd.DataFrame().append(history.metrics, ignore_index=True)
train.style.hide_index()

***** Running training *****
  Num examples = 4673
  Num Epochs = 5
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 185
  Number of trainable parameters = 83504416


Step,Training Loss
35,9.5984
70,7.9889
105,7.4388
140,7.2044
175,7.1812


Saving model checkpoint to aryRoBERTa/checkpoint-35
Configuration saved in aryRoBERTa/checkpoint-35/config.json
Model weights saved in aryRoBERTa/checkpoint-35/pytorch_model.bin
Saving model checkpoint to aryRoBERTa/checkpoint-70
Configuration saved in aryRoBERTa/checkpoint-70/config.json
Model weights saved in aryRoBERTa/checkpoint-70/pytorch_model.bin
Saving model checkpoint to aryRoBERTa/checkpoint-105
Configuration saved in aryRoBERTa/checkpoint-105/config.json
Model weights saved in aryRoBERTa/checkpoint-105/pytorch_model.bin
Saving model checkpoint to aryRoBERTa/checkpoint-140
Configuration saved in aryRoBERTa/checkpoint-140/config.json
Model weights saved in aryRoBERTa/checkpoint-140/pytorch_model.bin
Deleting older checkpoint [aryRoBERTa/checkpoint-35] due to args.save_total_limit
Saving model checkpoint to aryRoBERTa/checkpoint-175
Configuration saved in aryRoBERTa/checkpoint-175/config.json
Model weights saved in aryRoBERTa/checkpoint-175/pytorch_model.bin
Deleting older chec

train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss,epoch
194.2472,120.285,0.952,774708261150720.0,7.8334,5.0


### * Push aryRoberta Model to Hugging Face Hub: 

In [20]:
trainer.push_to_hub("SaiedAlshahrani/aryRoBERTa") # Push the trained model to your huggingface account

Saving model checkpoint to aryRoBERTa
Configuration saved in aryRoBERTa/config.json
Model weights saved in aryRoBERTa/pytorch_model.bin
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 1.00/319M [00:00<?, ?B/s]

Upload file runs/Oct28_19-53-39_naj6vyunmn/events.out.tfevents.1698522825.naj6vyunmn.748.0:   0%|          | 1…

To https://huggingface.co/SaiedAlshahrani/aryRoBERTa
   25d16b6..4d05cd9  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}}
To https://huggingface.co/SaiedAlshahrani/aryRoBERTa
   4d05cd9..8314f05  main -> main



'https://huggingface.co/SaiedAlshahrani/aryRoBERTa/commit/4d05cd9715addd67feabcce448894814de9d16cb'

### * Save aryRoberta Model Locally:

In [21]:
trainer.save_model("aryRoBERTa")

Saving model checkpoint to aryRoBERTa
Configuration saved in aryRoBERTa/config.json
Model weights saved in aryRoBERTa/pytorch_model.bin
Saving model checkpoint to aryRoBERTa
Configuration saved in aryRoBERTa/config.json
Model weights saved in aryRoBERTa/pytorch_model.bin
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}}


### * Test aryRoberta Using Transformers Pipeline:

In [22]:
import logging
logging.disable(logging.WARNING)

from transformers import pipeline

def mask_filler(prompt, top_k=None, targets=None):
    fill = pipeline(
        'fill-mask', 
        model='aryRoBERTa', 
        tokenizer='aryRoBERTa', 
        top_k=top_k, targets=targets
        )
    results = fill(prompt)
    return results

In [23]:
mask_filler(f'تقع دولة المغرب في قارة <mask>', top_k=10, targets=tokenizer.tokenize(' إفريقيا'))

[{'score': 0.0010255323722958565,
  'token': 844,
  'token_str': ' إفريقيا',
  'sequence': 'تقع دولة المغرب في قارة إفريقيا'}]