## This notebook is used to pre-train AraElectra on a specific domain dataset that will later be used for a downtask problem, such as a question answering system.
### The code uses Tensorflow 1.x

In [4]:
from src.pretraining.preprocess import ArabertPreprocessor
import tensorflow as tf



### Download Tensorflow pretrained model from huggingface

In [2]:
!wget https://huggingface.co/aubmindlab/araelectra-base-discriminator/resolve/main/tf1_model.tar.gz -O araelectra-base-discriminator/tf1_model.tar.gz
!tar -xvf araelectra-base-discriminator/tf1_model.tar.gz -C araelectra-base-discriminator/
!rm araelectra-base-discriminator/tf1_model.tar.gz

--2022-10-11 22:09:32--  https://huggingface.co/aubmindlab/araelectra-base-discriminator/resolve/main/tf1_model.tar.gz
Auflösen des Hostnamens huggingface.co (huggingface.co) … 2600:1f18:147f:e850:6f3d:1caa:26e9:1d53, 2600:1f18:147f:e800:3e44:323f:9748:7184, 2600:1f18:147f:e800:8b16:ea06:2538:561f, ...
Verbindungsaufbau zu huggingface.co (huggingface.co)|2600:1f18:147f:e850:6f3d:1caa:26e9:1d53|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://cdn-lfs.huggingface.co/aubmindlab/araelectra-base-discriminator/cbca4d0cedf32683a99a235494e946ba11a373095f4040260e005948c88f2af1?response-content-disposition=attachment%3B%20filename%3D%22tf1_model.tar.gz%22&Expires=1665778174&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL2F1Ym1pbmRsYWIvYXJhZWxlY3RyYS1iYXNlLWRpc2NyaW1pbmF0b3IvY2JjYTRkMGNlZGYzMjY4M2E5OWEyMzU0OTRlOTQ2YmExMWEzNzMwOTVmNDA0MDI2MGUwMDU5NDhjODhmMmFmMT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPWF0dGFjaG1lbnQ

### Preprocess the dataset with farasa_segmentation

In [5]:
model_name = "aubmindlab/araelectra-base-discriminator"
sample_text = "dataset/pretraining-dataset/iskan.txt"
sample_text_output_after_farasa_segmentation = "dataset/pretraining-dataset/iskan_farasa_segmentation.txt"

arabert_prep = ArabertPreprocessor(model_name=model_name)

with open(sample_text, "r") as f:
    data = [d.strip("\n") for d in f.readlines()]

with open(sample_text_output_after_farasa_segmentation, "w") as f:
    for sample in data:
        f.write(arabert_prep.preprocess(sample) +"\n")

# Here is how the sample after farasa segmentation looks like
print(arabert_prep.preprocess(data[0]))

يعود تاريخ وزارة الإسكان والتخطيط العمراني إلى العام 1975 ، عندما أصدر صاحب السمو الشيخ عيسى بن سلمان آل خليفة أمير دولة البحرين – طيب الله ثراه .


### Create data for pretraining AraElectra. This will convert the dataset in an expected format which is ".tfrecord"

In [7]:
!python src/pretraining/create_pretraining_data.py \
  --input_file=./dataset/pretraining-dataset/iskan_farasa_segmentation.txt \
  --output_file=./dataset/pretraining-dataset/iskan.tfrecord \
  --vocab_file=araelectra-base-discriminator/tf-araelectra-base/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5




W1011 22:22:55.196787 140434982973632 module_wrapper.py:139] From src/pretraining/create_pretraining_data.py:488: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1011 22:22:55.196874 140434982973632 module_wrapper.py:139] From src/pretraining/create_pretraining_data.py:488: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.




INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:  ./dataset/pretraining-dataset/iskan_farasa_segmentation.txt
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow:  ./dataset/pretraining-dataset/iskan.tfrecord

INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] وقد [MASK] اوامر صاحب السمو الملكي الامير [MASK] بن حمد ال خليفة ولي العهد ناي [MASK] القا ##يد الاعلى [MASK] ##يب . الاول لري [MASK] مجلس [MASK] المتتالية بتوزيع [MASK] الاسكانية منذ منتصف عام 2016 ، والمساحات ##ضيف مكتسب ##ا اسكان ##يا جديدا . [SEP] الاسكان [UNUSED_1316

### Pre-train/fine-tune AraElectra on a specific domain

In [10]:
!python src/pretraining/run_pretraining.py \
  --input_file=dataset/pretraining-dataset/iskan.tfrecord \
  --output_dir=pretraining_output\
  --do_train=True \
  --do_eval=True \
  --bert_config_file=araelectra-base-discriminator/config.json \
  --init_checkpoint=araelectra-base-discriminator/tf-araelectra-base/model.ckpt \
  --train_batch_size=2 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=50 \
  --num_warmup_steps=10 \
  --learning_rate=5e-5 






W1011 22:28:55.030264 139694655488192 module_wrapper.py:139] From src/pretraining/run_pretraining.py:496: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1011 22:28:55.030861 139694655488192 module_wrapper.py:139] From src/pretraining/run_pretraining.py:496: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.




The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Using config: {'_model_dir': 'pretraining_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placemen

### Convert Tensorflow checkpoints to a Pytorch model for later use in fine-tuning the model for downstream tasks, e.g., a question answering system.

In [11]:

!transformers-cli convert --model_type bert \
  --tf_checkpoint pretraining_output/model.ckpt-50 \
  --config araelectra-base-discriminator/config.json \
  --pytorch_dump_output araelectra-base-discriminator/pytorch_model.bin

Building PyTorch model from configuration: BertConfig {
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "generator_hidden_size": 0.33333,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 64000
}

Converting TensorFlow checkpoint from /home/mohammed/AraBERT-QuestionAnswering/pretraining_output/model.ckpt-50
Loading TF weight bert/embeddings/LayerNorm/beta with shape [768]
Loading TF weight bert/embeddings/LayerNorm/beta/adam_m with shape [768]
Loading TF weight bert/embeddings/LayerNorm/beta/adam_v with shape