## Tutorial 1: Arabic Tokenizer



---

هذا الملف هو محاولة لتبسيط بعض المفاهيم المتعلقة بالمعالجة اللغوية للمهتمين من العرب حيث نحاول تطبيق بعض مفاهيم البرمجة اللغوية على البيانات العربية وشرح الخوارزميات والطرق المتعلقة بذلك.


---


- This series of tutorials intends to explain how to leverage HuggingFace tools for arabic natural language processing. This is the first tutorial which emphsis on the use of tokenizers from HuggingFace. 

- Most of the notebook is written by The great team of HuggingFace. However, I made a slight modification to adjust some expermint and ideas that emphsis on the arabic natural language processing. 

In [2]:
#@title
%%html
<div style="background-color: pink;">
  This Notebook written based on the great work of Hugging Face and <a href="https://github.com/aditya-malte">Aditya Malte</a> <br> in this <a href="https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb">Notebook</a> with a slight modification to test different tokenizer algorithm with the arabic corpus.. 
</div>



### Tokenization 


In [None]:
## First step is to install the tokenizers library from HuggingFace. As you can see it is an easy task with only one line !
## الخطوة الأولى نقوم بعملية تثبيت لأحد المكتبات التي سوف نستخدمها وهي تابعة لـ
## HuggingFace 

!pip install tokenizers==0.9.2

Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 2.8MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.9.2


In [None]:
## The second step is to import all the nesccary libraries for the target task
## الخطوة الثانية نقوم باستدعاء المكتبات التي سوف نستخدمها في هذا التمرين

import torch
import numpy as np
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer,BertWordPieceTokenizer #,SentencePieceBPETokenizer,CharBPETokenizer
from tokenizers.decoders import ByteLevel

from tokenizers.processors import BertProcessing


### Download an Arabic Corpus

The corpus is download from the following website : [open parallel corpus
](http://opus.nlpl.eu/)


The corpus is part of the MultiUN corpus. The corpus is a collection of translated documents from the United Nations. For more information, look into these two papers:

* Eisele, A. and Chen, Y., 2010, May. MultiUN: A Multilingual Corpus from United Nation Documents. In LREC.

* J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

In [None]:
## This function downloaed the data and extract its path
## هذه الدالة تقوم بتحميل البيانات المذكورة بالأعلى عن طريق موقعها ثم استخراج ملف البيانات العربية 
## بعد ذلك نقوم بعملية استخلاص مسار الملف حتى نقوم باستدعاءه والملف يحفظ في قوقل درايف الخاص بك المرتبط في ايميل جيميل الذي تستخدمه في هذه الصفحة

def download_Dataset():
  """Download the dataset and extract the path

    Returns
    -------
    paths
        a string that holds the path of the dataset
    """
  ## بداية نقوم بتحميل البيانات من الموقع 
  ## First we download the dataset from OPUS website
  !wget https://object.pouta.csc.fi/OPUS-MultiUN/v1/mono/ar.txt.gz

  ## بعد ذلك نقوم باستخراج البيانات
  ## Then, we extract the dataset 
  !gzip -d /content/ar.txt.gz
  
  ## الخطوة الأخيرة هي القيام بإستخراج المسار الخاص بالبيانات 
  ## Finally we save the path of the extracted dataset by looking into the content folder 
  ## to find any file with .txt extention. 
  ## Important note: if you have more than one dataset or files with .txt extenstion
  ## this process will save the paths for both files in the paths list
  paths = [str(x) for x in Path("/content/").glob("**/*.txt")]
  return paths

In [None]:
## We call the function download_Dataset and save the return list in the paths
## هذه الخطوة لإستدعاء الدالة السابقة والتي سوف تعيد لنا المسار الخاص بالملف على هيئة 
## LIST قائمة

paths=download_Dataset()
paths

--2020-10-19 07:36:16--  https://object.pouta.csc.fi/OPUS-MultiUN/v1/mono/ar.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 615366163 (587M) [application/gzip]
Saving to: ‘ar.txt.gz’


2020-10-19 07:36:43 (22.2 MB/s) - ‘ar.txt.gz’ saved [615366163/615366163]



['/content/ar.txt']

### Training Tokenizer

In [None]:

def train_tokenizer(tokenizer,paths,strr="arabic"):
  """train the tokenizer using huggingface tokenizers algorithm

    Parameters
    ----------
    tokenizer : tokenizers.implementations
        A tokenizer implementation that is not trained yet 
    paths : list
        a list that holds the path of the dataset
    strr : str, optional
        a string name that used as base name for the saved dictionary 
    Returns
    -------
    tokenizer
        a trained tokenizer 
    """

  ## A function that train the tokenizer using the given file with setup hyperparameters
  ## The way of learning the tokenizaton is by looking into subword tokenization using 
  ## one of the implementation in the tokenizers library (ByteLevelBPETokenizer,BertWordPieceTokenizer, ... etc)
  ## The tokenizer variable holds what is the target implementation 

  ## The tokenizer has some of the parameters that need to be filled or it will take the defult value if there is one.
  ## file=paths : files:Union[str, List[str]] here we pass the path of the dataset that we will use to build our tokenizer.
  ## vocab_size:int=30000 : the size of the vocabulary. Also, the tokenizer will stop when the vocabulary reach this number. 
  ## min_frequency:int=2 how many time a word (m) can exist in the vocabulary
  ## special_tokens : here we tell the tokenizer about the notation of the special tokens. 
  tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
      "<bos>", 
      "<pad>",
      "<eos>",
      "<unk>",
      "<mask>",
  ])

  # Save files to disk
  tokenizer.save(".", strr )
  
  return tokenizer



###Byte Level BPE Tokenizer

In [None]:
# Initialize a Byete Level BPE tokenizer as introduced by OpenAI in the GPT2

tokenizer = ByteLevelBPETokenizer()
# Start training the tokenizer over the corpus
BPE_tokenizer=train_tokenizer(tokenizer,paths)

In [None]:

BPE_tokenizer._tokenizer.post_processor = BertProcessing(
    ("<bos>", BPE_tokenizer.token_to_id("<bos>")),
    ("<eos>", BPE_tokenizer.token_to_id("<eos>")),
)
BPE_tokenizer.enable_truncation(max_length=512)

Decoder=ByteLevel()

encoded_tokens= BPE_tokenizer.encode("سافر محمد وخالد إلى الرياض سويا").tokens
print(encoded_tokens)
tokens=[]

tokens=[ Decoder.decode(tok) for tok in encoded_tokens] 
print(tokens)

['<eos>', 'Ø³', 'Ø§ÙģØ±', 'ĠÙħØŃÙħØ¯', 'ĠÙĪØ®', 'Ø§ÙĦØ¯', 'ĠØ¥ÙĦÙī', 'ĠØ§ÙĦØ±ÙĬØ§Ø¶', 'ĠØ³ÙĪÙĬØ§', '<bos>']
['<eos>', 'س', 'افر', ' محمد', ' وخ', 'الد', ' إلى', ' الرياض', ' سويا', '<bos>']


### Bert Word Piece Tokenizer

In [None]:
# Initialize a tokenizer
Bert_tokenizer = BertWordPieceTokenizer()


In [None]:
BertTokenizer=train_tokenizer(Bert_tokenizer,paths)

In [None]:
BertTokenizer._tokenizer.post_processor = BertProcessing(
    ("<bos>", BertTokenizer.token_to_id("<bos>")),
    ("<eos>", BertTokenizer.token_to_id("<eos>")),
)
BertTokenizer.enable_truncation(max_length=512)

print( BertTokenizer.encode("سافر محمد وخالد إلى الشرقية سويا").tokens )

['<eos>', 'سافر', 'محمد', 'وخال', '##د', 'الى', 'الشرقية', 'سويا', '<bos>']
