# ArbEngVec

Word Embeddings (WE) are getting increasingly popular and widely applied in many Natural Language Processing (NLP) applications due to their effectiveness in capturing semantic properties of words; Machine Translation (MT), Information Retrieval (IR) and Information Extraction (IE) are among such areas.  In this project , we propose an open source ArbEngVec which provides several Arabic-English cross-lingual word embedding models.  To train our bilingual models, we use a large dataset with more than 93 million pairs of Arabic-English parallel sentences. 

## Evaluation

WE perform both extrinsic and intrinsic evaluations for the different word embedding model variants. The extrinsic evaluation assesses the performance of models on the cross-language Semantic Textual Similarity (STS), while the intrinsic evaluation is based on the Word Translation (WT) task.


## Models 

All model variants with GenSim format can be found here: https://drive.google.com/open?id=1S2ugc8pZshYD3mTAkShKoA7He4Ih4gnS

All model variants with Binary format can be found here: https://drive.google.com/open?id=1guk6kNVoJFuaq1_zFRaEKBPZjz7LhOuj



## Visualisation : Random shuffle with SkipGram Model


![Random shuffle with SkipGram](https://raw.githubusercontent.com/Raki22/ArbEngVec/master/random_skip_plot.png)


## Citation

In further research usage of this script please use this citation:


@inproceedings{lachraf-etal-2019-arbengvec,
    title = "{A}rb{E}ng{V}ec : {A}rabic-{E}nglish Cross-Lingual Word Embedding Model",
    author = "Lachraf, Raki  and
      Nagoudi, El Moatez Billah  and
      Ayachi, Youcef  and
      Abdelali, Ahmed  and
      Schwab, Didier",
    booktitle = "Proceedings of the Fourth Arabic Natural Language Processing Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4605",
    doi = "10.18653/v1/W19-4605",
    pages = "40--48",
}


<h1 style= color:red;><b>Data Set</b> </h1>
<p>Corpus: MultiUN</p>
<p>Content: The MultiUN parallel corpus is extracted from the United Nations Website</p>
<p>Sentences: 20.3M</p>
<p>Link hugging face: <a href="https://huggingface.co/datasets/Helsinki-NLP/un_pc/viewer/ar-fr">Link to data set</a> </p>



<h1 style = color:red => <b>Imports</b><h1>

In [None]:

from datasets import load_dataset

# Arabic preprocessing
from arabic_preprocess import arabic_preprocesser

# NLTK is a leading platform for building Python programs to work with human language data. 
import nltk


#Library detecting the language used
import langid

# For calculating the duration of training
import time

# For stop word removal
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Convert a document into a list of tokens.
from gensim.utils import simple_preprocess

# To shuffle the list of words randomly
from random import shuffle

# Word2Vec
from gensim.models import Word2Vec

#To retrieve number of rows
from datasets import load_dataset_builder


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulazizalmakhdhoub/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulazizalmakhdhoub/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<h1 style= color:red> <b>Data set preperations</b><h1>

In [2]:
def data_set_preperations():
    # Use this dataset
    ds = load_dataset("Helsinki-NLP/un_pc",split='train', data_dir ="ar-fr",streaming = True)
    return ds




<h1 style= color:red> <b>identifying ideal sentence  to keep the model as clean as possible</b><h1>


In [3]:
def identify_ideal_sentence(ar,fr):

    return langid.classify(ar)[0] == 'ar' and langid.classify(fr)[0] =='fr'

<h1 style= color:red> <b>Resume logic </b><h1>
<p> when colab runtime unexpectedly disconnects...the last index can be retrieved from output 
and affected to stopped_count argument so the training can restart where it left off (simply by passing already trained pairs) 
</p>

In [4]:
def Resume_logic(stopped_count,dataset):
    # generator version of dataset
    g = (iter(dataset))
    for i in range(0,stopped_count):
        next(g)
    return g


<h1 style= color:red> <b>Stop word removal </b><h1>


In [5]:
def stopWordsRemover(ar_list,fr_list):
  ar_nS = []
  fr_nS = []

  ar_stopwords_list = stopwords.words('arabic')
  fr_stopwords_list = stopwords.words('french')
 
  for word in ar_list:
    if word not in ar_stopwords_list:
      ar_nS.append(word)

  
  for word in fr_list:
    if word not in fr_stopwords_list:
      fr_nS.append(word)
  
  return {"ar":ar_nS,"fr":fr_nS}

<h1 style= color:red> <b>random shuffle </b><h1>


In [None]:
def random_shuffle(ar, fr):
    # clean Arabic first
    ar_clean = arabic_preprocesser(ar)

    # Arabic list of words
    ar_w_list = simple_preprocess(ar_clean)

    # French list of words
    fr_w_list = simple_preprocess(fr)

    dic_ar_fr = stopWordsRemover(ar_w_list, fr_w_list)
    temp = dic_ar_fr['ar'] + dic_ar_fr['fr']
    shuffle(temp)
    return temp


In [None]:
def random_shuffle(ar,fr):

    ar_string_cleaned = arabic_preprocesser(ar)

    # Arabic list of words
    ar_w_list = simple_preprocess(ar)
    # French list of words
    fr_w_list = simple_preprocess(fr)

    dic_ar_fr = stopWordsRemover(ar_w_list,fr_w_list)


    temp = dic_ar_fr['ar'] + dic_ar_fr['fr']
    shuffle(temp)
    return temp
    


<h1 style= color:red> <b>Training</b><h1>


In [7]:
def trainer(modelLocation,re_train,stopped_count =0):
    ds = data_set_preperations()
    #number of rows
    pairsNumber = 20281645
    g = Resume_logic(stopped_count,ds)
    
    documents = []
    start = time.time()
    for i in range(0, 33000):
        row =next(g)['translation']
        ar = row['ar']
        fr = row['fr']
        if(not identify_ideal_sentence(ar,fr)):
            pass
        else: 
            documents.append((random_shuffle(ar,fr)))
    if (re_train == 0):
        print("creating model")
        model = Word2Vec(documents, vector_size = 300, window = 5, min_count = 10, workers = 4, sg = 0)
        model.save(modelLocation)
        print("sentence {}: model initialized and trained on the suitable part of first 33000 sentence pairs, vocab now holds {} words".format(i + 1 + stopped_count, len(model.wv)))
    elif (re_train ==1):
        print("loading model")
        model = Word2Vec.load(modelLocation)
        model.build_vocab(corpus_iterable = documents, update = True)
        model.train(documents,total_examples=len(documents),epochs=10)
        model.save(modelLocation)
        print("sentence {}: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds {} words".format(i + 1 + stopped_count,len(model.wv)))

    documents = []
    for i in range(0, pairsNumber - 33000 - stopped_count):
        row =next(g)['translation']
        ar = row['ar']
        fr = row['fr']
        if(not identify_ideal_sentence(ar,fr)):
            pass
        else: 
            documents.append((random_shuffle(ar,fr)))
            if(len(documents)==33000):
                model.build_vocab(corpus_iterable  = documents, update = True)
                model.train(documents,total_examples=len(documents),epochs=10)
                model.save(modelLocation)
                print("sentence {}: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds {} words".format(i + 1 + 33000 + stopped_count,len(model.wv)))
                documents = []
    model.build_vocab(corpus_iterable = documents,update=True)
    model.train(documents,total_examples=len(documents),epochs=10)
    model.save(modelLocation)
    print("sentence {}: model trained on the remaining suitable sentence pairs, vocab now holds {} words".format(i + 1 + 33000 + stopped_count, len(model.wv)))
    end = time.time()

    print("DONE :)")
    print("time spent in traning (in seconds): {}".format(end-start))


<h1 style= color:red> <b>Test</b><h1>


In [8]:
trainer("randomshuffle_5window_skipgram_300size.model",re_train = 0,stopped_count =0)

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

creating model
sentence 33000: model initialized and trained on the suitable part of first 33000 sentence pairs, vocab now holds 10536 words
sentence 71544: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 15570 words
sentence 111404: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 17621 words
sentence 147925: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 20047 words
sentence 183395: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 21682 words
sentence 222110: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 22823 words
sentence 259839: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 24034 words
sentence 301947: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab 

'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 1d3ba7cf-002a-4511-87d3-51fb3613744b)')' thrown while requesting GET https://huggingface.co/datasets/Helsinki-NLP/un_pc/resolve/99a0348ce711ec6b30d5f18d70c0d30c971011c7/ar-fr/train-00006-of-00018.parquet
Retrying in 1s [Retry 1/5].


sentence 6947873: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 74377 words
sentence 6983127: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 74567 words
sentence 7025223: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 74926 words
sentence 7064941: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 75085 words
sentence 7100980: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 75200 words
sentence 7137870: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 75263 words
sentence 7175513: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now holds 75321 words
sentence 7211663: model loaded and trained on the suitable part of the other 33000 sentence pairs, vocab now ho