## Integrate [AraVec](https://github.com/bakrianoo/aravec) with [Spacy.io](https://spacy.io/)

This notebook demonstrates how to integrate an [AraVec](https://github.com/bakrianoo/aravec) model with [spaCy.io](https://spacy.io/)

## Outlines

- Install/Load the required modules
- Load AraVec
- Export the Word2Vec format + gzip it.
- Initialize the spaCy model using AraVec vectors
- Run Your AraVec Spacy Model
- Test the Model

## Install/Load the required modules

In [2]:
# !pip install gensim spacy nltk

In [3]:
import gensim
import re
import spacy

# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','&quot;','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text

## Load AraVec
Download a model from the [AraVec Repository](https://github.com/bakrianoo/aravec), then follow the below steps to load it.

In [5]:
# Download via terminal commands
!wget "https://bakrianoo.sfo2.digitaloceanspaces.com/aravec/full_grams_cbow_100_twitter.zip"
# !unzip "full_grams_cbow_100_twitter.zip"

--2021-01-16 18:53:21--  https://bakrianoo.sfo2.digitaloceanspaces.com/aravec/full_grams_cbow_100_twitter.zip
Resolving bakrianoo.sfo2.digitaloceanspaces.com (bakrianoo.sfo2.digitaloceanspaces.com)... 138.68.32.225
Connecting to bakrianoo.sfo2.digitaloceanspaces.com (bakrianoo.sfo2.digitaloceanspaces.com)|138.68.32.225|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-01-16 18:53:22 ERROR 404: Not Found.



In [10]:
# load the AraVec model
model = gensim.models.Word2Vec.load("full_grams_cbow_100_twitter.mdl")
print("We've",len(model.wv.index2word),"vocabularies")

We've 1476715 vocabularies


## Export the Word2Vec format + gzip it.

In [0]:
# make a directory called "spacyModel"
%mkdir spacyModel

In [0]:
# export the word2vec fomart to the directory
model.wv.save_word2vec_format("./spacyModel/aravec.txt")

In [0]:
# using `gzip` to compress the .txt file
!gzip ./spacyModel/aravec.txt

## Initialize the spaCy model using AraVec vectors

- This will create a folder called `/spacy.aravec.model` within your current working directory.
- This step could take several minutes to be completed.

In [14]:
!python -m spacy  init-model ar spacy.aravec.model --vectors-loc ./spacyModel/aravec.txt.gz

[2K[38;5;2m✔ Successfully created model[0m
1476715it [01:04, 22923.32it/s]
[2K[38;5;2m✔ Loaded vectors from spacyModel/aravec.txt.gz[0m
[38;5;2m✔ Sucessfully compiled vocab[0m
1476903 entries, 1476715 vectors


## Run Your AraVec Spacy Model


In [0]:
# load AraVec Spacy model
nlp = spacy.load("./spacy.aravec.model/")

In [0]:
# Define the preprocessing Class
class Preprocessor:
    def __init__(self, tokenizer, **cfg):
        self.tokenizer = tokenizer

    def __call__(self, text):
        preprocessed = clean_str(text)
        return self.tokenizer(preprocessed)

In [0]:
# Apply the `Preprocessor` Class
nlp.tokenizer = Preprocessor(nlp.tokenizer)

## Test the Model

In [22]:
# Test your model
nlp("قطة").vector

array([ 0.6214019 ,  2.664876  , -2.4490244 , -0.13141291,  1.0106287 ,
        1.4277642 , -0.6019407 , -0.37155798,  2.2610269 , -0.51503485,
       -1.7400011 ,  1.4599515 ,  1.3110927 ,  0.4506139 ,  1.1511235 ,
       -2.3989084 ,  0.0108205 , -0.93597263,  0.20742278,  2.7626824 ,
       -0.21789424, -2.6269352 , -0.033042  , -2.0458148 ,  1.4766251 ,
       -2.589866  , -1.7341375 , -1.5589778 ,  0.57571614,  4.2727513 ,
        0.02701492,  1.77316   , -1.1816478 ,  0.24516247, -0.04227808,
        0.57215565,  3.2628767 , -1.2422727 ,  1.2351261 ,  1.7213373 ,
       -2.2107098 ,  3.334359  ,  0.835815  ,  0.27691752, -0.61994714,
        2.0607152 , -0.33151346,  2.132865  , -1.1516991 , -0.39679298,
       -2.1682317 ,  1.5982645 , -1.1571178 ,  1.3672193 , -0.81996626,
        0.5634883 ,  0.8571397 ,  1.2602032 ,  1.5811064 , -2.6346667 ,
       -0.21950944, -1.7665412 ,  1.3162723 , -0.9176698 , -0.5075662 ,
       -0.6396452 , -0.57308793,  2.6602883 ,  1.466169  , -0.54

In [24]:
egypt = nlp("مصر")
tunisia = nlp("تونس")
apple = nlp("تفاح")

print("egypt Vs. tunisia = ", egypt.similarity(tunisia))
print("egypt Vs. apple = ", egypt.similarity(apple))

egypt Vs. tunisia =  0.8277813628455393
egypt Vs. apple =  0.09689554003631644


## Done !!

Congratulations, now you have your AraVec model running on spaCy.