# Description + Algorithm

---

**This is  AAQAD : Alexu Arabic Question-Answer Dataset**

**Data collection procedure is almost same as SQUAD 2.0 dataset**

**The main objective of this Dataset is to answer MRQA Problem in Arabic**

**V3.0 if a FINAL VERSION of AAQAD Automatic Generator that do not contain any Testing cells**

---

                                            ALGORITHM  
            
            For each article in SQUAD 2.0:

                 Open article’s wikipedia English page.

                 If an Arabic version of this page exists:

                            Translate English page using Google Translate

                            Find matching translated paragraphs with the Arabic page

                            For each matched paragraph in each article:

                                      Save this paragraph (Arabic version from Arabic Wikipedia page)

                                      For each Question in SQUAD 2.0 on this paragraph:

                                                  Translate it with its answer(s) using Google Translate

                                                  Save it with the corresponding paragraph in JSON format

                 Else:
                            Abort this article (will be not included in AAQAD )




# Imports + Packages install

In [0]:
!pip install googletrans

!pip install munkres
  



In [0]:
from google.colab import files
import requests
from bs4 import BeautifulSoup
import unicodedata

from googletrans import Translator

import difflib
import json
import re
import numpy as np
from nltk import ngrams
from nltk import TreebankWordTokenizer
from nltk import WordPunctTokenizer
from nltk import WhitespaceTokenizer
from textblob.base import BaseTokenizer
import gensim
from scipy import spatial
from nltk import ngrams
import operator
from munkres import Munkres, print_matrix
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import random
import requests
import time

# Implementation Functions

## Retrieve Arabic's Wikipedia paragarphs (if exists)

In [0]:
'''
Given article title from SQUAD 2.0
  Check if arabic page exists and retrieve its arabic paragraphs
  Returns:
    boolean for arabic page existence
    the arabic title
    the arabic paragraphs
'''

def get_arabic_paragraphs(title):
  
  arabic_page_exists = True
  
  html = requests.get('https://ar.wikipedia.org/wiki/'+title).text
  soup = BeautifulSoup(html, "html.parser")
  ar_paragraphs = [p.get_text() for p in soup.find_all("p")]
  ar_title = ''
   
  #check if arabic wikipedia page do not exists for a given title
  if len(ar_paragraphs) == 1 and "هذه الصفحة خالية" in ar_paragraphs[0]:
    
    arabic_page_exists = False  
   
  #if arabic wikepidia page exists 
  else:
    
    #get arabic title 
    ar_title = [h1.get_text() for h1 in soup.find_all("h1")]
    ar_title =  ar_title[0]
    
    #reformat arabic paragraphs
    for i in range(len(ar_paragraphs)):
#       ar_paragraphs[i] = unicodedata.normalize("NFD", ar_paragraphs[i])
      ar_paragraphs[i] = re.sub(r'(\[(\d+)\])|(\[بحاجة لمصدر\])', '', ar_paragraphs[i])
            
  return arabic_page_exists, ar_title, ar_paragraphs

## Translate Arabic paragraphs

In [0]:
## dictionary to convert arabic digits to english

eastern_to_western = {"٠":"0","١":"1","٢":"2","٣":"3","٤":"4","٥":"5","٦":"6","٧":"7","٨":"8","٩":"9",
                      "0":"0","1":"1","2":"2","3":"3","4":"4","5":"5","6":"6","7":"7","8":"8","9":"9",
                      '²':'','₂':''}


## regex pattern to eliminate non translatable words

regex = re.compile('[^a-zA-Z ]')


In [0]:
## translate a given paragraph to arabic

def translate_to_arabic(paragraph):
  
  #delay 2 secs
  time.sleep(2)
  
  while True:
    try:
      
      translator = Translator() 
      translatedParagraph = translator.translate(paragraph, dest='ar')
      
      break
    except:
      print("Removing non-translatable words from paragraph")
      
      #First parameter is the replacement, second parameter is your input string
      paragraph = regex.sub('', paragraph)
      
  
  #replace arabic digits with english ones
  translatedParagraph = list(translatedParagraph.text)
   
  for i in range(0,len(translatedParagraph)):
    if translatedParagraph[i].isdigit():
      translatedParagraph[i] = eastern_to_western[translatedParagraph[i]] 

  translatedParagraph = "".join(translatedParagraph)
  
  
  return translatedParagraph

## Similarity Pretrained Model

**This Source Code is taken from:**
1. https://github.com/bakrianoo/aravec
2. https://github.com/adhaamehab/textblob-ar

### **Download Pretrained Word Embedding Model**

In [0]:
### downloading a pretrained arabic word embedding model
!wget -qq https://bakrianoo.sfo2.digitaloceanspaces.com/aravec/full_grams_cbow_300_wiki.zip -P ./data

### unzip the pretrained model  
!unzip -qq data/full_grams_cbow_300_wiki.zip -d ./data

### **utilities.py Module**

In [0]:
# =========================
# ==== Helper Methods =====

# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','&quot;','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text

def get_vec(n_model,dim, token):
    vec = np.zeros(dim)
    exist = False
    is_vec = False
    if token not in n_model.wv:
        _count = 0
        is_vec = True
        for w in token.split("_"):
            if w in n_model.wv:
                _count += 1
                vec += n_model.wv[w]
        if _count > 0:
            vec = vec / _count
            exist = True
    else:
        vec = n_model.wv[token]
        exist = True
    
    return vec,exist

def calc_vec(pos_tokens, neg_tokens, n_model, dim):
    vec = np.zeros(dim)
    for p in pos_tokens:
        vec += get_vec(n_model,dim,p)
    for n in neg_tokens:
        vec -= get_vec(n_model,dim,n)
    
    return vec   

## -- Retrieve all ngrams for a text in between a specific range
def get_all_ngrams(text, nrange=3):
    text = re.sub(r'[\,\.\;\(\)\[\]\_\+\#\@\!\?\؟\^]', ' ', text)
    tokens = [token for token in text.split(" ") if token.strip() != ""]
    ngs = []
    for n in range(2,nrange+1):
        ngs += [ng for ng in ngrams(tokens, n)]
    return ["_".join(ng) for ng in ngs if len(ng)>0 ]

## -- Retrieve all ngrams for a text in a specific n
def get_ngrams(text, n=2):
    text = re.sub(r'[\,\.\;\(\)\[\]\_\+\#\@\!\?\؟\^]', ' ', text)
    tokens = [token for token in text.split(" ") if token.strip() != ""]
    ngs = [ng for ng in ngrams(tokens, n)]
    return ["_".join(ng) for ng in ngs if len(ng)>0 ]

## -- filter the existed tokens in a specific model
def get_existed_tokens(tokens, n_model):
    return [tok for tok in tokens if tok in n_model.wv ]


### **tokenizer.py Module**

In [0]:
class NLTKTreebankWordTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return TreebankWordTokenizer().tokenize(text)

class NLTKWordPunctTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return WordPunctTokenizer().tokenize(text)


class NLTKWhitespaceTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return WhitespaceTokenizer().tokenize(text)


### **similarity.py Module**

In [0]:

class TextSimilarity:

    def __init__(self):
        try:
            self.model = gensim.models.Word2Vec.load('data/full_grams_cbow_300_wiki.mdl')
        except FileNotFoundError:
            raise FileNotFoundError
            
    def avg_feature_vector(self, sentence, num_features=300):
        words = NLTKWordPunctTokenizer().tokenize(clean_str(sentence))
        feature_vec = np.zeros((num_features, ), dtype='float32')
        n_words = 0
        for word in words:
            word_vect,exist = get_vec(n_model=self.model, dim=num_features, token=word)
            feature_vec = np.add(feature_vec, word_vect)
            if exist:
              n_words += 1
        if (n_words > 0):
            feature_vec = np.divide(feature_vec, n_words)
        return feature_vec

    def similarity(self, sentence1, sentence2):
        vec1, vec2 = self.avg_feature_vector(sentence1), self.avg_feature_vector(sentence2)
        return self.cosine_similarity(vec1, vec2)

    def cosine_similarity(self, vec1, vec2):
        return 1 - spatial.distance.cosine(vec1, vec2)

### **Build Model**

In [0]:
sim = TextSimilarity()
# takes around 12 second (macbook pro 2017) to load the pretrained word2vec

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### **Difflib built-in Similarity Method**

In [0]:
def difflib_similarity(paragraph1, paragraph2):
  
  sequence = difflib.SequenceMatcher(a = paragraph1, b = paragraph2, autojunk= False)
  difference = sequence.ratio()
  
  return difference  

## **Retrieve Similar Arabic Paragraphs**

In [0]:
'''
Given a translated arabic paragraphs from English wikipedia page
  it is required to retieve the most similar araibc
  paragraphs from all wikipedia arabic paragraphs

  Hungarian Assignment Algorithm implementation
  using Munkres built-in implementation algorithm

Returns a dictionray:
  key: index indicating the same order of the original english paragraph in SQUAD
  value: The corresponding arabic paragraph
'''

## Similarity Threshold
PARAG_SIM_THRESHOLD = 0.78


def get_similar_ar_paragraphs(en_paragraphs, wiki_ar_paragraphs):
  
  # translate en praragraphs using google translate
  translated_ar_paragraphs = []
  for p in en_paragraphs:
    translated_ar_paragraphs.append(translate_to_arabic(p))
  
  
  # Prepare 2D cost matrix:
  #  rows: translated ar paragraphs
  #  cols: wiki ar paragraphs
  rows = len(translated_ar_paragraphs)
  cols = len(wiki_ar_paragraphs)
  cost_matrix = [[0.0 for i in range(cols)] for j in range(rows)] 
  
  for i in range (0,rows):
    for j in range (0,cols):
      #subtract 1 from similarities to convert it to minimization problem
      similarity = sim.similarity(translated_ar_paragraphs[i], wiki_ar_paragraphs[j])
      cost_matrix[i][j] = (1.0 - similarity)   
  
  ## Optimization Efficency:
  #  Decrease cost matrix size to allow faster computations
  #  Delete entire rows & cols lower than the threshold
  cost_matrix = np.array(cost_matrix)
  #rows
  #cost_matrix = cost_matrix[~np.all(cost_matrix > (1.0-PARAG_SIM_THRESHOLD), axis=1)]
  cost_matrix = np.nan_to_num(cost_matrix)
  cost_matrix = cost_matrix.tolist()
    
  # apply Munkres' algorithm on cost_matrix
  results = {}
  m = Munkres()
  indexes = m.compute(cost_matrix) 
  
  for row, column in indexes:
      if (-1*(cost_matrix[row][column] - 1) ) >= PARAG_SIM_THRESHOLD:
        results[row] = wiki_ar_paragraphs[column]
        
  return results

## **Finding correct answer from Arabic Paragraphs**

In [0]:
buck2uni = {"'": u"\u2019", # hamza-on-the-line
            "|": u"\u0622", # madda
            ">": u"\u0623", # hamza-on-'alif
            "&": u"\u0624", # hamza-on-waaw
            "<": u"\u0625", # hamza-under-'alif
            "}": u"\u0626", # hamza-on-yaa'
            "*": u"\u0630", # dhaal
            "$": u"\u0634", # shiin
            "_": u"\u0640", # taTwil
            "~": u"\u0651", # shaddah
            "`": u"\u0670", # dagger 'alif
            "{": u"\u0671", # waSla
            "A": u"\u0627",
            "a": u"\u064E",
            "B": u"\u0628",
            "b": u"\u0628",
            "C": u"\u0643",
            "c": u"\u062B",
            "D": u"\u062F",
            "d": u"\u062F",
            "E": u"\u0639",
            "e": u"\u0650",
            "F": u"\u0641",
            "f": u"\u0641",
            "G": u"\u062C",
            "g": u"\u062C",
            "H": u"\u062D",
            "h": u"\u0647",
            "I": u"\u0625",
            "i": u"\u0650",
            "J": u"\u0686",
            "j": u"\u0686",
            "K": u"\u0643",
            "k": u"\u0643",
            "L": u"\u0644",
            "l": u"\u0644",
            "M": u"\u0645",
            "m": u"\u0645",
            "N": u"\u0646",
            "n": u"\u0646",
            "O": u"\u0648",
            "o": u"\u0619",
            "P": u"\u0628",
            "p": u"\u0628",
            "Q": u"\u0642",
            "q": u"\u0642",
            "R": u"\u0631",
            "r": u"\u0631",
            "S": u"\u0635",
            "s": u"\u0633",
            "T": u"\u0637",
            "t": u"\u062A",
            "U": u"\u0653",
            "u": u"\u0653",
            "V": u"\u06A4",
            "v": u"\u06A4",
            "W": u"\u0648",
            "w": u"\u0648",
            "X": u"\u062E",
            "x": u"\u062E",
            "Y": u"\u0649",
            "y": u"\u064A",
            "Z": u"\u0638",
            "z": u"\u0632",
            "ch": u"\u0634"
            
}


In [0]:
'''
    Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1
'''

def transString(string, reverse=1):
    
    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string

In [0]:
'''
 finding exact matching answer -if exists- given an arabic paragraph
  uses similarity function to compare translated answer with possible
  answer from the arabic paragraphs...
  
  Returns:
    boolean if answer exists
    correct answer text
    answer index start
'''


ANS_SIM_THRESHOLD = 0.6

def find_answer(ar_paragraph,translated_question,translated_answer,english_answer):
  
  ## required to return
  answer_exist = True
  correct_answer = ''
  answer_start_index = 0
  ar_paragraph = ar_paragraph.strip()   
  
  ## calculate similarity for each sentence in each paragraph
  #  with  the translated question and save them into dic
  sentence_dic = {}
  ar_sentences = ar_paragraph.split('.')
  #remove empty entries from array
  ar_sentences = list(filter(None, ar_sentences))
   
  for  similar_sentence in ar_sentences:
    simlarity = sim.similarity(translated_question,similar_sentence)
    if similar_sentence != '':
          sentence_dic[similar_sentence]= simlarity
    
  #retrieve max 3 similar sentences
  sorted_sentence = []
  iterations = 0
  if len(ar_sentences) >= 3:
    iterations = 3
  else:
    iterations = len(ar_sentences)
  
  for i in range(0,iterations):
      sorted_sentence.append(max(sentence_dic.items(), key=operator.itemgetter(1)))
      sentence_dic.pop(max(sentence_dic.items(), key=operator.itemgetter(1))[0], None)
      

  if translated_answer.isdigit() :
    ## if trnaslated answer is digit
    
    max_difflib = -1
    #loop on max 3 similar sentences
    for  q in sorted_sentence:  
      ## retrieve exact similar word from similar_sentences
      # q[0] -> accesing key (paragraph) from each array entry
      words = q[0].split(' ')
      for i in range(0,len(words)):
        for j in range(i,len(words)):
          temp_ans = ''
          num_words = j-i+1
          l = i 
          for k in range(0,num_words):
            temp_ans += words[l] + ' '
            l += 1          
          difflib_sim = difflib_similarity(temp_ans,translated_answer)
          if difflib_sim > max_difflib:
            max_difflib = difflib_sim
            correct_answer = temp_ans            
    
    # retrieve answer start index
    correct_answer = correct_answer.strip() 
    answer_start_index = ar_paragraph.find(correct_answer)
    
    
    if max_difflib < 0.5:
      answer_exist = False
       
    
  else:
    ## if trnaslated answer is NOT digit
    sim_dic = {}
    
    max_sim_answer = -1
    for  q in sorted_sentence:
      ## retrieve exact similar word from similar_sentences
      words = q[0].split(' ')

      for i in range(0,len(words)):
        for j in range(i,len(words)):

          temp_ans = ''
          num_words = j-i+1
          l = i 
          for k in range(0,num_words):
            temp_ans += words[l] + ' '
            l += 1
          siml = sim.similarity(translated_answer,temp_ans)
          if (temp_ans != ' ') & (temp_ans != ''):
            sim_dic[temp_ans]= siml
          if siml > max_sim_answer:
            max_sim_answer = siml
            correct_answer = temp_ans
    
    temp_correct_ans = correct_answer
    
    if max_sim_answer >= ANS_SIM_THRESHOLD:
      # compare highest 3 similarity answers with difflib
      sorted_x = []
      for i in range(0,3):
        sorted_x.append(max(sim_dic.items(), key=operator.itemgetter(1)))
        sim_dic.pop(max(sim_dic.items(), key=operator.itemgetter(1))[0], None)

      max_difflib = -1
      for e in sorted_x:

        if e[1] >= ANS_SIM_THRESHOLD:
          difflib_sim = difflib_similarity(e[0],translated_answer)
          if difflib_sim > max_difflib:
            max_difflib = difflib_sim
            correct_answer = e[0]

      if max_difflib < 0.5:
        correct_answer = temp_correct_ans


      # retrieve answer start index
      correct_answer = correct_answer.strip() 
      answer_start_index = ar_paragraph.find(correct_answer)
    
    else:
    ## not a digit answer but less than threshold
    
      ## Issue: 5 -> English not translated answer in ar parags
      max_difflib = -1
      for  q in sorted_sentence:
        ## retrieve exact similar word from similar_sentences
        words = q[0].split(' ')
        for i in range(0,len(words)):
          for j in range(i,len(words)):
            temp_ans = ''
            num_words = j-i+1
            l = i 
            for k in range(0,num_words):
              temp_ans += words[l] + ' '
              l += 1
            #compare with  ENgLISH answer !!
            difflib_sim = difflib_similarity(temp_ans,english_answer)
            if difflib_sim > max_difflib:
              max_difflib = difflib_sim
              correct_answer = temp_ans

      if max_difflib >= 0.5:
        # retrieve answer start index
        correct_answer = correct_answer.strip() 
        answer_start_index = ar_paragraph.find(correct_answer)

      else:
        ## Issue 6: literal transalation  ‘Destiny’s child’ -> ‘دستنى شايلد’	

        Transliterating = transString(english_answer)
        max_difflib = -1
        for  q in sorted_sentence:
          ## retrieve exact similar word from similar_sentences
          words = q[0].split(' ')
          for i in range(0,len(words)):
            for j in range(i,len(words)):
              temp_ans = ''
              num_words = j-i+1
              l = i 
              for k in range(0,num_words):
                temp_ans += words[l] + ' '
                l += 1
              difflib_sim = difflib_similarity(temp_ans,Transliterating)              
              if difflib_sim > max_difflib:
                max_difflib = difflib_sim
                correct_answer = temp_ans                

        correct_answer = correct_answer.strip() 
        answer_start_index = ar_paragraph.find(correct_answer) 

        if max_difflib < 0.4:
          answer_exist = False

  # handle if final answer is empty        
  if len(correct_answer) < 1:
    answer_exist = False
    
    
  return  answer_exist, correct_answer, answer_start_index 

## Save data into JSON File

In [0]:
def save_JSON(data,count):
  
  with open('./data/AAQAD-v1.0.json', 'w') as outfile:
    json.dump(data, outfile)
  
#   # download the file on device after collecting each 5 parags
#   if (count != 0) & ((count % 5) == 0):
#     files.download('/content/data/AAQAD-v1.0.json')
  
  return

# Main Function [ Run Here ]

**To Avoid Cell inactivity Shutdown:**

So to prevent this just run the following code in the console and it will prevent you from disconnecting.
Ctrl+ Shift + i to open inspector view .

Then goto console and type this code:

      function ClickConnect(){
      console.log("Working"); 
      document.querySelector("colab-toolbar-button#connect").click() 
      }
      setInterval(ClickConnect,60000)


It would keep on clicking the page and prevent it from disconnecting.


In [0]:
### download SQUAD dataset on colab (found in 'Files/data' section)

#training set 
!wget -qq https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -P ./data
  
#dev set
!wget -qq https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -P ./data
  

In [0]:
def AQQAD_generator():
  
  #### dictionary dataset declartion 
  # Will be used to store final JSON format of the dataset
  AAQAD_dic = {}
  AAQAD_dic["version"] = "v1.0"
  AAQAD_dic['data'] = []
  Question_ID = 1
  min_num_questions = 3
  
  #### Imitate training & dev set
  with open('./data/train-v2.0.json') as json_file:  
    file = json.load(json_file)
    
  ## loop on each article...
  en_article_count = -1
  ar_article_count = 0
  for article in file["data"][410:411]:
    en_article_count += 1
    print("##################################################")
    print("Article no ",en_article_count, " : ",article["title"])
        
    #extract all arabic paragraphs(if exists) from article title
    arabic_page_exist, article_ar_title, article_ar_paragraphs = get_arabic_paragraphs(article["title"])

    if arabic_page_exist:
      print("Arabic page exits !")
      article_paragraphs = []
      valid_article = False
      article_initiazlized = False
      
      ## send all english paragraphs along with the arabic ones 
      #  to be compared and to retrieve similarities between them...
      print("Collecting English paragraphs...")
      article_en_paragraphs = []
      for parag in article["paragraphs"]:
        article_en_paragraphs.append(parag["context"])      
      
      #find similarity between en_parags & wiki_ar parags
      print("Finding Similar Arabic paragraphs...")
      similar_parags_dic = get_similar_ar_paragraphs(article_en_paragraphs,article_ar_paragraphs)
      
      # if there is matching paragraphs ,iterate on them to
      # retrieve their Questions and Answers
      if len(similar_parags_dic) > 0:
        ##NOTE: article["paragraphs"][parag index] -> parag
        
        ## loop on questions for the each matched paragraph
        # key: parag index - val: corresponding ar paragraph
        print("Looping on ",len(similar_parags_dic)," matched paragraphs...")
        for key,val in similar_parags_dic.items():
          print("\nParagraph no: ",key," was found in arabic")
          print("\nQuestions:")
          ques_count = -1
          
          # boolean to check for at least 1 valid question for the current paragraph
          valid_questions = False
          qas = []
          
          for ques in article["paragraphs"][key]["qas"]:
            ques_count += 1
            print("\nquestion no: ",ques_count)

            # translate question to arabic
            translated_ques = translate_to_arabic(ques["question"])

            # find if first answer(or plausible answer) exist in the arabic matched paragraph
            print("Cheking Answer")

            if ques["is_impossible"] == False:
              translated_ans = translate_to_arabic(ques["answers"][0]["text"])
              en_ans = ques["answers"][0]["text"]
            else:
              #plausible answers can be empty
              if len(ques["plausible_answers"]) > 0:
                translated_ans = translate_to_arabic(ques["plausible_answers"][0]["text"])
                en_ans = ques["plausible_answers"][0]["text"]
              else:
                en_ans = ""
            
            if en_ans != "":
              #search for the existance of the correct answer in ar paragraph
              ans_exist, correct_ans, ans_index_start = find_answer(val,translated_ques,translated_ans,en_ans)
              if ans_exist:
                print("Answer exists for Question no: ",ques_count)
                if valid_questions == False:
                  valid_questions = True

                # adding valid answer to appropiate dictionaries
                ans_details = {'text': correct_ans,
                              'answer_start':ans_index_start}
                if ques["is_impossible"] == False:
                  qas.append({'question' : translated_ques,
                              'id': Question_ID,
                              'answers': [ans_details],
                              'is_impossible': False})                  
                else:
                  qas.append({'plausible_answers':[ans_details],
                              'question' : translated_ques,
                              'id': Question_ID,
                              'answers': [],
                              'is_impossible': True})
                Question_ID += 1  
            
              else:
                print("Answer do not exist for Question no: ",ques_count)
            
            else:
              #handle empty answers when is impossible = true
              if ques["is_impossible"] == True:
                qas.append({'plausible_answers':[],
                              'question' : translated_ques,
                              'id': Question_ID,
                              'answers': [],
                              'is_impossible': True})
                Question_ID += 1
              
          #check for valid questions & minimum # of allowed questions to approve current paragraph
          if (valid_questions == True) & (len(qas) >= min_num_questions):
            valid_article = True            
            article_paragraphs.append({'qas': qas, 'context' : val })
          else:
            # handling Question_ID issue
            Question_ID = Question_ID - len(qas) + 1
      
     ## add all paragraphs to the correct article in AAQAD_dic
      if valid_article == True:
            
        # initiazlize article once
        if article_initiazlized == False :
          AAQAD_dic['data'].append({'title' : article_ar_title,
                                        'paragraphs': []})
          article_initiazlized = True
        
        AAQAD_dic['data'][ar_article_count]['paragraphs'] = article_paragraphs
        ar_article_count += 1
        
        # Write current valid article To JSON file
        save_JSON(AAQAD_dic,ar_article_count)        
        
        # delay for 2 seconds between each article
        time.sleep(2)      
        
    else:
      print("Arabic page do NOT exist")
    
    print("####################################################")
  
  #Downloading lastest updated version
  save_JSON(AAQAD_dic,ar_article_count)
#   files.download('/content/data/AAQAD-v1.0.json')
#   print("\nLastest version downloaded !")
  
  return

In [0]:
# RUN HERE
AQQAD_generator()


##################################################
Article no  0  :  Egypt
Arabic page exits !
Collecting English paragraphs...
Finding Similar Arabic paragraphs...
paragraph to translate:
Egypt (i/ˈiːdʒɪpt/; Arabic: مِصر‎ Miṣr, Egyptian Arabic: مَصر Maṣr, Coptic: Ⲭⲏⲙⲓ Khemi), officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and southwest corner of Asia, via a land bridge formed by the Sinai Peninsula. It is the world's only contiguous Eurafrasian nation. Most of Egypt's territory of 1,010,408 square kilometres (390,000 sq mi) lies within the Nile Valley. Egypt is a Mediterranean country. It is bordered by the Gaza Strip and Israel to the northeast, the Gulf of Aqaba to the east, the Red Sea to the east and south, Sudan to the south and Libya to the west.
paragraph to translate:
Egypt has one of the longest histories of any modern country, arising in the tenth millennium BC as one of the world's first nation states. Considered

  dist = 1.0 - uv / np.sqrt(uu * vv)
  dist = 1.0 - uv / np.sqrt(uu * vv)


Looping on  57  matched paragraphs...

Paragraph no:  0  was found in arabic

Questions:

question no:  0
Cheking Answer
Answer exists for Question no:  0

question no:  1
Cheking Answer
Answer do not exist for Question no:  1

question no:  2
Cheking Answer
Answer do not exist for Question no:  2

question no:  3
Cheking Answer
Answer exists for Question no:  3

question no:  4
Cheking Answer
Answer do not exist for Question no:  4

Paragraph no:  1  was found in arabic

Questions:

question no:  0
Cheking Answer
Answer exists for Question no:  0

question no:  1
Cheking Answer
Answer exists for Question no:  1

question no:  2
Cheking Answer
Answer do not exist for Question no:  2

question no:  3
Cheking Answer
Answer do not exist for Question no:  3

question no:  4
Cheking Answer
Answer do not exist for Question no:  4

Paragraph no:  2  was found in arabic

Questions:

question no:  0
Cheking Answer
Answer exists for Question no:  0

question no:  1
Cheking Answer
Answer exists f