# Description + Algorithm

---

**This is  AAQAD : Alexu Arabic Question-Answer Dataset**

**Data collection procedure is almost same as SQUAD 2.0 dataset**

**The main objective of this Dataset is to answer MRQA Problem in Arabic**

---

                                            ALGORITHM  
            
            For each article in SQUAD 2.0:

                 Open article’s wikipedia English page.

                 If an Arabic version of this page exists:

                            Translate English page using Google Translate

                            Find matching translated paragraphs with the Arabic page

                            For each matched paragraph in each article:

                                      Save this paragraph (Arabic version from Arabic Wikipedia page)

                                      For each Question in SQUAD 2.0 on this paragraph:

                                                  Translate it with its answer(s) using Google Translate

                                                  Save it with the corresponding paragraph in JSON format

                 Else:
                            Abort this article (will be not included in AAQAD )




# Imports + Packages install

In [0]:
!pip install googletrans

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/fd/f0/a22d41d3846d1f46a4f20086141e0428ccc9c6d644aacbfd30990cf46886/googletrans-2.4.0.tar.gz
Building wheels for collected packages: googletrans
  Building wheel for googletrans (setup.py) ... [?25l[?25hdone
  Created wheel for googletrans: filename=googletrans-2.4.0-cp36-none-any.whl size=15776 sha256=d765345c911f2754a155091923eb5b7afbee0f3610c516c93838302280cbd131
  Stored in directory: /root/.cache/pip/wheels/50/d6/e7/a8efd5f2427d5eb258070048718fa56ee5ac57fd6f53505f95
Successfully built googletrans
Installing collected packages: googletrans
Successfully installed googletrans-2.4.0


In [0]:
import requests
from bs4 import BeautifulSoup
import unicodedata
from googletrans import Translator
import difflib
import json
import re
import numpy as np
from nltk import ngrams
from nltk import TreebankWordTokenizer
from nltk import WordPunctTokenizer
from nltk import WhitespaceTokenizer
from textblob.base import BaseTokenizer
import gensim
from scipy import spatial
from nltk import ngrams
import operator

# Implementation Functions

## **Retrieve Arabic's Wikipedia paragarphs (if exists)**

In [0]:
##Given article title from SQUAD 2.0
#Check if arabic page exists and retrieve its arabic paragraphs

#returns a boolean for arabic page existance
#and the arabic title
#and the arabic paragraphs

def get_arabic_paragraphs(title):
  
  arabic_page_exists = True
  
  html = requests.get('https://ar.wikipedia.org/wiki/'+title).text
  soup = BeautifulSoup(html, "html.parser")
  ar_paragraphs = [p.get_text() for p in soup.find_all("p")]
   
  #check if arabic wikipedia page do not exists for a given title
  if len(ar_paragraphs) == 1 and "هذه الصفحة خالية" in ar_paragraphs[0]:
    
    arabic_page_exists = False  
   
  #if arabic wikepidia page exists 
  else:
    
    #get arabic title 
    ar_title = [h1.get_text() for h1 in soup.find_all("h1")]
    ar_title =  ar_title[0]
    
    #reformat arabic paragraphs
    for i in range(len(ar_paragraphs)):
      ar_paragraphs[i] = unicodedata.normalize("NFD", ar_paragraphs[i])
      ar_paragraphs[i] = re.sub(r'(\[(\d+)\])|(\[بحاجة لمصدر\])', '', ar_paragraphs[i])
            
  return arabic_page_exists, ar_title, ar_paragraphs

### **Testing**

In [0]:
#testing

#check, ar_parag =  get_arabic_paragraphs("Wikipedia:Articles_in_many_other_languages_but_not_on_English_Wikipedia")

check, ar_title, ar_parag = get_arabic_paragraphs("Beyoncé")

print(ar_title)
# print(len(ar_parag))

# for p in ar_parag:
#   print(p)
#   print("##########################################")

# print("##########################################")
# for p in en_parag:
#   print(p)
#   print("##########################################")

بيونسيه


## Translate Arabic paragraphs

In [0]:
## dictionary to convert arabic digits to english

eastern_to_western = {"٠":"0","١":"1","٢":"2","٣":"3","٤":"4","٥":"5","٦":"6","٧":"7","٨":"8","٩":"9",
                      "0":"0","1":"1","2":"2","3":"3","4":"4","5":"5","6":"6","7":"7","8":"8","9":"9"}

In [0]:
## translate a given paragraph to arabic

def translate_to_arabic(paragraph):
  
  translator = Translator()
  translatedParagraph = translator.translate(paragraph, dest='ar')
  
  #replace arabic digits with english ones
  translatedParagraph = list(translatedParagraph.text)
   
  for i in range(0,len(translatedParagraph)):
    if translatedParagraph[i].isdigit():
      translatedParagraph[i] = eastern_to_western[translatedParagraph[i]] 

  translatedParagraph = "".join(translatedParagraph)
  
  return translatedParagraph

### **Testing**

In [0]:
## testing

p = "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say; born September 4, 1981)[4] is an American singer, songwriter and actress. Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as lead singer of the R&B girl-group Destiny's Child, one of the best-selling girl groups in history. Their hiatus saw the release of her first solo album, Dangerously in Love (2003), which debuted at number one on the US Billboard 200 chart and earned her five Grammy Awards.[5] The album also featured the US Billboard Hot 100 number-one singles 'Crazy in Love' and 'Baby Boy'."

# p = "Beyoncé Giselle Knowl 20"

translate_to_arabic(p)


'بيونسي جيزيل نولز كارتر (/ biːˈjɒnseɪ / bee-YON- say ؛ من مواليد 4 سبتمبر 1981) [4] مغنية وكاتب أغاني وممثلة أمريكية. ولدت ونشأت في هيوستن ، تكساس ، بيونسيه في مختلف مسابقات الغناء والرقص كطفل. صعدت إلى الشهرة في أواخر التسعينيات كمغنية رئيسية لمجموعة ديستينيز تشايلد للبنات ، إحدى أفضل مجموعات الفتيات مبيعًا في التاريخ. شهدت الفجوة الخاصة بهم إصدار ألبومها الفردي الأول ، Dangerious in Love (2003) ، الذي ظهر لأول مرة في المرتبة الأولى على مخطط Billboard 200 الأمريكي وحاز على جوائز Grammy الخمس. [5] وضم الألبوم أيضًا أغنية US Billboard Hot 100 الفردية الأولى "Crazy in Love" و "Baby Boy".'

In [0]:
## testing arabic to english digit conversion

s = " حكم الدولة العثمانية من سنة ١١١١ حتى سنة ٣٣٣٣"
s = list(s)

for i in range(0,len(s)):
  if s[i].isdigit():
    s[i] = ''.join([eastern_to_western[c] for c in s[i]])
    
s = "".join(s)

print(s)

 حكم الدولة العثمانية من سنة 1111 حتى سنة 3333


## **Similarity Pretrained Model**

**This Source Code is taken from:**
1. https://github.com/bakrianoo/aravec
2. https://github.com/adhaamehab/textblob-ar

### **Download Pretrained Word Embedding Model**

In [0]:
### downloading a pretrained arabic word embedding model
!wget -qq https://bakrianoo.sfo2.digitaloceanspaces.com/aravec/full_grams_cbow_300_wiki.zip -P ./data


In [0]:
### unzip the pretrained model  
!unzip -qq data/full_grams_cbow_300_wiki.zip -d ./data

### **utilities.py Module**

In [0]:
# =========================
# ==== Helper Methods =====

# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','&quot;','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text

def get_vec(n_model,dim, token):
    vec = np.zeros(dim)
    exist = False
    is_vec = False
    if token not in n_model.wv:
        _count = 0
        is_vec = True
        for w in token.split("_"):
            if w in n_model.wv:
                _count += 1
                vec += n_model.wv[w]
        if _count > 0:
            vec = vec / _count
            exist = True
    else:
        vec = n_model.wv[token]
        exist = True
    
    return vec,exist

def calc_vec(pos_tokens, neg_tokens, n_model, dim):
    vec = np.zeros(dim)
    for p in pos_tokens:
        vec += get_vec(n_model,dim,p)
    for n in neg_tokens:
        vec -= get_vec(n_model,dim,n)
    
    return vec   

## -- Retrieve all ngrams for a text in between a specific range
def get_all_ngrams(text, nrange=3):
    text = re.sub(r'[\,\.\;\(\)\[\]\_\+\#\@\!\?\؟\^]', ' ', text)
    tokens = [token for token in text.split(" ") if token.strip() != ""]
    ngs = []
    for n in range(2,nrange+1):
        ngs += [ng for ng in ngrams(tokens, n)]
    return ["_".join(ng) for ng in ngs if len(ng)>0 ]

## -- Retrieve all ngrams for a text in a specific n
def get_ngrams(text, n=2):
    text = re.sub(r'[\,\.\;\(\)\[\]\_\+\#\@\!\?\؟\^]', ' ', text)
    tokens = [token for token in text.split(" ") if token.strip() != ""]
    ngs = [ng for ng in ngrams(tokens, n)]
    return ["_".join(ng) for ng in ngs if len(ng)>0 ]

## -- filter the existed tokens in a specific model
def get_existed_tokens(tokens, n_model):
    return [tok for tok in tokens if tok in n_model.wv ]


### **tokenizer.py Module**

In [0]:
class NLTKTreebankWordTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return TreebankWordTokenizer().tokenize(text)

class NLTKWordPunctTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return WordPunctTokenizer().tokenize(text)


class NLTKWhitespaceTokenizer(BaseTokenizer):

    def tokenize(self, text):
        return WhitespaceTokenizer().tokenize(text)


### **similarity.py Module**

In [0]:

class TextSimilarity:

    def __init__(self):
        try:
            self.model = gensim.models.Word2Vec.load('data/full_grams_cbow_300_wiki.mdl')
        except FileNotFoundError:
            raise FileNotFoundError
            
    def avg_feature_vector(self, sentence, num_features=300):
        words = NLTKWordPunctTokenizer().tokenize(clean_str(sentence))
        feature_vec = np.zeros((num_features, ), dtype='float32')
        n_words = 0
        for word in words:
            word_vect,exist = get_vec(n_model=self.model, dim=num_features, token=word)
            feature_vec = np.add(feature_vec, word_vect)
            if exist:
              n_words += 1
        if (n_words > 0):
            feature_vec = np.divide(feature_vec, n_words)
        return feature_vec

    def similarity(self, sentence1, sentence2):
        vec1, vec2 = self.avg_feature_vector(sentence1), self.avg_feature_vector(sentence2)
        return self.cosine_similarity(vec1, vec2)

    def cosine_similarity(self, vec1, vec2):
        return 1 - spatial.distance.cosine(vec1, vec2)

### **Build Model**

In [0]:
sim = TextSimilarity()
# takes around 12 second (macbook pro 2017) to load the pretrained word2vec

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
## testing similarity

sent1 = u'الإرهابي الصالح هي رواية خيال سياسي للكاتبة دوريس ليسينج. ظهرت أول طبعة للرواية في سبتمبر من عام 1985 للناشرين جوناثان كيب في المملكة المتحدة وألفريد أ'

# sent2 = u'روايه الكاتبه دوريس ليسينج هي روايه خيال سياسي ظهرت في سبتمبر 1985 بعنوان الارهابي الصالح وتم نشرها عن طريق جوناثان كيب والفريد أ في انجلترا'

sent2 = u'الكوكبة هي مجموعة من النجوم التي تكون شكلا أو صورة، وهي تدل على المنطقة التي تظهر فيها مجموعة محدودة من النجوم. وقد قسم الاتحاد الفلكي الدولي في عام 1930 السماء إلى 88 كوكبة، وذلك لتوحيد أشكال الكوكبات وعددها بعد أن كانت تتخيلها كل من الحضارات القديمة بشكل مختلف.'


sim.similarity(sent1, sent2)

0.34852882960983866

### **Difflib built-in Similarity Method**

In [0]:
def difflib_similarity(paragraph1, paragraph2):
  
  sequence = difflib.SequenceMatcher(a = paragraph1, b = paragraph2, autojunk= False)
  difference = sequence.ratio()
  
  return difference  

## **Retrieve Similar Arabic Paragraphs**

In [0]:
## Given a translated arabic paragraph it is
#  required to retieve the most similar araibc
#  paragraphs from all wikipedia arabic paragraphs

## return:
#  boolean to check if the corresponding arabic parag exists
#  text containing the correct retrieved arabic paragraph

def get_similar_ar_paragraph(translated_ar_paragraph, ar_paragraphs):
  
  is_similar = True
  max_similarity = -1
  correct_ar_paragraph = ""
  
  for parag in ar_paragraphs:
    siml = sim.similarity(translated_ar_paragraph,parag)  
       
    if siml > max_similarity:
        max_similarity = siml
        correct_ar_paragraph = parag
                               
  #Threshold for similarity => 80%                            
  if max_similarity < 0.7:
    is_similar = False                 
                              
                               
  return is_similar, correct_ar_paragraph

## **Finding correct answer from Arabic Paragraphs**

In [0]:
######################### COMPARING ALL SENTENCES ###############################


### finding exact matching answer -if exists- given an arabic paragraph
## uses similarity function to compare translated answer with possible
## answer from the arabic paragraphs...

ANS_SIM_THRESHOLD = 0.8

def find_answer(ar_paragraph,translated_answer):
  
  ## required to return
  correct_answer = ''
  answer_exist = True
  answer_start_index = 0
  sim_dic = {}
  
  ## find most similar answer in each sentence in paragraph
  ar_sentences = ar_paragraph.split('.')
  max_sim_answer = -1
  for  similar_sentence in ar_sentences:
    ## retrieve exact similar word from similar_sentences
    words = similar_sentence.split(' ')

    for i in range(0,len(words)):
      for j in range(i,len(words)):

        temp_ans = ''
        num_words = j-i+1
        l = i 
        for k in range(0,num_words):
          temp_ans += words[l] + ' '
          l += 1

        siml = sim.similarity(translated_answer,temp_ans)
        if temp_ans != ' ':
          sim_dic[temp_ans]= siml
        if siml > max_sim_answer:
          max_sim_answer = siml
          correct_answer = temp_ans
      
#   print("correct: ",correct_answer)
  temp_correct_ans = correct_answer
  if max_sim_answer < ANS_SIM_THRESHOLD:
    answer_exist = False
  
  else: 
    
    # compare highest 3 similarity answers with difflib
    sorted_x = []
    for i in range(0,3):
      sorted_x.append(max(sim_dic.items(), key=operator.itemgetter(1)))
      sim_dic.pop(max(sim_dic.items(), key=operator.itemgetter(1))[0], None)
      
    max_difflib = -1
    for e in sorted_x:
   
      if e[1] >= ANS_SIM_THRESHOLD:
        difflib_sim = difflib_similarity(e[0],translated_answer)
        if difflib_sim > max_difflib:
          max_difflib = difflib_sim
          correct_answer = e[0]
    
    if max_difflib < 0.5:
      correct_answer = temp_correct_ans
    
   
    
    # retrieve answer start index
    correct_answer = correct_answer.strip() 
    answer_start_index = ar_paragraph.find(correct_answer)    
  
  
  return  answer_exist, correct_answer, answer_start_index

  

In [0]:

# ######################### NOT USED ###############################


# ### finding exact matching answer -if exists- given an arabic paragraph
# ## uses similarity function to compare translated answer with possible
# ## answer from the arabic paragraphs...

# ANS_SIM_THRESHOLD = 0.8

# def find_answer(ar_paragraph,translated_answer):
  
#   ## required to return
#   correct_answer = ''
#   answer_exist = True
#   answer_start_index = 0
#   sim_dic = {}
  
#   ## retrieve max similarity sentence with the answer
#   max_similarity = -1
#   similar_sentence = ''
#   ar_sentences = ar_paragraph.split('.')
  
#   for  s in ar_sentences:
#     #calculate similarity
#     siml = sim.similarity(translated_answer,s)
#     if siml > max_similarity:
#       similar_sentence = s
#       max_similarity = siml

#   ## retrieve exact similar word from similar_sentence 
#   max_sim_answer = -1
#   words = similar_sentence.split(' ')
  
#   for i in range(0,len(words)):
#     for j in range(i,len(words)):
      
#       temp_ans = ''
#       num_words = j-i+1
#       l = i 
#       for k in range(0,num_words):
#         temp_ans += words[l] + ' '
#         l += 1
      
#       siml = sim.similarity(translated_answer,temp_ans)
#       if temp_ans != ' ':
#         sim_dic[temp_ans]= siml
#       if siml > max_sim_answer:
#         max_sim_answer = siml
#         correct_answer = temp_ans
      
  
#   temp_correct_ans = correct_answer
#   if max_sim_answer < ANS_SIM_THRESHOLD:
#     answer_exist = False
  
#   else: 
    
#     # compare highest 3 similarity answers with difflib
#     sorted_x = []
#     for i in range(0,3):
#       sorted_x.append(max(sim_dic.items(), key=operator.itemgetter(1)))
#       sim_dic.pop(max(sim_dic.items(), key=operator.itemgetter(1))[0], None)
      
#     max_difflib = -1
#     for e in sorted_x:
   
#       if e[1] >= ANS_SIM_THRESHOLD:
#         difflib_sim = difflib_similarity(e[0],translated_answer)
#         if difflib_sim > max_difflib:
#           max_difflib = difflib_sim
#           correct_answer = e[0]
    
#     if max_difflib < 0.5:
#       correct_answer = temp_correct_ans
    
   
    
#     # retrieve answer start index
#     correct_answer = correct_answer.strip() 
#     answer_start_index = ar_paragraph.find(correct_answer)    
  
  
#   return  answer_exist, correct_answer, answer_start_index

  

### **Testing**

In [0]:
par = u"بيونسي جيزيل نولز-كارتر (من مواليد 4 سبتمبر، 1981)، المعروفة باسم بيونسي. ولدت ونشأت في هيوستن بولاية تكساس، هي مغنية وممثلة أميركية حائزة على 23 جائزة غرامي.غنت في مسابقات غناء ورقص مختلفة عندما كانت طفلة، أصبحت مشهورة في أواخر التسعينات كمغنية آر أند بي (رئيسية) للفرقة الغنائية النسائية دستنيز تشايلد. والتي أديرت من قِبل والدها ماثيو نولز، وأصبحت الفرقة واحدة من الأكثر مبيعاً في العالم من الفرق النسائية على الإطلاق. وقد شهد إنفصال الفرقة المؤقت صدور ألبوم بيونسي الأول Dangerously in Love دانجيروسلي إن لوف (2003)، والذي أنشأها بأن تكون فنان منفرد ناجح في العالم؛ بيعت منه 16 مليون نسخة، حصل على خمسة جوائز غرامي وتضمن الأغاني التي وصلت إلى قمة الرسم البياني الأمريكي بيلبورد هوت 100 كريزي إن لوف و بيبي بوي"


ans = u"المعروفة باسم بيونسي"

# ans = ""


ans_exist, correct_ans, index = find_answer(par,ans)

print(ans_exist)
print(correct_ans)
print(index)


# print(par[52],par[53],par[54],par[55])


  dist = 1.0 - uv / np.sqrt(uu * vv)
  dist = 1.0 - uv / np.sqrt(uu * vv)


correct:  المعروفة باسم بيونسي 
True
المعروفة باسم بيونسي
52


In [0]:
x = {"adel": 0.22222222222222222222222222222222222, "sandra": 0.444444444444444444444444444444, "rimon": 0.3222222222222222222222222222222222,"hamada":1.0}

# sorted_x = sorted(x.items(), key=operator.itemgetter(1), reverse=True)

# for e in sorted_x:
#   print(e)

  
#   max(x.items(), key=operator.itemgetter(1))

('hamada', 1.0)
('sandra', 0.4444444444444444)
('rimon', 0.32222222222222224)


## Save data into JSON File

In [0]:
def save_JSON(data):
  
  with open('./data/AAQAD-v1.0.json', 'w') as outfile:
    json.dump(data, outfile)
    
  return

# Main Function [ Run Here ]

### **Code**

In [0]:
### download SQUAD dataset on colab (found in 'Files/data' section)

#training set 
!wget -qq https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -P ./data
  
#dev set
!wget -qq https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -P ./data
  

In [0]:
def AQQAD_generator():
  
  #### dictionary dataset declartion 
  # Will be used to store final JSON format of the dataset
  AAQAD_dic = {}
  AAQAD_dic["version"] = "v1.0"
  AAQAD_dic['data'] = []
  Question_ID = 1
  
  #### Imitate training set
  with open('./data/train-v2.0.json') as json_file:  
    file = json.load(json_file)
    
  ## loop on each article...
  article_count = -1
  for article in file["data"][0:1]:
    article_count += 1
    print("##################################################")
    print("Article no ",article_count, " : ",article["title"])
        
    # boolean to check if it is valid aritcle
    # by having at least 1 valid paragraph
    valid_article = False

    #extract all arabic paragraphs(if exists) from article title
    arabic_page_exist, ar_title, ar_paragraphs = get_arabic_paragraphs(article["title"])

    if arabic_page_exist:
      print("Arabic page exits !")
      article_paragraphs = []

      article_initiazlized = False

      ## loop on each paragraph for the current article... 
      # (context section + QAS sections)
      parag_count = -1
      for parag in article["paragraphs"]: 
        parag_count += 1
        print("\nPragraph no: ",parag_count)
        
        #translate current english paragraph in SQUAD 2.0 into arabic
        translated_ar_paragraph = translate_to_arabic(parag["context"])

        ## find a similarity between translated paragraph and the arabic paragraphs
        # Similarity must be the max & above 80 %
        similar_parag_exist, correct_ar_parag = get_similar_ar_paragraph(translated_ar_paragraph,ar_paragraphs)
        if similar_parag_exist:
          #remove matched arabic paragraph from ar_list
          #ar_paragraphs.remove(correct_ar_parag)

          if article_initiazlized == False :
            AAQAD_dic['data'].append({'title' : ar_title,
                                      'paragraphs': []})
            article_initiazlized = True

          print("Paragraph no: ",parag_count," was found in arabic")
#             print(correct_ar_parag)

          ## loop on questions for the curent paragraph
          print("Questions:")
          ques_count = -1
          # boolean to check for at least 1 valid question for the current paragraph
          valid_questions = False
          qas = []
          for ques in parag["qas"]:
            ques_count += 1
            print("\nquestion no: ",ques_count)

            # translate question to arabic
            translated_ques = translate_to_arabic(ques["question"])

            # find if first answer(or plausible answer) exist in the arabic matched paragraph
            print("Cheking Answer")

#             print("is impossible -> ",type(ques["is_impossible"]))
            if ques["is_impossible"] == False:
              translated_ans = translate_to_arabic(ques["answers"][0]["text"])
            else:
              translated_ans = translate_to_arabic(ques["plausible_answers"][0]["text"])

            #search for the existance of the correct answer in ar paragraph
            ans_exist, correct_ans, ans_index_start = find_answer(correct_ar_parag,translated_ans)
            if ans_exist:
              print("Answer exists for Question no: ",ques_count)
              if valid_questions == False:
                valid_questions = True

              # adding valid answer to appropiate dictionaries
              ans_details = {'text': correct_ans,
                            'answer_start':ans_index_start}
              if ques["is_impossible"] == False:
                qas.append({'question' : translated_ques,
                            'id': Question_ID,
                            'answers': [ans_details],
                            'is_impossible': False})                  
              else:
                qas.append({'plausible_answers':[ans_details],
                            'question' : translated_ques,
                            'id': Question_ID,
                            'answers': [],
                            'is_impossible': True})
              Question_ID += 1  
                     
            else:
              print("Answer do not exist for Question no: ",ques_count)
          
          if valid_questions == True:
                valid_article = True
                article_paragraphs.append({'qas': qas, 'context' : correct_ar_parag })
                

        else:
          print("Paragraph no: ",parag_count," was NOT found in arabic")

        print("--------------------------------------------------------------------")

        
      ## add all paragraphs to the correct article in AAQAD_dic
      if valid_article == True:
        AAQAD_dic['data'][article_count]['paragraphs'] = article_paragraphs
    
    else:
      print("Arabic page do NOT exist in Arabic")


    print("##################################################")
  
  
  #### Write prepared dataset (dic) into a JSON file
  save_JSON(AAQAD_dic)
  
  return

### **Testing**

In [0]:
#testing 

AQQAD_generator()


##################################################
Article no  0  :  Beyoncé
Arabic page exits !

Pragraph no:  0
Paragraph no:  0  was NOT found in arabic
--------------------------------------------------------------------

Pragraph no:  1
Paragraph no:  1  was found in arabic
Questions:

question no:  0
Cheking Answer


  dist = 1.0 - uv / np.sqrt(uu * vv)
  dist = 1.0 - uv / np.sqrt(uu * vv)


Answer do not exist for Question no:  0

question no:  1
Cheking Answer
Answer do not exist for Question no:  1

question no:  2
Cheking Answer
Answer do not exist for Question no:  2

question no:  3
Cheking Answer
Answer exists for Question no:  3

question no:  4
Cheking Answer
Answer do not exist for Question no:  4

question no:  5
Cheking Answer
Answer do not exist for Question no:  5

question no:  6
Cheking Answer
Answer do not exist for Question no:  6

question no:  7
Cheking Answer
Answer do not exist for Question no:  7

question no:  8
Cheking Answer
Answer do not exist for Question no:  8

question no:  9
Cheking Answer
Answer exists for Question no:  9

question no:  10
Cheking Answer
Answer do not exist for Question no:  10

question no:  11
Cheking Answer
Answer do not exist for Question no:  11
--------------------------------------------------------------------

Pragraph no:  2
Paragraph no:  2  was found in arabic
Questions:

question no:  0
Cheking Answer
Answer do

In [0]:
# import json

# qas = []

# ans_details = {'text': '20 years old',
#                'answer_start':'50'}

# qas.append({'question' : 'how old are you',
#             'id': '1',
#             'answers': [ans_details],
#             'is_impossible':'false'})

# data = {}
# data['test'] = qas
# with open('./data/test.json', 'w') as outfile:
#     json.dump(data, outfile)


p = ['a','3','fd']

for e in p:
  print (p.index(e))

0
1
2
