# **Question and Answering with Transformers**
Question and answering is one of the most diverse and fast-moving areas of development in the world of transformers. In this notebook we will summarize a few of the key topics that we will later cover.

# **Types of Model**
**Open-Domain vs Reading Comprehension** - in QA we will often find that models will recieve a question, and (sometimes) extract an answer from a context. In the cases where this context is provided to the model alongside the question, eg:

{
    'question': 'What field of study has a variety of unusual contexts?',
    'context': 'The term "matter" is used throughout physics in a bewildering variety of contexts: for example, one refers to "condensed matter physics", "elementary matter", "partonic" matter, "dark" matter, "anti"-matter, "strange" matter, and "nuclear" matter. In discussions of matter and antimatter, normal matter has been referred to by Alfvén as koinomatter (Gk. common matter). It is fair to say that in physics, there is no broad consensus as to a general definition of matter, and the term "matter" usually is used in conjunction with a specifying modifier.'
}

# **we need to install transformers library for training Arabic transformer based model for Question and Answering**

In [1]:
! pip install transformers datasets huggingface_hub

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.0 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.2.1-py3-none-any.whl (342 kB)
[K     |████████████████████████████████| 342 kB 49.9 MB/s 
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 36.9 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.w

In [2]:
from google.colab import drive
import pandas as pd
drive.mount("/content/drive")

Mounted at /content/drive


**Lets open up the training data and confirm that is shares the same format as the training data.**

In [3]:
import os
import json

with open(os.path.join("/content/drive/MyDrive/Dataset/train.json"), 'rb') as tr, open(os.path.join("/content/drive/MyDrive/Dataset/test.json"), 'rb') as te:
    train_dataset = json.load(tr)
    val_dataset = json.load(te)

As before, the JSON structure contains a top-level 'data' key which contains a list of groups, where each group is a topic. We can take a look at a few examples from the start and end of the dataset.

In [4]:
train_dataset['data'][0]['paragraphs'][0]

{'context': 'السعودية أو (رسميًا: المملكة العربية السعودية) هي دولة عربية، وتعد أكبر دولة في الشرق الأوسط وتقع تحديدًا في الجنوب الغربي من قارة آسيا وتشكل الجزء الأكبر من شبه الجزيرة العربية إذ تبلغ مساحتها حوالي مليوني كيلومتر مربع.',
 'qas': [{'answers': [{'answer_start': 21,
     'text': 'المملكة العربية السعودية)'}],
   'id': '333772766499',
   'question': ' - أي دولة هي أكبر دولة في الشرق الأوسط؟ ال'},
  {'answers': [{'answer_start': 109, 'text': 'الجنوب الغربي'}],
   'id': '262204981583',
   'question': ' - أين تقع المملكة العربية السعودية في آسيا؟ ال'},
  {'answers': [{'answer_start': 194, 'text': 'حوالي مليوني كيلومتر مربع.'}],
   'id': '809283218984',
   'question': ' - ما هي مساحة الجزء الأكبر من شبه الجزيرة العربية؟ ال'}]}

In [5]:
val_dataset['data'][0]['paragraphs'][0]

{'context': 'حمزة بن عبد المطلب الهاشمي القرشي صحابي من صحابة رسول الإسلام محمد، وعمُّه وأخوه من الرضاعة وأحد وزرائه الأربعة عشر، وهو خير أعمامه لقوله: «خَيْرُ إِخْوَتِي عَلِيٌّ، وَخَيْرُ أَعْمَامِي حَمْزَةُ رَضِيَ اللَّهُ عَنْهُمَا».',
 'qas': [{'answers': [{'answer_start': 34,
     'text': 'صحابي من صحابة رسول الإسلام محمد، وعمُّه وأخوه من الرضاعة وأحد وزرائه الأربعة عشر،'}],
   'id': '621723207492',
   'question': 'من هو حمزة بن عبد المطلب؟'},
  {'answers': [{'answer_start': 166, 'text': 'وَخَيْرُ أَعْمَامِي'}],
   'id': '189105393656',
   'question': 'بما وصفه رسول الله؟'},
  {'answers': [{'answer_start': 139, 'text': '«خَيْرُ إِخْوَتِي عَلِيٌّ،'}],
   'id': '662616978980',
   'question': 'بما وصف رسول الله على ؟'}]}

In [6]:
train_dataset = train_dataset['data']
val_dataset = val_dataset['data']

We need to change the Dictionary file into a suitable file so that we will be able to train a model on it

In [7]:

context = {"Context":'',"answers":''}
train_dict = {}
qas = {}

for i in range(len(train_dataset)):
  # print(i)
  # print(train_dataset[i]['paragraphs'][0]['context'])
  # context["Context"].append(train_dataset[i]['paragraphs'][0]['context'])
  # context['answers'].append(train_dataset[i]['paragraphs'][0]['qas'][0]['answers'])
  if "Context" not in train_dict:
    train_dict["id"] = []
    train_dict["Context"] = []
    train_dict["answers"] = []
    train_dict["question"] = []

  train_dict["id"].append(train_dataset[i]['paragraphs'][0]['qas'][0]['id']) 
  train_dict["Context"].append(train_dataset[i]['paragraphs'][0]['context'])
  train_dict["answers"].append(train_dataset[i]['paragraphs'][0]['qas'][0]['answers'][0])
  train_dict["question"].append(train_dataset[i]['paragraphs'][0]['qas'][0]['question'])


In [8]:

context = {"Context":'',"answers":''}
test_dict = {}
qas = {}

for i in range(len(val_dataset)):
  # print(i)
  # print(train_dataset[i]['paragraphs'][0]['context'])
  # context["Context"].append(train_dataset[i]['paragraphs'][0]['context'])
  # context['answers'].append(train_dataset[i]['paragraphs'][0]['qas'][0]['answers'])
  if "Context" not in test_dict:
    test_dict["id"] = []
    test_dict["Context"] = []
    test_dict["answers"] = []
    test_dict["question"] = []

  test_dict["id"].append(str(val_dataset[i]['paragraphs'][0]['qas'][0]['id'])) 
  test_dict["Context"].append(val_dataset[i]['paragraphs'][0]['context'])
  test_dict["answers"].append(val_dataset[i]['paragraphs'][0]['qas'][0]['answers'][0])
  test_dict["question"].append(val_dataset[i]['paragraphs'][0]['qas'][0]['question'])


In [9]:
train_dict['Context'][0:5]

['السعودية أو (رسميًا: المملكة العربية السعودية) هي دولة عربية، وتعد أكبر دولة في الشرق الأوسط وتقع تحديدًا في الجنوب الغربي من قارة آسيا وتشكل الجزء الأكبر من شبه الجزيرة العربية إذ تبلغ مساحتها حوالي مليوني كيلومتر مربع.',
 'مِصرَ أو (رسمياً: جُمهورِيّةُ مِصرَ العَرَبيّةِ) هي دولة عربية تقع في الركن الشمالي الشرقي من قارة أفريقيا، ولديها امتداد آسيوي، حيث تقع شبه جزيرة سيناء داخل قارة آسيا فهي دولة عابرة للقارات، قُدّر عدد سكانها بـ104 مليون نسمة، ليكون ترتيبها الثالثة عشر بين دول العالم بعدد السكان والأكثر سكانا عربيا.',
 'أَبُو القَاسِم مُحَمَّد بنِ عَبد الله بنِ عَبدِ المُطَّلِب (22 أبريل 571 - 8 يونيو 632) يُؤمن المسلمون بأنَّه رسول الله إلى الإنس والجن؛ ليعيدهم إلى توحيد الله وعبادته شأنه شأن كل الأنبياء والمُرسَلين، وهو خاتمهم، وأُرسِل للنَّاس كافَّة، ويؤمنون أيضا بأنّه أشرف المخلوقات وسيّد البشر، كما يعتقدون فيه العِصمة.',
 'المَغْرِب أو (رسمياً: المَمْلَكَةُ المَغْرِبِيَّة)  (بالأمازيغية: ⵍⵎⵖⵔⵉⴱ ⵏ ⵜⴰⴳⵍⴷⵉⵜ: وتنطق لمغريب) هي دولة عربية تقع في أقصى غرب شمال أفريقيا عاصمتها الربا

In [10]:
train_dict['question'][0:5]

[' - أي دولة هي أكبر دولة في الشرق الأوسط؟ ال',
 ' - أين تقع مصر؟ ال',
 ' - من هو رسول الله؟ ال',
 ' - أين يقع المغرب؟ ال',
 'من اسس الدولة العثمانية؟']

In [11]:
train_dict['answers'][0:5]

[{'answer_start': 21, 'text': 'المملكة العربية السعودية)'},
 {'answer_start': 63, 'text': 'تقع في الركن الشمالي الشرقي من قارة أفريقيا،'},
 {'answer_start': 0,
  'text': 'أَبُو القَاسِم مُحَمَّد بنِ عَبد الله بنِ عَبدِ المُطَّلِب'},
 {'answer_start': 112, 'text': 'تقع في أقصى غرب شمال أفريقيا'},
 {'answer_start': 221, 'text': 'عثمان الأول بن أرطغرل،'}]

In [12]:
train_dict['answers'][0]['answer_start']

21

In [13]:
train_dict['id'][0:5]

['333772766499', '762977773921', '39922158955', '143079297992', '310583692508']

In [14]:
from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
model = AutoModel.from_pretrained("asafaya/bert-base-arabic")

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/491 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/326k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at asafaya/bert-base-arabic were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
import transformers

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [16]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128

In [17]:
from datasets import Dataset
my_dict = {"id": train_dict["id"],
           "context":train_dict['Context'],
           "question":train_dict['question'],
           "answers":train_dict['answers']}
datasets = Dataset.from_dict(my_dict)

my_dict_test = {"id": test_dict["id"],
           "context":test_dict['Context'],
           "question":test_dict['question'],
           "answers":test_dict['answers']}
datasets_test = Dataset.from_dict(my_dict_test)

In [18]:
datasets

Dataset({
    features: ['id', 'context', 'question', 'answers'],
    num_rows: 74
})

In [19]:
datasets_test

Dataset({
    features: ['id', 'context', 'question', 'answers'],
    num_rows: 71
})

In [20]:
for i, example in enumerate(datasets):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets[i]

In [21]:
len(tokenizer(example["question"], example["context"])["input_ids"])

88

In [22]:
len(
    tokenizer(
        example["question"],
        example["context"],
        max_length=max_length,
        truncation="only_second",
    )["input_ids"]
)

88

In [23]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride,
)

In [24]:
[len(x) for x in tokenized_example["input_ids"]]

[88]

In [25]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] ما هو اسم الباراسيتامول بالانجليزية ؟ [SEP] الپاراسیتامول ( بالانجليزية : paracetamol ) او الاسيتامينوفين ، وهو الاسم المعتمد في الولايات المتحدة ( بالانجليزية : acetaminophen ) ، او الخلنجول ( لفظ منحوت من خلــي نـشادري الـجــاوول ) او خلي نشادري الجاوول هو مسكن وخافض للحرارة واسع الاستخدام. [SEP]


In [26]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride,
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 2), (3, 5), (6, 9), (10, 15), (15, 18), (18, 21), (21, 23), (24, 35), (35, 36), (0, 0), (0, 2), (2, 3), (3, 6), (6, 7), (7, 8), (8, 11), (11, 13), (14, 15), (15, 26), (26, 27), (28, 31), (31, 34), (34, 36), (36, 37), (37, 39), (39, 40), (41, 43), (44, 49), (49, 52), (52, 55), (55, 58), (58, 59), (60, 63), (64, 69), (70, 77), (78, 80), (81, 89), (90, 97), (98, 99), (99, 110), (110, 111), (112, 114), (114, 116), (116, 118), (118, 121), (121, 123), (123, 125), (125, 126), (126, 127), (128, 130), (131, 136), (137, 140), (141, 143), (144, 145), (145, 148), (149, 152), (152, 154), (155, 157), (158, 161), (163, 166), (168, 169), (170, 172), (173, 177), (178, 179), (182, 185), (185, 186), (187, 190), (190, 194), (194, 195), (196, 198), (199, 205), (208, 213), (213, 218), (221, 224), (225, 227), (228, 230), (233, 235), (236, 240), (241, 243), (243, 245), (245, 246), (247, 250), (250, 254), (255, 259), (260, 269), (269, 270), (0, 0)]


In [27]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(
    tokenizer.convert_ids_to_tokens([first_token_id])[0],
    example["question"][offsets[0] : offsets[1]],
)

ما ما


In [28]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


In [29]:
answers = example["answers"]
print(answers)

{'answer_start': 28, 'text': 'Paracetamol)'}


In [30]:
answers = example["answers"]
start_char = answers["answer_start"]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (
    offsets[token_start_index][0] <= start_char
    and offsets[token_end_index][1] >= end_char
):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while (
        token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char
    ):
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

21 21


In [31]:
print(
    tokenizer.decode(
        tokenized_example["input_ids"][0][start_position : end_position + 1]
    )
)
print(answers["text"][0])

par
P


In [32]:

pad_on_right = tokenizer.padding_side == "right"

In [33]:
def prepare_train_features(examples):
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        if answers["answer_start"] == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [34]:
features = prepare_train_features(datasets[:5])
features = prepare_train_features(datasets_test[:5])

In [35]:
tokenized_datasets = datasets.map(
    prepare_train_features, batched=True, remove_columns=datasets.column_names
)
tokenized_datasets_test = datasets_test.map(
    prepare_train_features, batched=True, remove_columns=datasets.column_names
)


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Building the model

In [36]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("asafaya/bert-base-arabic")

Downloading:   0%|          | 0.00/520M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at asafaya/bert-base-arabic and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
push_to_hub_model_id = "ics-arabert-qa"
learning_rate = 2e-5
num_train_epochs = 100
weight_decay = 0.01
batch_size = 8

In [89]:
len(tokenized_datasets['input_ids'])

74

In [90]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

In [91]:
train_set = tokenized_datasets.to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
validation_set = tokenized_datasets_test.to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [92]:
from transformers import create_optimizer

total_train_steps = (len(tokenized_datasets) // batch_size) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=learning_rate, num_warmup_steps=0, num_train_steps=total_train_steps
)

In [93]:

import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [94]:

from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard

# push_to_hub_callback = PushToHubCallback(
#     output_dir="./qa_model_save",
#     tokenizer=tokenizer,
#     hub_model_id=push_to_hub_model_id,
# )

# tensorboard_callback = TensorBoard(log_dir="./qa_model_save/logs")

# callbacks = [tensorboard_callback, push_to_hub_callback]

model.fit(
    train_set,
    validation_data=validation_set,
    epochs=num_train_epochs,
    # callbacks=callbacks,
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fd5305d6fd0>

In [95]:
model.save_pretrained('./final_ics472-aqa-project')

##Evaluation

In [45]:
batch = next(iter(validation_set))
output = model.predict_on_batch(batch)
output.keys()


odict_keys(['loss', 'start_logits', 'end_logits'])

In [46]:
output.start_logits.shape, output.end_logits.shape

((8, 384), (8, 384))

In [47]:
import numpy as np

np.argmax(output.start_logits, -1), np.argmax(output.end_logits, -1)

(array([ 0, 10, 71,  0, 59, 50, 19, 11]),
 array([ 0, 51,  0, 54,  8,  0, 19, 21]))

In [48]:
n_best_size = 20

In [49]:
import numpy as np

start_logits = output.start_logits[0]
end_logits = output.end_logits[0]

start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if (
            start_index <= end_index
        ):  
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "",  
                }
            )

In [50]:
def prepare_validation_features(examples):

    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )


    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):

        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [51]:
validation_features = datasets_test.map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets_test.column_names,
)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [52]:
validation_dataset = validation_features.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [53]:
raw_predictions = model.predict(validation_dataset)

In [54]:
max_answer_length = 30

In [55]:
start_logits = output.start_logits[0]
end_logits = output.end_logits[0]
offset_mapping = validation_features[0]["offset_mapping"]

context = datasets_test[0]["context"]

start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:

        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue

        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if (
            start_index <= end_index
        ):  
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char:end_char],
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[
    :n_best_size
]
valid_answers

[{'score': 3.926443, 'text': 'ابي من صحابة'},
 {'score': 3.8427482, 'text': 'ابي من صحابة رسول الإسلام محمد، وعمُّه وأخ'},
 {'score': 3.822568, 'text': 'صحابي من صحابة'},
 {'score': 3.8190947, 'text': 'ابي'},
 {'score': 3.7388732, 'text': 'صحابي من صحابة رسول الإسلام محمد، وعمُّه وأخ'},
 {'score': 3.7152195, 'text': 'صحابي'},
 {'score': 3.6859488,
  'text': ': «خَيْرُ إِخْوَتِي عَلِيٌّ، وَخَيْرُ أَعْمَامِي حَمْزَةُ رَضِيَ اللَّه'},
 {'score': 3.657933,
  'text': 'خير أعمامه لقوله: «خَيْرُ إِخْوَتِي عَلِيٌّ، وَخَيْرُ أَعْمَامِي حَمْزَةُ رَضِيَ اللَّه'},
 {'score': 3.6548944, 'text': 'صح'},
 {'score': 3.626093, 'text': 'ابي من صحابة رسول الإسلام محمد، وعمُّه'},
 {'score': 3.6230273,
  'text': 'ابي من صحابة رسول الإسلام محمد، وعمُّه وأخوه من الرضاعة وأحد وزرائه الأربعة عشر، وهو خير'},
 {'score': 3.6218736, 'text': 'خير'},
 {'score': 3.5765266, 'text': 'ابي من صحابة رسول'},
 {'score': 3.527875, 'text': 'ابي من'},
 {'score': 3.522218, 'text': 'صحابي من صحابة رسول الإسلام محمد، وعمُّه'},
 {'

In [56]:
datasets_test[0]["answers"]

{'answer_start': 34,
 'text': 'صحابي من صحابة رسول الإسلام محمد، وعمُّه وأخوه من الرضاعة وأحد وزرائه الأربعة عشر،'}

In [60]:
import collections
squad_v2 = False
examples = datasets_test
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

In [61]:
from tqdm.auto import tqdm


def postprocess_qa_predictions(
    examples,
    features,
    all_start_logits,
    all_end_logits,
    n_best_size=20,
    max_answer_length=30,
):

    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)


    predictions = collections.OrderedDict()


    print(
        f"Post-processing {len(examples)} example predictions split into {len(features)} features."
    )


    for example_index, example in enumerate(tqdm(examples)):

        feature_indices = features_per_example[example_index]

        min_null_score = None  
        valid_answers = []

        context = example["context"]

        for feature_index in feature_indices:

            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]

            offset_mapping = features[feature_index]["offset_mapping"]

            cls_index = features[feature_index]["input_ids"].index(
                tokenizer.cls_token_id
            )
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            start_indexes = np.argsort(start_logits)[
                -1 : -n_best_size - 1 : -1
            ].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:

                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue

                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char:end_char],
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[
                0
            ]
        else:

            best_answer = {"text": "", "score": 0.0}

        
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = (
                best_answer["text"] if best_answer["score"] > min_null_score else ""
            )
            predictions[example["id"]] = answer

    return predictions

In [62]:
final_predictions = postprocess_qa_predictions(
    datasets_test,
    validation_features,
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
)
print(final_predictions.items())


Post-processing 71 example predictions split into 73 features.


  0%|          | 0/71 [00:00<?, ?it/s]

odict_items([('621723207492', 'صحابي من صحابة رسول الإسلام محمد، وعمُّه وأخوه من الرضاعة وأحد وزرائه الأربعة عشر، وهو'), ('809758390695', 'بالإضافة إلى أنه خامس أكبر قمرٍ طبيعيٍ في المجموعة الشمسية. فهو'), ('147293444932', 'مدينة مقدسة لدى المسلمين، بها المسجد الحرام، والكعبة التي تعد قبلة المسلمين في صلاتهم. تقع'), ('990778242153', 'الأبراج'), ('415269827302', '(2'), ('784255729308', 'طيف التوحد أو مختلف اضطرابات النمو المتفشية، هو اضطراب النمو العصبي الذي'), ('124880607612', 'Manchester United Football Club) ويعرف'), ('550131195698', 'شاعرعربي من مكانة رفيعة، بَرز في فترةِ الجاهلية، ويُعد'), ('361673467364', '" وهي'), ('25304657387', 'إنْكِلْترا (بالإنجليزية: Eng'), ('517594478761', '(1165 - 1227م) . وهو'), ('538945350605', '(ميلاد: 6 نوفمبر 1494 بطرابزون، وفاة: 7 سبتمبر 1566 بسيكتوار)،'), ('168251134990', 'مراكز أو أقسام أو مراكز وأقسام معاً، المراكز الإدارية توجد في المحافظات التي بها ريف، وين'), ('4664484146', '1945 في مدينة'), ('507285299079', 'هي جمهورية فيدرالية وبلد غير ساحلي 

In [65]:
from datasets import load_metric
metric = load_metric("squad_v2" if squad_v2 else "squad")


In [95]:
if squad_v2:
    formatted_predictions = [
        {"id": k, "prediction_text": v, "no_answer_probability": 0.0}
        for k, v in final_predictions.items()
    ]
else:
    formatted_predictions = [
        {"id": k, "prediction_text": v} for k, v in final_predictions.items()
    ]
references = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in datasets_test
]
metric.compute(predictions=formatted_predictions, references=references)

In [None]:
print(datasets_test['answers'])

In [None]:
from transformers import TFAutoModelForQuestionAnswering

tess = TFAutoModelForQuestionAnswering.from_pretrained("./final_ics472-aqa-project")

In [None]:
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
model = TFAutoModelForQuestionAnswering.from_pretrained("./ics472-aqa-project")

text = r"""

تقع سويسرا في قلب القارة الأوروبية وتحيط بها خمس دول، وهي ألمانيا من الشمال وإيطاليا من الجنوب والنمسا وإمارة ليختنشتاين من الشرق وفرنسا من الغرب، وليست لها منافذ بحرية وتبلغ مساحتها حوالي 41300 كيلومترا مربعا. وتتكون سويسرا من ثلاث مناطق جغرافية وهي: سلسلة جبال الألب، التي تمتد في الجنوب وتغطي حوالي ثلثي مساحة البلاد ويبلغ ارتفاع أعلى قممها "بونتا دوفور" 4638م. ثم هناك سلسلة جبال جورا والتي تمتد على شكل هلال في غرب وشمال البلاد، وتمثل الحد الفاصل بين سويسرا وفرنسا وتغطي نحو 12٪ من المساحة الكلية، ويبلغ ارتفاع أعلى قممها "كريت دو لا نيج" 1718 م، وبين هاتين المجموعتين من السلاسل الجبلية، تمتد منطقة الهضبة السهلية التي تضم معظم المدن والقرى السويسرية.
"""

questions = [
      "كم دولة على الحدود مع سويسرا"
      ,"كم دولة تحيط بسويسرا؟"
      ,"ما مساحتها"
      ,"ما هي المساحة الاجمالية لسويسرا؟"
      ,"أي دولة تقع جنوب سويسرا؟"
      ,"ما هي الدول الواقعة شرق سويسرا؟"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]

    outputs = model(inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
    answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

In [68]:
ve = []
ke = []
for k, v in final_predictions.items():
  ke = k
  ve = v

In [82]:
print(type(ke[0]))

<class 'str'>
