**`This code shows how to use transformers to fine-tune the extractive QA model: 'ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA'.`**

**This code is an adaption from: https://github.com/zeyadahmed10/Arabic-MRC by Zeyad Ahmed**




Dataset: CGSQuAD

Dataset format: CSV


* Dataset file name: CGSQuAD.csv
* Necessary columns: title, context,	question,	answer,	answer_start,	is_impossible,	count,	ID,





#Install dependencies

In [None]:
!pip install transformers
!pip install preprocess
!pip install arabert
!pip install datasets



In [None]:
#!rm -r Arabic-MRC
!git clone https://github.com/SubaieiFatemah/Arabic-MRC/

fatal: destination path 'Arabic-MRC' already exists and is not an empty directory.


#Apply dataset script on CGSQuAD

In [None]:
from arabert.preprocess import ArabertPreprocessor
import pandas as pd
data = pd.read_csv('/content/CGSQuAD.csv')
data

Unnamed: 0,title,context,question,answer,answer_start,is_impossible,count,ID
0,تنظيم كلية الدراسات العليا,رسالة كلية الدراسات العليا هي العمل المخطط اله...,ما هي رسالة كلية الدراسات العليا؟,العمل المخطط الهادف الى المساهمة في تنمية إمكا...,30,False,1,56be85543aeaaa14008c9063
1,تنظيم كلية الدراسات العليا,رسالة كلية الدراسات العليا هي العمل المخطط اله...,ما هي غاية كلية الدراسات العليا؟,اتاحة فرص تعليم,159,False,2,56be85543aeaaa14008c9065
2,تنظيم كلية الدراسات العليا,رسالة كلية الدراسات العليا هي العمل المخطط اله...,ما هي أهداف كلية الدراسات العليا؟,اتاحة فرص تعليم,159,False,3,56be85543aeaaa14008c9066
3,تنظيم كلية الدراسات العليا,رسالة كلية الدراسات العليا هي العمل المخطط اله...,ما هي مهام كلية الدراسات العليا؟,الموافقة على برامج الدراسات العليا ووضع الأنظمة,278,False,4,56bf6b0f3aeaaa14008c9601
4,تنظيم كلية الدراسات العليا,رسالة كلية الدراسات العليا هي العمل المخطط اله...,ما هي واجبات كلية الدراسات العليا؟,الموافقة على برامج الدراسات العليا ووضع الأنظمة,278,False,5,56bf6b0f3aeaaa14008c9602
...,...,...,...,...,...,...,...,...
1499,أسئلة شائعة,يكون موعد التقديم على كلية الدراسات العليا عاد...,ما هو عنوان البريد؟,اسم المنطقة التي يتواجد بها مكتب البريد,12310,False,1500,56cc5fd66d243a140015ef53
1500,أسئلة شائعة,يكون موعد التقديم على كلية الدراسات العليا عاد...,ما هو العنوان البريدي؟,اسم المنطقة التي يتواجد بها مكتب البريد,12310,False,1501,56cc5fd66d243a140015ef54
1501,أسئلة شائعة,يكون موعد التقديم على كلية الدراسات العليا عاد...,ما هو الرمز البريدي؟,رقم خاص يرمز الى بريد المنطقة,12368,False,1502,56cccb3c62d2951400fa64be
1502,أسئلة شائعة,يكون موعد التقديم على كلية الدراسات العليا عاد...,ما معنى الرمز البريدي؟,رقم خاص يرمز الى بريد المنطقة,12368,False,1503,56cccb3c62d2951400fa64bf


Update the context by selecting the answer and n sentences before and after it.

In [None]:
import pandas as pd
import re
import heapq
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

def extract_relevant_sentences(text, answer, answer_start, n_sentences=5):
    sentences = text.split('. ')#'. ' as a sentence slicer

    answer_start = answer_start#get answer location
    answer_end = answer_start + len(answer)

    start_index = 0#identify start and end of relevant context
    end_index = len(sentences)

    for i, sentence in enumerate(sentences):
        if sentence.lower().find(answer.lower()) != -1:
            counter = 1
            while i - counter >= 0 and counter <= n_sentences:#get n sentences from right and left
                start_index = i - counter
                counter += 1

            counter = 1
            while i + counter < len(sentences) and counter <= n_sentences:
                end_index = i + counter + 1
                counter += 1
            break

    new_context = '. '.join(sentences[start_index:end_index])#construct the relevant context out of the extracted sentences

    if not new_context.endswith('.'):#add a period at the last sentence
        new_context += '.'

    return new_context

data['context'] = data.apply(lambda row: extract_relevant_sentences(row['context'], row['answer'], row['answer_start']), axis=1)#for each context, call the function
data.to_csv("modified_csv_file.csv", index=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
data['context'][1000]

'اسكان طلبة الدراسات العليا دون مقابل او رسوم. شروط اسكان طلبة الدراسات العليا ان يكون الطالب مسجلا بدوام كامل ولا تقيم اسرته في الكويت. نعم يجوز اسكان طلبة الدراسات العليا ممن تقيم اسرهم داخل الكويت في ضوء دراسة حالتهم الاجتماعية. شروط حضور المهمات العلمية لطالب الدراسات العليا ان يكون مسجلا ومستمرا كطالب نظامي بدوام كامل وان لا يقل معدله المتوسط عن 3.5 نقاط وقد اجتاز 15 وحدة دراسية على الاقل وبتوصية من مشرفه. عدد المهمات العلمية التي يستطيع الطالب حضورها هي مهمة واحدة فقط خلال مدة دراسته. نفقات المهمة العلمية هي نفقات الاشتراك وتذاكر سفر على الدرجة السياحية وبدل سفر يومي قدره 30 دينار كويتي. تسري احكام اللائحة اعتبارا من الفصل الاول للعام الجامعي التالي لاعتمادها من مجلس الجامعة. قيمة اشراف ومناقشة المشرف على المشروع هي 100 دينار كويتي للاشراف و100 دينار كويتي للمناقشة. قيمة مناقشة رئيس اللجنة على المشروع هي 100 دينار كويتي. قيمة مناقشة المناقش على المشروع هي 100 دينار كويتي. قيمة مكافأة اشراف ومناقشة المشرف الرئيس على الاطروحة هي 400 دينار كويتي للاشراف و100 دينار كويتي للمناقشة.'

In [None]:
#/content/CGSQuAD.csv is splitted into /content/cgsqa-train.json /content/cgsqa-val.json /content/cgsqa-test.json
!python /content/Arabic-MRC/Translator/translation2dataset.py /content/modified_csv_file.csv

(1504, 8)
Index(['title', 'context', 'question', 'answer', 'answer_start',
       'is_impossible', 'count', 'ID'],
      dtype='object')
(1052, 8) (226, 8) (226, 8)


#Preprocess CGSQuAD

In [None]:
import os
import shutil
from collections import Counter
import numpy as np
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, ElectraForQuestionAnswering, DataCollatorWithPadding,BertModel, ElectraForSequenceClassification, ElectraModel
from arabert.preprocess import ArabertPreprocessor
import matplotlib.pyplot as plt
import seaborn as sns
import csv
torch.manual_seed(3407)

<torch._C.Generator at 0x7c5950f1aeb0>

In [None]:
def add_end_index(answer, context):
  # 1 if span match the context 0 otherwise
  text = answer['text']
  start_idx = answer['answer_start']
  end_idx = start_idx + len(text)
  answer['answer_end'] = end_idx
  if text == context[start_idx:end_idx]:
    answer['answer_end'] = end_idx
    return False
  for i in range(1,3):
    if text == context[start_idx-i:end_idx-i]:
      answer['answer_end']= end_idx-1#not -i?
      answer['answer_start'] = start_idx-1
      return False
  return True

In [None]:
def arabert_preprocess(context,question, answer, arabert_prep):
    answer['text'] = arabert_prep.preprocess(answer['text'])
    context = arabert_prep.preprocess(context)
    question = arabert_prep.preprocess(question)
    res = context.find(answer['text'])
    if res !=-1:
        answer['answer_start'] = res
    return context, question, answer, res

In [None]:
def Read_Dataset(path,arabert_prep):
  contexts =[]
  answers =[]
  questions =[]
  IDs= []
  cnt = 0
  with open(path) as f:
    my_dict = json.load(f)
    for article in my_dict['data']:
      for passage in article['paragraphs']:
        context = passage['context']
        for qa in passage['qas']:
          question = qa['question']
          access = 'answers'
          for answer in qa[access]:
            context,question, answer, res =  arabert_preprocess(context,question, answer, arabert_prep)
            flag = add_end_index(answer, context) #if false dont add
            cnt =cnt + flag
            flag = False
            if not flag:
              contexts.append(context)
              answers.append(answer)
              questions.append(question)
              IDs.append(int(qa['id']))
  return contexts,questions,answers,IDs

In [None]:
model_name = "araelectra-base-discriminator"
arabert_prep = ArabertPreprocessor(model_name=model_name)#custom class
cgsqa_span_train_contexts, cgsqa_span_train_questions, cgsqa_span_train_answers, cgsqa_span_train_ids = Read_Dataset('/content/cgsqa-train.json', arabert_prep)
cgsqa_val_contexts, cgsqa_val_questions, cgsqa_val_answers, cgsqa_val_ids = Read_Dataset('/content/cgsqa-val.json', arabert_prep)
cgsqa_test_contexts, cgsqa_test_questions, cgsqa_test_answers, cgsqa_test_ids = Read_Dataset('/content/cgsqa-test.json', arabert_prep)

In [None]:
len(cgsqa_test_answers)+len(cgsqa_span_train_answers)+len(cgsqa_val_answers)

1504

#Apply tokenization

In [None]:
#Creating the tokenizer. HF class
model_name =  "aubmindlab/araelectra-base-discriminator"
araelectra_tokenizer = AutoTokenizer.from_pretrained(model_name,do_lower_case=False)
span_train_encodings = araelectra_tokenizer(cgsqa_span_train_questions, cgsqa_span_train_contexts, truncation=True)#, return_offsets_mapping=True
val_encodings = araelectra_tokenizer(cgsqa_val_questions, cgsqa_val_contexts, truncation=True, return_offsets_mapping=True)
test_encodings = araelectra_tokenizer(cgsqa_test_questions, cgsqa_test_contexts, truncation=True,  return_offsets_mapping=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/825k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
val_offset = val_encodings['offset_mapping']
del val_encodings['offset_mapping']

test_offset = test_encodings['offset_mapping']
del test_encodings['offset_mapping']

In [None]:
val_ids_to_idx = {k:i for i,k in enumerate(cgsqa_val_ids)}
test_ids_to_idx = {k:i for i,k in enumerate(cgsqa_test_ids)}

In [None]:
def index_to_token_position(encodings , answers):
  start_positions = list()
  end_positions = list()
  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start'], 1))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'], 1))
    #if context truncated
    if start_positions[-1] is None:
      start_positions[-1] = araelectra_tokenizer.model_max_length
    #if end index is space
    itt = 1
    while end_positions[-1] is None:
      end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end']-itt, 1)
      itt = itt + 1
  encodings.update({'start_positions': torch.tensor(start_positions), 'end_positions': torch.tensor(end_positions)})
  encodings['start_positions'] = encodings['start_positions'].view(len(answers), 1)
  encodings['end_positions'] = encodings['end_positions'].view(len(answers), 1)

In [None]:
index_to_token_position(span_train_encodings, cgsqa_span_train_answers)
index_to_token_position(val_encodings, cgsqa_val_answers)
index_to_token_position(test_encodings, cgsqa_test_answers)

In [None]:
val_encodings['IDs'] = cgsqa_val_ids
test_encodings['IDs'] = cgsqa_test_ids

In [None]:
print(val_encodings.keys())
print(test_encodings.keys())
print(span_train_encodings.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions', 'IDs'])
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions', 'IDs'])
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])


#Use data loader

In [None]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

In [None]:
class cgsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

span_train_dataset = cgsDataset(span_train_encodings)
val_dataset = cgsDataset(val_encodings)
test_dataset = cgsDataset(test_encodings)

In [None]:
data_collator = DataCollatorWithPadding(araelectra_tokenizer)

In [None]:
span_train_loader = DataLoader(span_train_dataset, batch_size=8, shuffle= True, collate_fn = data_collator)
val_loader = DataLoader(val_dataset, batch_size =8, shuffle = True, collate_fn = data_collator)
test_loader = DataLoader(test_dataset, batch_size = 8, shuffle = True, collate_fn = data_collator)

#Fine-tune and validate the model on CGSQuAD

In [None]:
model_name1="ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA"

In [None]:
QA_AraElectra = ElectraForQuestionAnswering.from_pretrained(model_name1)

Downloading (…)lve/main/config.json:   0%|          | 0.00/851 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/538M [00:00<?, ?B/s]

In [None]:
def get_raw_preds(data_loader, model,ids_to_index,offset,contexts, max_answer_length, n_best_size):
  model.eval()
  imd_predictions,script_predictions = dict(), dict()
  with torch.no_grad():
    total_loss = 0.0
    total_predictions = dict()
    #no_probs_pred = dict()
    loop = tqdm(data_loader, leave=True)
    for batch_idx, batch in enumerate(loop):
      tokens = batch['input_ids'].to(device)
      masks = batch['attention_mask'].to(device)
      tokens_type = batch['token_type_ids'].to(device)
      gt_start = batch['start_positions'].to(device)
      gt_end = batch['end_positions'].to(device)
      IDs = batch['IDs'].to(device)
      outputs = model(tokens, masks, tokens_type, start_positions=gt_start, end_positions=gt_end)
      #calculating loss
      loss = outputs.loss
      #update average total loss
      total_loss = total_loss + ((1 / (batch_idx + 1)) * (loss.item() - total_loss))
      #calculating f1 score and EM
      #curr_batch_size = tokens.shape[0]
      post_raw_preds(IDs, outputs.start_logits, outputs.end_logits, ids_to_index, offset, contexts,max_answer_length, n_best_size, imd_predictions, script_predictions )
    model.train()
    return imd_predictions,script_predictions

In [None]:
def post_raw_preds(IDs, total_start_logits, total_end_logits,ids_to_index,offset,contexts, max_answer_length, n_best_size,
 imd_predictions,script_predictions ):
    total_start_logits = total_start_logits.cpu().numpy()
    total_end_logits = total_end_logits.cpu().numpy()
    IDs = IDs.cpu().numpy()
    for i in range(IDs.shape[0]):
        offset_mapping = offset[ids_to_index[IDs[i].squeeze()]]
        # The first feature comes from the first example. For the more general case, we will need to be match the example_id to
        # an example index
        context = contexts[ids_to_index[IDs[i].squeeze()]]
        start_logits = total_start_logits[i]
        end_logits = total_end_logits[i]
        # Gather the indices the best start/end logits:
        start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
        end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
        valid_answers = []
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                # to part of the input_ids that are not in the context.
                if (
                    start_index >= len(offset_mapping)
                    or end_index >= len(offset_mapping)
                    or offset_mapping[start_index] is None
                    or offset_mapping[end_index] is None
                ):
                    continue
                # Don't consider answers with a length that is either < 0 or > max_answer_length.
                if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                    continue
                if start_index <= end_index: # We need to refine that test to check the answer is inside the context
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        if len(valid_answers) ==0:
            valid_answers.append({"text":"", "score":""})

        valid_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        imd_predictions[str(IDs[i].squeeze())] = valid_answer
        script_predictions[str(IDs[i].squeeze())] = valid_answer['text']

In [None]:
def get_preds(total_preds, data_path):
    log_path = '/content/'
    preds_path = os.path.join(log_path, 'preds1')
    if not os.path.exists(preds_path):
        os.mkdir(preds_path)
    text_preds_path = os.path.join(preds_path, 'preds1.json')
    jsonString = json.dumps(total_preds)
    jsonFile = open(text_preds_path, "w")
    jsonFile.write(jsonString)
    jsonFile.close()

    !python /content/Arabic-MRC/evaluatev2.py /content/cgsqa-val.json /content/preds1/preds1.json electra  --out-file /content/preds1

    if log_path:
        with open(os.path.join(log_path, '/content/preds1/res.csv')) as f:
            DictReader_obj = csv.DictReader(f)
            lastrow = None
            for item in DictReader_obj:
                lastrow = dict(item)
        return lastrow
    return 1

In [None]:
def span_train(model,start_epoch, num_epochs, optimizer, train_loader, val_loader):#optimizer,max_compined_metric,
  model.train()
  for epoch in range(start_epoch,num_epochs):
    total_loss = 0.0
    loop = tqdm(train_loader, leave=True)
    for batch_idx, batch in enumerate(loop):
      tokens = batch['input_ids'].to(device)
      masks = batch['attention_mask'].to(device)
      tokens_type = batch['token_type_ids'].to(device)
      gt_start = batch['start_positions'].to(device)
      gt_end = batch['end_positions'].to(device)
      outputs = model(tokens, masks, tokens_type, start_positions=gt_start, end_positions=gt_end)
      loss = outputs.loss
      loss = 2*loss
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
      total_loss = total_loss + ((1 / (batch_idx + 1)) * (loss.item() - total_loss))
      loop.set_description(f'Epoch {epoch}')
      loop.set_postfix(loss=loss.item())

    imd_preds, script_preds = get_raw_preds(val_loader, model,val_ids_to_idx,val_offset,cgsqa_val_contexts, 30, 10)
    result_dict = get_preds(script_preds, '/content/cgsqa-val.json')
    '''
    checkpoint = {
            'epoch': epoch + 1,
            'result_dict':result_dict,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }
    '''
  return model

In [None]:
QA_AraElectra = ElectraForQuestionAnswering.from_pretrained(model_name1)
span_num_epochs = 3
span_learning_rate = 3e-5
span_optimizer = torch.optim.AdamW(QA_AraElectra.parameters(), lr=span_learning_rate, weight_decay=1e-5)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
QA_AraElectra.to(device)

ElectraForQuestionAnswering(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(64000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerN

In [None]:
span_trained_model = span_train(QA_AraElectra,0, span_num_epochs, span_optimizer, span_train_loader, val_loader)#span_optimizer,0.0,

  0%|          | 0/132 [00:00<?, ?it/s]You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Epoch 0: 100%|██████████| 132/132 [01:02<00:00,  2.10it/s, loss=1.32]
100%|██████████| 29/29 [00:04<00:00,  6.45it/s]


{
  "exact": 88.05309734513274,
  "f1": 93.77490749172163,
  "total": 226,
  "HasAns_exact": 88.05309734513274,
  "HasAns_f1": 93.77490749172163,
  "HasAns_total": 226
}


Epoch 1: 100%|██████████| 132/132 [01:05<00:00,  2.01it/s, loss=0.0428]
100%|██████████| 29/29 [00:04<00:00,  5.99it/s]


{
  "exact": 91.59292035398231,
  "f1": 95.47018001664017,
  "total": 226,
  "HasAns_exact": 91.59292035398231,
  "HasAns_f1": 95.47018001664017,
  "HasAns_total": 226
}


Epoch 2: 100%|██████████| 132/132 [01:04<00:00,  2.03it/s, loss=0.301]
100%|██████████| 29/29 [00:04<00:00,  6.13it/s]


{
  "exact": 92.92035398230088,
  "f1": 96.49975516125957,
  "total": 226,
  "HasAns_exact": 92.92035398230088,
  "HasAns_f1": 96.49975516125957,
  "HasAns_total": 226
}


In [None]:
span_trained_model.save_pretrained('Model1')
araelectra_tokenizer.save_pretrained('Model1')

('Model1/tokenizer_config.json',
 'Model1/special_tokens_map.json',
 'Model1/vocab.txt',
 'Model1/added_tokens.json',
 'Model1/tokenizer.json')

#Test and evaluate the model on CGSQuAD

In [None]:
span_trained_model = ElectraForQuestionAnswering.from_pretrained('Model1')
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
span_trained_model.to(device)

ElectraForQuestionAnswering(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(64000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerN

In [None]:
imd_preds, script_preds = get_raw_preds(test_loader, span_trained_model, test_ids_to_idx, test_offset, cgsqa_test_contexts, 30, 10)

100%|██████████| 29/29 [00:04<00:00,  6.18it/s]


In [None]:
print(imd_preds)

{'18': {'score': 10.368954, 'text': 'رسالة بحثية يعدها الطالب بتوجيه المشرف الاكاديمي'}, '125': {'score': 10.414878, 'text': 'ويتم اختياره من قبل عميد كلية الدراسات العليا'}, '182': {'score': 12.677702, 'text': 'سنتان ويجوز تمديدها'}, '153': {'score': 17.14149, 'text': 'خلال مشرفه الاكاديمي'}, '13': {'score': 12.878348, 'text': 'عن طريق عميد كلية الدراسات العليا'}, '198': {'score': 17.025555, 'text': 'اجتياز جميع متطلبات البرنامج بمعدل متوسط لا يقل عن 3'}, '110': {'score': 15.506573, 'text': 'الفصل الاول للعام الجامعي التالي لاعتمادها'}, '145': {'score': 17.534016, 'text': 'المشرف والمشرف المساعد ورئيس اللجنة والممتحنين'}, '88': {'score': 17.608189, 'text': 'ناجح أو راسب'}, '40': {'score': 14.685641, 'text': '500 دينار كويتي'}, '53': {'score': 15.75659, 'text': 'يقوم بتقديم تقريرا مكتوبا'}, '212': {'score': 14.253257, 'text': 'عن طريق تقديم الاقتراح الى لجنة البرنامج'}, '51': {'score': 14.811602, 'text': 'سنتين'}, '8': {'score': 9.200519, 'text': 'متطلبات البرنامج نفسه ومن الممكن اضافة

In [None]:
print(script_preds)

{'18': 'رسالة بحثية يعدها الطالب بتوجيه المشرف الاكاديمي', '125': 'ويتم اختياره من قبل عميد كلية الدراسات العليا', '182': 'سنتان ويجوز تمديدها', '153': 'خلال مشرفه الاكاديمي', '13': 'عن طريق عميد كلية الدراسات العليا', '198': 'اجتياز جميع متطلبات البرنامج بمعدل متوسط لا يقل عن 3', '110': 'الفصل الاول للعام الجامعي التالي لاعتمادها', '145': 'المشرف والمشرف المساعد ورئيس اللجنة والممتحنين', '88': 'ناجح أو راسب', '40': '500 دينار كويتي', '53': 'يقوم بتقديم تقريرا مكتوبا', '212': 'عن طريق تقديم الاقتراح الى لجنة البرنامج', '51': 'سنتين', '8': 'متطلبات البرنامج نفسه ومن الممكن اضافة متطلبات اضافية', '142': 'يقدم الطالب مشروع تفصيلي لبحث الدكتوراة', '147': 'سنة دراسية واحدة', '168': 'لاحتسابه وقت التحكيم', '159': 'يحددها مجلس الجامعة', '3': 'ولا يمكن التقديم على أكثر من برنامج مختلف في نفس الوقت', '10': 'ولا تحتسب فترات انقطاع الدراسة', '28': 'لاحتسابه وقت التحكيم', '121': '6 وحدات دراسية بحد اقصى', '176': '3 سنوات', '210': 'مهمة واحدة فقط خلال مدة دراسته', '175': '8 و 12 وحدة دراسية في كل ف

In [None]:
def get_preds2(total_preds):
  preds_path = os.path.join('preds2')
  if not os.path.exists(preds_path):
    os.mkdir(preds_path)
  text_preds_path = os.path.join(preds_path, 'preds2.json')
  jsonString = json.dumps(total_preds)
  jsonFile = open(text_preds_path, "w")
  jsonFile.write(jsonString)
  jsonFile.close()
  !python /content/Arabic-MRC/evaluatev2.py /content/cgsqa-test.json /content/preds2/preds2.json electra --out-file /content/preds2

In [None]:
qa_result_dict = get_preds2(script_preds)

{
  "exact": 92.47787610619469,
  "f1": 96.14696509323463,
  "total": 226,
  "HasAns_exact": 92.47787610619469,
  "HasAns_f1": 96.14696509323463,
  "HasAns_total": 226
}


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
span_trained_model.push_to_hub("AraELECTRA-CGSQuAD-QA-Model1")
araelectra_tokenizer.push_to_hub("AraELECTRA-CGSQuAD-QA-Tokenizer1")

CommitInfo(commit_url='https://huggingface.co/FatemahAlsubaiei/AraELECTRA-CGSQuAD-QA-Tokenizer1/commit/2244c0544136c9e18092646d0d29abed440d6138', commit_message='Upload tokenizer', commit_description='', oid='2244c0544136c9e18092646d0d29abed440d6138', pr_url=None, pr_revision=None, pr_num=None)