Skip to content

DataScienceUIBK/ArabicaQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArabicaQA is a robust dataset designed to support and advance the development of Arabic Question Answering (QA) systems. This dataset encompasses a wide range of question types, including both Machine Reading Comprehension (MRC) and Open-Domain questions, catering to various aspects of QA research and application. The dataset is structured to facilitate training, validation, and testing of Arabic QA models.

Demo

Try Our Demo here

Requirements

# for inference
pip install torch==1.5.1
pip install faiss-cpu==1.7.3
pip install transformers==3.0.0

Using AraDPR

To use our AraDPR for question answering, follow the steps below:

Step 1: Clone AraDPR Repository

First, download the AraDPR model by cloning the repository:

git clone https://huggingface.co/abdoelsayed/AraDPR

After cloning, move the AraDPR model directory to DPR/Model within your project structure:

Step 2: Clone AraDPR Index

Next, download the DPR index required for running AraDPR:

git clone https://huggingface.co/abdoelsayed/AraDPR_index

Once downloaded, move the AraDPR index directory to DPR/DPR_index within your project structure:

Step 3: Wikipeda Data

Next, download the TSV:

TSV

Once downloaded, move the wikiAr.tsv to wiki within your project structure:

Step 4: Running Inference

With the AraDPR model and index in place, you can run inference to answer questions. Edit the inference.py script to include your questions or use the example provided in the script.

To run the inference, execute:

python inference.py

Step 5: Review Results

The results of your inference will be saved in result.json. Open this file to review the answers provided by the AraDPR model to your questions.

Dataset Overview

ArabicaQA is divided into several segments to address different QA challenges:

  • Machine Reading Comprehension (MRC): Contains questions with provided context paragraphs and specified answers. It includes both answerable and unanswerable questions to mimic real-world scenarios where some questions may not have straightforward answers.
  • Open-Domain QA: Designed for scenarios where questions are asked in an open context, encouraging models to retrieve relevant information from a broad dataset.
  • Retriever Training Data: Offers structured data to train retriever models, which are crucial for identifying relevant context or documents from a large corpus.

Dataset Statistics

Category Training Validation Test
MRC (with answers) 62,186 13,483 13,426
MRC (unanswerable) 2,596 561 544
Open-Domain 62,057 13,475 13,414
Open-Domain (Human) 58,676 12,715 12,592

Download Links

MRC Dataset

Structured as JSON files, the MRC dataset includes train.json, val.json, and test.json for training, validation, and testing phases, respectively, along with a metadata CSV file.

  • Data Structure:
Click to maximize

{
  "data": [
    {
      "title": "",
      "paragraphs": [
        {
          "context": "",
          "qas": [
            {
              "question": "",
              "id": "",
              "answers": [
                {
                  "answer_start": 0,
                  "text": ""
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Open-Domain QA Dataset

Available in both JSON and JSONL formats, this part of the dataset is annotated by humans for realistic QA scenarios.

  • Data Structure:
Click to maximize

[
    {
        "question_id": "",
        "answer_id": "",
        "question": "",
        "answer": ""
    }
]

Retriever Training Data

This section provides datasets for training retrieval models, crucial for efficient information extraction and context identification.

  • Data Structure:
Click to maximize

[
    {
        "question": "...",
        "answers": ["...", "...", "..."],
        "positive_ctxs": [{
            "title": "...",
            "text": "...."
            }],
        "negative_ctxs": ["..."],
        "hard_negative_ctxs": ["..."]
    }
]

Retriever Data Output

Outputs from the retrieval models, showcasing the effectiveness of different retrieval strategies (DPR, BM25) in context selection.

  • Data Structure:
Click to maximize

[
    {
    "question": "...",
    "answers": ["...", "..."],
    "ctxs": [
        {
            "id": "...",
            "title": "",
            "text": "....",
            "score": "...",
            "has_answer": true|false
        }
     ]
    }
]

Wikipedia data

  • Data Structure:
id	text	title
  • Wikipedia: TSV

Trainin AraDPR

Will be avaiable soon

Citation

If you find these codes or data useful, please consider citing our paper as:

@misc{abdallah2024arabicaqa,
      title={ArabicaQA: A Comprehensive Dataset for Arabic Question Answering}, 
      author={Abdelrahman Abdallah and Mahmoud Kasem and Mahmoud Abdalla and Mohamed Mahmoud and Mohamed Elkasaby and Yasser Elbendary and Adam Jatowt},
      year={2024},
      eprint={2403.17848},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

ArabicaQA: Comprehensive Dataset for Arabic Question Answering accepted at SIGIR 2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages