Skip to content

PyThaiNLP/thaiqa_squad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

annotations_creators language_creators languages licenses multilinguality size_categories source_datasets task_categories task_ids
expert-generated
found
th
cc-by-nc-sa-3.0
monolingual
1K<n<10K
extended|other-thaiqa
question-answering
extractive-qa
open-domain-qa

Dataset Card for thaiqa-squad

Table of Contents

Dataset Description

Dataset Summary

thaiqa_squad is an open-domain, extractive question answering dataset (4,000 questions in train and 74 questions in dev) in SQuAD format, originally created by NECTEC from Wikipedia articles and adapted to SQuAD format by PyThaiNLP.

Supported Tasks and Leaderboards

extractive question answering

Languages

Thai

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

train valid
# questions 4000 74
# avg words in context 1186.740750 1016.459459
# avg words in question 14.325500 12.743243
# avg words in answer 3.279750 4.608108

Dataset Creation

Curation Rationale

PyThaiNLP created thaiqa_squad as a SQuAD version of thaiqa. thaiqa is part of The 2nd Question answering program from Thai Wikipedia of National Software Contest 2020.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

Wikipedia authors for contexts and NECTEC for questions and answer annotations

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

NECTEC

Personal and Sensitive Information

All contents are from Wikipedia. No personal and sensitive information is expected to be included.

Considerations for Using the Data

Social Impact of Dataset

  • open-domain, extractive question answering in Thai

Discussion of Biases

[More Information Needed]

Other Known Limitations

  • The contexts include <doc> tags at start and at the end

Additional Information

Dataset Curators

NECTEC for original thaiqa. SQuAD formattting by PyThaiNLP.

Licensing Information

CC-BY-NC-SA 3.0

Citation Information

[More Information Needed]