# Data preprocessing for [MLQA](https://github.com/facebookresearch/MLQA?tab=readme-ov-file) [dataset](https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip)

In [1]:
import json
import random 

# Check data architecture

## English data

In [2]:
file_path = './data/MLQA/test/test-context-en-question-en.json'

with open(file_path, 'r') as f:
    data_en = json.load(f)

def collect_keys(obj, depth=0, keys_per_level={}):
    if isinstance(obj, dict):
        if depth not in keys_per_level:
            keys_per_level[depth] = set()
        for k, v in obj.items():
            keys_per_level[depth].add(k)
            collect_keys(v, depth + 1, keys_per_level)
    elif isinstance(obj, list):
        for item in obj:
            collect_keys(item, depth, keys_per_level)

    return keys_per_level

keys_per_level = collect_keys(data_en)

for level, keys in keys_per_level.items():
    print(f"Level {level}: {', '.join(keys)}")

Level 0: version, data
Level 1: paragraphs, title
Level 2: context, qas
Level 3: question, id, answers
Level 4: text, answer_start


- version
- data
    - title
    - paragraphs 
        - context 
        - qas
            - id
            - question
            - answers
                - text
                - answer_start

One paragraphs may contain multiple context, one context may have multiple qa

```
{'title': 'Cell culture',
 'paragraphs': 
      [ {'context': 'An established or immortalized of the telomerase gene....',
         'qas': 
          [{
            'question': 'What thing composes the line?',
            'answers': 
              [{
                'text': 'cell', 
                'answer_start': 31
              }],
            'id': '037e8929e7e4d2f949ffbabd10f0f860499ff7c9'
          }]
        },
        {'context': 'The 19th-century English physiologist Sydney Ringer developed salt solutions......',
         'qas': 
          [{
            'question': 'When did Roux remove some of his medullary plate?',
            'answers': 
              [{
                'text': '1885', 
                'answer_start': 232
              }],
            'id': '4b36724f3cbde7c287bde512ff09194cbba7f932'
           },
           {
            'question': 'When were cell culture techniques significantly advanced?',
            'answers': 
              [{
                'text': 'the 1940s and 1950s', 
                'answer_start': 677
              }],
            'id': 'c8acddd587c933917a0a09a214aee83c30764a0d'
          }]
        }
      ]
}
```

## Add `context id` to each `context`

In [3]:
context_id = 0
for article in data_en["data"]:
    for paragraph in article["paragraphs"]:
        paragraph["context_id"] = context_id
        for qa in paragraph["qas"]:
            qa["context_id"] = context_id
        context_id += 1

data_en["data"][0]["paragraphs"][0] 

{'context': 'In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation and Recovery Act (which governs hand

In [4]:
all_contexts_en = []
all_qas_en = []

for item in data_en["data"]:
    for paragraph in item["paragraphs"]:
        all_contexts_en.append(paragraph["context"])
        all_qas_en.extend(paragraph["qas"])

In [5]:
all_contexts_en[0], all_qas_en[0]
# print(json.dumps(all_qas_en, indent=2))

('In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation and Recovery Act (which governs handling of dan

In [6]:
print(f"For English data:\n\tNumber of Contexts: {len(all_contexts_en)}")
print(f"\tNumber of QA pairs: {len(all_qas_en)}")

For English data:
	Number of Contexts: 9916
	Number of QA pairs: 11590


## Chinese data

In [7]:
file_path = './data/MLQA/test/test-context-zh-question-zh.json'

with open(file_path, 'r') as f:
    data_zh = json.load(f)

# print(f"Total num: {len(data_zh)}")
# data_zh
    
keys_per_level = collect_keys(data_zh)
for level, keys in keys_per_level.items():
    print(f"Level {level}: {', '.join(keys)}")

Level 0: version, data
Level 1: paragraphs, title
Level 2: context, qas
Level 3: question, id, answers
Level 4: text, answer_start


In [8]:
context_id = 0
for article in data_zh["data"]:
    for paragraph in article["paragraphs"]:
        paragraph["context_id"] = context_id
        for qa in paragraph["qas"]:
            qa["context_id"] = context_id
        context_id += 1

data_zh["data"][0]["paragraphs"][0] 

{'context': '在电路学里，电动势（英语：electromotive force，缩写为emf）表征一些电路元件供应电能的特性。这些电路元件称为「电动势源」。电化电池、太阳能电池、燃料电池、热电装置、发电机等等，都是电动势源。电动势源所供应的能量每单位电荷是其电动势。假设，电荷',
 'qas': [{'question': '各电化电池都能提供电动势？',
   'answers': [{'text': '电化电池', 'answer_start': 71}],
   'id': '465f3fb044b5c50a78a2e2f9bc94c424d1f7d039',
   'context_id': 0}],
 'context_id': 0}

In [9]:
all_contexts_zh = []
all_qas_zh = []

for item in data_zh["data"]:
    for paragraph in item["paragraphs"]:
        all_contexts_zh.append(paragraph["context"])
        all_qas_zh.extend(paragraph["qas"])

In [10]:
all_contexts_zh[0], all_qas_zh[0]
# print(json.dumps(all_qas_zh, indent=2))

('在电路学里，电动势（英语：electromotive force，缩写为emf）表征一些电路元件供应电能的特性。这些电路元件称为「电动势源」。电化电池、太阳能电池、燃料电池、热电装置、发电机等等，都是电动势源。电动势源所供应的能量每单位电荷是其电动势。假设，电荷',
 {'question': '各电化电池都能提供电动势？',
  'answers': [{'text': '电化电池', 'answer_start': 71}],
  'id': '465f3fb044b5c50a78a2e2f9bc94c424d1f7d039',
  'context_id': 0})

In [11]:
print(f"For Chinese data:\n\tNumber of Contexts: {len(all_contexts_zh)}")
print(f"\tNumber of QA pairs: {len(all_qas_zh)}")

For Chinese data:
	Number of Contexts: 4546
	Number of QA pairs: 5137


## Check if all zh data overlap with en data

In [12]:
ids_en = [item['id'] for item in all_qas_en]
ids_zh = [item['id'] for item in all_qas_zh]
len(ids_en), len(ids_zh)

(11590, 5137)

In [13]:
count = 0
for id in ids_zh:
    if id in ids_en:
        count+=1
count

5137

```
Summary：

1.Dataset architecture：
- version
- data
    - title
    - paragraphs 
        - context 
        - qas
            - id
            - question
            - answers
                - text
                - answer_start

One paragraphs may contain multiple context, one context may have multiple qa

For English data:
	Number of Contexts: 9916
	Number of QA pairs: 11590

For Chinese data:
	Number of Contexts: 4546
	Number of QA pairs: 5137

All 5137 Chinese QAs can be matched to the English dataset by their IDs.
```

# Check data quality

## Manual check of the quality of QA data

In [14]:
data_zh['data'][random.randint(0,2429)]

{'title': '美洲水鼬',
 'paragraphs': [{'context': '美洲水鼬（学名：Neovison vison，英语： American mink）是鼬科的一个半水栖物种，原产自北美洲，但由于人类的活动，其分布范围已经扩张至欧洲、南美洲的诸多地区。由于分布范围较广，美洲水鼬被国际自然保护联盟认定为无危物种。 自从海貂绝种以来，美洲水鼬成为水鼬属唯一一个现存的物种。美洲水鼬是一种肉食动物，进食大鼠、鱼类、 甲壳动物、蛙类和鸟类。在欧洲，由于是外来物种，它已经被归类为入侵物种——人们认为它与欧洲水鼬、比利牛斯鼬鼹和水䶄种群数量的减少有着直接的关系。此外，美洲水鼬也经常被人为饲养以取其毛皮。它是世界上最常见的生产毛皮的动物，在经济活动中的重要性超过了银狐、紫貂、貂属和臭鼬。',
   'qas': [{'question': '美洲水鼬以什么为食？',
     'answers': [{'text': '大鼠、鱼类、 甲壳动物、蛙类和鸟类', 'answer_start': 164}],
     'id': 'd5fefdb71fa18739d95edb69d76c690594ff3a4d',
     'context_id': 1427},
    {'question': '为什么美国鼬现在在欧洲出现了？',
     'answers': [{'text': 'American mink）是鼬科的一个半水栖物种，原产自北美洲，但由于人类的活动，其分布范围已经扩张至欧洲、南美洲的诸多地区。',
       'answer_start': 27}],
     'id': '2bcab480460c1f37b0f1d867fa238a8b516b89b5',
     'context_id': 1427}],
   'context_id': 1427}]}

```
# Mark 1 for an incorrect answer, 0 for correct.
10000,00000
00000,00000
10100,10000
00000,00000
00100,00000

error rate: 0.1%

{'title': '赤色黎明',
 'paragraphs': [{'context': '《赤色黎明》（英语：Red Dawn）是由丹·布拉德利执导的一部2012年美国战争片。剧本由卡尔·埃尔斯沃斯和杰里米·帕斯（Jeremy Passmore）改编自1984年同名电影。演员阵容有克里斯·海姆斯沃斯、乔希·佩克、乔什·哈切森、阿德琳妮·帕里奇、伊莎贝尔·卢卡斯、康纳·克鲁斯和杰弗里·迪恩·摩根。影片聚焦于一群帮助家乡抵御北朝鲜入侵的年轻人。',
   'qas': [{'question': '他们想保卫哪个国家？',
     'answers': [{'text': '北朝鲜', 'answer_start': 167}],
     'id': 'd221b071d496de0aa07a11addfa5202f30edaa4c'}]}]}
```

## Add `title` to each `question` & `context`
Since the data was categorised by different topics(titles), it would result in questions that were too broad and confusing when used randomly. So the headings were added to each question to increase its readability.

In [15]:
all_qas_en[:5]

[{'question': 'Who analyzed the biopsies?',
  'answers': [{'text': 'Rutgers University biochemists', 'answer_start': 457}],
  'id': 'a4968ca8a18de16aa3859be760e43dbd3af3fce9',
  'context_id': 0},
 {'question': 'who represented robert frost and walter kasza in their suit?',
  'answers': [{'text': 'George Washington University law professor Jonathan Turley',
    'answer_start': 218}],
  'id': 'f251ea56c4f1aa1df270137f7e6d89c0cc1b6ef4',
  'context_id': 0},
 {'question': 'What was the law suit against Groom about',
  'answers': [{'text': 'the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation and Recovery Act (which governs handling of dangerous materials)',
    'answer_start': 826}],
  'id': '04ecd5555635bc05fd2f379d1b9027edd663cebf',
  'context_id': 0},
 {'question': 'what did the complainants alleged happen to them?',
  'answers': [{'text': 'had sustained skin, liver, and respiratory injuries',
    'answer_start': 607

In [16]:
qas_with_title_en = []

for item in data_en["data"]:
    title = item["title"] 
    for paragraph in item["paragraphs"]:
        for qa in paragraph["qas"]:
            ''' Add title to each corresponding question '''
            # qa['question'] = f"{title}: {qa['question']}" 
            ''' Or use `title` as a new key '''
            qa['title'] = title
            qas_with_title_en.append(qa)  

# print(json.dumps(qas_with_title_en, indent=2))
assert len(qas_with_title_en) == len(all_qas_en)
qas_with_title_en[:5]

[{'question': 'Who analyzed the biopsies?',
  'answers': [{'text': 'Rutgers University biochemists', 'answer_start': 457}],
  'id': 'a4968ca8a18de16aa3859be760e43dbd3af3fce9',
  'context_id': 0,
  'title': 'Area 51'},
 {'question': 'who represented robert frost and walter kasza in their suit?',
  'answers': [{'text': 'George Washington University law professor Jonathan Turley',
    'answer_start': 218}],
  'id': 'f251ea56c4f1aa1df270137f7e6d89c0cc1b6ef4',
  'context_id': 0,
  'title': 'Area 51'},
 {'question': 'What was the law suit against Groom about',
  'answers': [{'text': 'the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation and Recovery Act (which governs handling of dangerous materials)',
    'answer_start': 826}],
  'id': '04ecd5555635bc05fd2f379d1b9027edd663cebf',
  'context_id': 0,
  'title': 'Area 51'},
 {'question': 'what did the complainants alleged happen to them?',
  'answers': [{'text': 'had sustain

In [17]:
all_qas_zh[:5]

[{'question': '各电化电池都能提供电动势？',
  'answers': [{'text': '电化电池', 'answer_start': 71}],
  'id': '465f3fb044b5c50a78a2e2f9bc94c424d1f7d039',
  'context_id': 0},
 {'question': '哪水体有助土地如此多产？',
  'answers': [{'text': '楚河', 'answer_start': 36}],
  'id': '1aee17dd937cc1043e3ff47c38396541fc3409e5',
  'context_id': 1},
 {'question': '它用来写什么类型的记录？',
  'answers': [{'text': '法律、行政和私人记录', 'answer_start': 90}],
  'id': 'c1100f360fed1386068a5dc584b875cc9aefb60a',
  'context_id': 2},
 {'question': '在哪帝国期间凯提文广受使用？',
  'answers': [{'text': '莫卧儿帝国期间', 'answer_start': 28}],
  'id': '89325aff92794352bde6c064b6160e601aed56b6',
  'context_id': 3},
 {'question': '爱丽丝怎样恢复她原来的身高？',
  'answers': [{'text': '经过一番努力', 'answer_start': 224}],
  'id': '9fd571d90b8081f45cfd263c961c131c257634c2',
  'context_id': 4}]

In [18]:
qas_with_title_zh = []

for item in data_zh["data"]:
    title = item["title"] 
    for paragraph in item["paragraphs"]:
        for qa in paragraph["qas"]:
            ''' Add title to each corresponding question '''
            # qa['question'] = f"{title}: {qa['question']}" 
            ''' Or use `title` as a new key '''
            qa['title'] = title
            qas_with_title_zh.append(qa)  
            
# print(json.dumps(qas_with_title_zh, indent=2))
assert len(qas_with_title_zh) == len(all_qas_zh)
qas_with_title_zh[:5]

[{'question': '各电化电池都能提供电动势？',
  'answers': [{'text': '电化电池', 'answer_start': 71}],
  'id': '465f3fb044b5c50a78a2e2f9bc94c424d1f7d039',
  'context_id': 0,
  'title': '電動勢'},
 {'question': '哪水体有助土地如此多产？',
  'answers': [{'text': '楚河', 'answer_start': 36}],
  'id': '1aee17dd937cc1043e3ff47c38396541fc3409e5',
  'context_id': 1,
  'title': '楚河州'},
 {'question': '它用来写什么类型的记录？',
  'answers': [{'text': '法律、行政和私人记录', 'answer_start': 90}],
  'id': 'c1100f360fed1386068a5dc584b875cc9aefb60a',
  'context_id': 2,
  'title': '凱提文'},
 {'question': '在哪帝国期间凯提文广受使用？',
  'answers': [{'text': '莫卧儿帝国期间', 'answer_start': 28}],
  'id': '89325aff92794352bde6c064b6160e601aed56b6',
  'context_id': 3,
  'title': '凱提文'},
 {'question': '爱丽丝怎样恢复她原来的身高？',
  'answers': [{'text': '经过一番努力', 'answer_start': 224}],
  'id': '9fd571d90b8081f45cfd263c961c131c257634c2',
  'context_id': 4,
  'title': '爱丽丝梦游仙境'}]

**Add `id` and `title` to `Context`**

English

In [19]:
all_contexts_with_idTitle_en = []
context_id = 0
for item in data_en["data"]:
    title = item["title"] 
    for paragraph in item["paragraphs"]:
        all_contexts_with_idTitle_en.append({"id": context_id, "title": title, "context": paragraph["context"]})
        context_id += 1
        
print("Total=",len(all_contexts_with_idTitle_en))
all_contexts_with_idTitle_en[:5]

Total= 9916


[{'id': 0,
  'title': 'Area 51',
  'context': 'In 1994, five unnamed civilian contractors and the widows of contractors Walter Kasza and Robert Frost sued the USAF and the United States Environmental Protection Agency. Their suit, in which they were represented by George Washington University law professor Jonathan Turley, alleged they had been present when large quantities of unknown chemicals had been burned in open pits and trenches at Groom. Biopsies taken from the complainants were analyzed by Rutgers University biochemists, who found high levels of dioxin, dibenzofuran, and trichloroethylene in their body fat. The complainants alleged they had sustained skin, liver, and respiratory injuries due to their work at Groom, and that this had contributed to the deaths of Frost and Kasza. The suit sought compensation for the injuries they had sustained, claiming the USAF had illegally handled toxic materials, and that the EPA had failed in its duty to enforce the Resource Conservation an

Chinese

In [20]:
all_contexts_with_idTitle_zh = []
context_id = 0
for item in data_zh["data"]:
    title = item["title"] 
    for paragraph in item["paragraphs"]:
        all_contexts_with_idTitle_zh.append({"id": context_id, "title": title, "context": paragraph["context"]})
        context_id += 1
        
all_contexts_with_idTitle_zh[:5]

[{'id': 0,
  'title': '電動勢',
  'context': '在电路学里，电动势（英语：electromotive force，缩写为emf）表征一些电路元件供应电能的特性。这些电路元件称为「电动势源」。电化电池、太阳能电池、燃料电池、热电装置、发电机等等，都是电动势源。电动势源所供应的能量每单位电荷是其电动势。假设，电荷'},
 {'id': 1,
  'title': '楚河州',
  'context': '楚河州包括有整个楚河河谷及邻近的山脉与峡谷。河谷的黑土非常肥沃，而且被从楚河引来的河水灌溉着。当地的农业生产计有：小麦、玉蜀黍、甜菜、马铃薯、紫花苜蓿及各种不同品种的蔬菜及水果。在苏联统治期间，省内有不少农产品加工及其他工业，使省内涌现多个新市镇，如：托克马克、坎特（Kant）及卡拉巴尔塔（Kara-Balta）等。相对于国内其他省份，本州的人口成份比较复杂，计有：俄罗斯人、乌克兰人、东干人（中国回民的后裔）、朝鲜人及德国人等。'},
 {'id': 2,
  'title': '凱提文',
  'context': '凯提文(Kaithi，कैथी)，也叫做Kayathi或Kayasthi，是历史上的一种文字，曾广泛用于北印度，主要是以前的西北行省和Oudh（今天的北方邦）和比哈尔。它曾用于书写法律、行政和私人记录。Unicode技术委员会已经接受了在Unicode标准中编码凯提文的提案，范围是U+11080-110CF。'},
 {'id': 3,
  'title': '凱提文',
  'context': '用凯提文记录的文档可追溯到至少16世纪。这种文字广泛用在莫卧儿帝国期间。在1880年代英属印度期间，这种文字被认可为比哈尔邦法庭上的官方文字。尽管一般而言凯提文曾在某些地区比城文更加广泛使用，它现在已经失去了竞争力。'},
 {'id': 4,
  'title': '爱丽丝梦游仙境',
  'context': '第五章：毛毛虫的建议（Advice from a Caterpillar）爱丽丝见到一棵蘑菇，上面坐著一条蓝色的毛虫。他抽著水烟，向爱丽丝探问起来。爱丽丝回应他，自己正在个性转变期之中，时常心绪不宁，她甚至连一首诗都记不起来。毛虫离开之前，告诉了她蘑菇的秘密：吃其中一半会使她变高，吃另一半会使她变矮。于是，她

# Output data

In [21]:
with open('./data/Context_EN.json', 'w') as file:
    json.dump(all_contexts_with_idTitle_en, file, indent=2)

with open('./data/Context_ZH.json', 'w') as file:
    json.dump(all_contexts_with_idTitle_zh, file, indent=2)

In [22]:
with open('./data/QA_EN.json', 'w') as file:
    json.dump(qas_with_title_en, file, indent=2)

with open('./data/QA_ZH.json', 'w') as file:
    json.dump(qas_with_title_zh, file, indent=2)