# Overview: Datasets for Medical Question Answering

In this notebook, we present various datasets used for Medical Question Answering. For each section below, we introduce one dataset and give instructions and code on how to download and inspect data.

# Preparation

Run below cell to enable access to google Drive. When prompted, click on the link and authorize access to Google Drive of desired account.

In [None]:
### Google Colab Mount Drive ###

# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive

/content/drive/MyDrive


# HeadQA
[HeadQA](https://aghie.github.io/head-qa/) is a set of multiple-choice questions covering Medicine, Nursing, Psychology, Chemistry, Pharmacology, and Biology. Questions come from exams to access a specialized position in the Spanish healthcare system. The dataset can be downloaded from [huggingface datasets](https://huggingface.co/datasets/head_qa). Details of loading and inspecting HeadQA are shown below.

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

The questions and answers are available in both Spanish and English. Deafult language is Spanish. 

If Spanish version is desired, use the command `headqa = load_dataset("head_qa")` to load dataset 

If English version is desired, use the command `headqa = load_dataset("head_qa", "en")` to load dataset.

In this example, we use the English version.






In [None]:
headqa = load_dataset("head_qa", "en")

Downloading:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Downloading and preparing dataset head_qa/en (download: 1.67 MiB, generated: 2.65 MiB, post-processed: Unknown size, total: 4.31 MiB) to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2...


Downloading:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset head_qa downloaded and prepared to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

The `headqa` object itself is a [DatasetDict](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set. For each key, the value is a [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset).

In [None]:
headqa

DatasetDict({
    train: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 2657
    })
    test: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 2742
    })
    validation: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 1366
    })
})

To view an actual data instance, select one of the splits and then specify an index.

In [None]:
# display the first training data instance
headqa['train'][0] 

{'answers': [{'aid': 1, 'atext': 'They are all or nothing.'},
  {'aid': 2, 'atext': 'They are hyperpolarizing.'},
  {'aid': 3, 'atext': 'They can be added.'},
  {'aid': 4, 'atext': 'They spread long distances.'},
  {'aid': 5, 'atext': 'They present a refractory period.'}],
 'category': 'biology',
 'image': '',
 'name': 'Cuaderno_2013_1_B',
 'qid': 1,
 'qtext': 'The excitatory postsynaptic potentials:',
 'ra': 3,
 'year': '2013'}

To get a better sense of what the data looks like, the following function will show some examples picked randomly from the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
# randomly choose 3 test data instances
show_random_elements(headqa["test"], 3) 

Unnamed: 0,name,year,category,qid,qtext,ra,image,answers
0,Cuaderno_2017_1_E,2017,nursery,81,A patient with a venous ulcer in the lower limbs has characteristic symptomatology and clinical manifestations. Which of the following responses is not characteristic of this situation ?:,3,,"[{'aid': 1, 'atext': 'Thick and hardened skin.'}, {'aid': 2, 'atext': 'Significant edema'}, {'aid': 3, 'atext': 'Intermittent claudication.'}, {'aid': 4, 'atext': 'Normal pulses'}]"
1,Cuaderno_2017_1_P,2017,psychology,67,Studies on sleep in subjects complaining of insomnia show:,2,,"[{'aid': 1, 'atext': 'That most overestimate the amount of time he actually sleeps.'}, {'aid': 2, 'atext': 'That most underestimate the amount of time that actually sleeps.'}, {'aid': 3, 'atext': 'That most accurately estimate the amount of time he actually sleeps.'}, {'aid': 4, 'atext': 'That the majority estimates with accuracy the amount of time that sleeps only during the siesta.'}]"
2,Cuaderno_2016_1_E,2016,nursery,54,"With respect to critical thinking, which of the following terms used by Richard Paul as characteristics of critical thinkers is INCORRECT. The critical thinkers are:",3,,"[{'aid': 1, 'atext': 'Humble.'}, {'aid': 2, 'atext': 'Realistic'}, {'aid': 3, 'atext': 'Reagents'}, {'aid': 4, 'atext': 'Good communicators.'}]"


In each example, the question text is contained in the field `qtext`. The `answers` field is a list of dictionaries, each dictionary has two keys: `aid` contains the index of the choice and `atext` contains the text for the choice. \\
The following function helps to better visualize each question. The file `ra` contains the index of the right answer.

In [None]:
def show_one(example):
    print(f"Question: {example['qtext']}")
    print(f"  1 - {example['answers'][0]['atext']}")
    print(f"  2 - {example['answers'][1]['atext']}")
    print(f"  3 - {example['answers'][2]['atext']}")
    print(f"  4 - {example['answers'][3]['atext']}")
    print(f"  5 - {example['answers'][4]['atext']}")
    print(f"\nGround truth: option {example['ra']}")

In [None]:
show_one(headqa["train"][0])

Question: The excitatory postsynaptic potentials:
  1 - They are all or nothing.
  2 - They are hyperpolarizing.
  3 - They can be added.
  4 - They spread long distances.
  5 - They present a refractory period.

Ground truth: option 3


# BioASQ
[BioASQ](http://bioasq.org) organizes challenges on biomedical semantic indexing and question answering (QA). The challenges include a variety of tasks, but in this section, we focus only on Question Answering (QA). Among [all challenges](http://bioasq.org/participate/challenges), The two relevant tasks are Task 9b: Biomedical Semantic QA and BioASQ Task Synergy: Biomedical Semantic QA for COVID-19.

**Task 9b: Biomedical Semantic QA** \\
[Task 9b](http://participants-area.bioasq.org/general_information/Task9b/) uses a benchmark QA dataset with four types of questions: \\


1. **Yes/no questions**: These are questions that, strictly speaking, require "yes" or "no" answers, though of course in practice longer answers will often be desirable. For example, "Do CpG islands colocalise with transcription start sites?" is a yes/no question.
2. **Factoid questions**: These are questions that, strictly speaking, require a particular entity name (e.g., of a disease, drug, or gene), a number, or a similar short expression as an answer, though again a longer answer may be desirable in practice. For example, "Which virus is best known as the cause of infectious mononucleosis?" is a factoid question.
3. **List questions**: These are questions that, strictly speaking, require a list of entity names (e.g., a list of gene names), numbers, or similar short expressions as an answer; again, in practice additional information may be desirable. For example, "Which are the Raf kinase inhibitors?" is a list question.
4. **Summary questions**: These are questions that do not belong in any of the previous categories and can only be answered by producing a short text summarizing the most prominent relevant information. For example, "What is the treatment of infectious mononucleosis?" is a summary question. \\

We will inspect the dataset below.



In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

In [None]:
%cd BioASQ-training9b/

/content/drive/MyDrive/BioASQ-training9b


Inspect README file. The distribution of 3742 questions : 1091 factoid, 1033 yesno, 899 summary, 719 list

In [None]:
!cat README

== Data purpose ==

The data are intended to be used as training and development data for BioASQ 9, which will take place during 2021.
There is one file containing the data:
 - training9b.json


The file contains the data of the first seven editions of the challenge: 3742 questions [1] with their relevant documents, snippets, concepts and RDF triples, exact and ideal answers.
For more information about the format of the data as well as the instructions for participating at BioASQ please consult: http://participants-area.bioasq.org/general_information/Task9b/

Differences with BioASQ-training8b.json 
	- 499 new questions added from BioASQ8
		- The question with id 5e30e689fbd6abf43b00003a had identical body with 5880e417713cbdfd3d000001. All relevant elements from both questions are available in the merged question with id 5880e417713cbdfd3d000001.



		
== Citing BioASQ ==
When using this data please cite our previous work:

An overview of the BIOASQ large-scale biomedical semantic ind

Load data from json file.

In [None]:
import json
data_file = "training9b.json"
data = json.load(open(data_file))

Inspect structure of json file. The json file contains one key, 'questions'. The corresponding value is a list of 3743 questions

In [None]:
print(data.keys())
print("Total number of questions: ", len(data['questions']))

dict_keys(['questions'])
Total number of questions:  3743


In [None]:
type(data['questions'])

list

Inspect structure of any question. 



In [None]:
data['questions'][0]

{'body': 'Is Hirschsprung disease a mendelian or a multifactorial disorder?',
 'concepts': ['http://www.disease-ontology.org/api/metadata/DOID:10487',
  'http://www.nlm.nih.gov/cgi/mesh/2015/MB_cgi?field=uid&exact=Find+Exact+Term&term=D006627',
  'http://www.nlm.nih.gov/cgi/mesh/2015/MB_cgi?field=uid&exact=Find+Exact+Term&term=D020412',
  'http://www.disease-ontology.org/api/metadata/DOID:11372'],
 'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/15858239',
  'http://www.ncbi.nlm.nih.gov/pubmed/15829955',
  'http://www.ncbi.nlm.nih.gov/pubmed/6650562',
  'http://www.ncbi.nlm.nih.gov/pubmed/12239580',
  'http://www.ncbi.nlm.nih.gov/pubmed/21995290',
  'http://www.ncbi.nlm.nih.gov/pubmed/23001136',
  'http://www.ncbi.nlm.nih.gov/pubmed/15617541',
  'http://www.ncbi.nlm.nih.gov/pubmed/8896569',
  'http://www.ncbi.nlm.nih.gov/pubmed/20598273'],
 'id': '55031181e9bde69634000014',
 'ideal_answer': ["Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the developme

Inspect keys of each question, different question type has different set of keys.

In [None]:
data['questions'][0].keys()

dict_keys(['body', 'documents', 'ideal_answer', 'concepts', 'type', 'id', 'snippets'])

Check distribution of question types.

In [None]:
from collections import Counter
type_distribution = Counter([x['type'] for x in data['questions']])

In [None]:
type_distribution

Counter({'factoid': 1092, 'list': 719, 'summary': 899, 'yesno': 1033})

The following function display one or more examples of a specified question type. Use the following function to further explore content of each question.

In [None]:
import numpy as np
import random
def show_random_question(dataset, qtype="factoid", num_examples=1):
  assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
  picks = []
  for _ in range(num_examples):
    pick = random.randint(0, len(dataset)-1)
    while pick in picks or dataset[pick]['type'] != qtype:
      pick = random.randint(0, len(dataset)-1)
    picks.append(pick)
  picked_questions = [dataset[pick] for pick in picks]
  return picked_questions

In [None]:
show_random_question(data['questions'], "factoid")

[{'body': 'What is the ubiquitin proteome?',
  'concepts': ['http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D020543',
   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D014452',
   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D040901',
   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D054875',
   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D025801',
   'http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016567',
   'http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0031386',
   'http://www.nlm.nih.gov/cgi/mesh/2014/MB_cgi?field=uid&exact=Find+Exact+Term&term=D057149',
   'http://www.uniprot.org/uniprot/UBIQ_CERCA'],
  'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/22178446',
   'http://www.ncbi.nlm.nih.gov/pubmed/23743150',
   'http://www.ncbi.nlm.nih.gov/pubmed/23764619'],
  

The following function will return a list of questions of a particular type.

In [None]:
type_num_dic = {'factoid': 1092, 'list': 719, 'summary': 899, 'yesno': 1033}
def get_question_of_type(dataset, qtype):
  questions = [q for q in dataset if q['type'] == qtype]
  assert len(questions) == type_num_dic[qtype]
  return questions

Get all questions of 'factoid' type and inspect a random element.

In [None]:
qtype = 'factoid'
selected_questions = get_question_of_type(data['questions'], qtype) 
show_random_question(selected_questions, qtype, 1)

[{'body': 'What type of mutation is causing the industrial melanism phenotype in peppered moths?',
  'documents': ['http://www.ncbi.nlm.nih.gov/pubmed/27251284',
   'http://www.ncbi.nlm.nih.gov/pubmed/12298233',
   'http://www.ncbi.nlm.nih.gov/pubmed/12140267'],
  'exact_answer': ['transposable element insertion'],
  'id': '58a877cf38c171fb5b000004',
  'ideal_answer': ['The mutation event giving rise to industrial melanism in Britain was the insertion of a large, tandemly repeated, transposable element into the first intron of the gene cortex.'],
  'snippets': [{'beginSection': 'title',
    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/27251284',
    'endSection': 'title',
    'offsetInBeginSection': 0,
    'offsetInEndSection': 85,
    'text': 'The industrial melanism mutation in British peppered moths is a transposable element.'},
   {'beginSection': 'abstract',
    'document': 'http://www.ncbi.nlm.nih.gov/pubmed/27251284',
    'endSection': 'abstract',
    'offsetInBeginSection': 

**Task Synergy** 

[Task Synergy](http://participants-area.bioasq.org/general_information/TaskSynergy/)

In [None]:
# similar to Task9b

# MedQuAD 
[MedQuAD](https://github.com/abachaa/MedQuAD) includes 47,457 medical question-answer pairs created
from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.

Link to [Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4) for more information on dataset construction.

First, clone MedQuAD github repository. The repository contains 12 folders, each folder in turn contains questions from one of the medical resources. In each folder, there are multiple xml files. We will demontrate how to extract relevant information from these xml files below.

In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

/content/drive/My Drive


In [None]:
!git clone https://github.com/abachaa/MedQuAD.git

Cloning into 'MedQuAD'...
remote: Enumerating objects: 11301, done.[K
remote: Total 11301 (delta 0), reused 0 (delta 0), pack-reused 11301[K
Receiving objects: 100% (11301/11301), 11.00 MiB | 3.72 MiB/s, done.
Resolving deltas: 100% (6803/6803), done.
Checking out files: 100% (11276/11276), done.


In [None]:
%cd MedQuAD/

/content/drive/My Drive/MedQuAD


We will show an example of parsing an xml file using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#id17). <br>

Install BeautifulSoup

In [None]:
!pip install bs4



The following function parse one xml file specified by filename.

In [None]:
from bs4 import BeautifulSoup

def parse_one_file(filename):
  data = open(filename, 'r').read()
  soup = BeautifulSoup(data, 'xml')
  info_dic = {}
  # parse Document tag
  document_id = soup.Document['id']
  source = soup.Document['source']
  url = soup.Document['url']
  info_dic['document_id'] = document_id
  info_dic['source'] = source
  info_dic['url'] = url
  # parse focus
  focus = soup.Focus.string # or soup.Focus.contents[0]
  info_dic['focus'] = focus
  # parse semantic group
  semantic_group = soup.SemanticGroup.string
  info_dic['semantic_group'] = semantic_group
  QA_pairs = []
  # parse QA pairs
  for QAPair in soup.find_all(pid=True):
    qid = QAPair.Question['qid']
    qtype = QAPair.Question['qtype']
    question = QAPair.Question.string
    answer = QAPair.Answer.string
    QA_pairs.append([qid, qtype, question, answer])
  info_dic['QA_pairs'] = QA_pairs
  return info_dic

The following function prints info_dic in a more readable format. 

In [None]:
def show_example(info_dic):
    keys = ['document_id', 'focus', 'semantic_group', 'source']
    output = ""
    for k in keys:
      output += "{:17}{}\n".format(k + ':', info_dic[k])
    output += "QA_pairs:\n"
    for qid, qtype, question, answer in info_dic['QA_pairs']:
      answer = ' '.join(answer.strip().split())
      output += "id: {:17}qtype: {}\n     Question:\t{}\n     Answer:\t{}\n".format(qid, qtype, question, answer)
    return output

In [None]:
example_file = "./1_CancerGov_QA/0000001_1.xml"
example_info_dic = parse_one_file(example_file)
print(show_example(example_info_dic))

document_id:     0000001_1
focus:           Adult Acute Lymphoblastic Leukemia
semantic_group:  Disorders
source:          CancerGov
QA_pairs:
id: 0000001_1-1      qtype: information
     Question:	What is (are) Adult Acute Lymphoblastic Leukemia ?
     Answer:	Key Points - Adult acute lymphoblastic leukemia (ALL) is a type of cancer in which the bone marrow makes too many lymphocytes (a type of white blood cell). - Leukemia may affect red blood cells, white blood cells, and platelets. - Previous chemotherapy and exposure to radiation may increase the risk of developing ALL. - Signs and symptoms of adult ALL include fever, feeling tired, and easy bruising or bleeding. - Tests that examine the blood and bone marrow are used to detect (find) and diagnose adult ALL. - Certain factors affect prognosis (chance of recovery) and treatment options. Adult acute lymphoblastic leukemia (ALL) is a type of cancer in which the bone marrow makes too many lymphocytes (a type of white blood cell). Adul

The following function parse all xml files in a specified directory.

In [None]:
import os
dir = './1_CancerGov_QA'
# get all files in the specified directory
files = [f for f in os.listdir(dir) if os.path.isfile(os.path.join(dir, f))]
print(files)
dir_dicts = []
for f in files:
  info_dic = parse_one_file(os.path.join(dir, f))
  dir_dicts.append(info_dic)
# inspect a random parsed dict in the directory
print(show_example(dir_dicts[15]))

['0000001_1.xml', '0000001_2.xml', '0000001_3.xml', '0000001_4.xml', '0000001_5.xml', '0000001_6.xml', '0000001_7.xml', '0000003_1.xml', '0000003_2.xml', '0000003_3.xml', '0000003_4.xml', '0000003_5.xml', '0000003_6.xml', '0000004_1.xml', '0000004_2.xml', '0000004_3.xml', '0000004_4.xml', '0000004_5.xml', '0000004_6.xml', '0000004_7.xml', '0000005_1.xml', '0000005_2.xml', '0000006_1.xml', '0000006_2.xml', '0000006_3.xml', '0000006_4.xml', '0000006_5.xml', '0000006_6.xml', '0000006_7.xml', '0000006_8.xml', '0000006_9.xml', '0000007_1.xml', '0000007_2.xml', '0000007_3.xml', '0000007_4.xml', '0000007_5.xml', '0000009_1.xml', '0000009_2.xml', '0000010_1.xml', '0000013_1.xml', '0000013_2.xml', '0000013_2_1.xml', '0000013_2_2.xml', '0000013_2_3.xml', '0000013_2_4.xml', '0000013_2_5.xml', '0000013_2_6.xml', '0000013_3.xml', '0000013_3_1.xml', '0000013_3_2.xml', '0000013_3_3.xml', '0000013_3_4.xml', '0000014_1.xml', '0000014_2.xml', '0000014_3.xml', '0000014_4.xml', '0000015_1.xml', '0000016_1

# LiveQA
[LiveQA](https://github.com/abachaa/LiveQA_MedicalTask_TREC2017), or TREC-2017 LiveQA: Medical Question Answering Task focuses on consumer health question answering. Details of data creation can be found in the [paper](https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf). There are 634 question-answer pairs for training and 104 for testing.
Additional 2,479 judged answers are available with MedQuAD.


In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

Clone github repo.

In [None]:
!git clone https://github.com/abachaa/LiveQA_MedicalTask_TREC2017.git

In [None]:
%cd LiveQA_MedicalTask_TREC2017/

/content/drive/My Drive/LiveQA_MedicalTask_TREC2017


In [None]:
%cd TrainingDatasets/

/content/drive/My Drive/LiveQA_MedicalTask_TREC2017/TrainingDatasets


The following function parse an entire train file. It returns a list of dictionaries, each dictionary corresponds to one of the 200 questions.

In [None]:
from bs4 import BeautifulSoup
def parse_data(filename):
  data = open(filename, 'r').read()
  soup = BeautifulSoup(data, 'xml')
  info_dic = []
  # training_questions = soup.find_all('NLM-QUESTION') # cannot use soup.NLM-QUESTION because of hyphen
  # get list of questions
  training_questions = soup.find_all(questionid=True)
  print("Number of questions: ", len(training_questions))
  for example_q in training_questions:
    questionid = example_q['questionid']
    subject = example_q.SUBJECT.string
    message = example_q.MESSAGE.string
    # get list of all subqustions
    sub_questions = example_q.find_all("SUB-QUESTION") 
    sub_q_dic = []
    for s in sub_questions:
      subqid = s['subqid']
      focus = s.FOCUS.string
      qtype = s.TYPE.string
      answers = s.find_all('ANSWER')
      answer_dics = []
      for a in answers:
        answer_dics.append({'answerid': a['answerid'], 'pairid': a['pairid'], 'atext': a.string})
      sub_q_dic.append({'subqid': subqid, 'focus': focus, 'qtype': qtype, 'answers': answer_dics})
    info_dic.append({'questioniid': questionid, 'subject': subject, 'message': message, 'sub-questions': sub_q_dic})
  return info_dic

In [None]:
filename = 'TREC-2017-LiveQA-Medical-Train-1.xml'
train_data = parse_data(filename)
# inspect 
train_data[0]

Number of questions:  200


{'message': 'Literature on Cardiac amyloidosis.  Please let me know where I can get literature on Cardiac amyloidosis.  My uncle died yesterday from this disorder.  Since this is such a rare disorder, and to honor his memory, I would like to distribute literature at his funeral service.  I am a retired NIH employee, so I am familiar with the campus in case you have literature at NIH that I can come and pick up.  Thank you ',
 'questioniid': 'Q1',
 'sub-questions': [{'answers': [{'answerid': 'Q1-S1-A1',
     'atext': 'Cardiac amyloidosis is a disorder caused by deposits of an abnormal protein (amyloid) in the heart tissue. These deposits make it hard for the heart to work properly.',
     'pairid': '1'},
    {'answerid': 'Q1-S1-A2',
     'atext': 'The term "amyloidosis" refers not to a single disease but to a collection of diseases in which a protein-based infiltrate deposits in tissues as beta-pleated sheets. The subtype of the disease is determined by which protein is depositing; alth

# MEDIQA2019
[MEDIQA2019](https://github.com/abachaa/MEDIQA2019) challenge is an ACL-BioNLP 2019 shared tasks aiming to attract further research effors in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA). There is one task for each of the below task. In this section, we focus on [task 3 QA](https://github.com/abachaa/MEDIQA2019/tree/master/MEDIQA_Task3_QA).

In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
%cd ..

/content/drive/My Drive


In [None]:
!git clone https://github.com/abachaa/MEDIQA2019.git

In [None]:
%cd MEDIQA2019

/content/drive/My Drive/MEDIQA2019


In [None]:
%cd MEDIQA_Task3_QA

/content/drive/My Drive/MEDIQA2019/MEDIQA_Task3_QA


Task description: <br>
1) filter/classify the provided answers (1: correct, 0: incorrect) <br>
2) re-rank the answers <br>
Dataset:
TrainingSet1 contains 104 consumer health questions covering different types of questions about diseases and drugs, and the associated answers.
TrainingSet2 contains 104 simple qustions about the most frequent diseases, and the associated answers.

The following function parses train/val/test xml files.

In [None]:
from bs4 import BeautifulSoup
def parse_file(filename):
  data = open(filename, 'r').read()
  soup = BeautifulSoup(data, 'xml')
  questions = soup.find_all('Question')
  question_dic = []
  for q in questions:
    QID = q['QID']
    QuestionText = q.QuestionText.STRING
    AnswerList = q.AnswerList.find_all('Answer')
    answer_list_dic = []
    for answer in AnswerList:
      AID = answer['AID']
      SystemRank = answer['SystemRank']
      ReferenceRank = answer['ReferenceRank']
      ReferenceScore = answer['ReferenceScore']
      AnswerURL = answer.AnswerURL.string
      AnswerText = answer.AnswerText.string
      answer_list_dic.append({'AID': AID, 'SystemRank': SystemRank, 'ReferenceRank': ReferenceRank,
                             'ReferenceScore': ReferenceScore, 'AnswerURL': AnswerURL, 'AnswerText': AnswerText})
      question_dic.append({'QID': QID, 'QuestionText': QuestionText, 'Answer_dics': answer_list_dic})
    return question_dic

In [None]:
filename = './MEDIQA_Task3_QA/MEDIQA2019-Task3-QA-TrainingSet1-LiveQAMed.xml'
question_dic = parse_file(filename)

In [None]:
# inspect keys of first example
question_dic[0].keys()

dict_keys(['QID', 'QuestionText', 'Answer_dics'])

In [None]:
# inspect Answers for first example
question_dic[0]['Answer_dics']

[{'AID': '1_Answer1',
  'AnswerText': "Noonan syndrome: Noonan syndrome is a genetic disorder that prevents normal development in various parts of the body. A person can be affected by Noonan syndrome in a wide variety of ways. These include unusual facial characteristics, short stature, heart defects, other physical problems and possible developmental delays. Noonan syndrome is caused by a genetic mutation and is acquired when a child inherits a copy of an affected gene from a parent (dominant inheritance). It can also occur as a spontaneous mutation, meaning there's no family history involved. Management of Noonan syndrome focuses on controlling the disorder's symptoms and complications. Growth hormone may be used to treat short stature in some people with Noonan syndrome. Signs and symptoms of Noonan syndrome vary greatly among individuals and may be mild to severe. Characteristics may be related to the specific gene containing the mutation. Facial appearance is one of the key clini

# Medication_QA_MedInfo2019
[Medication_QA_MedInfo2019](https://github.com/abachaa/Medication_QA_MedInfo2019) is the gold standard corpus for medication qustion answering. The dataset consists of 674 question-answer pairs with annotations of the question focus, type, and the answer source. 

In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

In [None]:
!git clone https://github.com/abachaa/Medication_QA_MedInfo2019.git

Cloning into 'Medication_QA_MedInfo2019'...
remote: Enumerating objects: 18, done.[K
remote: Total 18 (delta 0), reused 0 (delta 0), pack-reused 18[K
Unpacking objects: 100% (18/18), done.


In [None]:
%cd Medication_QA_MedInfo2019/

/content/drive/MyDrive/Medication_QA_MedInfo2019


The dataset is in an excel sheet, load into dataframe using pandas. Inspect first 3 rows of data. <br>


In [None]:
import pandas as pd
file = 'MedInfo2019-QA-Medications.xlsx'
df = pd.read_excel(file)
df.head(3)

Unnamed: 0,Question,Focus (Drug),Question Type,Answer,Section Title,URL
0,how does rivatigmine and otc sleep medicine in...,rivastigmine,Interaction,tell your doctor and pharmacist what prescript...,What special precautions should I follow?,https://medlineplus.gov/druginfo/meds/a602009....
1,how does valium affect the brain,Valium,Action,Diazepam is a benzodiazepine that exerts anxio...,CLINICAL PHARMACOLOGY,https://dailymed.nlm.nih.gov/dailymed/drugInfo...
2,what is morphine,morphine,Information,Morphine is a pain medication of the opiate fa...,,https://en.wikipedia.org/wiki/Morphine


In [None]:
print("Number of examples: ", len(df))
print("Column: ", df.columns)

Number of examples:  690
Column:  Index(['Question', 'Focus (Drug)', 'Question Type', 'Answer', 'Section Title',
       'URL'],
      dtype='object')


In [None]:
# summary of the dataset
df.describe()

Unnamed: 0,Question,Focus (Drug),Question Type,Answer,Section Title,URL
count,690,689,690,689,617,677
unique,651,515,37,652,262,591
top,what does memantine look like,marijuana,Information,No answers,DOSAGE AND ADMINISTRATION,https://medlineplus.gov/marijuana.html
freq,4,14,112,8,60,8


In [None]:
# check for duplicated rows
df[df.duplicated(keep=False)]

Unnamed: 0,Question,Focus (Drug),Question Type,Answer,Section Title,URL
405,does marijuana use lead to negative health out...,marijuana,Side effects,"Marijuana can cause problems with memory, lear...",Summary,https://medlineplus.gov/marijuana.html
432,does marijuana use lead to negative health out...,marijuana,Side effects,"Marijuana can cause problems with memory, lear...",Summary,https://medlineplus.gov/marijuana.html


# BiQA
[BiQA](https://github.com/lasigeBioTM/BiQA) Generating Scientific Question Answering Corpora from Q&A forums (StackExchange & Reddit), including Biology, Medical Sciences, and Nutrition.


In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

/content/drive/My Drive


In [None]:
!git clone https://github.com/lasigeBioTM/BiQA.git

Cloning into 'BiQA'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 64 (delta 31), reused 43 (delta 18), pack-reused 0[K
Unpacking objects: 100% (64/64), done.


In [None]:
%cd BiQA/

/content/drive/My Drive/BiQA


In [None]:
%cd april2020

/content/drive/My Drive/BiQA/april2020


In [None]:
import pandas as pd
filename = 'biology_202004.csv'
df = pd.read_csv(filename)
df.head(5)

Unnamed: 0,question_id,answer_id,question_text,question_score,pmid,pmtitle
0,21216,21219,Why do I only breathe out of one nostril?,286,7876041,EEG changes during forced alternate nostril br...
1,56476,56498,Why are so few foods blue?,190,11598230,Why leaves turn red in autumn. The role of ant...
2,30116,30126,"Does DNA have the equivalent of IF-statements,...",153,15922833,Transcriptional interference--a crash course.
3,937,939,How many times did terrestrial life emerge fro...,149,15535883,A genomic timescale of prokaryote evolution: i...
4,937,939,How many times did terrestrial life emerge fro...,149,20204349,The influence of different land uses on the st...


# MASHQA
[MASHQA](https://github.com/mingzhu0527/MASHQA)

Download data zip file from github repo. Upload to google drive.

In [None]:
# If you are in the folder of another dataset, uncomment and run the following command
# %cd ..

/content/drive/My Drive


In [None]:
%cd mashqa_data/

/content/drive/My Drive/mashqa_data


In [None]:
!ls

test_webmd_squad_v2_consec.json   train_webmd_squad_v2_full.json
test_webmd_squad_v2_full.json	  val_webmd_squad_v2_consec.json
train_webmd_squad_v2_consec.json  val_webmd_squad_v2_full.json


In [None]:
import json
data_file = 'train_webmd_squad_v2_consec.json'
data = json.load(open(data_file))

In [None]:
# inspect keys 
data.keys()

dict_keys(['version', 'data'])

In [None]:
data['version']

'2.0'

In [None]:
print(type(data['data']))
print(type(data['data'][0]))
print(data['data'][0].keys())
print('Title:\n', data['data'][0]['title'])
paragraphs = data['data'][0]['paragraphs']
paragraphs[0]

<class 'list'>
<class 'dict'>
dict_keys(['title', 'paragraphs'])
Title:
 https://www.webmd.com/eye-health/understanding-glaucoma-treatment


{'context': "Treatment of open-angle glaucoma -- the most common form of the disease -- requires lowering the eye's pressure by increasing the drainage of aqueous humor fluid or decreasing the production of that fluid. Medications can accomplish both of these goals. Surgery and laser treatments are directed at improving the eye's aqueous drainage. If not diagnosed early, open-angle glaucoma may significantly damage vision and even cause blindness. That is why it's so important to have your eye doctor test you regularly for glaucoma. Once diagnosed, glaucoma is usually controlled with eye drops that reduce eye pressure. Glaucoma is a life-long condition and needs continual follow-up with your eye doctor. Both drugs and surgery have high rates of success in treating chronic open-angle glaucoma, but you can help yourself by carefully following the doctor's treatment plan. Some patients may find it difficult to follow a regimen involving two or three different eye drops. Be candid and tell

# EPIC QA
[EPICQA](https://bionlp.nlm.nih.gov/epic_qa/)
develop systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2, related corona viruses, and the recommended response to the pandemic.

Two tasks: <br>
1) ExpertQA: In Task A, teams are provided with a set of questions asked by experts and are asked to provide a ranked list of expert-level answers to each question. In Task A, answers should provide information that is useful to researchers, scientists, or clinicians. <br>  
2) Consumer QA: In Task B, teams are provided with a set of questions asked by consumers and are asked to provide a ranked list of consumer-friendly answers to each question. In Task B, answers should be understandable by the general public. <br>

In [None]:
# json file
# document collection 1.4GB

# emrQA: A Large Corpus for Question Answering on Electronic Medical Records
[emrQA](https://github.com/panushri25/emrQA)

need to register

# HealthQA
[github](https://github.com/mingzhu0527/HAR)
Need to email for access.