# BioRED

This notebook explores the BioRED Relation Extraction dataset from https://doi.org/10.1093/bib/bbac282.

**Note**: This dataset is sourced from https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/, the official source linked in the paper.

In [1]:
import json

from pprint import pprint

Load and check out the dataset structure.

In [None]:
def load_dataset(dataset_path):
    with open(dataset_path, 'r', encoding='utf-8') as f:
        dataset = json.load(f)
    
    return dataset

In [5]:
ds_train = load_dataset("/home/bt19d200/Ayaan/raw-datasets/BIORED/Train.BioC.JSON")
ds_dev = load_dataset("/home/bt19d200/Ayaan/raw-datasets/BIORED/Dev.BioC.JSON")
ds_test = load_dataset("/home/bt19d200/Ayaan/raw-datasets/BIORED/Test.BioC.JSON")

In [14]:
print("Dataset keys:", ds_train.keys())
for key, val in ds_train.items():
    print(f"{key} type: {type(val)}")

Dataset keys: dict_keys(['source', 'date', 'key', 'documents'])
source type: <class 'str'>
date type: <class 'str'>
key type: <class 'str'>
documents type: <class 'list'>


The data is in the 'documents' key.

In [15]:
ds_train = ds_train['documents']
ds_dev = ds_dev['documents']
ds_test = ds_test['documents']

In [17]:
pprint(ds_train[0])

{'id': '10491763',
 'passages': [{'annotations': [{'id': '0',
                                'infons': {'identifier': '3175',
                                           'type': 'GeneOrGeneProduct'},
                                'locations': [{'length': 27, 'offset': 0}],
                                'text': 'Hepatocyte nuclear factor-6'},
                               {'id': '1',
                                'infons': {'identifier': 'D003924',
                                           'type': 'DiseaseOrPhenotypicFeature'},
                                'locations': [{'length': 16, 'offset': 74}],
                                'text': 'type II diabetes'},
                               {'id': '2',
                                'infons': {'identifier': '3630',
                                           'type': 'GeneOrGeneProduct'},
                                'locations': [{'length': 7, 'offset': 140}],
                                'text': 'insulin'}],
          

Verify if entity offsets include space between the two passages (title and abstract).

In [23]:
def verify_offset(example):
    ent = example['passages'][0]['annotations'][0]
    offset = ent['locations'][0]
    text = example['passages'][0]['text'] + ' ' + example['passages'][1]['text']
    if text[offset['offset']: offset['offset'] + offset['length']] == ent['text']:
        return True
    
    return False

In [24]:
print(verify_offset(ds_train[0]))

True


Get entity repeats (mentions, offsets, and  textual variations).

In [27]:
def max_ent_counts(dataset):
    max_count, max_offsets = 0, 0
    for ex in dataset:
        ent_count = {}
        entities = ex['passages'][0]['annotations'] + ex['passages'][1]['annotations'] # two passages (title and abstract)
        for ent in entities:
            ent_id = ent['infons']['identifier']
            ent_count[ent_id] = ent_count[ent_id] + 1 if ent_id in ent_count else 1
            ent_offsets = len(ent['locations'])
            if ent_offsets > max_offsets:
                max_offsets = ent_offsets
            
        ex_max_count = max(ent_count.values()) if ent_count else 0
        if ex_max_count > max_count:
            max_count = ex_max_count
    
    return {'max_offsets': max_offsets, 'max_repeats': max_count}

In [28]:
print("Entity counts:")
print("Train:")
pprint(max_ent_counts(ds_train))
print()
print("Test:")
pprint(max_ent_counts(ds_test))

Entity counts:
Train:
{'max_offsets': 1, 'max_repeats': 29}

Test:
{'max_offsets': 1, 'max_repeats': 17}


This shows that mention-level end-to-end RE cannot be applied, as there are repeated mentions of some entities for some examples in the dataset.

In [29]:
def type_statistics(ds):
    ent_types = {}
    rel_types = {}
    for ex in ds:
        entities = ex['passages'][0]['annotations'] + ex['passages'][1]['annotations']
        for ent in entities:
            ent_type = ent['infons']['type']
            ent_types[ent_type] = ent_types[ent_type] + 1 if ent_type in ent_types else 1
        
        for rel in ex['relations']:
            rel_type = rel['infons']['type']
            rel_types[rel_type] = rel_types[rel_type] + 1 if rel_type in rel_types else 1
    
    return {'ent_stats': ent_types, 'rel_stats': rel_types}

In [30]:
print("Train stats:")
pprint(type_statistics(ds_train))
print()

print("Test stats:")
pprint(type_statistics(ds_test))

Train stats:
{'ent_stats': {'CellLine': 103,
               'ChemicalEntity': 2853,
               'DiseaseOrPhenotypicFeature': 3646,
               'GeneOrGeneProduct': 4430,
               'OrganismTaxon': 1429,
               'SequenceVariant': 890},
 'rel_stats': {'Association': 2192,
               'Bind': 61,
               'Comparison': 28,
               'Conversion': 3,
               'Cotreatment': 31,
               'Drug_Interaction': 11,
               'Negative_Correlation': 763,
               'Positive_Correlation': 1089}}

Test stats:
{'ent_stats': {'CellLine': 50,
               'ChemicalEntity': 754,
               'DiseaseOrPhenotypicFeature': 917,
               'GeneOrGeneProduct': 1180,
               'OrganismTaxon': 393,
               'SequenceVariant': 241},
 'rel_stats': {'Association': 635,
               'Bind': 9,
               'Comparison': 6,
               'Conversion': 1,
               'Cotreatment': 14,
               'Drug_Interaction': 2,
      