# CDR Dataset

This notebook explores the CDR Relation Extraction dataset from https://doi.org/10.1093/database/baw068.

**Note**: This dataset is sourced from the BioC website (https://bioc.sourceforge.net/, alternatively from https://github.com/JHnlp/BioCreative-V-CDR-Corpus) as the source linked in the paper is no longer available.

In [11]:
import json

from bioc import biocxml, biocjson
from pprint import pprint

In [12]:
def load_file(path):
    with open(path, 'r') as fp:
        collection = biocxml.load(fp)
    
    json_string = biocjson.dumps(collection)
    return json.loads(json_string)

In [13]:
train_path = "/home/bt19d200/Ayaan/raw-datasets/BC5CDR/CDR_Data/CDR.Corpus.v010516/CDR_TrainingSet.BioC.xml"
val_path = "/home/bt19d200/Ayaan/raw-datasets/BC5CDR/CDR_Data/CDR.Corpus.v010516/CDR_DevelopmentSet.BioC.xml"
test_path = "/home/bt19d200/Ayaan/raw-datasets/BC5CDR/CDR_Data/CDR.Corpus.v010516/CDR_TestSet.BioC.xml"

ds_train = load_file(train_path)
ds_val = load_file(val_path)
ds_test = load_file(test_path)

#### Verify Keys

In [17]:
pprint(ds_train.keys())

dict_keys(['bioctype', 'source', 'date', 'key', 'version', 'infons', 'documents'])


In [19]:
pprint(ds_train['bioctype'])

'BioCCollection'


In [20]:
pprint(ds_train['source'])

'PubTator'


In [21]:
pprint(ds_train['date'])

'0/0/0'


In [22]:
pprint(ds_train['key'])

'PubTator.key'


In [23]:
pprint(ds_train['version'])

'1.0'


In [24]:
pprint(ds_train['infons'])

{}


In [25]:
pprint(ds_train['documents'])

[{'annotations': [],
  'bioctype': 'BioCDocument',
  'id': '227508',
  'infons': {},
  'passages': [{'annotations': [{'id': '0',
                                 'infons': {'MESH': 'D009270',
                                            'type': 'Chemical'},
                                 'locations': [{'length': 8, 'offset': 0}],
                                 'text': 'Naloxone'},
                                {'id': '1',
                                 'infons': {'MESH': 'D003000',
                                            'type': 'Chemical'},
                                 'locations': [{'length': 9, 'offset': 49}],
                                 'text': 'clonidine'}],
                'bioctype': 'BioCPassage',
                'infons': {'type': 'title'},
                'offset': 0,
                'relations': [],
                'sentences': [],
                'text': 'Naloxone reverses the antihypertensive effect of '
                        'clonidine.'},
          

The 'documents' key contains the data.

#### Explore the dataset

In [26]:
ds_train = ds_train['documents']
ds_val = ds_val['documents']
ds_test = ds_test['documents']

In [28]:
example = ds_train[0]

pprint(example)

{'annotations': [],
 'bioctype': 'BioCDocument',
 'id': '227508',
 'infons': {},
 'passages': [{'annotations': [{'id': '0',
                                'infons': {'MESH': 'D009270',
                                           'type': 'Chemical'},
                                'locations': [{'length': 8, 'offset': 0}],
                                'text': 'Naloxone'},
                               {'id': '1',
                                'infons': {'MESH': 'D003000',
                                           'type': 'Chemical'},
                                'locations': [{'length': 9, 'offset': 49}],
                                'text': 'clonidine'}],
               'bioctype': 'BioCPassage',
               'infons': {'type': 'title'},
               'offset': 0,
               'relations': [],
               'sentences': [],
               'text': 'Naloxone reverses the antihypertensive effect of '
                       'clonidine.'},
              {'annotations': [

Check how dataset varies from standard general-domain datasets.

In [33]:
def clarify_ent_info(dataset):
    max_count, max_offsets = 1, 1
    max_offset_index = None
    for i, example in enumerate(dataset):
        passages = example['passages']
        entities = passages[0]['annotations'] + passages[1]['annotations']
        ent_count = {}
        for ent in entities:
            ent_id = ent['infons']['MESH']
            ent_count[ent_id] = ent_count[ent_id] + 1 if ent_id in ent_count else 1
            offsets = len(ent['locations'])
            if offsets > max_offsets:
                max_offsets = offsets
                max_offset_index = i
            
        max_count_ex = max(ent_count.values())
        if max_count < max_count_ex:
            max_count = max_count_ex
    
    return {'max_offsets': max_offsets, 'max_count': max_count, 'max_offset_index': max_offset_index}

In [40]:
print("--- Dataset Entity Peculiarities ---")
print("Train:")
pprint(clarify_ent_info(ds_train))
print()
print("Validation:")
pprint(clarify_ent_info(ds_val))
print()
print("Test:")
pprint(clarify_ent_info(ds_test))

--- Dataset Entity Peculiarities ---
Train:
{'max_count': 29, 'max_offset_index': 8, 'max_offsets': 2}

Validation:
{'max_count': 17, 'max_offset_index': 6, 'max_offsets': 2}

Test:
{'max_count': 22, 'max_offset_index': 19, 'max_offsets': 2}


General-domain datasets usually have single mentions/spans per entity.

In [None]:
pprint(ds_train[8])

{'annotations': [],
 'bioctype': 'BioCDocument',
 'id': '2234245',
 'infons': {},
 'passages': [{'annotations': [{'id': '0',
                                'infons': {'CompositeRole': 'CompositeMention',
                                           'MESH': 'D014786|D006311',
                                           'type': 'Disease'},
                                'locations': [{'length': 28, 'offset': 0}],
                                'text': 'Ocular and auditory toxicity'},
                               {'id': '1',
                                'infons': {'CompositeRole': 'IndividualMention',
                                           'MESH': 'D014786',
                                           'type': 'Disease'},
                                'locations': [{'length': 6, 'offset': 0},
                                              {'length': 8, 'offset': 20}],
                                'text': 'Ocular toxicity'},
                               {'id': '2',
           

Looking at the example, we can observe the following:
- Some entities have two locations while others have one (both single and multi-word alike)
- Some have 'CompositeRole' in their 'infons' while others do not
- Some have 'IndividualMention' as their 'CompositeRole' while others have 'CompositeMention'
- There are multiple instances having overlapping offsets
- The same entity occurs multiple times in a document, sometimes with different textual representations

Every entity span with a 'CompositeRole' or with two MESH IDs seperated by a '|' is a dual entity, i.e., these spans include two distinct entities clubbed together. We have to strategise on how to incorporate this into the train, validation, and test data.

The same entity having multiple textual representations make it difficult to use this dataset with end-to-end RE models operating at a mention-level.

In [36]:
def label_statistics(ds):
    ent_types = {}
    rel_types = {}
    for ex in ds:
        passages = ex['passages']
        entities = passages[0]['annotations'] + passages[1]['annotations']
        for ent in entities:
            ent_type = ent['infons']['type']
            ent_types[ent_type] = ent_types[ent_type] + 1 if ent_type in ent_types else 1
        
        for rel in ex['relations']:
            rel_type = rel['infons']['relation']
            rel_types[rel_type] = rel_types[rel_type] + 1 if rel_type in rel_types else 1
    
    return {'ent_stats': ent_types, 'rel_stats': rel_types}

In [41]:
print("--- Dataset Label Statistics ---")
print("Train:")
pprint(label_statistics(ds_train))
print()
print("Validation:")
pprint(label_statistics(ds_val))
print()
print("Test:")
pprint(label_statistics(ds_test))

--- Dataset Label Statistics ---
Train:
{'ent_stats': {'Chemical': 5207, 'Disease': 4363}, 'rel_stats': {'CID': 1038}}

Validation:
{'ent_stats': {'Chemical': 5352, 'Disease': 4421}, 'rel_stats': {'CID': 1012}}

Test:
{'ent_stats': {'Chemical': 5394, 'Disease': 4534}, 'rel_stats': {'CID': 1066}}
