# Testing the disease highlighting ability of DNorm

2015-06-17 Tong Shu Li

Now that we know tmChem does quite well at identifying chemicals, we will see how well DNorm does at identifying diseases.

In [13]:
from collections import defaultdict

In [1]:
import sys

In [2]:
sys.path.append("/home/toby/Code/util")
from file_util import read_file

### Strip the biocreative data down for processing:

In [3]:
def process_data(infile, outfile):
    with open(outfile, "w") as out:
        title = ""
        pmid = ""
        for i, line in enumerate(read_file(infile)):
            vals = line.split("|")
            if len(vals) == 3:
                if vals[1] == "t":
                    pmid = vals[0]
                    title = vals[2]
                elif vals[1] == "a":
                    out.write("{0}\t{1}\n".format(pmid, title + " " + vals[2]))

In [4]:
process_data("data/training/CDR_TrainingSet.txt",
             "data/dnorm/dnorm_training_input.txt")

In [5]:
process_data("data/development/CDR_DevelopmentSet.txt",
             "data/dnorm/dnorm_development_input.txt")

### Run DNorm:

In [6]:
%%bash

mv data/dnorm/dnorm_*_input.txt ~/Code/DNorm/DNorm-0.0.6/inloc
cd ~/Code/DNorm/DNorm-0.0.6

for fin in inloc/dnorm_*_input.txt;
do
    fname=`basename $fin`;
    
    outpath="outloc/${fname/input/output}";
    
    ./RunDNorm.sh config/banner_NCBIDisease_TEST.xml data/CTD_diseases.tsv output/simmatrix_NCBIDisease_e4.bin $fin $outpath
done

# move everything back to the original directory
mv inloc/dnorm_*_input.txt ~/Research/Projects/biocreativeV/data/dnorm
mv outloc/dnorm_*_output.txt ~/Research/Projects/biocreativeV/data/dnorm

Creating index
Not adding alternate name ADRENOCORTICOTROPIC HORMONE DEFICIENCY to concept OMIM:201400 because it is the primary name of a parent
Not adding alternate name ANIRIDIA to concept MESH:C536372 because it is the primary name of a parent
Not adding alternate name ABDOMINAL AORTIC ANEURYSM to concept OMIM:100070 because it is the primary name of a parent
Not adding alternate name ANEURYSM, ABDOMINAL AORTIC to concept OMIM:100070 because it is the primary name of a parent
Not adding alternate name DYSEQUILIBRIUM SYNDROME to concept OMIM:224050 because it is the primary name of a parent
Not adding alternate name CADASIL to concept OMIM:125310 because it is the primary name of a parent
Not adding alternate name 18-HYDROXYLASE DEFICIENCY to concept OMIM:203400 because it is the primary name of a parent
Not adding alternate name CRANIOFRONTONASAL DYSPLASIA to concept OMIM:304110 because it is the primary name of a parent
Not adding alternate name CRIGLER-NAJJAR SYNDROME to concept 

### Representations of the data:

In [7]:
class Annotation:
    def __init__(self, uid, stype, text, start, stop):
        if uid.startswith("MESH:"):
            uid = uid[5 : ]
        
        self.uid = uid
        self.stype = stype.lower()
        assert self.stype in ["chemical", "disease"]
        self.text = text
        self.start = int(start)
        self.stop = int(stop)
        assert self.start < self.stop
        
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return self.__dict__ == other.__dict__
            
        return False
    
    def __ne__(self, other):
        return not self.__eq__(other)
        
    def output(self):
        print self.uid
        print self.start
        print self.stop
        print self.text
        print

In [8]:
class Relation:
    def __init__(self, drug, disease):
        assert drug != "-1"
        assert disease != "-1"
        self.drug = drug
        self.disease = disease
        
    def output(self):
        print self.drug, self.disease

In [9]:
def make_annotations(annotations):
    """
    Annotations with an identifier of -1 or with
    no known identifier are ignored because they
    never show up in a relationship.
    
    Ignored for comparision too for the above
    reason.
    """
    chemicals = []
    diseases = []
    
    for group in annotations:
        if group[5] != "-1":
            res = Annotation(group[5], group[4], group[3], group[1], group[2])
            if res.stype == "chemical":
                chemicals.append(res)
            else:
                diseases.append(res)
                
    return (chemicals, diseases)

def make_relations(relations):
    res = []
    for group in relations:
        res.append(Relation(group[2], group[3]))
    
    return res
        
class Paper:
    def __init__(self, pmid, title, abstract, annotations, relations):
        self.pmid = pmid
        self.title = title
        self.abstract = abstract
        
        self.chemicals, self.diseases = make_annotations(annotations)
        self.relations = make_relations(relations)
        
    def output(self):
        print self.pmid
        print len(self.annotations), len(self.relations)

In [10]:
def parse_input(loc, fname):
    """
    Parses the given input file and returns a list
    of Paper objects.
    """
    papers = []

    counter = 0
    annotations = []
    relations = []
    for i, line in enumerate(read_file(fname, loc)):
        if len(line) == 0:
            # time to finish up this paper and prepare a new one
            papers.append(Paper(pmid, title, abstract, annotations, relations))

            counter = 0

            annotations = []
            relations = []
        else:
            if 0 <= counter <= 1:
                vals = line.split('|')
                assert len(vals) == 3
            else:
                vals = line.split('\t')

            if counter == 0:
                assert vals[1] == 't'
                pmid = vals[0]            
                title = vals[2]
            elif counter == 1:
                assert vals[1] == 'a'
                abstract = vals[2]
            elif len(vals) == 4:
                relations.append(vals)
            else:
                assert 5 <= len(vals) <= 7, pmid
                # 5 fields means it determined that the text span
                # was a chemical, but could not assign an identifier
                
                # 7 means it was a mistake in the original input (extra tab)
                # 6 is the ideal output
                
                if len(vals) == 5:
                    vals.append("-1")
                
                annotations.append(vals) # 6 or 7 fields

            counter += 1
            
    return papers

In [23]:
def parse_dnorm_output(loc, fname):
    res = defaultdict(list)
    lol = set()
    for line in read_file(fname, loc):
        vals = line.split('\t')
        
        if len(vals) == 4:
            vals.append("-1") # no known identifier
        
        assert len(vals) == 5
        
        if ":" in vals[4]:
            lol.add(vals[4][ : vals[4].index(":")])
        # give this pmid an extra annotation
        
        
    return lol
    

In [24]:
parse_dnorm_output("data/dnorm", "dnorm_training_output.txt")

{'MESH', 'OMIM'}

for pmid 6386793 of the training set, dnorm correctly found the disease "depressive illness" from indicies 354 to 372, and assigned it the OMIM identifier 309200. It did not assign it a mesh term. However, in the official training data, the same text span is assigned the MESH identifier D003866. Therefore if we were comparing on everything, then dnorm would have missed this term. Should we compared on text locations only?


It looks like the evaluations will be based on identifiers, since that is how the relationships are given in the gold data provided, and this is also what is present in the evaluation toolkit provided. Therefore we will need some way to ensure that dnorm can find the correct identifiers in MESH..

### Grab the gold standard data:

In [7]:
gold_training = parse_input("data/training", "CDR_TrainingSet.txt")

In [8]:
gold_development = parse_input("data/development", "CDR_DevelopmentSet.txt")

### Grab DNorm's output:

In [9]:
dnorm_training = parse_input("data/dnorm", "dnorm_training_output.txt")


AssertionError: 

In [10]:
a = "Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man resulted in profound depression of the activity of the sinoatrial and atrioventricular nodal pacemakers. The patient had no apparent associated conditions which might have predisposed him to the development of bradyarrhythmias; and, thus, this probably represented a true idiosyncrasy to lidocaine."

In [11]:
a

'Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man resulted in profound depression of the activity of the sinoatrial and atrioventricular nodal pacemakers. The patient had no apparent associated conditions which might have predisposed him to the development of bradyarrhythmias; and, thus, this probably represented a true idiosyncrasy to lidocaine.'

In [12]:
len(a)

383

In [13]:
a[11:25]

' administratio'

In [14]:
a[140:188]

'sinoatrial and atrioventricular nodal pacemakers'