# Testing the chemical highlighting abilities of tmChem

2015-06-16 Tong Shu Li<br>
Last updated: 2015-08-17 

Our crowdsourcing approach relies upon being able to exhaustively annotate all chemical annotations in the original raw text. Here we test to see how well tmChem can annotate chemicals.

In [1]:
from __future__ import division

In [2]:
import sys

In [3]:
sys.path.append("/home/toby/Code/util")
from file_util import read_file

In [4]:
from src.data_model import *

In [5]:
from collections import Counter
from itertools import islice

### We first take the data for biocreative V and strip it down to the original raw text:

Training data:

In [4]:
with open("data/tmchem/tmchem_training.txt", "w") as out:
    for line in read_file("data/training/CDR_TrainingSet.txt"):
        if len(line) == 0:
            out.write("\n")
        elif len(line) > 0 and "|" in line:
            vals = line.split("|")
            if vals[1] in ["t", "a"]:
                out.write("{0}\n".format("|".join(vals)))

Development data:

In [5]:
with open("data/tmchem/tmchem_development.txt", "w") as out:
    for line in read_file("data/development/CDR_DevelopmentSet.txt"):
        if len(line) == 0:
            out.write("\n")
        elif len(line) > 0 and "|" in line:
            vals = line.split("|")
            if vals[1] in ["t", "a"]:
                out.write("{0}\n".format("|".join(vals)))

### Run tmChem:

In [6]:
% mv data/tmchem/tmchem_*.txt ~/Code/tmChem/tmChem.M2.ver02/input

In [7]:
% cd ~/Code/tmChem/tmChem.M2.ver02

/home/toby/Code/tmChem/tmChem.M2.ver02


In [8]:
% pwd

u'/home/toby/Code/tmChem/tmChem.M2.ver02'

In [9]:
! perl tmChem.pl -i input -o output Model/All.Model

Input format: PubTator
Running tmChem on 500 docs in tmchem_development.txt ... Finished in 61 seconds. 
Input format: PubTator
Running tmChem on 500 docs in tmchem_training.txt ... Finished in 63 seconds. 


In [10]:
% mv output/*.tmChem ~/Research/Projects/biocreativeV/crowdbefree/crowd_only/data/tmchem

In [11]:
% cd ~/Research/Projects/biocreativeV/crowdbefree/crowd_only

/home/toby/Research/Projects/biocreativeV/crowdbefree/crowd_only


In [12]:
% pwd

u'/home/toby/Research/Projects/biocreativeV/crowdbefree/crowd_only'

### Grab tmChem's output:

In [6]:
tmchem_training = parse_input("data/tmchem", "tmchem_training.txt.tmChem", is_gold = False, return_format = "list")

In [7]:
tmchem_development = parse_input("data/tmchem", "tmchem_development.txt.tmChem", is_gold = False, return_format = "list")

### Grab the gold standard data:

In [8]:
gold_training = parse_input("data/training", "CDR_TrainingSet.txt")

In [9]:
gold_development = parse_input("data/development", "CDR_DevelopmentSet.txt")

---

### Acronym resolver

If an abbreviation for a chemical is found in parentheses immediately after a term tmChem has identified as a chemical, then replace the id

In [40]:
import copy

In [41]:
def acronym_resolver(dataset):
    """
    Given a parsed dataset, trys to resolve acronyms.
    """
    for paper in dataset:
        used = [False] * len(paper.annotations)

        full_text = "{0} {1}".format(paper.title, paper.abstract)

        M = len(paper.annotations)

        changes = 0
        for i, definition in enumerate(paper.annotations[:-1]):
            if not used[i] and definition.has_mesh:
                next_annot = paper.annotations[i+1]

                if (next_annot.start == 2 + definition.stop
                    and full_text[next_annot.start - 1] == "("
                    and full_text[next_annot.stop] == ")"):

                    # found an acronym definition
#                     print "pmid", paper.pmid
#                     print "Found an acronym definition using:"
#                     print definition


                    used[i] = True

                    for j, annot in enumerate(islice(paper.annotations, i+1, None)):
                        if annot.text == next_annot.text and annot.uid != definition.uid:
#                             print "Changing annotation to definition:"
#                             print annot
#                             print
                            annot.uid = definition.uid
                            used[i + 1 + j] = True

                            changes += 1

    #                     print


        if changes > 0:
            print changes
            
    return dataset




In [43]:
training_copy = copy.deepcopy(tmchem_training)
changed_training = acronym_resolver(training_copy)

In [45]:
dev_copy = copy.deepcopy(tmchem_development)
changed_dev = acronym_resolver(dev_copy)

2
9
5
4
9
8
8
4
4
21
5
1
1
3
2
3
1
5
2
1
1
4
2
5
3
2
6
2
2
10
9
10
6
5
1
6
4
3
3
7
1
1
2
15
6
10
2
1
4
5
4
9
4
3
5
12
3
6
2
4
9
10
3
1


### Look at the performance of tmChem:

In [46]:
def results(program_output, gold_data):
    TP = 0
    FP = 0
    sum_chemicals = 0
    
    stuff = []
    for prog_data, gold_std in zip(program_output, gold_data):
        assert prog_data.pmid == gold_std.pmid
        
        tp = 0
        fp = 0
        for prog_annot in prog_data.annotations:
            if prog_annot.stype == "chemical":
                for gold_annot in gold_std.annotations:
                    if prog_annot.stype == "chemical":
                        if (prog_annot.text == gold_annot.text
                            and prog_annot.start == gold_annot.start
                            and prog_annot.stop == gold_annot.stop):
                            
                            #assert prog_annot.uid == gold_annot.uid, "{0} {1}".format(prog_annot, gold_annot)
                            # treat non mesh differently
                            
                            p = set(filter(lambda v: v.uid_type == "MESH", prog_annot.uid))
                            g = set(filter(lambda v: v.uid_type == "MESH", gold_annot.uid))
                            
                            if p == g:
                                tp += 1
                            else:
                                fp += 1
                                stuff.append(prog_annot.text)
                               
                            
        for gold_annot in gold_std.annotations:
            if gold_annot.stype == "chemical":
                sum_chemicals += 1
                
        TP += tp
        FP += fp
        
    recall = TP / sum_chemicals
    precision = TP / (TP + FP)
    
    f_score = 2 * precision * recall / (precision + recall)
    
    print "F score:", f_score

    print "recall: {0}".format(TP / sum_chemicals)
    print "precision: {0}".format(TP / (TP + FP))

    print "TP: {0}".format(TP)
    print "FP: {0}".format(FP)
    print "all gold annotations: {0}".format(sum_chemicals)
    return stuff

In [47]:
stuff = results(tmchem_training, gold_training)

F score: 0.99014111734
recall: 0.984432058428
precision: 0.995916780089
TP: 5122
FP: 21
all gold annotations: 5203


In [48]:
stuff

['MFL',
 'MFL regimen',
 'MFL regimen',
 'MFL regimen',
 'CPA',
 'CPA',
 'CPA',
 'CPA',
 'CPA',
 'CPA',
 'alendronate',
 'alendronate sodium',
 'alendronate',
 'alkylating agents',
 'alkylating agents',
 'alkylating agents',
 'alkylating agents',
 'PO2',
 'DA',
 'HVA',
 'DA']

In [49]:
stuff = results(tmchem_development, gold_development)

F score: 0.877668077104
recall: 0.872825883673
precision: 0.88256429652
TP: 4667
FP: 621
all gold annotations: 5347


In [50]:
kek = results(changed_training, gold_training)

F score: 0.99014111734
recall: 0.984432058428
precision: 0.995916780089
TP: 5122
FP: 21
all gold annotations: 5203


In [51]:
kek = results(changed_dev, gold_development)

F score: 0.935213916314
recall: 0.93005423602
precision: 0.940431164902
TP: 4973
FP: 315
all gold annotations: 5347


# The acronym resolver makes a 0.05 improvement (!!!) in the F-score for the dev set! Include in the final!!

PMID 12615818, for tmChem's output, CPA the abbreviation is a different id from "cyproterone acetate". How to fix this?

In [23]:
def results(program_output, gold_std_data):
    TP = 0
    FP = 0
    sum_chemicals = 0
    for p_data, gold_std in zip(program_output, gold_std_data):
        assert p_data.pmid == gold_std.pmid

        # check tmChem's output against the gold standard

        tp = 0
        fp = 0
        for annot in p_data.chemicals:
            if annot in gold_std.chemicals:
                tp += 1
            else:
                fp += 1

        sum_chemicals += len(gold_std.chemicals)
        TP += tp
        FP += fp
        
    recall = TP / sum_chemicals
    precision = TP / (TP + FP)
    
    f_score = 2 * precision * recall / (precision + recall)
    
    print "F score:", f_score

    print "recall: {0}".format(TP / sum_chemicals)
    print "precision: {0}".format(TP / (TP + FP))

    print "TP: {0}".format(TP)
    print "FP: {0}".format(FP)
    print "all gold annotations: {0}".format(sum_chemicals)

In [24]:
results(tmchem_training, gold_training)

F score: 0.98931416359
recall: 0.985294117647
precision: 0.993367147874
TP: 5092
FP: 34
all gold annotations: 5168


In [25]:
results(tmchem_development, gold_development)

F score: 0.871685893544
recall: 0.812028657617
precision: 0.940803844474
TP: 4307
FP: 271
all gold annotations: 5304


## In conclusion, it looks like tmChem does pretty well at identifying the chemicals in a piece of text. The recall is a lot lower on the development set, but is still high

If we really were worried about recall, then we could always add more concept recognizers to drive up total recall.