# Testing the chemical highlighting abilities of tmChem

Tong Shu Li<br>
Created on: 2015-06-16<br>
Last updated: 2015-08-21

Our crowdsourcing approach relies upon being able to exhaustively annotate all chemical annotations in the original raw text. Here we test to see how well tmChem can annotate chemicals.

In [1]:
from src.lingpipe.file_util import read_file
from src.data_model import parse_input

### Strip the gold standard down to raw text:

In [2]:
def strip_file(fin_loc, fout_loc):
    with open(fout_loc, "w") as fout:
        for line in read_file(fin_loc):
            if len(line) == 0:
                fout.write("\n")
            elif (len(line) > 0) and ("|" in line) and (line.split("|")[1] in ["t", "a"]):
                fout.write("{0}\n".format(line))

In [3]:
strip_file("data/training/CDR_TrainingSet.txt", "data/tmchem/tmchem_training.txt")

In [4]:
strip_file("data/development/CDR_DevelopmentSet.txt", "data/tmchem/tmchem_development.txt")

### Run tmChem:

In [5]:
%%bash

# move things to the correct directory
cur_path=$(pwd)
cp data/tmchem/tmchem_*.txt src/tmChem.M2.ver02/input
cd src/tmChem.M2.ver02

# run tmChem
perl tmChem.pl -i input -o output Model/All.Model

# move results back
rm input/*.txt
mv output/*.tmChem $cur_path/data/tmchem
cd $cur_path

Input format: PubTator
Running tmChem on 500 docs in tmchem_training.txt ...Running tmChem on 500 docs in tmchem_training.txt ... Finished in 61 seconds. 
Input format: PubTator
Running tmChem on 500 docs in tmchem_development.txt ...Running tmChem on 500 docs in tmchem_development.txt ... Finished in 63 seconds. 


### Read the gold standard:

In [6]:
gold_train = parse_input("data/training", "CDR_TrainingSet.txt")

In [7]:
gold_dev = parse_input("data/development", "CDR_DevelopmentSet.txt")

### Read tmChem's output, with and without acronym resolution:

In [8]:
tmchem_train_raw = parse_input("data/tmchem", "tmchem_training.txt.tmChem",
                              is_gold = False, return_format = "list", fix_acronyms = False)

In [9]:
tmchem_train_fixed = parse_input("data/tmchem", "tmchem_training.txt.tmChem",
                              is_gold = False, return_format = "list", fix_acronyms = True)

In [10]:
tmchem_dev_raw = parse_input("data/tmchem", "tmchem_development.txt.tmChem",
                              is_gold = False, return_format = "list", fix_acronyms = False)

In [11]:
tmchem_dev_fixed = parse_input("data/tmchem", "tmchem_development.txt.tmChem",
                              is_gold = False, return_format = "list", fix_acronyms = True)

### Examine performance against gold standard:

In [12]:
def results(program_output, gold_data):
    TP = 0
    FP = 0
    sum_chemicals = 0
    
    for prog_data, gold_std in zip(program_output, gold_data):
        assert prog_data.pmid == gold_std.pmid
        
        tp = 0
        fp = 0
        for prog_annot in prog_data.annotations:
            if prog_annot.stype == "chemical":
                for gold_annot in gold_std.annotations:
                    if prog_annot.stype == "chemical":
                        if (prog_annot.text == gold_annot.text
                            and prog_annot.start == gold_annot.start
                            and prog_annot.stop == gold_annot.stop):
                            
                            p = {v for v in prog_annot.uid if v.uid_type == "MESH"}
                            g = {v for v in gold_annot.uid if v.uid_type == "MESH"}
                            
                            if p == g:
                                tp += 1
                            else:
                                fp += 1
                            
        for gold_annot in gold_std.annotations:
            if gold_annot.stype == "chemical":
                sum_chemicals += 1
                
        TP += tp
        FP += fp
        
    recall = TP / sum_chemicals
    precision = TP / (TP + FP)
    
    f_score = 2 * precision * recall / (precision + recall)
    
    print("F score:", f_score)

    print("recall: {0}".format(TP / sum_chemicals))
    print("precision: {0}".format(TP / (TP + FP)))

    print("TP: {0}".format(TP))
    print("FP: {0}".format(FP))
    print("all gold annotations: {0}".format(sum_chemicals))

### tmChem results without acronym resolution

In [13]:
results(tmchem_train_raw, gold_train)

F score: 0.9924608544364972
recall: 0.9867384201422257
precision: 0.9982500486097609
TP: 5134
FP: 9
all gold annotations: 5203


In [14]:
results(tmchem_dev_raw, gold_dev)

F score: 0.8776680771039022
recall: 0.8728258836730877
precision: 0.8825642965204236
TP: 4667
FP: 621
all gold annotations: 5347


### tmChem results with acronym resolution

In [15]:
results(tmchem_train_fixed, gold_train)

F score: 0.9924608544364972
recall: 0.9867384201422257
precision: 0.9982500486097609
TP: 5134
FP: 9
all gold annotations: 5203


In [16]:
results(tmchem_dev_fixed, gold_dev)

F score: 0.9348377997179126
recall: 0.9296801945015897
precision: 0.9400529500756429
TP: 4971
FP: 317
all gold annotations: 5347


## Results

tmChem performs very well on the training set. Performance on the development set is lower, but still respectable. Acronym resolution makes no difference for the training set, but improves F-score by 0.05 (a significant amount) for the development set.