# NER processing of BioCreative V Task 3 evaluation dataset

Tong Shu Li<br>
Created on: Monday 2015-08-17<br>
Last updated: 2015-08-21

tmChem and DNorm are used to annotate the final evaluation dataset for BioCreative V task 3.

One newline ("\n") was added by hand to the final line of the CDR_TestSet.PubTator.txt file in order to ensure that tmChem and DNorm would work properly.

The original input file is saved at <code>data/final_eval/orig_data/CDR_TestSet.PubTator.txt</code>

### Run file through tmChem to annotate chemicals:

In [1]:
%%bash

# move things to the correct directory
cur_path=$(pwd)
cp data/final_eval/orig_data/CDR_TestSet.PubTator.txt src/tmChem.M2.ver02/input/CDR_TestSet.txt
cd src/tmChem.M2.ver02

# run tmChem
perl tmChem.pl -i input -o output Model/All.Model

# move results back
mv output/*.tmChem $cur_path/data/final_eval/tmchem
cd $cur_path

Input format: PubTator
Running tmChem on 500 docs in CDR_TestSet.txt ...Running tmChem on 500 docs in CDR_TestSet.txt ... Finished in 68 seconds. 


One empty line at the end of the tmChem processed file was deleted. If no newline was added to the original file, tmChem does not process the last abstract properly.

---

### Run DNorm:

Make a inloc and outloc folder in DNorm's folder to hold our data files.

In [2]:
%%bash

cur_path=$(pwd)
dnorm_path=$cur_path/src/DNorm-0.0.7
ab3p_loc=$cur_path/src/Ab3P-v1.5

cp data/final_eval/orig_data/CDR_TestSet.PubTator.txt $dnorm_path/inloc/CDR_TestSet_input.txt
cd $dnorm_path

for fin in inloc/*_input.txt;
do
    fname=`basename $fin`;
    
    outpath="outloc/${fname/input/output}";
    
    ./ApplyDNorm.sh config/banner_BC5CDR_UMLS2013AA_SAMPLE.xml data/CTD_diseases-2015-06-04.tsv output/simmatrix_BC5CDR_e4_TRAINDEV.bin $ab3p_loc TEMP $fin $outpath
done

# move everything back to the original directory
mv inloc/*_input.txt $cur_path/data/final_eval/dnorm
mv outloc/*_output.txt $cur_path/data/final_eval/dnorm

Creating index
Not adding alternate name Alpha-1 Antitrypsin Deficiency to concept MESH:C566273 because it is the primary name of a parent
Not adding alternate name Anemia, Hypoplastic Congenital to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name Anemias, Hypoplastic Congenital to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name Congenital Anemia, Hypoplastic to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name Congenital Anemias, Hypoplastic to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name Hypoplastic Congenital Anemia to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name Hypoplastic Congenital Anemias to concept MESH:D029503 because it is the primary name of a parent
Not adding alternate name ANIRIDIA to concept MESH:C536372 because it is the primary name of a parent
Not adding alt

### Combine outputs of DNorm and tmChem into one file:

In [3]:
from src.lingpipe.file_util import read_file

In [4]:
tmchem_fname = "data/final_eval/tmchem/CDR_TestSet.txt.tmChem"
dnorm_fname = "data/final_eval/dnorm/CDR_TestSet_output.txt"

In [5]:
output_fname = "data/final_eval/CDR_annotated_testset.txt"

In [6]:
def read_output(fname):
    res = dict()
    
    counter = 0
    pmid = -1
    title = ""
    abstract = ""
    concepts = []
    for line in read_file(fname):
        if len(line) == 0:
            res[pmid] = (title, abstract, concepts)
            counter = 0
            concepts = []
        else:
            if 0 <= counter <= 1:
                vals = line.split("|")
                assert len(vals) == 3
                
                pmid = int(vals[0])
                
                if vals[1] == "t":
                    title = vals[2]
                elif vals[1] == "a":
                    abstract = vals[2]
            else:
                concepts.append(line)
            
            counter += 1
            
    return res

In [7]:
chem = read_output(tmchem_fname)
dise = read_output(dnorm_fname)

In [8]:
assert set(chem.keys()) == set(dise.keys())

In [9]:
pmids = set(chem.keys())

In [10]:
with open(output_fname, "w") as fout:
    for pmid in pmids:
        assert chem[pmid][0] == dise[pmid][0] # title same
        assert chem[pmid][1] == dise[pmid][1] # abstract same
        
        title = chem[pmid][0]
        abstract = chem[pmid][1]
        
        concepts = chem[pmid][2] + dise[pmid][2]
        
        fout.write("{0}|t|{1}\n".format(pmid, title))
        fout.write("{0}|a|{1}\n".format(pmid, abstract))
        fout.write("{0}\n".format("\n".join(concepts)))
        fout.write("\n")