# Part 1: Extracting Chemical Named Entities

## 1. Obtaining Data

**ChemDNER Corpus v1.0**

The ChemDNER corpus consists of 10,000 PubMED abstracts and their corresponding label sets of named chemical entities. This data set is [publicly available](http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/) and can be downloaded directly using the shell script 

    load_data.sh

## 2. Extracting Candidate Mentions
For the extraction step, our goal is to get the highest recall as possible. In cases where we have labeled data, it's easy to get a recall estimate for our extraction pipline.

In [1]:
import umls
import operator
from lexicons.matchers import * 
from utils import replace_penn_tb_tags
from ddlite import *
from datasets import *

In [2]:
# parse our corpus, saving CoreNLP results at cache_path
parser = SentenceParser()
corpus = ChemdnerCorpus('datasets/chemdner_corpus/', parser=parser, cache_path="/tmp/")

# load the first 100 training documents and collapse all sentences into a single list
pmids = [pmid for pmid in corpus.cv["training"].keys()]
documents = {pmid:corpus[pmid]["sentences"] for pmid in pmids}
sentences = reduce(lambda x,y:x+y, documents.values())

# load gold annotation tags
annotations = [corpus.annotations[pmid] for pmid in pmids if pmid in corpus.annotations]
annotations = reduce(lambda x,y:x+y, annotations)
annotations = [a.text for a in annotations]

print("%d PubMed abstracts" % len(documents))
print("%d true chemical entity mentions" % len(annotations))
word_n = sum([len(sent.words) for sent in sentences])
print("%d tokens" % word_n)

3500 PubMed abstracts
29478 true chemical entity mentions
809867 tokens


### Using Matchers

The easiest way to identify candidates is through simple string matching using a dictionary of known entity names. Curating good lexicons can take some time, so we use pre-existing dictionaries provided by the *tmChem* tagger and a UMLS dictionary of all *Substance* semantic types (see the UMLS notebook for instructions how to create arbitrary dictionaries).

In [9]:
regex_fnames = ["datasets/regex/chemdner/patterns.txt"]

# dictionaries from tmChem & the UMLS
dict_fnames = ["datasets/dictionaries/chemdner/mention_chemical.txt",
              "datasets/dictionaries/chemdner/chebi.txt",
              "datasets/dictionaries/chemdner/addition.txt",
              "datasets/dictionaries/umls/substance-sab-all.txt"]

chemicals = []
for fname in dict_fnames:
    chemicals += [line.strip().split("\t")[0] for line in open(fname,"rU").readlines()]
    
regexes = []
for fname in regex_fnames:
    regexes += [line.strip() for line in open(fname,"rU").readlines()]   

# create matchers and extract candidates
extr1 = DictionaryMatch('C', chemicals, ignore_case=True)
extr2 = AllUpperNounsMatcher('C')
extr3 = RegexMatch('C', regexes[0], ignore_case=True)
extr4 = RegexMatch('C', regexes[1], ignore_case=False)
extr5 = RegexMatch('C', regexes[2], ignore_case=False)
matcher = MultiMatcher(extr1, extr2, extr3, extr4, extr5)

candidates = Entities(sentences, matcher)

# Crude recall estimate (ignores actual span match and tokenization problems)
mentions = [" ".join(replace_penn_tb_tags([e.words[i] for i in e.idxs])) for e in candidates]
gold_mentions = [term for term in annotations]

for m in mentions:
    if m in gold_mentions:
        gold_mentions.remove(m)
tp = len(annotations) - len(gold_mentions)

print("Found %d candidate entities" % len(candidates))
print("Candidates: %.2f%% of all tokens" % (len(candidates)/float(word_n) * 100) )
print("Annotations %.2f%% of all tokens" % (len(annotations)/float(word_n) * 100) )

print("~recall: %.2f (%d/%d)" % (float(tp) / len(annotations), tp, len(annotations)))

Found 189538 candidate entities
Candidates: 23.40% of all tokens
Annotations 3.64% of all tokens
~recall: 0.73 (21498/29478)


### Error Analysis
For distant supervision, we want our candidate set to have high recall. We currently fall short for chemical named entities. If we look at the gold standard annotations, we can see why our string matching misses some entities. Note how paranthesis and other tokenization issues result in many missed entities. 

In [10]:
mentions = {term:1 for term in mentions}
missed = [term for term in annotations if term not in mentions]
missed = {term:missed.count(term) for term in missed}

for term in sorted(missed.items(),key=operator.itemgetter(1),reverse=1):
    print("%s: %d" % (term[0], missed[term[0]]))

Ca(2+): 120
(1)H: 62
(13)C: 40
K(+): 39
Na(+): 35
H(2)O(2): 34
CO(2): 34
MeHg: 31
TiO(2): 27
Mg(2+): 22
Cr(VI): 22
Ca²⁺: 20
WC: 20
Arg: 19
As(V): 19
Cu(2+): 19
NiO: 18
Mn(2+): 18
Res: 16
O₃: 15
(15)N: 15
H(2)O: 15
aryl hydrocarbon: 15
NO(2): 15
phospho: 14
ZnS: 14
acyl: 14
PFAAs: 13
Zn(2+): 13
SnO2: 13
Pb(2+): 13
As(2)O(3): 13
poly(ethylene glycol): 12
(99m)Tc: 12
DMA(III): 11
(-)-reboxetine: 11
As(III): 11
Fe(II): 11
CCl(4): 11
C(60): 10
NAD(+): 10
(-)-carvone: 10
Al(3+): 10
organochlorine: 10
Cr(III): 10
Cd(2+): 10
Mn(3+): 9
Fe(III): 9
polyplexes: 9
ClFn(+): 9
α-syn12 peptide: 9
SrRan: 9
PEG-FA: 9
cytosines: 9
H(2)S: 9
Zr(IV): 9
1,25(OH)(2)D(3): 9
N2: 9
graphene oxide: 8
O(2): 8
steroidal saponins: 8
Rib: 8
H2: 8
Ca2+: 8
PyH(0): 8
phenolic acids: 8
TAGs: 8
(13) C: 8
Zn(II): 8
E2-3,4-Q: 8
graphenes: 8
Sal B: 8
vitamin D(3): 8
CP[c]Ph: 8
d-GalN: 7
W: 7
cannabisin B: 7
ferrocenyl: 7
LiFePO(4): 7
(Ga,Mn)As: 7
[(18)F]FDG: 7
oxyphytosterol: 7
EETs: 7
NiCl(2): 7
(-)-cocaine: 7
(129)Xe: 7
p,

In [6]:
# dump candidates to a pickle
candidates.dump_candidates("/tmp/chemdner_candidate_mentions.pkl")

### Example Best-in-class Tagger

The winning system in the 2013 BioCreative IV CHEMDNER task was tmChem which used 2 linear chain conditional random fields (CRF) with different tokenziation approaches and feature sets.

| Model       | Precision | Recall | F1     |
|-------------|-----------|--------|--------|
| Model 1     | 0.8595    | 0.8721 | 0.8657 |
| Model 2     | **0.8909**    | 0.8575 | **0.8739** |
| Heuristic Combination     | 0.8516    | 0.8906 | 0.8706 |
| Highest Recall | 0.7672    | **0.9212** | 0.8372 |

Leaman, Robert, Chih-Hsuan Wei, and Zhiyong Lu. ["tmChem: a high performance approach for chemical named entity recognition and normalization."](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331693/) J. Cheminformatics 7.S-1 (2015): S3.