# Part 1: Extracting Chemical Named Entities

## 1. Obtaining Data

**ChemDNER Corpus v1.0**

The ChemDNER corpus consists of 10,000 PubMED abstracts and their corresponding label sets of named chemical entities. This data set is [publicly available](http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/) and can be downloaded directly using the shell script 

    load_data.sh

## 2. Loading Documents
For the extraction step, our goal is to get the highest recall as possible. In cases where we have labeled data, it's easy to get a recall estimate for our extraction pipline.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import umls
import codecs
import operator
import itertools
from ddlite import *
from datasets import *
from utils import unescape_penn_treebank
from lexicons import AllUpperNounsMatcher,RuleTokenizedDictionaryMatch

In [2]:
parser = SentenceParser()
corpus = ChemdnerCorpus('datasets/chemdner_corpus/', parser=parser, 
                        cache_path="examples/cache/")

# ChemNDER has pre-defined cross-validation folds
dev_set = corpus.cv["development"].keys()

# load training documents and collapse all sentences into a single list
documents = {doc_id:(corpus[doc_id]["sentences"],corpus[doc_id]["tags"]) for doc_id in dev_set}
sentences, gold_entities = zip(*documents.values())
sentences = list(itertools.chain.from_iterable(sentences))
gold_entities = list(itertools.chain.from_iterable(gold_entities))

# summary statistics
gold_entity_n = len(list(itertools.chain.from_iterable(gold_entities)))
word_n = sum([len(sent.words) for sent in sentences])
print("%d PubMed abstracts" % len(documents))
print("%d ChemNDER gold entities" % gold_entity_n)
print("%d tokens" % word_n)


3500 PubMed abstracts
29526 ChemNDER gold entities
804690 tokens


## 3. Building Matchers

The easiest way to identify candidates is through simple string matching using a dictionary of known entity names. Curating good lexicons can take some time, so we use pre-existing dictionaries provided by the *tmChem* tagger and a UMLS dictionary of all *Substance* semantic types (see the UMLS notebook for instructions how to create arbitrary dictionaries). The goal of matching is to acheive as high recall as possible. In real-world applications, we can't compute true recall, so it's important to try and get good coverage. 

- **DictionaryMatch** Match to an existing dictionary of known entity names.

- **RegexMatcher** Match words according to simple regular expressions. Here we just match Greek letters and  simple patterns of the form -3,4- which tend to indicate chemical names.

- **RuleTokenizedDictionaryMatch** Match a dictionary under a different tokenization scheme (in this case we provide a whitespace tokenizer. The resulting labels are mapped back into our primary CoreNLP token space.

- **AllUpperNounsMatcher** (From the Gene Tagger example) Identify all uppercase nouns in text. 

In [7]:
# tokenizer for matching within raw sentence text
def rule_tokenizer(s):
    s = re.sub("([,?!:;] )",r" \1",s)
    s = re.sub("([.]$)",r" .",s)
    return s.split()

# dictionaries from tmChem & the UMLS
dict_fnames = ["datasets/dictionaries/chemdner/mention_chemical.txt",
               "datasets/dictionaries/chemdner/chebi.txt",
               "datasets/dictionaries/chemdner/addition.txt",
               "datasets/dictionaries/umls/substance-sab-all.txt",
               "datasets/dictionaries/chemdner/train.chemdner.vocab.txt"]

chemicals = []
for fname in dict_fnames:
    chemicals += [line.strip().split("\t")[0] for line in open(fname,"rU").readlines()]
chemicals = {term:1 for term in chemicals}

# create matchers and extract candidates
extr1 = DictionaryMatch('C', chemicals, ignore_case=True)
extr2 = RuleTokenizedDictionaryMatch('C', chemicals, ignore_case=True, 
                                     tokenizer=rule_tokenizer)
extr3 = RegexMatch('C',"[αβΓγΔδεϝζηΘθικΛλμνΞξοΠπρΣστυΦφχΨψΩω]+[-]+[A-Za-z]+", 
                   ignore_case=True)
extr4 = RegexMatch('C', "([-]*(\d[,]*)+[-])", ignore_case=True)
extr5 = AllUpperNounsMatcher('C')
matcher = MultiMatcher(extr1,extr2,extr3,extr4,extr5)

## 4. Extracting Candidate Entities
Once we have matchers, we want to generate and store our candidate entity set for later use in learning. (Note this can take a long time, which is why you should precompute candidates before moving to the learning stage.)

In [8]:
candidates = Entities(sentences, matcher)

# Crude recall estimate (ignores actual span match)
mentions = [" ".join(unescape_penn_treebank([e.words[i] for i in e.idxs])) for e in candidates]
gold_mentions = list(zip(*itertools.chain.from_iterable(gold_entities))[0])

for m in mentions:
    if m in gold_mentions:
        gold_mentions.remove(m)
tp = gold_entity_n - len(gold_mentions)

print("Found %d candidate entities" % len(candidates))
print("Candidates: %.2f%% of all tokens" % (len(candidates)/float(word_n) * 100) )
print("Annotations %.2f%% of all tokens" % (gold_entity_n/float(word_n) * 100) )
print("~recall: %.2f (%d/%d)" % (float(tp) / gold_entity_n, tp, gold_entity_n))

candidates.dump_candidates("examples/candidates.pkl")

Found 189276 candidate entities
Candidates: 23.52% of all tokens
Annotations 3.67% of all tokens
~recall: 0.74 (21771/29526)


### Error Analysis
For distant supervision, we want our candidate set to have high recall. We currently fall short for chemical named entities. If we look at the gold standard annotations, we can see why our string matching misses some entities. Note how paranthesis and other tokenization issues result in many missed entities. 

In [9]:
# What are we missing due to tokenization errors?
regexes = [re.compile("[αβΓγΔδεϝζηΘθικΛλμνΞξοΠπρΣστυΦφχΨψΩω]+[-]+[A-Za-z]+")]
regexes += [re.compile("([-]*(\d[,]*)+[-])")]

def regex_match(t):
    for regex in regexes:
        if regex.search(t):
            return True
    return False

tokenization_errors = [term for term in gold_mentions if term in chemicals or regex_match(term)]
tokenization_errors = {term:tokenization_errors.count(term) for term in tokenization_errors}
oov_errors = [term for term in gold_mentions if term not in tokenization_errors]
oov_errors = {term:oov_errors.count(term) for term in oov_errors}

print("Est. Tokenization Errors: %d" % (sum(tokenization_errors.values())))
print("Est. Out-of-vocabulary Errors: %d" % (sum(oov_errors.values())))

Est. Tokenization Errors: 3841
Est. Out-of-vocabulary Errors: 3914


We see that almost half our errors stem from tokenization issues. If we fixed all of those errors, we would have ~0.87 recall on the development set. If we actually look at OOV mentions we missed, we see there is considerable room for refining regular expressions to identify mentions like NaAsO(2) or FeSe, which just consist of element names. 

In [10]:
# print out our out of vocabulary terms
for term in sorted(oov_errors.items(),key=operator.itemgetter(1),reverse=1):
    print("%s: %d" % (term[0], oov_errors[term[0]]))

DMAs(V): 14
CP-778 875: 12
RuBPY: 11
GnIH: 11
Δ(9)-THC: 11
Cramb816: 10
NaAsO(2): 10
25(OH)D(3): 10
nC(60): 10
EVn-50: 10
CPF: 9
BzP: 8
iron-sulfur: 8
NSC 710305: 8
FeSe: 8
acyl glucuronides: 8
Cisp: 8
4'G-RSV: 8
Polycalcium: 7
PEI 423: 7
SCN(-): 7
6His: 7
l-Pro: 7
ENs: 7
rGO: 7
TZDs: 7
[(18)F]DPA-714: 7
dihydrotanshinone I: 7
furanic: 7
(177)Lu-DOTA-GGNle-CycMSHhex: 7
Co(3)O(4): 7
alkamides: 7
O(3): 7
(+)-naloxone: 7
(-)-cannabidiol: 7
PF(6)(-): 6
(-)-NPA: 6
(-)-xanthatin: 6
Co(II)(Ch): 6
PEO-PPO-PEO: 6
(19Z)-HCA: 6
PPy: 6
stearoyl-PEG-polySDM: 6
Lu AF21934: 6
BpyAla: 6
pentacyclic triterpenoids: 6
D-fagomine: 6
FLC: 6
1,25(OH)(2)D: 6
Forum: 6
DMAs(III): 6
Pyrimidyn: 6
cyclic-AMP: 6
Ang II: 6
Bzf: 5
LNG: 5
Gd-PSQ: 5
Au-Ag: 5
cycloartane triterpenoids: 5
N-methylconiine: 5
HO: 5
CoQ10: 5
HOD(+): 5
ertugliflozin: 5
PEG12: 5
TFs: 5
enterolignans: 5
BrO3 (-): 5
ZnPd: 5
aromadendrine: 5
h-BN: 5
CuSO(4): 5
αKG: 5
CrB4: 5
TiO 2: 5
(68) Ga: 5
ONOO(-): 5
(+)-sattabacin: 5
AlClPc: 5
DOM: 5
Ag-A

### Example Best-in-class Tagger

The winning system in the 2013 BioCreative IV CHEMDNER task was tmChem which used 2 linear chain conditional random fields (CRF) with different tokenziation approaches and feature sets.

| Model       | Precision | Recall | F1     |
|-------------|-----------|--------|--------|
| Model 1     | 0.8595    | 0.8721 | 0.8657 |
| Model 2     | **0.8909**    | 0.8575 | **0.8739** |
| Heuristic Combination     | 0.8516    | 0.8906 | 0.8706 |
| Highest Recall | 0.7672    | **0.9212** | 0.8372 |

Leaman, Robert, Chih-Hsuan Wei, and Zhiyong Lu. ["tmChem: a high performance approach for chemical named entity recognition and normalization."](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331693/) J. Cheminformatics 7.S-1 (2015): S3.