# Part 1: Extracting Chemical Named Entities

## 1. Obtaining Data

**ChemDNER Corpus v1.0**

The ChemDNER corpus consists of 10,000 PubMED abstracts and their corresponding label sets of named chemical entities. This data set is [publicly available](http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/) and can be downloaded directly using the shell script 

    load_data.sh

## 2. Loading Documents
For the extraction step, our goal is to get the highest recall as possible. In cases where we have labeled data, it's easy to get a recall estimate for our extraction pipline.

In [17]:
%load_ext autoreload
%autoreload 2

import sys
import codecs
import operator
import itertools
from ddlite import *
from datasets import *
from utils import unescape_penn_treebank
from lexicons import AllUpperNounsMatcher, RuleTokenizedDictionaryMatch

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
parser = SentenceParser()
corpus = ChemdnerCorpus('datasets/chemdner_corpus/', parser=parser, 
                        cache_path="examples/cache/chemdner/")

# ChemNDER has pre-defined cross-validation folds -- use 25 docs
dev_set = sorted(corpus.cv["training"].keys())[:100]

# load training documents and collapse all sentences into a single list
documents = {doc_id:(corpus[doc_id]["sentences"],corpus[doc_id]["tags"]) for doc_id in dev_set}
sentences, gold_entities = zip(*documents.values())
sentences = list(itertools.chain.from_iterable(sentences))
gold_entities = list(itertools.chain.from_iterable(gold_entities))

# summary statistics
gold_entity_n = len(list(itertools.chain.from_iterable(gold_entities)))
word_n = sum([len(sent.words) for sent in sentences])
print("%d PubMed abstracts" % len(documents))
print("%d ChemNDER gold entities" % gold_entity_n)
print("%d tokens" % word_n)

100 PubMed abstracts
743 ChemNDER gold entities
21529 tokens


## 3. Building Matchers

The easiest way to identify candidates is through simple string matching using a dictionary of known entity names. Curating good lexicons can take some time, so we use pre-existing dictionaries provided by the *tmChem* tagger and a UMLS dictionary of all *Substance* semantic types (see the UMLS notebook for instructions how to create arbitrary dictionaries). The goal of matching is to acheive as high recall as possible. In real-world applications, we can't compute true recall, so it's important to try and get good coverage. 

- **DictionaryMatch** Match to an existing dictionary of known entity names.

- **RegexMatcher** Match words according to simple regular expressions. Here we just match Greek letters and  simple patterns of the form -3,4- which tend to indicate chemical names.

- **RuleTokenizedDictionaryMatch** Match a dictionary under a different tokenization scheme (in this case we provide a whitespace tokenizer. The resulting labels are mapped back into our primary CoreNLP token offset space.

- **AllUpperNounsMatcher** (From the Gene Tagger example) Identify all uppercase nouns in text. 

In [26]:
# tokenizer for matching within raw sentence text
def rule_tokenizer(s):
    s = re.sub("([,?!:;] )",r" \1",s)
    s = re.sub("([.]$)",r" .",s)
    return s.split()

# load dictionaries 
dict_fnames = ["datasets/dictionaries/chemdner/mention_chemical.txt",
               "datasets/dictionaries/chemdner/chebi.txt",
               "datasets/dictionaries/chemdner/addition.txt",
               "datasets/dictionaries/umls/substance-sab-all.txt"]

chemicals = []
for fname in dict_fnames:
    chemicals += [line.strip().split("\t")[0] for line in codecs.open(fname,"rU","utf-8").readlines()]

# remove stopwords
fname = "datasets/dictionaries/chemdner/stopwords.txt"
stopwords = {line.strip().split("\t")[0]:1 for line in open(fname,"rU").readlines()}
chemicals = {term:1 for term in chemicals if term not in stopwords}.keys()

# create matchers and extract candidates
extr1 = DictionaryMatch('C', chemicals, ignore_case=True)
extr2 = RuleTokenizedDictionaryMatch('C', chemicals, ignore_case=True, tokenizer=rule_tokenizer)
extr3 = RegexMatch('C',"[αβΓγΔδεϝζηΘθικΛλμνΞξοΠπρΣστυΦφχΨψΩω]+[-]+[A-Za-z]+", ignore_case=True)
extr4 = RegexMatch('C', "([-]*(\d[,]*)+[-])", ignore_case=True)
extr5 = RegexMatch('C',"[αβΔδη]+", ignore_case=True)
extr6 = AllUpperNounsMatcher('C')

In [27]:
matcher = MultiMatcher(extr1, extr2, extr3, extr4, extr5, extr6)

## 4. Extracting Candidate Entities
Once we have matchers, we want to generate and store our candidate entity set for later use in learning. (Note this can take a long time, which is why you should precompute candidates before moving to the learning stage.)

In [28]:
candidates = Entities(sentences, matcher)

# Crude recall estimate (ignores actual span match)
mentions = [" ".join(unescape_penn_treebank([e.words[i] for i in e.idxs])) for e in candidates]
gold_mentions = list(zip(*itertools.chain.from_iterable(gold_entities))[0])

for m in mentions:
    if m in gold_mentions:
        gold_mentions.remove(m)
tp = gold_entity_n - len(gold_mentions)

print("Found %d candidate entities" % len(candidates))
print("Candidates: %.2f%% of all tokens" % (len(candidates)/float(word_n) * 100))
print("Annotations %.2f%% of all tokens" % (gold_entity_n/float(word_n) * 100))
print("~recall: %.2f (%d/%d)" % (float(tp) / gold_entity_n, tp, gold_entity_n))

candidates.dump_candidates("examples/cache/chem-candidates.pkl")

Found 2971 candidate entities
Candidates: 13.80% of all tokens
Annotations 3.45% of all tokens
~recall: 0.72 (535/743)


In [32]:
candidates[23].render()

### Error Analysis
For distant supervision, we want our candidate set to have high recall. We currently fall short for chemical named entities. If we look at the gold standard annotations, we can see why our string matching misses some entities. Note how paranthesis and other tokenization issues result in many missed entities. 

In [33]:
# What are we missing due to tokenization errors?
regexes = [re.compile("[αβΓγΔδεϝζηΘθικΛλμνΞξοΠπρΣστυΦφχΨψΩω]+[-]+[A-Za-z]+")]
regexes += [re.compile("([-]*(\d[,]*)+[-])")]
regexes += [re.compile("[αβΔδη]+")]

# regular expression matches
def regex_match(t):
    for regex in regexes:
        if regex.search(t):
            return True
    return False
            
tokenization_errors = [term for term in gold_mentions if term in chemicals or regex_match(term)]
tokenization_errors = {term:tokenization_errors.count(term) for term in tokenization_errors}
oov_errors = [term for term in gold_mentions if term not in tokenization_errors]
oov_errors = {term:oov_errors.count(term) for term in oov_errors}

print("Est. Tokenization Errors: %d" % (sum(tokenization_errors.values())))
print("Est. Out-of-vocabulary Errors: %d" % (sum(oov_errors.values())))

Est. Tokenization Errors: 83
Est. Out-of-vocabulary Errors: 125


We see that almost half our errors stem from tokenization issues. If we fixed all of those errors, we would have ~0.87 recall on the development set. If we actually look at OOV mentions we missed, we see there is considerable room for refining regular expressions to identify mentions like NaAsO(2) or FeSe, which just consist of element names. 

In [34]:
# print out our out of vocabulary terms
for term in sorted(oov_errors.items(),key=operator.itemgetter(1),reverse=1):
    print("%s: %d" % (term[0], oov_errors[term[0]]))

CO(2): 10
PCAHs: 6
Thiovit: 4
H2 O2: 4
withanolide A: 4
GABA: 4
steroidal saponins: 3
furostanol: 3
RAP: 3
mimulone B: 2
Glaucogenin E: 2
cAMP: 2
vicinal diol: 2
Al: 2
pyrimidinedione: 2
PLA: 2
Ser: 2
Fe(II): 2
C-geranylated flavonoids: 2
fluoro-indomethacin: 2
thiobarbituric: 1
Tween-80: 1
ixoroside: 1
phosphoinositol(3,4)P2: 1
ethoxyresorufine: 1
decanoate salt: 1
SDS: 1
Nepetanudoside B: 1
Zn(2+): 1
gamma-amino-butyric-acid: 1
sodium dodecyl sulphate polyacrylamide: 1
tetrabrombisphenol A: 1
benzoflouroanthene: 1
Tomentomimulol: 1
oxygenated monoterpenes: 1
betulon aldehyde: 1
N-acetylgalactosamine: 1
nickel-sulfate: 1
arachidonic (C20 : 4ω-6) and eicosapentaenoic (C20 : 5ω-3) acids: 1
polyhydroxyl: 1
Polymethoxylated flavones: 1
methanolic potassium hydroxide: 1
poly (ADP-ribose): 1
phosphoinositol(3,4,5)P3: 1
Trans-cyclo-(D-tryptophanyl-L-tyrosyl): 1
vitamin E: 1
poly(D,L-lactide): 1
acetal triterpenes: 1
3,3'-di-O-methylellagic acid: 1
tomentomimulol: 1
Grignard reagent: 1
phosph

### Example Best-in-class Tagger

The winning system in the 2013 BioCreative IV CHEMDNER task was tmChem which used 2 linear chain conditional random fields (CRF) with different tokenziation approaches and feature sets.

| Model       | Precision | Recall | F1     |
|-------------|-----------|--------|--------|
| Model 1     | 0.8595    | 0.8721 | 0.8657 |
| Model 2     | **0.8909**    | 0.8575 | **0.8739** |
| Heuristic Combination     | 0.8516    | 0.8906 | 0.8706 |
| Highest Recall | 0.7672    | **0.9212** | 0.8372 |

Leaman, Robert, Chih-Hsuan Wei, and Zhiyong Lu. ["tmChem: a high performance approach for chemical named entity recognition and normalization."](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331693/) J. Cheminformatics 7.S-1 (2015): S3.