# Tagging Chemical Named Entities with ddlite

## Introduction

As with the gene tagging example, this task is broken down into several steps:

1. Obtain and parse input data (ChemDNER Corpus PubMed abstracts)
2. Extract candidates for tagging
3. Generate features
4. Write distant supervision rules
5. Learn the tagging model

## 1. Obtaining Data

**ChemDNER Corpus v1.0**

The ChemDNER corpus is used to evaluate systems for identifying chemical names in biomedical literature. The corpus consists of 10,000 PubMED abstracts and their corresponding label sets of named chemical entities. The data is broken down as follows:

* 3500 Training
* 3500 Development
* 3000 Evaluation (Testing)

This data set is [publicly available](http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/) and can be downloaded directly using the shell script below. This will download and extact all files into your dd-biolib datasets directory.

    load_data.sh
    

## 2. Extracting Candidate Mentions

In [1]:
import umls
from ddlite import *
from datasets import *

Once we have our data, we need to preprocess our documents. The SentenceParser class is a wrapper for CoreNLP and parses documents into tagged sentences (i.e., words, lemmas, POS tags, dependency parents, and dependency labels). This takes a bit of time, so the ChemdnerCorpus object will cache parsed files to disk after a document is accessed for the first time.  

**Note**: This is a standard preprocessing step for all DeepDive applications. Theoretically, you could swap out the CoreNLP parser for something else here if you wished, e.g., [spacy.io](https://spacy.io) (faster, less accurate), [cTAKES](http://ctakes.apache.org) or [NLP4J](https://github.com/emorynlp/nlp4j) (for clinical text tagging).

The ChemDNER corpus defines cross-validation sets, so let's create a dictionary of the first 10 training documents as well as a list of all true chemical entity mentions in those documents.

In [76]:
parser = SentenceParser(absolute_path=True)
corpus = ChemdnerCorpus('datasets/chemdner_corpus/', parser=None)

# load the first 10 training documents and collapse all sentences into a single list
pmids = [pmid for pmid in corpus.cv["training"].keys()[:10]]
documents = {pmid:corpus[pmid]["sentences"] for pmid in pmids}
sentences = reduce(lambda x,y:x+y, documents.values())

# load gold annotation tags
annotations = [corpus.annotations[pmid] for pmid in pmids if pmid in corpus.annotations]
annotations = reduce(lambda x,y:x+y, annotations)
annotations = [a.text for a in annotations]
print("%d true chemical entity mentions" % len(annotations))

85 true chemical entity mentions


### Dictionary Matching

The easiest way to identify candidates is through simple string matching using a dictionary of known entity names. Curating good lexicons can take some time, so we use pre-existing dictionaries provided by the *tmChem* tagger.

In [64]:
# dictionaries from tmChem
dict_fnames = ["datasets/dictionaries/chemdner/mention_chemical.txt",
              "datasets/dictionaries/chemdner/chebi.txt",
              "datasets/dictionaries/chemdner/addition.txt"]
chemicals = []
for fname in dict_fnames:
    chemicals += [line.split("\t")[0] for line in open(fname,"rU").readlines()]

extractor = DictionaryMatch('C', chemicals, ignore_case=True)

candidates = Entities(extractor, sentences)
mentions1 = [" ".join([e.words[i] for i in e.idxs]) for e in candidates.entities]

# Crude/incorrect estimate of how well we did (ignores actual span match)
m = len([term for term in annotations if term in mentions1])

print("Found %d candidate entities" % len(candidates.entities))
print("Recall: %.2f" % (float(m) / len(annotations)))

Found 167 candidate entities
Recall: 0.65


### UMLS Matching
Straight dictionary matching doesn't actually find that many entities in our small subset. Rather than using existing dictionary files, we can interface directly with the UMLS and finding string matches by semantic type.  Note that this approach only uses ontologies curated by the UMLS, which we specify using the source_vocab paramter. By default, UmlsMatch uses all curated ontologies, which results in large candidate entity sets. Here we restrict to 3 ontologies that provide reasonable coverage: RxNorm, SNOMED CT, and MeSH (Medical Subject Headings).

In [65]:
extractor = umls.UmlsMatch('C', semantic_types=["Substance"], 
                           source_vocab=["RXNORM","SNOMEDCT_US","MSH"], ignore_case=True)

candidates = Entities(extractor, sentences)
mentions2 = [" ".join([e.words[i] for i in e.idxs]) for e in candidates.entities]

m = len([term for term in annotations if term in mentions2])

print("Found %d candidate entities" % len(candidates.entities))
print("Recall: %.2f" % (float(m) / len(annotations)))

Found 212 candidate entities
Recall: 0.65


In [67]:
# Merging candidates doesn't really lead to much improvement
mentions = mentions1 + mentions2
m = len([term for term in annotations if term in mentions])
print("Found %d candidate entities" % len(mentions))
print("Recall: %.2f" % (float(m) / len(annotations)))


Found 379 candidate entities
Recall: 0.68


### Error Analysis
Recall isn't that great and merging candidates doesn't provide much performance boost. If we look at the gold standard annotations, we can see why our string matching misses a some set of entities.

In [72]:
missed = [term for term in annotations if term not in mentions]
missed = {term:missed.count(term) for term in missed}
for term in sorted(missed,key=len):
    print("%s : %d" % (term,missed[term]))

NAD : 1
(1)H : 1
(31)P : 1
C2H3OH : 1
hydroxy : 4
UVI2008 : 1
Hydroxy : 1
4-methyl : 1
carboxyl : 1
chloroquine : 1
cholesterol : 2
l-amino acid : 2
mallotophenone : 1
organochlorine : 3
carbonyl di-imidazole : 1
dimeric phloroglucinols : 2
mallotojaponins B (1) and C (2) : 1
(22R)-hydroxylanosta-7,9(11),24-trien-3-one : 1
6,8-dihydroxy-3-methyl-3,4-dihydroisocoumarin : 1


**Parsing Problems** CoreNLP isn't trained for biomedical text (by default), so for entities containing hyphens or paranthesis, tokenization by broken in complicated chemical names. 

* 6,8-dihydroxy-3-methyl-3,4-dihydroisocoumarin
* 6,8-dihydroxy-3-methyl-3 ,4 - dihydroisocoumarin

* (22R)-hydroxylanosta-7,9(11),24-trien-3-one
* ( 22R ) - hydroxylanosta-7 ,9 ( 11 ) ,24 - trien-3-one

*tmChem* addresses this issue using regular expressions on the original (unparsed) text. To address this, a RegexMatch class will be implemented shortly. 


## 3. Generating Features
After we get our recall as high as possible, we need to generate features for each mention so that we can idenfiy true and negative entity instances. In ddlite, this is very simple and automated -- remember the goal of ddlite is rapid prototyping of distance supervision rules not feature engineering!

In [74]:
candidates.extract_features()
print "Extracted {} features for each of {} mentions".format(*candidates.feats.shape)

Extracted 6627 features for each of 212 mentions


## 4. Distant Supervision Rules


## 5. Learning the tagging model

### Example Best-in-class Tagger

The winning system in the 2013 BioCreative IV CHEMDNER task was tmChem which used 2 linear chain conditional random fields (CRF) with different tokenziation approaches and feature sets.

| Model       | Precision | Recall | F1     |
|-------------|-----------|--------|--------|
| Model 1     | 0.8595    | 0.8721 | 0.8657 |
| Model 2     | **0.8909**    | 0.8575 | **0.8739** |
| Heuristic Combination     | 0.8516    | 0.8906 | 0.8706 |
| Highest Recall | 0.7672    | **0.9212** | 0.8372 |

Leaman, Robert, Chih-Hsuan Wei, and Zhiyong Lu. ["tmChem: a high performance approach for chemical named entity recognition and normalization."](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331693/) J. Cheminformatics 7.S-1 (2015): S3.