# Part 2: Training a Chemical Named Entity Tagger

This notebook requires pickle file pre-generated candidate entities. Please refer to Part 1 in `ChmeicalExtraction.ipynb` for more details.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import cPickle
import itertools
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from ddlite import *
from datasets import *
from lexicons import AllUpperNounsMatcher,RuleTokenizedDictionaryMatch

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (18,6)

## 1. Load Precomputed Candidates
Since generating our initial candidate set takes some time, we load a snapshot of all entities identified in our Part 1 notebook.

In [2]:
candidates = Entities("examples/cache/chem-candidates.pkl")

## 2. Feature Generation
The `CandidateModel` object processes our extracted entity candidates. Since `Entities` object defines a feature generation method, features are automatically created when we initialize a `CandidateModel` object. 

In [3]:
model = CandidateModel(candidates)
msg = "Extracted {} features for each of {} mentions"
print msg.format(model.num_feats(), model.num_candidates())

Extracted 69084 features for each of 2971 mentions


## 2. Ground Truth Data


### Annotation with MindTagger
Often we lack ground truth (or "gold") annotated data for our labeling task. In order to evaluate our labeling functions and learning results, we'll create a small set of ground truth labels for some candidates using Mindtagger. This will highlight each candidate in the sentence in which it appears. We set the response to yes if it is a mention of gene, and no otherwise. If you aren't sure, you can abstain from labeling. In a real application, we would likely want to tag more than 20 candidates.

In [4]:
model.open_mindtagger(num_sample=20, width='100%', height=1200)

Making sure MindTagger is installed. Hang on!


#### Gold Standard Data
For the ChemNDER corpus, we actually have labeled training data, so let's load our gold labels and use those to evaluate our system. 

In [22]:
corpus = ChemdnerCorpus('datasets/chemdner_corpus/', parser=SentenceParser(), 
                        cache_path="examples/cache/chemdner/")

dev_set = sorted(corpus.cv["development"].keys())[:250]
documents = {doc_id:(corpus[doc_id]["sentences"],corpus[doc_id]["tags"]) for doc_id in dev_set}
sentences, gold_entities = zip(*documents.values())
#sentences = list(itertools.chain.from_iterable(sentences))
#gold_entities = list(itertools.chain.from_iterable(gold_entities))

for x in gold_entities:
    print x
    break

#model.add_mindtagger_tags()
#gold = np.zeros((model.num_candidates()))
#gold[np.array([48,49,50,51,52,53,54,55,56,58,59,60,61,62,63,64,65,66,68,69,70,
#     71,72,73,74,75,76,78,79,80,81,82,83,84,85,86,88,89,90,91])] = np.array([
#     -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,1,1,1,1,1,1,1,-1,1,-1,-1,1,
#     -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,1])
#model.set_gold_labels(gold)
#model.set_holdout()

[[('triterpenoids', (1, 2))], [('pentacyclic triterpenoid', (2, 4)), ('taraxer-14-ene', (11, 12)), ('carboxylic acid', (18, 20))], [('2\xce\xb1,3\xce\xb1-dihydroxytaraxer-14-en-28-oic acid', (6, 11))], [('2\xce\xb1,3\xce\xb1-diacetyltaraxer-14-en-28-oic acid', (3, 9)), ('2\xce\xb1,3\xce\xb1-di-O-carbonyl-2\xce\xb1,3\xce\xb1-dihydroxytaraxer-14-en-28-oic acid', (9, 19)), ('2\xce\xb1,3\xce\xb1-dipropionyltaraxer-14-en-28-oic acid', (18, 25))], [('3\xce\xb2-hydroxytaraxer-14-en-28-oic acid', (6, 10)), ('aleuritolic acid', (9, 16))], [('maprounic acid', (3, 6)), ('aleuritolic acid', (5, 8))], [('maprounic acid', (6, 10)), ('p-bromobenzyl acetylmaprounate', (12, 15))], [('3\xce\xb1-hydroxytaraxer-14-en-28-oic acid', (9, 13)), ('isoaleuritolic acid', (12, 19)), ('3\xce\xb1-acetyltaraxer-14-en-28-oic acid acetate', (18, 23)), ('aleuritolic acid acetate', (23, 30))], [], [('3-oxo-taraxer-14-ene', (10, 13)), ('taraxerone', (15, 18)), ('\xce\xb2-sitosterol', (20, 23)), ('stigmasterol', (24, 29))

## 3. Labeling Functions
We want to create a set of functions that weakly predicts a class label.

In [None]:
def post_window(m, key, n=3):
    s = list(m.idxs)
    b = len(m.lemmas) - np.max(s)
    s.extend([np.max(s) + i for i in range(1, min(b,n+1))])
    return key in [m.lemmas[i] for i in s]

def pre_window(m, key, n=3):
    s = list(m.idxs)
    b = np.min(s)
    s.extend([b - i for i in range(1, min(b,n+1))])
    return key in [m.lemmas[i] for i in s]

def LF_mutation(m):
    return 1 if 'treat' in [m.lemmas[m.dep_parents[i] - 1] for i in m.idxs] else 0