# Disease Tagging: Annotating using gold data

Here, we'll match gold annotations to the corresponding (i.e. equal or superset) candidates, and then label those candidates as true disease mentions!

In [None]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

In [None]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

In [None]:
from snorkel.models import CandidateSet

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
cs

## First: Loading External Evaluation Labels

Loading in external annotations can be a bit messier, since these external annotations could be in any format.  Here, we'll provide an example of how to use the `ExternalAnnotationsLoader` helper class to make this a bit simpler.

**Note that most of the code below is custom code just for this particular example's external annotations format;** we start, however, by creating the loader helper.  Note that we use it to create a `CandidateSet` (named "Gold Candidates") and `AnnotationKey` (named "Gold Labels") for the annotations we load.

Note in particular that we need to define a new candidate set because _the external annotations we load might be over candidates not in our candidate set._

In [None]:
from snorkel.loaders import ExternalAnnotationsLoader

loader = ExternalAnnotationsLoader(session, Disease, 
                                   'CDR Development Candidates -- Gold',
                                   'CDR Development Labels -- Gold',
                                   expand_candidate_set=True)

Next, we use custom scripts to extract this particular type of annotations.  **The details of these scripts are mostly left out as they are particular to this example (see `tutorial/utils.py`).**

The key part is that we need to form a _dictionary of `TemporaryContexts`_ to pass into the loader:

In [None]:
from utils import get_docs_xml, get_CID_unary_mentions
from snorkel.models import Document, TemporarySpan
import os
ROOT = os.environ['SNORKELHOME'] + '/tutorials/disease_tagging/data/'

def load_BioC_disease_labels(loader, file_name):
    # Get all the annotated Pubtator documents as XML trees
    doc_xmls = get_docs_xml(ROOT + file_name)
    for doc_id, doc_xml in doc_xmls.iteritems():
    
        # Get the corresponding Document object
        stable_id = "%s::document:0:0" % doc_id
        doc       = session.query(Document).filter(Document.stable_id == stable_id).first()
        if doc is not None:
        
            # Use custom script + loader to add
            for d in get_CID_unary_mentions(doc_xml, doc, 'Disease'):
                loader.add(d)

load_BioC_disease_labels(loader, 'CDR_DevelopmentSet.BioC.xml')

We've created a candidate set and a corresponding set of labels:

In [None]:
from snorkel.models import Label

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates -- Gold').one()
print len(cs)
print session.query(Label).filter(Label.key == loader.annotation_key).count()

Now we'll load the rest of the annotations:

In [None]:
for set_name in ['Training', 'Test']:
    loader = ExternalAnnotationsLoader(session, Disease, 
                                       'CDR %s Candidates -- Gold' % set_name,
                                       'CDR %s Labels -- Gold' % set_name,
                                       expand_candidate_set=True)
    load_BioC_disease_labels(loader, 'CDR_%sSet.BioC.xml' % set_name)
    print session.query(Label).filter(Label.key == loader.annotation_key).count()

# Now, we'll match the gold candidates with ours & annotate!

In [None]:
from snorkel.models import AnnotationKeySet, AnnotationKey, Span

for name in ['Training', 'Development', 'Test']:
    
    # Load gold candidates
    gold = session.query(CandidateSet).filter(CandidateSet.name == 'CDR %s Candidates -- Gold' % name).one()
    print "Gold candidates:", len(gold)
    
    # Load NP-chunk candidates
    candidates = session.query(CandidateSet).filter(CandidateSet.name == 'CDR %s Candidates' % name).one()
    print "NP Candidates:", len(candidates)
    
    # Create / load a label key set
    try:
        label_key_set = session.query(AnnotationKeySet).filter(AnnotationKeySet.name == '%s Labels' % name).one()
    except:
        label_key_set = AnnotationKeySet(name='%s Labels' % name)
        session.add(label_key_set)
        session.commit()
    
    # Create / load a label key
    try:
        label_key = session.query(AnnotationKey).filter(AnnotationKey.name == 'Gold NP-Chunk Label').one()
    except:
        label_key = AnnotationKey(name='Gold NP-Chunk Label')
        session.add(label_key)
    
    # Add label key to label key set
    if label_key not in label_key_set.keys:
        label_key_set.append(label_key)

    session.commit()
    
    seen  = set()
    for g in gold:
    
        # Get the candidates in our NP candidate set which are in the same sentence
        ds = session.query(Disease).join(Span)\
            .filter(Disease.sets.contains(candidates))\
            .filter(Span.parent == g.disease.parent).all()
    
        # Check for the superset candidate which contains the gold span
        for d in ds:
        
            # Note that a small number of candidates contain > 1 gold candidate
            # Just deal with heuristically here...
            if g.disease in d.disease and d not in seen:
                label = Label(key=label_key, candidate=d, value=1)
                session.add(label)
                seen.add(d)
                break

    session.commit()