# Disease Tagging Tutorial

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Part III: Creating or Loading Evaluation Labels

In [None]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

## Part III(a): Creating Evaluation Labels in the `Viewer`

We repeat our definition of the `ChemicalDisease` `Candidate` subclass from Part II.

In [None]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

## Loading the development `CandidateSet`

We will start by viewing the development `CandidateSet` we created in Part II in the `Viewer`.

First we reload the development `CandidateSet`.

In [None]:
from snorkel.models import CandidateSet

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates').one()
cs

## Labeling the `CandidateSet` in the `Viewer`

We create a `Viewer` to annotate them manually.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(cs[:300], session, annotator_name="Tutorial Part III User")
else:
    sv = None

We now open the Viewer.

You can mark each `Candidate` as true or false. Remember that <span style="color:red">red</span> denotes the first argument (chemical) and <span style="color:blue">blue</span> denotes the second (disease). Try it!

These labels are automatically saved in the database backend, and can be accessed using the annotator's name ('Tutorial Part III User') as the AnnotationKey.

In [None]:
sv

## Part III(b): Loading External Evaluation Labels

Loading in external annotations can be a bit messier, since these external annotations could be in any format.  Here, we'll provide an example of how to use the `ExternalAnnotationsLoader` helper class to make this a bit simpler.

**Note that most of the code below is custom code just for this particular example's external annotations format;** we start, however, by creating the loader helper.  Note that we use it to create a `CandidateSet` (named "Gold Candidates") and `AnnotationKey` (named "Gold Labels") for the annotations we load.

Note in particular that we need to define a new candidate set because _the external annotations we load might be over candidates not in our candidate set._

In [None]:
from snorkel.loaders import ExternalAnnotationsLoader

loader = ExternalAnnotationsLoader(session, Disease, 
                                   'CDR Development Candidates -- Gold',
                                   'CDR Development Labels -- Gold',
                                   expand_candidate_set=True)

Next, we use custom scripts to extract this particular type of annotations.  **The details of these scripts are mostly left out as they are particular to this example (see `tutorial/utils.py`).**

The key part is that we need to form a _dictionary of `TemporaryContexts`_ to pass into the loader:

In [None]:
from utils import get_docs_xml, get_CID_unary_mentions
from snorkel.models import Document, TemporarySpan
import os
ROOT = os.environ['SNORKELHOME'] + '/tutorials/disease_tagging/data/'

def load_BioC_disease_labels(loader, file_name):
    # Get all the annotated Pubtator documents as XML trees
    doc_xmls = get_docs_xml(ROOT + file_name)
    for doc_id, doc_xml in doc_xmls.iteritems():
    
        # Get the corresponding Document object
        stable_id = "%s::document:0:0" % doc_id
        doc       = session.query(Document).filter(Document.stable_id == stable_id).first()
        if doc is not None:
        
            # Use custom script + loader to add
            for d in get_CID_unary_mentions(doc_xml, doc, 'Disease'):
                loader.add(d)

load_BioC_disease_labels(loader, 'CDR_DevelopmentSet.BioC.xml')

We've created a candidate set and a corresponding set of labels:

In [None]:
from snorkel.models import Label

cs = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Development Candidates -- Gold').one()
print len(cs)
print session.query(Label).filter(Label.key == loader.annotation_key).count()

Now we'll load the rest of the annotations:

In [None]:
for set_name in ['Training', 'Test']:
    loader = ExternalAnnotationsLoader(session, Disease, 
                                       'CDR %s Candidates -- Gold' % set_name,
                                       'CDR %s Labels -- Gold' % set_name,
                                       expand_candidate_set=True)
    load_BioC_disease_labels(loader, 'CDR_%sSet.BioC.xml' % set_name)
    print session.query(Label).filter(Label.key == loader.annotation_key).count()