# Disease Tagging Tutorial

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Part II: `Candidate` Extraction

In [None]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

## Loading the `Corpus`

First, we will load the `Corpus` that we preprocessed in Part I:

In [None]:
from snorkel.models import Corpus

corpus = session.query(Corpus).filter(Corpus.name == 'CDR Training').one()
corpus

Next, we collect each `Sentence` in the `Corpus` into a `set`.

In [None]:
sentences = set()
for document in corpus:
    for sentence in document.sentences:
        sentences.add(sentence)

## Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function.

Here we'll define a unary _disease relation mention_ which encapsulates a `Span` of text.  Note that this function will create the table in the database backend if it does not exist:

In [None]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

## Writing a basic `CandidateExtractor`

In [None]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import RegexMatchSpan

ngrams = Ngrams(n_max=8)

# Define a noun phrase matcher
NP_RGX = r'^(((JJ|VBN|VBD|RB) )*((NN(P|S)?|POS) )*NN(P|S)?|JJ)$'
np_matcher = RegexMatchSpan(attrib='pos_tags', rgx=NP_RGX, longest_match_only=True)

ce = CandidateExtractor(Disease, [ngrams], [np_matcher])

## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [None]:
%time c = ce.extract(sentences, 'CDR Training Candidates', session)
print "Number of candidates:", len(c)

### Saving the extracted candidates

In [None]:
session.add(c)
session.commit()

### Reloading the candidates

In [None]:
from snorkel.models import CandidateSet
c = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
c

### Repeating for development and test corpora
We will rerun the same operations for the other two CDR corpora: development and test. All we do for each is load in the `Corpus` object, collect the `Sentence` objects, and run them through the `CandidateExtractor`.

In [None]:
for corpus_name in ['CDR Development', 'CDR Test']:
    corpus = session.query(Corpus).filter(Corpus.name == corpus_name).one()
    sentences = set()
    for document in corpus:
        for sentence in document.sentences:
            sentences.add(sentence)
    
    %time c = ce.extract(sentences, corpus_name + ' Candidates', session)
    session.add(c)
session.commit()

Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.