# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At the core of this task will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part II: `Candidate` Extraction

In [None]:
%load_ext autoreload
%autoreload 2
import os

# Note: We run automated tests on this tutorial to make sure that it is always up to date! 
# However, certain interactive components cannot currently be tested automatically, and will 
# be skipped with if-then statements by the automated tests using the variable below:
AUTOMATED_TESTING = os.environ.get('TESTING') is not None

from snorkel import SnorkelSession
session = SnorkelSession()

## Loading the `Corpus`

First, we will load the `Corpus` that we preprocessed in Part I:

In [None]:
from snorkel.models import Corpus

corpus = session.query(Corpus).filter(Corpus.name == 'CDR Training').one()
corpus

Next, we collect each `Sentence` in the `Corpus` into a `set`.

In [None]:
sentences = set()
for document in corpus:
    for sentence in document.sentences:
        sentences.add(sentence)

## Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`; we can manually define this class, or use a helper function (similar in spirit to the Python `collections.namedtuple` function).

Here we'll define a _chemical-disease relation mention candidate class_ which is composed of two named contexts, corresponding to a _chemical mention_ and a _disease mention_.  Note that this function will create the table if it does not exist:

In [None]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

## Writing a basic `CandidateExtractor`

Next, we'll write a basic function to extract **candidate relation mentions** from the corpus.  For this first attempt, we'll just write a function that checks for matches against several dictionaries at the _entity mention level_--i.e. looking for candidate chemical and disease mentions--and then considering any co-occuring pairs in the same sentence as candidate relation mentions.

We'll use some precomputed disease and chemical dictionaries (see `tutorial/data/dicts/compile_dictionaries.py` for details)

In [None]:
# Load the dictionaries
ROOT = '%s/tutorial/data/dicts/' % os.environ['SNORKELHOME']
disease_phrases   = open(ROOT + 'disease_phrases.txt', 'rb').read().strip().split('\n')
disease_acronyms  = open(ROOT + 'disease_acronyms.txt', 'rb').read().strip().split('\n')
chemical_phrases  = open(ROOT + 'chemical_phrases.txt', 'rb').read().strip().split('\n')
chemical_acronyms = open(ROOT + 'chemical_acronyms.txt', 'rb').read().strip().split('\n')

We turn the dictionaries into a candidate extractor in three steps.

First, we define a child context space for our sentences. The particular context space we use here will consider all unigrams, bigrams, and trigrams that appear in a sentence as potential candidates to pass on to the matchers described in the next step.

In [None]:
from snorkel.candidates import Ngrams

ngrams = Ngrams(n_max=3)

Second, we define matchers to filter the child contexts based on the dictionaries. Those potential candidates that pass the filters will be materialized as candidate relations.

In [None]:
from snorkel.matchers import DictionaryMatch, Union

# Define a matcher for diseases
disease_matcher = Union(
    DictionaryMatch(d=disease_phrases, ignore_case=True),
    DictionaryMatch(d=disease_acronyms, ignore_case=False),
    longest_match_only=True
)

# Define a matcher for chemicals
chem_matcher = Union(
    DictionaryMatch(d=chemical_phrases, ignore_case=True),
    DictionaryMatch(d=chemical_acronyms, ignore_case=False),
    longest_match_only=True
)

Third, we combine the candidate class, child context spaces, and matchers into an extractor.

In [None]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(ChemicalDisease, [ngrams, ngrams], [chem_matcher, disease_matcher])

## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling `extract` with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [None]:
%time c = ce.extract(sentences, 'CDR Training Candidates', session)
print "Number of candidates:", len(c)

### Saving the extracted candidates

In [None]:
session.add(c)
session.commit()

### Reloading the candidates

In [None]:
from snorkel.models import CandidateSet
c = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
c

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note that our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
if not AUTOMATED_TESTING:
    sv = SentenceNgramViewer(c[:300], session, annotator_name="Tutorial Part 2 User")
else:
    sv = None

Now, we instantiate and render the `Viewer` object; note that we are being a bit sloppy, passing in _all_ the candidates and gold labels, but the `Viewer` object will take care of indexing them by sentence, and will only render the sentences we pass in:

And, now we render the `Viewer`. <span style="color:red">Red</span> denotes the first argument (chemical) and <span style="color:blue">blue</span> denotes the second (disease).

In [None]:
sv

Note that we can **navigate using the provided buttons**, or **using the keyboard (hover over buttons to see controls)**, highlight candidates (even if they overlap), and also **apply binary labels** (more on where to use this later!).  In particular, note that **the Viewer is synced dynamically with the notebook**, so that we can for example get the `Candidate` that is currently selected. Try it out!

In [None]:
if not AUTOMATED_TESTING:
    print sv.get_selected()

## Composing fancier `CandidateExtractor`s

We can additionally try to increase our candidate recall using more of the `Matcher` operators and their functionalities.  For example we can turn on **Porter stemming** in our dictionary matcher; Porter stemming is an aggressive rules-based method for normalizing word endings.  Another thing we can do is to allow for candidates to be subspans of each other by setting `longest_match_only=False` (note that this must be done in the outermost `Matcher`).

We can also use the `Concat` and `RegexMatch` operators to find candidate mentions composed of an _adjective followed by a term matching our diseases dictionary_.  Note in particular that we set `left_required=False` so that exact matches to our dictionary (with no adjective prepended) will still work:

In [None]:
from snorkel.matchers import Concat, RegexMatchEach

disease_matcher = Union(
    Concat(
        RegexMatchEach(rgx=r'JJ*', attrib='poses'),
        DictionaryMatch(d=disease_phrases, stemmer='porter'),
        left_required=False),
    DictionaryMatch(d=disease_acronyms, ignore_case=False),
    longest_match_only=False)

We then create a new `CandidateExtractor` with the new `Matcher` for diesease, keeping everything else the same.

In [None]:
ce_fancy = CandidateExtractor(ChemicalDisease, [ngrams, ngrams], [chem_matcher, disease_matcher])

In [None]:
%time c = ce_fancy.extract(sentences, 'CDR Training Candidates -- Fancy', session)
print len(c)
session.add(c)
session.commit()

This generates a pretty large candidate set, however, so for the rest of the tutorial we'll use our initial simpler set.

### Repeating for development and test corpora
We will rerun the same operations for the other two CDR corpora: development and test. All we do for each is load in the `Corpus` object, collect the `Sentence` objects, and run them through the `CandidateExtractor`.

In [None]:
for corpus_name in ['CDR Development', 'CDR Test']:
    corpus = session.query(Corpus).filter(Corpus.name == corpus_name).one()
    sentences = set()
    for document in corpus:
        for sentence in document.sentences:
            sentences.add(sentence)
    
    %time c = ce.extract(sentences, corpus_name + ' Candidates', session)
    session.add(c)
session.commit()

Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.