# Disease Tagging Tutorial

In this example, we'll be writing an application to extract *mentions of* diseases from Pubmed abstracts, using annotations from the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial, which has 5 parts, walks through the process of constructing a model to classify _candidate_ disease mentions as either true (i.e., that it is truly a mention of a disease) or false.

## Part II: `Candidate` Extraction

In [1]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the `Corpus`

First, we will load the `Corpus` that we preprocessed in Part I:

In [2]:
from snorkel.models import Corpus

corpus = session.query(Corpus).filter(Corpus.name == 'CDR Training').one()
corpus

Corpus (CDR Training)

Next, we collect each `Sentence` in the `Corpus` into a `set`.

In [3]:
sentences = set()
for document in corpus:
    for sentence in document.sentences:
        sentences.add(sentence)

## Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function.

Here we'll define a unary _disease relation mention_ which encapsulates a `Span` of text.  Note that this function will create the table in the database backend if it does not exist:

In [4]:
from snorkel.models import candidate_subclass

Disease = candidate_subclass('Disease', ['disease'])

In [5]:
from snorkel.candidates import Ngrams

ngrams = Ngrams(n_max=8, split_tokens=['-', '/'])

In [6]:
from snorkel.matchers import RegexMatchSpan

np_matcher = RegexMatchSpan(attrib='pos_tags', rgx='(NN\s*)+')

In [7]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(Disease, [ngrams], [np_matcher])

In [8]:
%time c = ce.extract(sentences, 'CDR Training Candidates', session)
print "Number of candidates:", len(c)


CPU times: user 43.5 s, sys: 223 ms, total: 43.7 s
Wall time: 43.6 s
Number of candidates: 21550


In [9]:
d = c[1]
d

Disease(Span("atropine", parent=1781, chars=[58,65], words=[11,11]))

In [10]:
d.get_contexts()

(Span("atropine", parent=1781, chars=[58,65], words=[11,11]),)

In [13]:
d.get_cids()

(3,)

In [12]:
d.disease_cid = 3

In [14]:
session.add(c)
session.commit()

## Writing a basic `CandidateExtractor`

Next, we'll write a basic function to extract **candidate relations mentions** from the corpus.  For this first attempt, we'll just write a function that checks for matches against several dictionaries at the _entity mention level_--i.e. looking for candidate chemical and disease mentions--and then considering any co-occuring pairs in the same sentence as candidate relation mentions.

We'll use some precomputed disease and chemical dictionaries (see `tutorial/data/dicts/compile_dictionaries.py` for details)

In [None]:
# Load the dictionaries
import pandas as pd
ROOT = 'data/dicts/'
diseases   = set(pd.read_csv(ROOT + 'disease_names.csv', header=None, index_col=0, encoding='utf-8').dropna()[1])
abbrvs  = set(pd.read_csv(ROOT + 'disease_abbrvs.csv', header=None, index_col=0, encoding='utf-8').dropna()[1])
body_parts  = set(pd.read_csv(ROOT + 'body_parts.csv', header=None, index_col=0, encoding='utf-8').dropna()[1])

We turn the dictionaries into a candidate extractor in three steps.

First, we define a child context space for our sentences.

In [None]:
from snorkel.candidates import Ngrams

ngrams = Ngrams(n_max=8, split_tokens=['-', '/'])

Next, we define two basic `DictionaryMatch` matchers to filter the child contexts based on the dictionaries.

In [None]:
from snorkel.matchers import DictionaryMatch

#
# DICTIONARIES
#
longest_match_only = True
dict_diseases = DictionaryMatch(d=diseases, ignore_case=True, 
                                longest_match_only=longest_match_only)
dict_abbrvs = DictionaryMatch(d=abbrvs, ignore_case=False, 
                              longest_match_only=longest_match_only)

We also build a third `DictionaryMatch` out of all stem words for later use.

In [None]:
keep = ["disease", "diseases", "syndrome", "syndromes", "disorder", 
        "disorders", "damage", "infection", "bleeding"]
stems = diseases | abbrvs | set(keep)
disease_stems = DictionaryMatch(d=stems, ignore_case=True, 
                                longest_match_only=longest_match_only)

Some diseases that we want to tag have common patterns indicating disease subtypes. We use a `Concat` matcher to match consecutive spans matched by its component matchers.

In [None]:
from snorkel.matchers import Concat

type_names = ['type', 'class', 'factor']
type_nums = ['i', 'ii', 'iii', 'vi', 'v', 'vi', '1a', 'iid', 'a', 'b', 'c', 'd'] 
type_nums += map(unicode,range(1,10))

types = Concat(DictionaryMatch(d=type_names),
               DictionaryMatch(d=type_nums))

disease_types_left = Concat(types, disease_stems)
disease_types_right = Concat(disease_stems, types)

We can make complex patterns with `Concat` and `DictionaryMatch` matchers.

In [None]:
disease_pattern = ["disease", "diseases", "syndrome", "syndromes", "disorder", "disorders", "damage", "infection", 
       "lesion", "lesions", "impairment", "impairments", "failure", "failures", "occlusion", "occlusions", 
       "dysfunction", "dysfunctions", "toxicity", "injury", "carcinoma", "carcinomas", "thrombosis", "cancer", 
       "cancers", "block", "pain"]

timestamp = ["end-stage", "acute", "chronic", "congestive"]

conjunction = ["and", "or", "and/or"]

stemmer='porter'
body_disease = Concat(Concat(DictionaryMatch(d=body_parts, longest_match_only=longest_match_only, stemmer=stemmer), 
                             DictionaryMatch(d=conjunction, longest_match_only=longest_match_only)), 
                      Concat(DictionaryMatch(d=timestamp, longest_match_only=longest_match_only), 
                             Concat(DictionaryMatch(d=body_parts, longest_match_only=longest_match_only, stemmer=stemmer), 
                                    DictionaryMatch(d=disease_pattern, longest_match_only=longest_match_only, stemmer=stemmer)), left_required=False), left_required=False)

We create a `Union` of the `Matcher` objects, to produce a final `Matcher` that matches any input that any one of its component matchers does.

In [None]:
from snorkel.matchers import Union

disease_matcher = Union(disease_types_left, disease_types_right, dict_diseases, dict_abbrvs, body_disease,
                        longest_match_only=longest_match_only)

Finally, we combine the candidate class, child context space, and matcher into an extractor.

In [None]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(Disease, [ngrams], [disease_matcher])

## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [None]:
%time c = ce.extract(sentences, 'CDR Training Candidates', session)
print "Number of candidates:", len(c)

### Saving the extracted candidates

In [None]:
session.add(c)
session.commit()

### Reloading the candidates

In [None]:
from snorkel.models import CandidateSet
c = session.query(CandidateSet).filter(CandidateSet.name == 'CDR Training Candidates').one()
c

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note, our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

First, we instantiate the `Viewer` object, which groups the input `Candidate` objects by `Sentence`:

In [None]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(c[:300], session, annotator_name="Tutorial Part 2 User")
else:
    sv = None

Next, we render the `Viewer.

In [None]:
sv

Note that we can **navigate using the provided buttons**, or **using the keyboard (hover over buttons to see controls)**, highlight candidates (even if they overlap), and also **apply binary labels** (more on where to use this later!).  In particular, note that **the Viewer is synced dynamically with the notebook**, so that we can for example get the `Candidate` that is currently selected. Try it out!

In [None]:
if 'CI' not in os.environ:
    print sv.get_selected()

### Repeating for development and test corpora
We will rerun the same operations for the other two CDR corpora: development and test. All we do for each is load in the `Corpus` object, collect the `Sentence` objects, and run them through the `CandidateExtractor`.

In [None]:
for corpus_name in ['CDR Development', 'CDR Test']:
    corpus = session.query(Corpus).filter(Corpus.name == corpus_name).one()
    sentences = set()
    for document in corpus:
        for sentence in document.sentences:
            sentences.add(sentence)
    
    %time c = ce.extract(sentences, corpus_name + ' Candidates', session)
    session.add(c)
session.commit()

Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.