# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At the core of this task will be constructing a model to classify _candidate_ CDR mentions as either true or false.

## Part I: Preprocessing

**Before starting, make sure to run the download_data.sh script!**

In this notebook, we'll preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data--which we refer to as _contexts_--as well as extracting standard linguistic features from each context.

In this example, we will extract two types of contexts, represented as `Context` subclasses: `Document` objects and constituent `Sentence` objects.  In particular, we'll do this using [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), which will also extract a number of standard linguistic features that will be used downstream.

All of this preprocessed input data will be saved to a database.  In Snorkel, if no database is specified, then a SQLite database is created by default- so no setup is needed here!

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

## Loading the `Corpus`

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a `DocParser`

We'll start by defining a `DocParser` class to read in Pubmed abstracts from [Pubtator]([Pubtator](http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/index.cgi)), where they are stored along with "gold" (i.e. hand-annotated by experts) *chemical* and *disease mention* annotations. We'll use the `XMLDocParser` class, which allows us to use [XPath queries](https://en.wikipedia.org/wiki/XPath) to specify the relevant sections of the XML format.

_Note that we are concatenating text from the title and abstract together with newlines for simplicity, but if we wanted to, we could easily extend the `DocParser` classes to preserve information about document structure._

In [None]:
from snorkel.parser import XMLMultiDocParser

xml_parser = XMLMultiDocParser(
    path='data/CDR_TrainingSet.BioC.xml',
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()')

### Creating a `SentenceParser`

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.  Here we use the default `SentenceParser` class.

In [None]:
from snorkel.parser import SentenceParser

sent_parser = SentenceParser()

### Pre-processing & loading the `Corpus`

Finally, we'll put this all together using a `CorpusParser` object, which will execute the parsers and store the results as a `Corpus`:

In [None]:
from snorkel.parser import CorpusParser

cp = CorpusParser(xml_parser, sent_parser)
%time corpus = cp.parse_corpus(name='CDR Training', session=session)

Note that the printed stats are a property of the `Corpus` object, and can be printed again via the `corpus.stats()` method.

In [None]:
doc = corpus.documents[0]
doc

In [None]:
sent = doc.sentences[0]
print sent
print sent.words
print sent.poses

### Saving the `Corpus`
Finally, we persist the parsed corpus in Snorkel's database backend:

In [None]:
session.add(corpus)
session.commit()

### Repeating for development and test corpora
We will rerun the same operations for the other two CDR corpora: development and test. All we do is change the path that the `XMLMultiDocParser` uses.

In [None]:
cp.doc_parser.path = 'data/CDR_DevelopmentSet.BioC.xml'
%time corpus = cp.parse_corpus(name='CDR Development', session=session)
session.add(corpus)

In [None]:
cp.doc_parser.path = 'data/CDR_TestSet.BioC.xml'
%time corpus = cp.parse_corpus(name='CDR Test', session=session)
session.add(corpus)

In [None]:
session.commit()

Next, in Part 2, we will look at how to extract `Candidate` relations from our saved `Corpus`.