# Tutorial, Part I: Candidate Extraction

In this example, we'll be writing an application to extract **person-age relationships** from homemade tables, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  At core, we will be constructing a model to classify _person-age relation mentions_ as either true or false.  To do this, we first need a set of such candidates- in this notebook, we'll use `Snorkel` utilities to extract person candidates.

## Loading the Corpus

First, we will load and pre-process the corpus, storing it for convenience in a `Corpus` object

### Configuring a table parser

We'll start by defining an 'HTMLTableParser' class to read HTML tables.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ddlite_parser import HTMLTableParser
html_parser = HTMLTableParser(path='data/diseases_table.xml')

In [3]:
from ddlite_parser import TableParser
table_parser = TableParser()

In [4]:
from ddlite_parser import Corpus
%time corpus = Corpus(html_parser, table_parser)

Parsing documents...
Parsing contexts...
CPU times: user 56.9 ms, sys: 8.23 ms, total: 65.1 ms
Wall time: 92.5 ms


In [5]:
corpus.get_docs()[0]

Document(id=0, file='diseases_table.xml', text='<html lang="en">  <table>    <tr>      <th>Disease</th>      <th>Location</th>      <th>Year</th>    </tr>    <tr>      <th>Polio</th>      <td>New York</td>      <td>1914</td>    </tr>    <tr>      <th>Chicken Pox</th>      <td>Boston</td>      <td>2001</td>    </tr>    <tr>      <th>Scurvy</th>      <td>Annapolis</td>      <td>1901</td>    </tr>    <caption>      Table 1: Infectious diseases and where to find them.    </caption>  </table>  <table>    <tr>      <th>Problem</th>      <th>Cause</th>      <th>Cost</th>    </tr>    <tr>      <th>Arthritis</th>      <td>Pokemon Go</td>      <td>Free</td>    </tr>    <tr>      <th>Yellow Fever</th>      <td>Unicorns</td>      <td>$17.75</td>    </tr>    <tr>      <th>Hypochondria</th>      <td>Fear</td>      <td>$100</td>    </tr>    <caption>      Table 2: Three ways to get sick and pay for it.    </caption>  </table></html>', attribs=None)

In [12]:
from load_dictionaries import load_disease_dictionary

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

Loaded 507899 disease phrases!


In [40]:
from ddlite_candidates import Ngrams, EntityExtractor
from ddlite_matchers import DictionaryMatch

# Define a candidate space
ngrams = Ngrams(n_max=3)

# Define a matcher
matcher = DictionaryMatch(d=diseases, longest_match_only=False)

# Define extractor
disease_extractor = EntityExtractor(ngrams, matcher)


Note that we set `longest_match_only=False`, which means that we _will_ consider subsequences of phrases that match 
our dictionary.

The `Ngrams` operator is applied over our `Sentence` objects and returns `Ngram` objects, and the `Matcher` then filters these, so we apply our operators over the sentences in the corpus, storing the results in a `Candidates` object for convenience:

In [41]:
corpus.get_sentences()[0].cells

[Cell(id='0-0-0', doc_id=0, doc_name='diseases_table.xml', sent_id=0, words=[u'Disease'], lemmas=[u'disease'], poses=[u'NN'], dep_parents=[0], dep_labels=[u'ROOT'], char_offsets=[0], text=u'Disease', cell_id=0, row_num=0, col_num=0, html_tag='th'),
 Cell(id='0-0-1', doc_id=0, doc_name='diseases_table.xml', sent_id=0, words=[u'Location'], lemmas=[u'Location'], poses=[u'NNP'], dep_parents=[0], dep_labels=[u'ROOT'], char_offsets=[0], text=u'Location', cell_id=1, row_num=0, col_num=1, html_tag='th'),
 Cell(id='0-0-2', doc_id=0, doc_name='diseases_table.xml', sent_id=0, words=[u'Year'], lemmas=[u'year'], poses=[u'NN'], dep_parents=[0], dep_labels=[u'ROOT'], char_offsets=[0], text=u'Year', cell_id=2, row_num=0, col_num=2, html_tag='th'),
 Cell(id='0-0-3', doc_id=0, doc_name='diseases_table.xml', sent_id=0, words=[u'Polio'], lemmas=[u'Polio'], poses=[u'NNP'], dep_parents=[0], dep_labels=[u'ROOT'], char_offsets=[0], text=u'Polio', cell_id=3, row_num=1, col_num=0, html_tag='th'),
 Cell(id='0-0-

In [42]:
from ddlite_candidates import Candidates
%time c = Candidates(disease_extractor, corpus.get_sentences())
c.get_candidates()

Extracting candidates...
CPU times: user 2.46 ms, sys: 578 µs, total: 3.04 ms
Wall time: 2.54 ms


[<Ngram("Disease", id=0-0-0:0-6, chars=[0,6], words=[0,0]),
 <Ngram("Polio", id=0-0-3:0-4, chars=[0,4], words=[0,0]),
 <Ngram("Chicken Pox", id=0-0-6:0-10, chars=[0,10], words=[0,1]),
 <Ngram("Yellow Fever", id=0-1-6:0-11, chars=[0,11], words=[0,1]),
 <Ngram("Arthritis", id=0-1-3:0-8, chars=[0,8], words=[0,0]),
 <Ngram("Problem", id=0-1-0:0-6, chars=[0,6], words=[0,0]),
 <Ngram("Scurvy", id=0-0-9:0-5, chars=[0,5], words=[0,0]),
 <Ngram("Hypochondria", id=0-1-9:0-11, chars=[0,11], words=[0,0]),
 <Ngram("Fever", id=0-1-6:7-11, chars=[7,11], words=[1,1])]

In [43]:
# Define another matcher
years = [str(x) for x in range(1800,2016)]
matcher = DictionaryMatch(d=years, longest_match_only=False)

# Define extractor
year_extractor = EntityExtractor(ngrams, matcher)

%time c = Candidates(year_extractor, corpus.get_sentences())
c.get_candidates()

Extracting candidates...
CPU times: user 31.5 ms, sys: 1.16 ms, total: 32.7 ms
Wall time: 32.5 ms


[<Ngram("1914", id=0-0-5:0-3, chars=[0,3], words=[0,0]),
 <Ngram("1901", id=0-0-11:0-3, chars=[0,3], words=[0,0]),
 <Ngram("2001", id=0-0-8:0-3, chars=[0,3], words=[0,0])]

In [54]:
from ddlite_candidates import RelationExtractor
disease_year_extractor = RelationExtractor([disease_extractor, year_extractor])
%time c = Candidates(disease_year_extractor, corpus.get_sentences())
c.get_candidates()

Extracting candidates...
CPU times: user 14.4 ms, sys: 3.92 ms, total: 18.3 ms
Wall time: 15.7 ms


[Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Disease", id=0-0-0:0-6),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Polio", id=0-0-3:0-4),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("2001", id=0-0-8:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("1901", id=0-0-11:0-3)>,
 Relation<Ngram("Chicken Pox", id=0-0-6:0-10),Ngram("1914", id=0-0-5:0-3)>,
 Relation<Ngram("Scurvy", id=0-0-9:0-5),Ngram("1914", id=0-0-5:0-3)>]