# Episode 1: The Phantom [Table] Menace

This notebook is meant for in-house demonstration of the the new classes created to perform (still relatively basic) entity extraction on tables. It assumes an input file in XHTML format, a strict form of HTML that coincides with XML structure, allowing for easy display (HTML) and safe tree traversal (XML).

### Candidate Extraction

First, import the 'HTMLParser' class to read HTML tables

In [1]:
from snorkel.parser import HTMLParser
html_parser = HTMLParser(path='data/diseases.xhtml')

The "TableParser" class divides the html doc into cells, adding a 'cell_id' attribute to each cell for future traversal, and creating "Cell" objects that have attributes such as row number, column number, html tag, html attributes, and any tags/attributes on a cells ancestors in the table.

In [97]:
from snorkel.parser import TableParser
table_parser = TableParser()

As usual, pass these to a Corpus object for digestion.

In [98]:
from snorkel.parser import Corpus
%time corpus = Corpus(html_parser, table_parser)

Parsing documents...
Parsing contexts...
CPU times: user 85.7 ms, sys: 24.3 ms, total: 110 ms
Wall time: 140 ms


Load the good 'ole disease dictionary for recognizing disease names.

In [99]:
from load_dictionaries import load_disease_dictionary

# Load the disease phrase dictionary
diseases = load_disease_dictionary()
print "Loaded %s disease phrases!" % len(diseases)

Loaded 507899 disease phrases!


Here we use a new CandidateSpace object, CellNgrams. It inherits from Ngrams, and ensures that the Table context object is broken up into cells before being passed into the usual routine for pulling out Ngrams.

In [100]:
from snorkel.candidates import CellNgrams
from snorkel.matchers import DictionaryMatch

# Define a candidate space
cell_ngrams = CellNgrams(n_max=3)

# Define a matcher
disease_matcher = DictionaryMatch(d=diseases, longest_match_only=False)

Passing the CandidateSpace, Matcher, and Context objects to a Candidates object, extraction is performed, and we see that a number of disease CellNgrams are returned.

In [101]:
from snorkel.candidates import Candidates
%time c = Candidates(cell_ngrams, disease_matcher, corpus.get_sentences())
for candy in c.get_candidates(): print candy

Extracting candidates...
CPU times: user 2.45 ms, sys: 580 µs, total: 3.03 ms
Wall time: 2.65 ms
<CellNgram("Disease", id=0-0-0:0-6, chars=[0,6], (row,col)=(0,1), tag=th)
<CellNgram("Polio", id=0-0-3:0-4, chars=[0,4], (row,col)=(1,1), tag=th)
<CellNgram("Chicken Pox", id=0-0-6:0-10, chars=[0,10], (row,col)=(2,1), tag=th)
<CellNgram("Yellow Fever", id=0-1-6:0-11, chars=[0,11], (row,col)=(2,1), tag=th)
<CellNgram("Location", id=0-0-1:0-7, chars=[0,7], (row,col)=(0,3), tag=th)
<CellNgram("Arthritis", id=0-1-3:0-8, chars=[0,8], (row,col)=(1,1), tag=th)
<CellNgram("Problem", id=0-1-0:0-6, chars=[0,6], (row,col)=(0,1), tag=th)
<CellNgram("Scurvy", id=0-0-9:0-5, chars=[0,5], (row,col)=(3,1), tag=th)
<CellNgram("Hypochondria", id=0-1-9:0-11, chars=[0,11], (row,col)=(3,1), tag=th)
<CellNgram("Fever", id=0-1-6:7-11, chars=[7,11], (row,col)=(2,1), tag=th)


### Candidate Extraction

We can then examine the basic tabular features on a given CellNGram, which can be added to the feature index Snorkel is accustomed to using:

In [102]:
features = c.extract_features()
print features.shape

(10, 16)


In [104]:
for feat in c.get_candidates()[0]._get_features():
    print feat

TREEDLIB_features_to come
DDLIB_features_to come
TABLE_ROW_NUM_0
TABLE_COL_NUM_1
TABLE_HTML_TAG_th
TABLE_HTML_ANC_TAG_tr
TABLE_HTML_ANC_TAG_tbody
TABLE_HTML_ANC_TAG_table
TABLE_HTML_ANC_TAG_body
TABLE_HTML_ANC_ATTR_align=left
TABLE_HTML_ANC_ATTR_size=5
TABLE_HTML_ANC_ATTR_font=blue


TA-DA! It's magic. Next up (Friday?): scooping up ngrams/features from spans of cells (like "all ngrams above me in the table" or "all ngrams within 2 cells of me") using simple Xpath queries on the decorated xhtml tree.