## Part II: `Candidate` Extraction

In [1]:
%load_ext autoreload
%autoreload 2
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///semparse'

from snorkel import SnorkelSession
session = SnorkelSession()

In [2]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

In [3]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=3)
person_matcher = PersonMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(Spouse, 
                                    [ngrams, ngrams], [person_matcher, person_matcher],
                                    symmetric_relations=False)

In [4]:
def number_of_people(sentence):
    active_sequence = False
    count = 0
    for tag in sentence.ner_tags:
        if tag == 'PERSON' and not active_sequence:
            active_sequence = True
            count += 1
        elif tag != 'PERSON' and active_sequence:
            active_sequence = False
    return count

In [5]:
from snorkel.models import Document
docs = session.query(Document).order_by(Document.name).all()
print "Total Documents: {}".format(len(docs))

Total Documents: 21


In [6]:
import os
import csv

labeled_docs = set()
with open(os.environ['SNORKELHOME'] + '/tutorials/semparse/data/articles_dev.tsv') as tsvin:
    reader = csv.reader(tsvin, delimiter='\t')
    for row in reader:
        doc = row[0]
        labeled_docs.add(doc)
print "Labeled documents: {}".format(len(labeled_docs))

Labeled documents: 71


In [7]:
from snorkel.models import Document
import random

train_sents = set()
dev_sents = set()
for doc in docs:
    if doc.name in labeled_docs:
        sents = dev_sents
    else:
        sents = train_sents
    for s in doc.sentences:
        if number_of_people(s) < 5:
            sents.add(s)

# filtered_sents = 0
# train_sents = set()
# dev_sents   = set()
# test_sents  = set()
# unlabeled_sents = set()
# splits = [0, 1, 0] # train, dev, test
# for i, doc in enumerate(docs):
#     for s in doc.sentences:
#         if number_of_people(s) < 5:
#             if doc.name in labeled_docs:
#                 r = random.random()
#                 if r < splits[0]:
#                     train_sents.add(s)
#                 elif r < (splits[0] + splits[1]):
#                     dev_sents.add(s)
#                 else:
#                     test_sents.add(s)
#             else:
#                 unlabeled_sents.add(s)
#         else:
#             filtered_sents += 1

In [8]:
print "Train sentences: %d" % len(train_sents)
print "Dev sentences: %d" % len(dev_sents)
# print "Test sentences: %d" % len(test_sents)
# print "Unlabeled sentences: %d" % len(unlabeled_sents)
# print "Filtered sentences: %d" % filtered_sents

Train sentences: 416
Dev sentences: 32


## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [13]:
for i, sents in enumerate([train_sents, dev_sents]): #, test_sents, unlabeled_sents]):
    %time cand_extractor.apply(sents, split=i, parallelism=1, clear=True)
    print "Number of candidates: %d" % session.query(Spouse).filter(Spouse.split == i).count()
    print

Clearing existing...
Running UDF...

CPU times: user 1.34 s, sys: 35.6 ms, total: 1.38 s
Wall time: 1.39 s
Number of candidates: 122

Clearing existing...
Running UDF...

CPU times: user 161 ms, sys: 8.82 ms, total: 170 ms
Wall time: 181 ms
Number of candidates: 5



Here we specified that these `Candidates` belong to the training set by specifying `split=0`; recall that we're referring to train / dev / test as splits 0 / 1 / 2.

Note also that again, we could have specified a `parallelism` parameter to execute in parralel, if we had a non-SQLite database set up. Now let's get the candidates we just extracted:

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note, our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

First, we instantiate the `Viewer` object, which groups the input `Candidate` objects by `Sentence`:

In [10]:
from snorkel.viewer import SentenceNgramViewer

dev_cands = session.query(Spouse).filter(Spouse.split == 1).all()
sv = SentenceNgramViewer(dev_cands[:300], session)

<IPython.core.display.Javascript object>

Next, we render the `Viewer.

In [11]:
sv

In [12]:
if 'CI' not in os.environ:
    print unicode(sv.get_selected())

Spouse(Span("nNukri Revishvili", sentence=17, chars=[4,20], words=[3,4]), Span("Strachan", sentence=17, chars=[82,89], words=[16,16]))


Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.