# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part II: `Candidate` Extraction

In [1]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the `Corpus`

First, we will load the `Corpus` that we preprocessed in Part I:

In [2]:
from snorkel.models import Corpus

corpus = session.query(Corpus).filter(Corpus.name == 'News Training').one()
corpus

Corpus (News Training)

Next, we collect each `Sentence` in the `Corpus` into a `set`.

At this point, we _filter out_ sentences that mention at least five people, determined by counting contiguous sequences of tokens identified as person names by [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), the tool on which our `SentenceParser` is built.

In [3]:
def number_of_people(sentence):
    active_sequence = False
    count = 0
    for tag in sentence.ner_tags:
        if tag == 'PERSON' and not active_sequence:
            active_sequence = True
            count += 1
        elif tag != 'PERSON' and active_sequence:
            active_sequence = False
    return count

In [4]:
sentences = set()
for document in corpus:
    for sentence in document.sentences:
        if number_of_people(sentence) < 5:
            sentences.add(sentence)

## Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function.

Here we'll define a binary _spouse relation mention_ which connects two `Span` objects of text.  Note that this function will create the table in the database backend if it does not exist:

In [5]:
from snorkel.models import candidate_subclass

Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

## Writing a basic `CandidateExtractor`

Next, we'll write a basic function to extract **candidate spouse relation mentions** from the corpus.  The `SentenceParser` we used in Part I is built on [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), which performs _named entity recognition_ for us.

We will extract `Candidate` objects of the `Spouse` type by identifying, for each `Sentence`, all pairs of ngrams (up to trigrams) that were tagged as people.

First, we define a child context space for our sentences.

In [6]:
from snorkel.candidates import Ngrams

ngrams = Ngrams(n_max=3)

Next, we use a `PersonMatcher` to enforce that candidate relations are composed of pairs of spans that were tagged as people by the `SentenceParser`.

In [7]:
from snorkel.matchers import PersonMatcher

person_matcher = PersonMatcher(longest_match_only=True)

Finally, we combine the candidate class, child context space, and matcher into an extractor.

In [8]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(Spouse, [ngrams, ngrams], [person_matcher, person_matcher],
                        symmetric_relations=False, nested_relations=False, self_relations=False)

## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [9]:
%time c = ce.extract(sentences, 'News Training Candidates', session)
print "Number of candidates:", len(c)


CPU times: user 2min 13s, sys: 4.62 s, total: 2min 17s
Wall time: 2min 22s
Number of candidates: 4698


### Saving the extracted candidates

In [10]:
session.add(c)
session.commit()

### Reloading the candidates

In [11]:
from snorkel.models import CandidateSet
c = session.query(CandidateSet).filter(CandidateSet.name == 'News Training Candidates').one()
c

Candidate Set (News Training Candidates)

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note, our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

First, we instantiate the `Viewer` object, which groups the input `Candidate` objects by `Sentence`:

In [12]:
from snorkel.viewer import SentenceNgramViewer

# NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# You should ignore this!
import os
if 'CI' not in os.environ:
    sv = SentenceNgramViewer(c[:300], session, annotator_name="Tutorial Part 2 User")
else:
    sv = None

<IPython.core.display.Javascript object>

Next, we render the `Viewer.

In [13]:
sv

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


Note that we can **navigate using the provided buttons**, or **using the keyboard (hover over buttons to see controls)**, highlight candidates (even if they overlap), and also **apply binary labels** (more on where to use this later!).  In particular, note that **the Viewer is synced dynamically with the notebook**, so that we can for example get the `Candidate` that is currently selected. Try it out!

In [14]:
if 'CI' not in os.environ:
    print unicode(sv.get_selected())

Spouse(Span("Hayes", parent=13272, chars=[0,4], words=[0,0]), Span("Stephanie Moseley", parent=13272, chars=[55,71], words=[9,10]))


### Repeating for development and test corpora
We will rerun the same operations for the other two news corpora: development and test. All we do for each is load in the `Corpus` object, collect the `Sentence` objects, and run them through the `CandidateExtractor`.

In [15]:
for corpus_name in ['News Development', 'News Test']:
    corpus = session.query(Corpus).filter(Corpus.name == corpus_name).one()
    sentences = set()
    for document in corpus:
        for sentence in document.sentences:
            if number_of_people(sentence) < 5:
                sentences.add(sentence)
    
    %time c = ce.extract(sentences, corpus_name + ' Candidates', session)
    session.add(c)
session.commit()


CPU times: user 7.63 s, sys: 283 ms, total: 7.92 s
Wall time: 8.17 s

CPU times: user 8.57 s, sys: 360 ms, total: 8.93 s
Wall time: 9.52 s


Next, in Part 3, we will annotate some candidates with labels so that we can evaluate performance.