# Intro. to Snorkel: Extracting Spouse Relations from the News
## Part 1: Writing Pattern-based Labeling Functions


In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
import numpy as np
from snorkel.models import Candidate
from snorkel import SnorkelSession

session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Snorkel requires that we formally define a type for our candidate.

In [4]:
from snorkel.models import candidate_subclass
try:
    Spouse = candidate_subclass('Spouse', ['person1', 'person2'])
except:
    print>>sys.stderr,"Info: Candidate type already defined"

Info: Candidate type already defined


## Using a _development set_ of human-labeled data

In our setting here, we will use the phrase "development set" to refer to a set of examples (here, a subset of our training set) which we label by hand and use to help us develop and refine labeling functions.  Unlike the _test set_, which we do not look at and use for final evaluation, we can inspect the development set while writing labeling functions.

In [15]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

## 1. Creating Labeling Functions

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets).  We'll go through some examples for our spouse extraction task below.

A labeling function isn't anything special. It's just a Python function that accepts a `Candidate` as the input argument and returns `1` if it says the `Candidate` should be marked as true, `-1` if it says the `Candidate` should be marked as false, and `0` if it doesn't know how to vote and abstains. In practice, many labeling functions are unipolar: it labels only `1`s and `0`s, or it labels only `-1`s and `0`s.

Recall that our goal is to ultimately train a high-performance classification model that predicts which of our `Candidate`s are true mentions of spouse relations.  It turns out that we can do this by writing potentially low-quality labeling functions!

## Helper functions

These are python helper functions that you can apply to candidates to return objects that are helpful during LF development.

You can (and should!) write your own helper functions to help write LFs.

In [None]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

### Example Use:

In [17]:
candidates = session.query(Candidate).filter(Candidate.split == 0).all()[0:5]

In [18]:
from viz import *
candidates = session.query(Candidate).filter(Candidate.split == 0).all()[0:5]
for c in candidates:
    display_candidate(c)

### Candidates and Spans
When applied directly to a `Candidate` object, `get_left_tokens` returns tokens from the leftmost argument in the candidate pair and `get_right_tokens` the rightmost argument. The window length is set with the `window` parameter.

In [7]:
print "Candidate LEFT tokens:   \t", list(get_left_tokens(c,window=2))
print "Candidate RIGHT tokens:  \t", list(get_right_tokens(c,window=2))
print "Candidate BETWEEN tokens:\t", get_text_between(c)

Candidate LEFT tokens:   	[u'games', u'with']
Candidate RIGHT tokens:  	[u'\\', u'n']
Candidate BETWEEN tokens:	 which states, "written to be understandable by kids as young as 10 to 12 years old, although it is great for anyone of any age who has never programmed before":\n\n\nInvent with 


We can also apply this helper functions to `Span` objects. These are the arguments of the `Spouse` relation, which is defined by the type definition at the beginning of this notebook.

In [None]:
# 

In [24]:
idx = 1
print candidates[idx].person1.get_span().split()[-1]
print candidates[idx].person2.get_span().split()[-1]


Hughes
Morgan


In [26]:
# add bit on brining up function def
# get_text_between??

In [8]:
print "Person1 LEFT tokens:  \t", list(get_left_tokens(c.person1,window=2))
print "Person1 RIGHT tokens: \t", list(get_right_tokens(c.person1,window=2))

print "Person2 LEFT tokens:  \t", list(get_left_tokens(c.person2,window=2))
print "Person2 RIGHT tokens: \t", list(get_right_tokens(c.person2,window=2))

Person1 LEFT tokens:  	[u'games', u'with']
Person1 RIGHT tokens: 	[u'which', u'states']
Person2 LEFT tokens:  	[u'ninvent', u'with']
Person2 RIGHT tokens: 	[u'\\', u'n']


# Sandbox

Write your labeling functions below:

In [10]:
other = {'boyfriend', 'girlfriend'}

def LF_wife_in_sentence(c):
    #return 1 if 'wife' in c.get_parent().words else 0
    return 1 if "wife" in get_between_tokens(c) else 0

def LF_other_relationship(c):
    return -1 if len(other.intersection(get_between_tokens(c))) > 0 else 0

## Evaluating Labeling Functions

### Individual LF Statistics
One simple thing we can do is quickly test it on our development set (or any other set), without saving it to the database.  This is simple to do. For example, we can easily get every candidate that this LF labels as true:

In [11]:
def eval_lf(lf, split, gold=None):
    labeled = []
    cands = session.query(Spouse).filter(Spouse.split == split).order_by(Candidate.id).all()
    for i,c in enumerate(cands):
        if lf(c) != 0:
            if gold != None and gold.size != 0:
                labeled.append((c, gold[i,0]))
            else:
                labeled.append(c)
    print("Number labeled:", len(labeled))
    return labeled

In [12]:
labeled = eval_lf(LF_wife_in_sentence, 1)

('Number labeled:', 4)


We can then easily put this into the Viewer to see individual candidates

In [13]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

<IPython.core.display.Javascript object>

The installed widget Javascript is the wrong version. It must satisfy the semver range ~2.1.4.


or we can view candidates en masse. 
WARNING -- this is slow for very large candidate sets so use with caution!!

In [16]:
for c,label in eval_lf(LF_wife_in_sentence, 1, L_gold_dev):
    display_candidate(c, label=label)

('Number labeled:', 4)


For later convenience we group the labeling functions into a list.

### Formal Metrics

In [19]:
from snorkel.lf_helpers import test_LF
tp, fp, tn, fn = test_LF(session, LF_wife_in_sentence, split=1, annotator_name='gold')

Scores (Un-adjusted)
Pos. class accuracy: 1.0
Neg. class accuracy: 0.0
Precision            0.5
Recall               1.0
F1                   0.667
----------------------------------------
TP: 2 | FP: 2 | TN: 0 | FN: 0



## 2. Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database.  We'll do this using the `LabelAnnotator` class, a UDF which we will again run with `UDFRunner`.  **Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.**  We start by setting up the class:

In [27]:
LFs = [
    LF_wife_in_sentence,
    LF_other_relationship
]

In [28]:
from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LFs)

### Explain: Where is the actual data coming from here -- the database conceals the size of the data, other details

In [29]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train.shape

Clearing existing...
Running UDF...

CPU times: user 24.4 s, sys: 341 ms, total: 24.7 s
Wall time: 25.2 s


(4781, 2)

In [30]:
L_train = labeler.load_matrix(session, split=0)
L_train.shape

(4781, 2)

In [None]:
#L_train.get_candidate(session, 0)

In [None]:
#L_train.get_key(session, 0)

We can also view statistics about the resulting label matrix.

* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a *conflicting* non-zero label for.

In [31]:
L_train.lf_stats(session)

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_wife_in_sentence,0,0.02489,0.000209,0.000209
LF_other_relationship,1,0.004602,0.000627,0.000627
