<img align="left" src="imgs/fonduer-logo.png" width="100px" style="margin-right:20px">

# Tutorial: Providing Supervision using Labeling Functions

## Running locally?

If you're running this tutorial interactively on your own machine, you'll need to create a new PostgreSQL database named `intro_supervision`.

If you already have the database `intro_supervision` in your postgresql, please uncomment the first line to drop it. Otherwise, download our database snapshots by executing `./download_data.sh` in the intro tutorial directory.

In [1]:
# ! dropdb --if-exists intro_supervision
! createdb intro_supervision
! psql intro_supervision < data/intro_supervision.sql > /dev/null

## Providing Supervision by Writing Labeling Functions

In this tutorial, you will learn what a labeling function (LF) is and how to write them by leverage Fonduer's [library of labeling function helpers](http://fonduer.readthedocs.io/en/latest/user/lf_helpers.html).

At a high level, a labeling function is a simple Python function that takes a candidate (a part and numerical value, in these intro tutorials) as input, and returns a label for the input candidate. Labels can be one of these values: {-1, 0, 1}. A label of -1 signifies that a candidate is False, 0 is a way to abstain from voting, and +1 labels the candidate as True.

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
import logging

# Configure logging for Fonduer
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

ATTRIBUTE = "intro_supervision"
conn_string = 'postgres://localhost:5432/' + ATTRIBUTE

from fonduer import Meta

session = Meta.init(conn_string).Session()

from fonduer import candidate_subclass
from fonduer import mention_subclass

Part = mention_subclass("Part")
Attr = mention_subclass("Attr")
PartAttr = candidate_subclass("PartAttr", [Part, Attr])

[INFO] fonduer.meta - Validating postgres://localhost:5432/intro_supervision as a connection string...
[INFO] fonduer.meta - Connecting None to localhost:5432/intro_supervision
[INFO] fonduer.meta - Initializing the storage schema


## I. Background

### Using a Development Set to Evaluate our Supervision
For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function.

Loading and saving external "gold" labels can be a bit messy, but is often a critical part of development, especially when gold labels are expensive and/or time-consuming to obtain. `Fonduer` stores all labels that are manually annotated in a **stable** format (called `StableLabel`s), which is somewhat independent from the rest of `Fonduers`'s data model, does not get deleted when you delete the candidates, corpus, or any other objects, and can be recovered even if the rest of the data changes or is deleted.

Our general procedure with external labels is to load them into the `StableLabel` table, then use `Fonduer`'s helpers to load them into the main data model from there. If interested in example implementation details, please see the script we now load:

In [3]:
from hardware_utils import load_hardware_labels

gold_file = 'data/hardware_tutorial_gold.csv'
load_hardware_labels(session, PartAttr, gold_file, ATTRIBUTE ,annotator_name='gold')




### Loading Candidates

Next, we can get our train and development set candidates by issuing SQLAlchemy queries for the `Part_Attr` candidate we defined during candidate generation.

In [4]:
train_cands = sorted(session.query(PartAttr).all())

print("Number of training candidates:", len(train_cands))

Number of training candidates: 2623


## Writing Labeling Functions 

Supervisions can be in different sources such as patterns or heuristics. Fonduer uses labeling functions to encode these supervisions that can be used to distinguish whether or not a candidate is true or false. In this notebook, we will describe how to use Fonduer API to express supervision via different modal signals.

The full list of functions that you can use are documented here:

http://fonduer.readthedocs.io/en/latest/user/data_model_utils.html

In [5]:
from fonduer.utils.data_model_utils import *

### Recall: what's in a candidate:

In [6]:
cand = train_cands[0]

Let's take a look at part number first:

In [7]:
print("part object:                     ", cand.part)
print("part text:                       ", cand.part.span.get_span())
print("part sentence object:            ", cand.part.span.sentence)
print("part sentence text:              ", cand.part.span.sentence.text)
print("check if part is in a table:     ", cand.part.span.sentence.is_tabular())
print("check if part has in visual info:", cand.part.span.sentence.is_visual())

part object:                      Part(Span("2N3904", sentence=9643, chars=[24,29], words=[3,3]))
part text:                        2N3904
part sentence object:             Sentence (Doc: 'AUKCS04635-1', Sec: 0, Par: 10, Idx: 10, Text: 'Complementary pair with 2N3904')
part sentence text:               Complementary pair with 2N3904
check if part is in a table:      False
check if part has in visual info: True


Then, we can look at the `attr`, which is the number representing the maximum collector-emitter voltage:

In [8]:
print("attr object:                     ", cand.attr)
print("attr text:                       ", cand.attr.span.get_span())
print("attr sentence object:            ", cand.attr.span.sentence)
print("attr sentence text:              ", cand.attr.span.sentence.text)
print("check if attr is in a table:     ", cand.attr.span.sentence.is_tabular())
print("check if attr has in visual info:", cand.attr.span.sentence.is_visual())

attr object:                      Attr(Span("150", sentence=15773, chars=[0,2], words=[0,0]))
attr text:                        150
attr sentence object:             Sentence (Doc: 'AUKCS04635-1', Table: 0, Row: 6, Col: 2, Index: 60, Text: '150')
attr sentence text:               150
check if attr is in a table:      True
check if attr has in visual info: True


### Example 1: Write a labeling function to check if two mentions in one candidate are in the same page. 
If they are, label the candidate True, otherwise, label it False.

In [9]:
def LF_same_page(c):
    return 1 if same_page(c) else -1

In [10]:
# Sanity check: the previous labeling function should pass the follwoing test.
true_candidate = train_cands[81]
false_candidate = train_cands[10]

if (LF_same_page(true_candidate) == 1 and LF_same_page(false_candidate) == -1):
    print("You passed!")
else:
    print("Try again.")

You passed!


### Example 2: Write a labeling function based on your insight of the data.

For example, inspecting several documents may reveal that storage temperatures are typically listed inside a table where the row header contains the word "storage". This intuitive pattern can be directly expressed as a labeling function. Similarly, the word "temperature" is an obvious positive signal.


In [11]:
def LF_storage_row(c):
    return 1 if 'storage' in get_row_ngrams(c.attr) else 0

def LF_temperature_row(c):
    return 1 if 'temperature' in get_row_ngrams(c.attr) else 0

### Example 3: Write a labeling function based on alignment information.

In [12]:
def LF_collector_aligned(c):
    return -1 if overlap(
        ['collector', 'collector-current', 'collector-base', 'collector-emitter'],
        list(get_aligned_ngrams(c.attr))) else 0

def LF_current_aligned(c):
    ngrams = get_aligned_ngrams(c.attr)
    return -1 if overlap(
        ['current', 'dc', 'ic'],
        list(get_aligned_ngrams(c.attr))) else 0

We can then collect all of these labeling functions in a list which we will provide to Fonduer as supervision signals.

In [13]:
LFs = [
    LF_same_page,
    LF_storage_row,
    LF_temperature_row,
    LF_collector_aligned,
    LF_current_aligned
]

### Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database. We'll do this using the `LabelAnnotator` class, a `UDF` which we will again run with `UDFRunner`. Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.

By default, `labeler.apply` will drop the existing table of labeling functions and the label values for each candidate. However, this behavior can be controlled by three parameters to the function to imperove iteration performance and reduce redundant computation:
- `split` defines which set to operate on (e.g. train, dev, or test)
- `clear` can be `True` or `False`, and is `True` by default. When set to `False`, the labeling functioni table is not dropped, and the behavior of `labeler.apply` is defined by the following two parameters.
- `update_keys` can be `True` or `False`. When `True`, the keys (which are each labeling function) are updated according to the set of labeling functions provided to the function. This should be set to `True` if new labeling functions are added. When `False`, no new LFs are evaluated and the keys of existing LFs remain the same.
- `update_values` can be `True` or `False`. This defines how to resolve conflicts. When `True`, the values assigned to each candiate is updated to the new values when in conflict. This should be set to `True` if labeling function logic is edited, even though the name of the labeling function remains the same. When `False`, the existing labels assigned to each candidate are used, and newly computed labels are ignored.
- `parallelism` is the amount of parallelism to use when labeling.

With this in mind, we set `clear=True` when we first apply our labeling functions, and this ensures that the table is created and intialized with proper keys and values.

In future iterations, we would typically set `clear=False, update_keys=True, update_values=True` so that we can simply update the set of LFs and their values without recreating the entire table. We will see how this is used later in the tutorial.

In [14]:
from fonduer import LabelAnnotator

labeler = LabelAnnotator(PartAttr, lfs = LFs)

In [15]:
L_dev = labeler.apply_existing(split=0)

[INFO] fonduer.utils.udf - Clearing existing...
[INFO] fonduer.utils.udf - Running UDF...



[INFO] fonduer.supervision.annotations - Copying partattr_label to postgres
[INFO] fonduer.supervision.annotations - b'COPY 2452\n'


### Labeling Function Metrics

Next, we can view insights provided by Fonduer to better understand the quality and coverage of our labeling functions.

In order to view statistics about the resulting label matrix, we provide several metrics to evaluate labelding functions:
* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a conflicting non-zero label for.
* **TP** is the number of True Positive candidates, or true candidates which were correctly labeled as True.
* **FP** is the number of False Positive candidates, or false candidates which were inorrectly labeled as True.
* **FN** is the number of False Negative candidates, or false candidates which were incorrectly labeled as False.
* **TN** is the number of True Negative candidates, or false candidates which were correctly labeled as False.

In addition, because we have already loaded the gold labels, we can view the emperical accuracy of these labeling functions when compared to our gold labels:

In [16]:
from fonduer import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=0)
%time L_dev.lf_stats(L_gold_dev)

CPU times: user 376 ms, sys: 0 ns, total: 376 ms
Wall time: 374 ms


Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.
LF_collector_aligned,0,0.036297,0.036297,0.00367,0,0,0,89,1.0
LF_current_aligned,1,0.219005,0.219005,0.076672,0,0,0,537,1.0
LF_same_page,2,1.0,0.29894,0.140294,0,339,0,2113,0.861746
LF_temperature_row,3,0.079935,0.079935,0.063622,0,196,0,0,0.0
LF_storage_row,4,0.073817,0.073817,0.061175,0,181,0,0,0.0
