<img align="left" src="imgs/fonduer-logo.png" width="100px" style="margin-right:20px">

# Tutorial: Providing Supervision using Labeling Functions

## Running locally?

If you're running this tutorial interactively on your own machine, you'll need to create a new PostgreSQL database named `intro_supervision`.

If you already have the database `intro_supervision` in your postgresql, please uncomment the first line to drop it. Otherwise, download our database snapshots by executing `./download_data.sh` in the intro tutorial directory.

In [1]:
#! dropdb --if-exists intro_supervision
! createdb intro_supervision
! psql intro_supervision < data/intro_supervision.sql > /dev/null

## Providing Supervision by Writing Labeling Functions

In this tutorial, you will learn what a labeling function (LF) is and how to write them by leverage Fonduer's [data model utilities](https://fonduer.readthedocs.io/en/stable/user/data_model_utils.html).

At a high level, a labeling function is a simple Python function that takes a candidate (a part and numerical value, in these intro tutorials) as input, and returns a label for the input candidate. Labels can be one of these values: {-1, 0, 1}. A label of -1 signifies that a candidate is False, 0 is a way to abstain from voting, and +1 labels the candidate as True.

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys
import logging

# Configure logging for Fonduer
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

ATTRIBUTE = "intro_supervision"
conn_string = f'postgresql://localhost:5432/{ATTRIBUTE}'

from fonduer import Meta

session = Meta.init(conn_string).Session()

from fonduer.candidates.models import candidate_subclass, mention_subclass

Part = mention_subclass("Part")
Attr = mention_subclass("Attr")
PartAttr = candidate_subclass("PartAttr", [Part, Attr])

[INFO] fonduer.meta - Connecting user:None to localhost:5432/intro_supervision
[INFO] fonduer.meta - Initializing the storage schema


## I. Background

### Using a Development Set to Evaluate our Supervision
For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function. If you're interested in the example implementation details, please see the script we now load:

In [3]:
from hardware_utils import load_hardware_labels

gold_file = 'data/hardware_tutorial_gold.csv'
load_hardware_labels(session, PartAttr, gold_file, ATTRIBUTE ,annotator_name='gold')

Loading 2533 candidate labels


HBox(children=(IntProgress(value=0, max=2533), HTML(value='')))


GoldLabels created: 2533


### Loading Candidates

Next, we can get our train and development set candidates by issuing SQLAlchemy queries for the `Part_Attr` candidate we defined during candidate generation.

In [4]:
train_cands = sorted(session.query(PartAttr).all())

print(f"Number of training candidates: {len(train_cands)}")

Number of training candidates: 2533


## Writing Labeling Functions 

Supervisions can be in different sources such as patterns or heuristics. Fonduer uses labeling functions to encode these supervisions that can be used to distinguish whether or not a candidate is true or false. In this notebook, we will describe how to use Fonduer API to express supervision via different modal signals.

The full list of functions that you can use are documented here:

https://fonduer.readthedocs.io/en/stable/user/data_model_utils.html

In [5]:
from fonduer.utils.data_model_utils import *

### Recall: what's in a candidate:

In [6]:
cand = train_cands[0]

Let's take a look at part number first:

In [7]:
print(f"part object:                      {cand.part}")
print(f"part text:                        {cand.part.context.get_span()}")
print(f"part sentence object:             {cand.part.context.sentence}")
print(f"part sentence text:               {cand.part.context.sentence.text}")
print(f"check if part is in a table:      {cand.part.context.sentence.is_tabular()}")
print(f"check if part has in visual info: {cand.part.context.sentence.is_visual()}")

part object:                      Part(SpanMention("2N3904", sentence=5396, chars=[24,29], words=[3,3]))
part text:                        2N3904
part sentence object:             Sentence (Doc: 'AUKCS04635-1', Sec: 0, Par: 10, Idx: 10, Text: 'Complementary pair with 2N3904')
part sentence text:               Complementary pair with 2N3904
check if part is in a table:      False
check if part has in visual info: True


Then, we can look at the `attr`, which is the number representing the maximum collector-emitter voltage:

In [8]:
print(f"attr object:                      {cand.attr}")
print(f"attr text:                        {cand.attr.context.get_span()}")
print(f"attr sentence object:             {cand.attr.context.sentence}")
print(f"attr sentence text:               {cand.attr.context.sentence.text}")
print(f"check if attr is in a table:      {cand.attr.context.sentence.is_tabular()}")
print(f"check if attr has in visual info: {cand.attr.context.sentence.is_visual()}")

attr object:                      Attr(SpanMention("150", sentence=13054, chars=[0,2], words=[0,0]))
attr text:                        150
attr sentence object:             Sentence (Doc: 'AUKCS04635-1', Table: 0, Row: 6, Col: 2, Index: 58, Text: '150')
attr sentence text:               150
check if attr is in a table:      True
check if attr has in visual info: True


### Example 1: Write a labeling function to check if two mentions in one candidate are in the same page. 
If they are, label the candidate True, otherwise, label it False.

In [9]:
ABSTAIN = 0
FALSE = 1
TRUE = 2

In [10]:
def LF_same_page(c):
    return TRUE if same_page(c) else FALSE

In [11]:
# Sanity check: the previous labeling function should pass the follwoing test.
true_candidate = train_cands[81]
false_candidate = train_cands[10]

if (LF_same_page(true_candidate) == TRUE and LF_same_page(false_candidate) == FALSE):
    print("You passed!")
else:
    print("Try again.")

Try again.


### Example 2: Write a labeling function based on your insight of the data.

For example, inspecting several documents may reveal that storage temperatures are typically listed inside a table where the row header contains the word "storage". This intuitive pattern can be directly expressed as a labeling function. Similarly, the word "temperature" is an obvious positive signal.


In [12]:
def LF_storage_row(c):
    return TRUE if 'storage' in get_row_ngrams(c.attr) else ABSTAIN

def LF_temperature_row(c):
    return TRUE if 'temperature' in get_row_ngrams(c.attr) else ABSTAIN

### Example 3: Write a labeling function based on alignment information.

In [13]:
def LF_collector_aligned(c):
    return FALSE if overlap(
        ['collector', 'collector-current', 'collector-base', 'collector-emitter'],
        list(get_aligned_ngrams(c.attr))) else ABSTAIN

def LF_current_aligned(c):
    ngrams = get_aligned_ngrams(c.attr)
    return FALSE if overlap(
        ['current', 'dc', 'ic'],
        list(get_aligned_ngrams(c.attr))) else ABSTAIN

We can then collect all of these labeling functions in a list which we will provide to Fonduer as supervision signals.

In [14]:
LFs = [
    LF_same_page,
    LF_storage_row,
    LF_temperature_row,
    LF_collector_aligned,
    LF_current_aligned
]

### Applying the Labeling Functions

Next, we need to actually run the LFs over all of our training candidates, producing a set of `Labels` and `LabelKeys` (just the names of the LFs) in the database. We'll do this using the `Labeler`. Note that this will delete any existing `Labels` and `LabelKeys` for this candidate set.

View the API provided by the `Labeler` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/supervision.html#fonduer.supervision.Labeler).

In [15]:
from fonduer.supervision import Labeler

labeler = Labeler(session, [PartAttr])

%time labeler.apply(split=0, lfs=[LFs], train=True)
%time L_train = labeler.get_label_matrices([train_cands])

[INFO] fonduer.supervision.labeler - Clearing Labels (split 0)
[INFO] fonduer.utils.udf - Running UDF...


HBox(children=(IntProgress(value=0, max=8), HTML(value='')))


CPU times: user 7.72 s, sys: 108 ms, total: 7.83 s
Wall time: 9.35 s
CPU times: user 3.93 s, sys: 160 ms, total: 4.09 s
Wall time: 5.2 s


### Labeling Function Metrics

Next, we can view insights provided by Fonduer to better understand the quality and coverage of our labeling functions.

In order to view statistics about the resulting label matrix, we provide several metrics to evaluate labelding functions:
* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a conflicting non-zero label for.
* **TP** is the number of True Positive candidates, or true candidates which were correctly labeled as True.
* **FP** is the number of False Positive candidates, or false candidates which were incorrectly labeled as True.
* **FN** is the number of False Negative candidates, or true candidates which were incorrectly labeled as False.
* **TN** is the number of True Negative candidates, or false candidates which were correctly labeled as False.

In addition, because we have already loaded the gold labels, we can view the emperical accuracy of these labeling functions when compared to our gold labels:

In [16]:
from fonduer.supervision import get_gold_labels
L_gold_dev = get_gold_labels(session, [train_cands],annotator_name='gold')

In [17]:
from metal import analysis

analysis.lf_summary(L_train[0], lf_names=labeler.get_keys(), Y=L_gold_dev[0].todense().reshape(-1,).tolist()[0])

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
LabelKey (LF_collector_aligned),0,1,0.015002,0.015002,0.003553,38,0,1.0
LabelKey (LF_current_aligned),1,1,0.16818,0.16818,0.07501,426,0,1.0
LabelKey (LF_same_page),2,"[1, 2]",0.90683,0.240032,0.135413,1958,339,0.852416
LabelKey (LF_storage_row),3,2,0.070272,0.070272,0.058824,0,178,0.0
LabelKey (LF_temperature_row),4,2,0.071852,0.071852,0.060403,0,182,0.0
