# Tables in Snorkel: Extracting Attributes from Spec Sheets

## Part III:  Loading Evaluation Labels

In [1]:
%load_ext autoreload
%autoreload 2

from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from snorkel.models import candidate_subclass

Part_Temp = candidate_subclass('Part_Temp', ['part','temp'])

## Loading Labels for the Training `CandidateSet`

We now wish to provide gold labels for some of our candidates. While not necessary for the learning algorithm, these labels will help us to assess the accuracy of our labeling functions and overall system in future notebooks.

Because our gold data is at the _entity_ level rather than the _mention_ level (i.e., we may know that (BC548, -55) is a true relation, but not if the particular mentions of BC548 and -55 in the document are the ones that would suggest to a human that a relationship exists), the mapping will be imperfect, but sufficient for our needs. We will give every candidate a label of `True` if it matches an entity in the gold set of relations.

First, we load the `CandidateSet`.

In [3]:
from snorkel.models import CandidateSet
candidates = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Training Candidates').one()
print "%s contains %d Candidates" % (candidates, len(candidates))

Candidate Set (Hardware Training Candidates) contains 6571 Candidates


Next, we define a loader object, giving it the session, the type of `Candidates` being annotated, the names we would like to give to the gold `CandidateSet` and `Labels`, and the instruction to expand to the `CandidateSet` (i.e., we are not simply adding additional annotations to a `CandidateSet` which is already defined).

In [4]:
from snorkel.loaders import ExternalAnnotationsLoader
loader = ExternalAnnotationsLoader(session, Part_Temp, 
                                   'Hardware Training Candidates -- Gold',
                                   'Hardware Training Labels -- Gold',
                                   expand_candidate_set=True)

We now load the labels.

In [5]:
import os
from hardware_utils import load_hardware_labels

filename = os.environ['SNORKELHOME'] + '/tutorials/tables/data/hardware/hardware_gold.csv'
%time load_hardware_labels(loader, candidates, filename, ['part','temp'], gold_attrib='stg_temp_min')


CPU times: user 4min 41s, sys: 8.19 s, total: 4min 49s
Wall time: 4min 59s


In [6]:
from snorkel.models import Label

train = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Training Candidates').one()
train_gold = session.query(CandidateSet).filter(
    CandidateSet.name == 'Hardware Training Candidates -- Gold').one()
print "%d/%d Candidates have positive Labels" % (len(train_gold), len(train))
print "%d Labels loaded" % session.query(Label).filter(
    Label.key == loader.annotation_key).count()

5507/6571 Candidates have positive Labels
5507 Labels loaded


## Repeat for the Development `CandidateSet`

In [7]:
for set_name in ['Development']:
    candidate_set_name = 'Hardware %s Candidates' % set_name
    candidates = session.query(CandidateSet).filter(
        CandidateSet.name == candidate_set_name).one()
    loader = ExternalAnnotationsLoader(session, Part_Temp, 
                                       'Hardware %s Candidates -- Gold' % set_name,
                                       'Hardware %s Labels -- Gold' % set_name,
                                       expand_candidate_set=True)
    %time load_hardware_labels(loader, candidates, filename, ['part','temp'], gold_attrib='stg_temp_min')
    candidates_gold = session.query(CandidateSet).filter(
        CandidateSet.name == candidate_set_name + ' -- Gold').one()
    print "%d/%d Candidates in %s have positive Labels" % (
        len(candidates_gold), len(candidates), candidates)
    print "%d Labels loaded" % session.query(Label).filter(
        Label.key == loader.annotation_key).count()


CPU times: user 1.23 s, sys: 84 ms, total: 1.31 s
Wall time: 1.36 s
57/57 Candidates in Candidate Set (Hardware Development Candidates) have positive Labels
57 Labels loaded


Next, in Part 4, we will auto-generate features for the `Candidates`