# Virus-Host Species Relation Extraction
## Notebook 2
### UC Davis Epicenter for Disease Dynamics

In [10]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
from snorkel.models import candidate_subclass

VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

In [12]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name = 'gold')
L_gold_dev

<1849x1 sparse matrix of type '<class 'numpy.int32'>'
	with 10 stored elements in Compressed Sparse Row format>

## Part I: Writing Labeling Functions

Labeling functions encode our heuristics and weak supervision signals to generate (noisy) labels for our training candidates.

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets). 

A labeling function is just a Python function that accepts a `Candidate` and returns `1` to mark the `Candidate` as true, `-1` to mark the `Candidate` as false, and `0` to abstain from labeling the `Candidate`.

In [13]:
# Labeling functions
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

In [14]:
# Rule based LFs using regex capture

def LF_related(c):
    return 1 if 'related' in c.get_parent().words else 0

def LF_isolated(c):
    return 1 if 'isolated' in c.get_parent().words else 0

def LF_detected(c):
    return 1 if 'detected' in c.get_parent().words else 0
    

In [15]:
labeled = []
for c in session.query(VirusHost).filter(VirusHost.split == 0).all():
    if LF_related(c) != 0 or LF_isolated(c) != 0 or LF_detected(c) != 0:
        labeled.append(c)
print("Number labeled:", len(labeled))

Number labeled: 252


In [16]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[16, 17, 18, 19, 20, 21], [81], [93]], [[9, 10, 11, 12, 13, 14, 15], [89, 90, 91], …

In [17]:
# Running the LFs
from snorkel.annotations import LabelAnnotator
LFs = [
    LF_related, LF_isolated, LF_detected
]
labeler = LabelAnnotator(lfs=LFs)

In [18]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...


100%|████████████████████████████████| 1849/1849 [00:11<00:00, 168.09it/s]


Wall time: 11 s


<1849x3 sparse matrix of type '<class 'numpy.int32'>'
	with 282 stored elements in Compressed Sparse Row format>