# Virus-Host Species Relation Extraction
## Notebook 2
### UC Davis Epicenter for Disease Dynamics

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

In [2]:
from snorkel.models import candidate_subclass

VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

## Part I: Writing Labeling Functions

Labeling functions encode our heuristics and weak supervision signals to generate (noisy) labels for our training candidates.

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets). 

A labeling function is just a Python function that accepts a `Candidate` and returns `1` to mark the `Candidate` as true, `-1` to mark the `Candidate` as false, and `0` to abstain from labeling the `Candidate`.

In [3]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, 
    get_right_tokens, 
    get_between_tokens,
    get_text_between, 
    get_tagged_text,
    rule_regex_search_tagged_text,
    rule_regex_search_btw_AB,
    rule_regex_search_btw_BA,
    rule_regex_search_before_A,
    rule_regex_search_before_B,
)

In [4]:
# Testing labels
train_cands = session.query(VirusHost).filter(VirusHost.split == 0).order_by(VirusHost.id).all()

In [5]:
# choose a candidate for testing; can test different functions and regex rules here
print(train_cands[140])
sentence = train_cands[140].get_parent()  # get one exmaple sentence to test
document = sentence.get_parent()
print(sentence)
print("Candidate LEFT tokens:   \t", list(get_left_tokens(train_cands[140],window=15)))
print("Candidate RIGHT tokens:  \t", list(get_right_tokens(train_cands[140],window=15)))
print("Candidate BETWEEN tokens:\t", get_text_between(train_cands[140]))
print("Get tagged text:\t", get_tagged_text(train_cands[140]))

VirusHost(Span("b'WNV'", sentence=8153, chars=[115,117], words=[18,18]), Span("b'White stork'", sentence=8153, chars=[196,206], words=[34,35]))
Sentence(Document Hubalek-2008-Serologic survey of potential ver.pdf,91,b'DETAILED COMPARISON OF RECIPROCAL PRN\xef\xbf\xbdT90 TITERS AGAINST WEST NILE AND USUTU VIRUSES No. Species Age Locality Date WNV titer USUV titer 124 White stork Juvenile WRCa Aug. 23, 2006 20\xe2\x80\x9340 20\xe2\x80\x9340 163 White stork Juvenile WRC Aug. 23, 2006 20 20 126 White stork Juvenile WRC Sept. 13, 2006 320 20 127 White stork Juvenile WRC Sept. 13, 2006 160 20 128 White stork Juvenile WRC Sept. 13, 2006 20 \xef\xbf\xbd20 6551 Mute swan Juvenile Bilew Oct. 11, 2006 20 \xef\xbf\xbd20 6552 Mute swan Juvenile Bilew Oct. 11, 2006 20 20 8 Common tern Juvenile Siemian\xc3\xb3wka July 27, 2006 20 20 9 Common tern Juvenile Siemian\xc3\xb3wka July 27, 2006 20\xe2\x80\x9340 40 11 Black-headed gull Adult Siemian\xc3\xb3wka July 27, 2006 40 40 13 Black-headed gull Juvenil

In [6]:
# Text Pattern based labeling functions, which look for certain keywords

# List to parenthetical
def ltp(x):
    return '(' + '|'.join(x) + ')'


# --------------------------------

# Positive LFs:

detect = {'detect', 'detects', 'detected', 'detecting', 'detection', 'detectable'}
detect_l = ['detect', 'detects', 'detected', 'detecting', 'detection', 'detectable']
infect = {'infect', 'infects', 'infected', 'infecting', 'infection'}
isolate = {'isolate', 'isolates', 'isolated', 'isolating', 'isolation'}
other_verbs = {
    'transmit(ted)?', 'found', 'find(ings)?', 'affect(s|ed|ing)?', 'confirm(s|ed|ing)?', 'relat(ed|es|e|ing|ion)?', 'recovered', 'identified', 'collected'
}
misc = {
    'seropositive', 'seropositivity', 'positive', 'host(s)?', 'prevalen(ce|t)?', 'case(s)?', 'ELISA', 'titer', 'viremia', 'antibod(y|ies)?', 'antigen', 'exposure', 'PCR', 'polymerase chain reaction', 'RNA', 'DNA', 'nucleotide', 'sequence', 'evidence', 'common', 'success(fully)?', 'extract(ed)?', 'PFU', '(PFU)', 'plaque-forming unit', 'suscept', 'probably', 'probable', 'high(er)?'
}

causal = ['caus(es|ed|e|ing|ation)?', 'induc(es|ed|e|ing)?', 'associat(ed|ing|es|e|ion)?']

positive = {'detect', 'detects', 'detected', 'detecting', 'detection', 'detectable', 'infect', 'infects', 'infected', 'infecting', 'infection', 'isolate', 'isolates', 'isolated', 'isolating', 'isolation'}
positive_l = ['detect', 'detects', 'detected', 'detecting', 'detection', 'detectable', 'infect', 'infects', 'infected', 'infecting', 'infection', 'isolate', 'isolates', 'isolated', 'isolating', 'isolation']

# negative words
negative = {
    'negative (antibodies)?', 'seronegative', 'seronegativity', 'negate', 'not', 'Not', '\bno\b', '\bNo\b', '(titer(s)?\W+(?:\w+\W+){1,6}?less than)', 'titers against', 'none', 'resist', 'never'
}
negative_l = [
    'negative (antibodies)?', 'negate', 'not', 'Not', '\bno\b', '\bNo\b', '(titer(s)?\W+(?:\w+\W+){1,6}?less than)', 'titers against', 'none', 'resist', 'never'
]
neg_rgx = r'|'.join(negative_l)

# search nearby words for negatives, returns True if negative word found:
def neg_nearby(c):  
    if (len(negative.intersection(get_between_tokens(c))) > 0):
        return True
    elif (len(negative.intersection(get_left_tokens(c, window=10))) > 0):
        return True
    elif (len(negative.intersection(get_right_tokens(c, window=10))) > 0):
        return True
    else:
        return False


# words like detect 
def LF_detect(c):
    if (len(detect.intersection(get_between_tokens(c))) > 0) and not neg_nearby(c):
        return 1
    elif (len(detect.intersection(get_left_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(detect.intersection(get_left_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(detect.intersection(get_right_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(detect.intersection(get_right_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    else:
        return 0
    
def LF_infect(c):
    if len(infect.intersection(get_between_tokens(c))) > 0  and not neg_nearby(c):
        return 1
    elif len(infect.intersection(get_left_tokens(c[0], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(infect.intersection(get_left_tokens(c[1], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(infect.intersection(get_right_tokens(c[0], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(infect.intersection(get_right_tokens(c[1], window=20))) > 0 and not neg_nearby(c):
        return 1
    else:
        return 0
    
    # Words like 'isolated'
def LF_isolate(c):
    if len(isolate.intersection(get_between_tokens(c))) > 0 and not neg_nearby(c):
        return 1
    elif len(isolate.intersection(get_left_tokens(c[0], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(isolate.intersection(get_left_tokens(c[1], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(isolate.intersection(get_right_tokens(c[0], window=20))) > 0 and not neg_nearby(c):
        return 1
    elif len(isolate.intersection(get_right_tokens(c[1], window=20))) > 0 and not neg_nearby(c):
        return 1
    else:
        return 0

        
def LF_misc(c):
    if (len(misc.intersection(get_between_tokens(c))) > 0) and not neg_nearby(c):
        return 1
    elif (len(misc.intersection(get_left_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(misc.intersection(get_left_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(misc.intersection(get_right_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(misc.intersection(get_right_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    else:
        return 0
    
# terms like 'virus A caused disease in host B'
def LF_v_cause_h(c):
    return 1 if (
        re.search(r'{{A}}.{0,50} ' + ltp(causal) + '.{0,50}{{B}}', get_tagged_text(c), re.I)
        and not re.search('{{A}}.{0,50}(not|no|negative).{0,20}' + ltp(causal) + '.{0,50}{{B}}', get_tagged_text(c), re.I)
    ) else 0

# if candidates are nearby and check for negative words
def LF_v_h(c):
    return 1 if (
        re.search(r'{{A}}.{0,200}{{B}}', get_tagged_text(c), re.I)
        and not re.search(neg_rgx, get_tagged_text(c), re.I)
    ) else 0

def LF_h_v(c):
    return 1 if (
        re.search(r'{{B}}.{0,250}{{A}}', get_tagged_text(c), re.I)
        and not re.search(neg_rgx, get_tagged_text(c), re.I)
    ) else 0

# positive verbs (detect, infect, isolate)
def LF_positive(c):
    if (len(positive.intersection(get_between_tokens(c))) > 0) and not neg_nearby(c):
        return 1
    elif (len(positive.intersection(get_left_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(positive.intersection(get_left_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(positive.intersection(get_right_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(positive.intersection(get_right_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    else:
        return 0
    
def LF_positive2(c):
    return 1 if (
        re.search(r'{{A}}.{0,100} ' + ltp(positive_l) + '.{0,100}{{B}}', get_tagged_text(c), re.I)
        and not re.search('{{A}}.{0,100}(not|no|negative).{0,20}' + ltp(positive_l) + '.{0,100}{{B}}', get_tagged_text(c), re.I)
    ) else 0

def LF_other_verbs(c):
    if (len(other_verbs.intersection(get_between_tokens(c))) > 0) and not neg_nearby(c):
        return 1
    elif (len(other_verbs.intersection(get_left_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(other_verbs.intersection(get_left_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(other_verbs.intersection(get_right_tokens(c[0], window=20))) > 0) and not neg_nearby(c):
        return 1
    elif (len(other_verbs.intersection(get_right_tokens(c[1], window=20))) > 0) and not neg_nearby(c):
        return 1
    else:
        return 0

    
def LF_percents(c):
    return 1 if (
        re.search(r'{{A}}.{0,100}' + '\d+%|\d+(.)?\d+%' + '.{0,100}{{B}}', get_tagged_text(c), re.I) 
        and not re.search('(none|not|\W+no\W+|humidity)', get_text_between(c), re.I)
    ) else 0


# -----------------------------------

# Negative LFs:

# Uncertain pairs
uncertain = ['combin', 'possible', 'unlikely']

def LF_uncertain(c):
    return rule_regex_search_before_A(c, ltp(uncertain) + '.*', -1)

# if candidate pair is too far apart (between 200-5000 characs apart), mark as negative
def LF_far_v_h(c):
    return rule_regex_search_btw_AB(c, '.{250,5000}', -1)

def LF_far_h_v(c):
    return rule_regex_search_btw_BA(c, '.{250,5000}', -1)

def LF_neg_h(c):
    return -1 if re.search(neg_rgx + '.{0,100}{{B}}', get_tagged_text(c), flags=re.I) else 0

def LF_neg_assertions(c):
    if (len(negative.intersection(get_between_tokens(c))) > 0): 
        return -1
    elif (len(negative.intersection(get_left_tokens(c[0], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_left_tokens(c[1], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_right_tokens(c[0], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_right_tokens(c[1], window=20))) > 0):
        return -1
    else:
        return 0


In [7]:
# Distant Supervision LFs
# Compare candidates with a database of known virus-host pairs (from Virus-Host Database)

import bz2

# Function to remove special characters from text
def strip_special(s):
    return ''.join(c for c in s if ord(c) < 128)

# Read in known pairs and save as set of tuples
with bz2.BZ2File('virushostdb.tar.bz2', 'rb') as f:
    known_pairs = set(
        tuple(strip_special(x.decode('utf-8')).strip().split('\t')) for x in f.readlines()
    )

def LF_distant_supervision(c):
    v, h = c.virus.get_span(), c.host.get_span()
    return 1 if (v,h) in known_pairs else 0

In [8]:
# list of all LFs
LFs = [
     LF_detect, LF_infect, LF_isolate, LF_positive, LF_positive2, LF_misc, LF_v_cause_h, LF_v_h, LF_h_v, LF_other_verbs, LF_uncertain, LF_far_v_h, LF_far_h_v, LF_neg_h, LF_distant_supervision, LF_neg_assertions
]

In [9]:
# To label and view LFs for testing
labeled = []
for c in session.query(VirusHost).filter(VirusHost.split == 0).all():
    for function in LFs:
        if function(c) != 0:
            if c not in labeled:
                labeled.append(c)
print("Number labeled:", len(labeled))

Number labeled: 3585


In [10]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[690, 1530, 1948, 2414, 2473], [2002, 2003], [416]], [[753, 1632], [60, 61], [213]]…

## Part II: Applying Labeling Functions

We run the LFs over all training candidates, producing a set of Labels (Virus and Host) and LabelKeys (the names of the LFs) in the database.

In [11]:
# set up the label annotator class
from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LFs)

In [12]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...

Wall time: 1min 22s


<3631x16 sparse matrix of type '<class 'numpy.int32'>'
	with 7396 stored elements in Compressed Sparse Row format>

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class

In [13]:
# get the candidate names and positions of any candidate in the set
L_train.get_candidate(session, 0) 

VirusHost(Span("b'FBS'", sentence=9371, chars=[114,116], words=[25,25]), Span("b'mice'", sentence=9371, chars=[25,28], words=[6,6]))

In [14]:
# get the LabelKey (the name of the LF used to identify the candidate)
L_train.get_key(session, 0)

LabelKey (LF_detect)

Viewing statistics about the resulting label matrix:

* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a *conflicting* non-zero label for.

In [15]:
L_train.lf_stats(session)

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_detect,0,0.061966,0.061966,0.017075
LF_infect,1,0.05453,0.05453,0.016524
LF_isolate,2,0.070504,0.070504,0.024511
LF_positive,3,0.165795,0.165795,0.047645
LF_positive2,4,0.046268,0.046268,0.008813
LF_misc,5,0.254751,0.247315,0.155605
LF_v_cause_h,6,0.006059,0.006059,0.001102
LF_v_h,7,0.259157,0.138529,0.003305
LF_h_v,8,0.182319,0.066648,0.0
LF_other_verbs,9,0.062793,0.062242,0.022308


## Part III: Checking Against Gold Labels (Hand Labeled Set)
- Run the labeler on the development set
- Load in some external labels:

### Load Gold Labels
Gold labels are a _small_ set of examples (here, a subset of our training set) which we label by hand and use to help us develop and refine labeling functions. Unlike the _test set_, which we do not look at and use for final evaluation, we can inspect the development set while writing labeling functions.

In [16]:
from util_virushost import load_external_labels

%time missed = load_external_labels(session, VirusHost, annotator_name = 'gold', split=1)

AnnotatorLabels created: 0
Wall time: 3.19 s


In [17]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name = "gold", split=1)
L_gold_dev

<175x1 sparse matrix of type '<class 'numpy.int32'>'
	with 114 stored elements in Compressed Sparse Row format>

In [18]:
%time L_dev = labeler.apply_existing(split=1)

Clearing existing...
Running UDF...

Wall time: 3.75 s


In [19]:
# Label Matrix Empirical Accuracies

L_dev.lf_stats(session, labels=L_gold_dev.toarray().ravel())

  ac = (tp+tn) / (tp+tn+fp+fn)


Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.
LF_detect,0,0.08,0.08,0.0,8,0,0,0,1.0
LF_infect,1,0.057143,0.057143,0.005714,10,0,0,0,1.0
LF_isolate,2,0.005714,0.005714,0.0,0,0,0,0,
LF_positive,3,0.142857,0.142857,0.005714,18,0,0,0,1.0
LF_positive2,4,0.051429,0.051429,0.0,5,0,0,0,1.0
LF_misc,5,0.234286,0.234286,0.085714,30,0,0,0,1.0
LF_v_cause_h,6,0.011429,0.011429,0.0,2,0,0,0,1.0
LF_v_h,7,0.365714,0.154286,0.0,45,0,0,0,1.0
LF_h_v,8,0.274286,0.171429,0.0,43,0,0,0,1.0
LF_other_verbs,9,0.262857,0.24,0.12,12,0,0,0,1.0


In [20]:
print('Number of Labeling Functions used: ', len(LFs))

Number of Labeling Functions used:  16


#### Iterating on Labeling Function Design:
When writing labeling functions, you will want to iterate on the process outlined above several times. You should focus on tuning individual LFs, based on emprical accuracy metrics, and adding new LFs to improve coverage.

In [21]:
### See Notebook Part 3 for Generative Model Training