# Virus-Host Species Relation Extraction
## Notebook 2
### UC Davis Epicenter for Disease Dynamics

In [21]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
from snorkel.models import candidate_subclass

VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

## Part I: Writing Labeling Functions

Labeling functions encode our heuristics and weak supervision signals to generate (noisy) labels for our training candidates.

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets). 

A labeling function is just a Python function that accepts a `Candidate` and returns `1` to mark the `Candidate` as true, `-1` to mark the `Candidate` as false, and `0` to abstain from labeling the `Candidate`.

In [23]:
import re
from snorkel.lf_helpers import (
    get_left_tokens, 
    get_right_tokens, 
    get_between_tokens,
    get_text_between, 
    get_tagged_text,
    rule_regex_search_tagged_text,
    rule_regex_search_btw_AB,
    rule_regex_search_btw_BA,
    rule_regex_search_before_A,
    rule_regex_search_before_B,
)

In [24]:
# Testing labels
train_cands = session.query(VirusHost).filter(VirusHost.split == 0).order_by(VirusHost.id).all()

In [25]:
# choose a candidate for testing
print(train_cands[250])
sentence = train_cands[250].get_parent()
document = sentence.get_parent()
print(sentence)
print("Candidate LEFT tokens:   \t", list(get_left_tokens(train_cands[250],window=20)))
print("Candidate RIGHT tokens:  \t", list(get_right_tokens(train_cands[250],window=20)))
print("Candidate BETWEEN tokens:\t", get_text_between(train_cands[250]))

VirusHost(Span("b'(TRBV)'", sentence=8347, chars=[794,799], words=[182,184]), Span("b'swine'", sentence=8347, chars=[969,973], words=[229,229]))
Sentence(Document Hubalek-2012-Tick-borne viruses in Europe.pdf,0,b'"REVIEW Tick-borne viruses in Europe Zdenek Hub\xc3\xa1lek & Ivo Rudolf Received: 23 January 2012 /Accepted: 20 March 2012 /Published online: 18 April 2012 # Springer-Verlag 2012 Abstract The aim of this review is to present briefly back- ground information on 27 tick-borne viruses (\xe2\x80\x9ctiboviruses\xe2\x80\x9d) that have been detected in Europe, viz flaviviruses tick- borne encephalitis (TBEV), louping-ill (LIV), Tyuleniy (TYUV), and Meaban (MEAV); orthobunyaviruses Bahig (BAHV) and Matruh (MTRV); phleboviruses Grand Arbaud (GAV), Ponteves (PTVV), Uukuniemi (UUKV), Zaliv Terpeniya (ZTV), and St. Abb\'s Head (SAHV); nairoviruses Soldado (SOLV), Puffin Island (PIV), Avalon (AVAV), Clo Mor (CMV), Crimean-Congo hemorrhagic fever (CCHFV); bunyavirus Bhanja (BHAV); coltiviru

In [26]:
# Text Pattern based labeling functions, which look for certain keywords

# List to parenthetical
def ltp(x):
    return '(' + '|'.join(x) + ')'

# --------------------------------

# Positive LFs:

detect = {'detect', 'detects', 'detected', 'detecting', 'detection', 'detectable'}
detect_l = ['detect', 'detects', 'detected', 'detecting', 'detection', 'detectable']
infect = {'infect', 'infects', 'infected', 'infecting', 'infection'}
isolate = {'isolate', 'isolates', 'isolated', 'isolating', 'isolation'}
other_verbs = {
    'transmit(ted)?', 'found', 'find', 'findings', 'remove(d)?', 'affect(s|ed|ing)?', 'confirm(s|ed|ing)?', 'relat(ed|es|e|ing|ion)?', 'recovered', 'identified',
}
misc = {
    'seropositive', 'seropositivity', 'positive', 'host(s)?', 'prevalen(ce|t)?', 'case(s)?', 'ELISA', 'titer', 'viremia', 'antibod(y|ies)?', 'antigen', 'exposure', 'PCR', 'polymerase chain reaction', 'RNA', 'DNA', 'nucleotide', 'sequence', 'evidence', 'common', 'success', 'successfully', 'extract', 'PFU', '(PFU)', 'plaque-forming unit'
}

causal = ['caus(es|ed|e|ing|ation)?', 'induc(es|ed|e|ing)?', 'associat(ed|ing|es|e|ion)?']

positive = {'detect', 'detects', 'detected', 'detecting', 'detection', 'detectable', 'infect', 'infects', 'infected', 'infecting', 'infection', 'isolate', 'isolates', 'isolated', 'isolating', 'isolation'}
positive_l = ['detect', 'detects', 'detected', 'detecting', 'detection', 'detectable', 'infect', 'infects', 'infected', 'infecting', 'infection', 'isolate', 'isolates', 'isolated', 'isolating', 'isolation']
negative = {
    'negative (antibodies)?', 'negate', '\W+not\W+', '\W+no\W+', '(titer\W+(?:\w+\W+){1,6}?less than)', 'none', 'resist'
}
negative_l = [
    'negative (antibodies)?', 'negate', '\W+not\W+', '\W+no\W+', '(titer\W+(?:\w+\W+){1,6}?less than)', 'none', 'resist'
]
neg_rgx = r'|'.join(negative_l)



# words like detect 
def LF_detect(c):
    if (len(detect.intersection(get_between_tokens(c))) > 0) and (len(negative.intersection(get_between_tokens(c))) == 0): 
        return 1
    elif (len(detect.intersection(get_left_tokens(c[0], window=20))) > 0) and (len(negative.intersection(get_left_tokens(c[0], window=20))) == 0):
        return 1
    elif (len(detect.intersection(get_left_tokens(c[1], window=20))) > 0) and (len(negative.intersection(get_left_tokens(c[1], window=20))) == 0):
        return 1
    elif (len(detect.intersection(get_right_tokens(c[0], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[0], window=20))) == 0)):
        return 1
    elif (len(detect.intersection(get_right_tokens(c[1], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[1], window=20))) == 0)):
        return 1
    else:
        return 0
    
def LF_detect2(c):
    return 1 if (
        re.search(r'{{A}}.{0,50}' + ltp(detect_l) + '.{0,50}{{B}}', get_tagged_text(c), re.I) 
        and not re.search('{{A}}.{0,50}(not|no).{0,20}' + ltp(detect_l) + '.{0,50}{{B}}', get_tagged_text(c), re.I)
    ) else 0

    
def LF_infect(c):
    if len(infect.intersection(get_between_tokens(c))) > 0: 
        return 1
    elif len(infect.intersection(get_left_tokens(c[0], window=20))) > 0:
        return 1
    elif len(infect.intersection(get_left_tokens(c[1], window=20))) > 0:
        return 1
    elif len(infect.intersection(get_right_tokens(c[0], window=20))) > 0:
        return 1
    elif len(infect.intersection(get_right_tokens(c[1], window=20))) > 0:
        return 1
    else:
        return 0
    
    # Words like 'isolated'
def LF_isolate(c):
    if len(isolate.intersection(get_between_tokens(c))) > 0: 
        return 1
    elif len(isolate.intersection(get_left_tokens(c[0], window=20))) > 0:
        return 1
    elif len(isolate.intersection(get_left_tokens(c[1], window=20))) > 0:
        return 1
    elif len(isolate.intersection(get_right_tokens(c[0], window=20))) > 0:
        return 1
    elif len(isolate.intersection(get_right_tokens(c[1], window=20))) > 0:
        return 1
    else:
        return 0

        
def LF_misc(c):
    if (len(misc.intersection(get_between_tokens(c))) > 0) and not (re.search(neg_rgx, get_tagged_text(c), re.I)): 
        return 1
    elif (len(misc.intersection(get_left_tokens(c[0], window=20))) > 0) and not (re.search(neg_rgx, get_tagged_text(c), re.I)): 
        return 1
    elif (len(misc.intersection(get_left_tokens(c[1], window=20))) > 0) and not (re.search(neg_rgx, get_tagged_text(c), re.I)): 
        return 1
    elif (len(misc.intersection(get_right_tokens(c[0], window=20))) > 0) and not (re.search(neg_rgx, get_tagged_text(c), re.I)): 
        return 1
    elif (len(misc.intersection(get_right_tokens(c[1], window=20))) > 0) and not (re.search(neg_rgx, get_tagged_text(c), re.I)): 
        return 1
    else:
        return 0
    
# Words like 'caused'
def LF_v_cause_h(c):
    return 1 if (
        re.search(r'{{A}}.{0,50} ' + ltp(causal) + '.{0,50}{{B}}', get_tagged_text(c), re.I)
        and not re.search('{{A}}.{0,50}(not|no|negative).{0,20}' + ltp(causal) + '.{0,50}{{B}}', get_tagged_text(c), re.I)
    ) else 0

# if candidates are nearby and check for negative words
def LF_v_h(c):
    return 1 if (
        re.search(r'{{A}}.{0,250}{{B}}', get_tagged_text(c), re.I)
        and not re.search(neg_rgx, get_tagged_text(c), re.I)
    ) else 0

def LF_h_v(c):
    return 1 if (
        re.search(r'{{B}}.{0,250}{{A}}', get_tagged_text(c), re.I)
        and not re.search(neg_rgx, get_tagged_text(c), re.I)
    ) else 0

# positive verbs (detect, infect, isolate)
def LF_positive(c):
    if (len(positive.intersection(get_between_tokens(c))) > 0) and (len(negative.intersection(get_between_tokens(c))) == 0): 
        return 1
    elif (len(positive.intersection(get_left_tokens(c[0], window=20))) > 0)  and (len(negative.intersection(get_left_tokens(c[0], window=15))) == 0):
        return 1
    elif (len(positive.intersection(get_left_tokens(c[1], window=20))) > 0) and (len(negative.intersection(get_left_tokens(c[1], window=20))) == 0):
        return 1
    elif (len(positive.intersection(get_right_tokens(c[0], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[0], window=20))) == 0)):
        return 1
    elif (len(positive.intersection(get_right_tokens(c[1], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[1], window=20))) == 0)):
        return 1
    else:
        return 0
    
def LF_positive2(c):
    return 1 if (
        re.search(r'{{A}}.{0,100} ' + ltp(positive_l) + '.{0,100}{{B}}', get_tagged_text(c), re.I)
        and not re.search('{{A}}.{0,100}(not|no|negative).{0,20}' + ltp(positive_l) + '.{0,100}{{B}}', get_tagged_text(c), re.I)
    ) else 0

def LF_other_verbs(c):
    if (len(other_verbs.intersection(get_between_tokens(c))) > 0) and (len(negative.intersection(get_between_tokens(c))) == 0): 
        return 1
    elif (len(other_verbs.intersection(get_left_tokens(c[0], window=20))) > 0)  and (len(negative.intersection(get_left_tokens(c[0], window=20))) == 0):
        return 1
    elif (len(other_verbs.intersection(get_left_tokens(c[1], window=20))) > 0) and (len(negative.intersection(get_left_tokens(c[1], window=20))) == 0):
        return 1
    elif (len(other_verbs.intersection(get_right_tokens(c[0], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[0], window=20))) == 0)):
        return 1
    elif (len(other_verbs.intersection(get_right_tokens(c[1], window=20))) > 0) and ((len(negative.intersection(get_right_tokens(c[1], window=20))) == 0)):
        return 1
    else:
        return 0

    
def LF_percents(c):
    return 1 if (
        re.search(r'{{A}}.{0,100}' + '\d+%|\d+(.)?\d+%' + '.{0,100}{{B}}', get_tagged_text(c), re.I) 
        and not re.search('(none|not|\W+no\W+|humidity)', get_text_between(c), re.I)
    ) else 0


# -----------------------------------

# Negative LFs:

# Uncertain pairs
uncertain = ['combin', 'possible', 'unlikely']

def LF_uncertain(c):
    return rule_regex_search_before_A(c, ltp(uncertain) + '.*', -1)

# if candidate pair is too far apart (between 200-5000 characs apart), mark as negative
def LF_far_v_h(c):
    return rule_regex_search_btw_AB(c, '.{250,5000}', -1)

def LF_far_h_v(c):
    return rule_regex_search_btw_BA(c, '.{250,5000}', -1)

def LF_neg_h(c):
    return -1 if re.search('(none|not|no) .{0,100}{{B}}', get_tagged_text(c), flags=re.I) else 0

def LF_neg_assertions(c):
    if (len(negative.intersection(get_between_tokens(c))) > 0): 
        return -1
    elif (len(negative.intersection(get_left_tokens(c[0], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_left_tokens(c[1], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_right_tokens(c[0], window=20))) > 0):
        return -1
    elif (len(negative.intersection(get_right_tokens(c[1], window=20))) > 0):
        return -1
    else:
        return 0


In [27]:
# Distant Supervision LFs
# Compare candidates with a database of known virus-host pairs (from Virus-Host Database)

import bz2

# Function to remove special characters from text
def strip_special(s):
    return ''.join(c for c in s if ord(c) < 128)

# Read in known pairs and save as set of tuples
with bz2.BZ2File('virushostdb.tar.bz2', 'rb') as f:
    known_pairs = set(
        tuple(strip_special(x.decode('utf-8')).strip().split('\t')) for x in f.readlines()
    )

def LF_distant_supervision(c):
    v, h = c.virus.get_span(), c.host.get_span()
    return 1 if (v,h) in known_pairs else 0

In [28]:
known_pairs

{('Salmonella phage vB_SnwM_CGG4-1',
  'Salmonella enterica subsp. enterica serovar Newport'),
 ('Pseudomonas phage vB_Pae_PS44', 'Pseudomonas aeruginosa'),
 ('Peanut mottle virus', 'Pisum sativum'),
 ('Leonurus mosaic virus', 'Lamiaceae'),
 ('East African cassava mosaic Malawi virus', 'Manihot esculenta'),
 ('Mycobacterium phage Validus', 'Mycolicibacterium smegmatis'),
 ('Zostera marina amalgavirus 2', 'Zostera marina'),
 ('Bush viper reovirus', 'Atheris squamigera'),
 ('Chaetoceros tenuissimus RNA virus type-II', 'Chaetoceros tenuissimus'),
 ('Burkholderia phage KS10', 'Burkholderia cenocepacia'),
 ('New World begomovirus associated satellite DNA isolate 404N1',
  'Sidastrum micranthum'),
 ('Mycobacterium virus Eureka', 'Mycolicibacterium smegmatis MC2 155'),
 ('Wenzhou hepe-like virus 2', 'root'),
 ('Acidovorax phage ACP17', 'Acidovorax citrulli'),
 ('Stenotrophomonas phage PSH1', 'Stenotrophomonas'),
 ('Escherichia phage PBECO 4', 'Escherichia coli O157'),
 ('Carnation mottle viru

In [29]:
# list of all LFs
LFs = [
     LF_detect, LF_detect2, LF_infect, LF_isolate, LF_positive, LF_positive2, LF_misc, LF_v_cause_h, LF_v_h, LF_h_v, LF_other_verbs, LF_uncertain, LF_far_v_h, LF_far_h_v, LF_neg_h, LF_distant_supervision, LF_neg_assertions
]

In [30]:
# To label and view LFs for testing
labeled = []
for c in session.query(VirusHost).filter(VirusHost.split == 0).all():
    for function in LFs:
        if function(c) != 0:
            if c not in labeled:
                labeled.append(c)
print("Number labeled:", len(labeled))

Number labeled: 3397


In [31]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[767, 1579, 1970, 2250, 2385, 2445], [2020, 2021], [477]], [[847, 1668], [65, 66], …

## Part II: Applying Labeling Functions

We run the LFs over all training candidates, producing a set of Labels (Virus and Host) and LabelKeys (the names of the LFs) in the database.

In [32]:
# set up the label annotator class
from snorkel.annotations import LabelAnnotator
labeler = LabelAnnotator(lfs=LFs)

In [33]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...


100%|█████████████████████████████████████| 3823/3823 [00:18<00:00, 208.59it/s]


Wall time: 18.5 s


<3823x17 sparse matrix of type '<class 'numpy.int32'>'
	with 5733 stored elements in Compressed Sparse Row format>

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class

In [34]:
# get the candidate names and positions of any candidate in the set
L_train.get_candidate(session, 0) 

VirusHost(Span("b'JEV'", sentence=12925, chars=[172,174], words=[31,31]), Span("b'human'", sentence=12925, chars=[44,48], words=[8,8]))

In [35]:
# get the LabelKey (the name of the LF used to identify the candidate)
L_train.get_key(session, 0)

LabelKey (LF_detect)

Viewing statistics about the resulting label matrix:

* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a *conflicting* non-zero label for.

In [36]:
L_train.lf_stats(session)

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_detect,0,0.072979,0.072979,0.021711
LF_detect2,1,0.007847,0.007847,0.0
LF_infect,2,0.057546,0.057546,0.012032
LF_isolate,3,0.077688,0.077688,0.026419
LF_positive,4,0.187026,0.187026,0.051269
LF_positive2,5,0.046037,0.045776,0.002354
LF_misc,6,0.118755,0.118755,0.023019
LF_v_cause_h,7,0.005755,0.005493,0.000785
LF_v_h,8,0.291133,0.150405,0.005231
LF_h_v,9,0.182056,0.065917,0.000262


## Part III: Checking Against Gold Labels (Hand Labeled Set)
- Run the labeler on the development set
- Load in some external labels:

### Load Gold Labels
Gold labels are a _small_ set of examples (here, a subset of our training set) which we label by hand and use to help us develop and refine labeling functions. Unlike the _test set_, which we do not look at and use for final evaluation, we can inspect the development set while writing labeling functions.

In [37]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name = "gold", split=1)
L_gold_dev

<430x1 sparse matrix of type '<class 'numpy.int32'>'
	with 126 stored elements in Compressed Sparse Row format>

In [38]:
%time L_dev = labeler.apply_existing(split=1)

Clearing existing...
Running UDF...


100%|███████████████████████████████████████| 430/430 [00:02<00:00, 163.07it/s]


Wall time: 2.71 s


In [39]:
# Label Matrix Empirical Accuracies

L_dev.lf_stats(session, labels=L_gold_dev.toarray().ravel())

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.
LF_detect,0,0.032558,0.032558,0.0,6,3,0,0,0.666667
LF_detect2,1,0.006977,0.006977,0.0,2,1,0,0,0.666667
LF_infect,2,0.025581,0.025581,0.0,7,0,0,0,1.0
LF_isolate,3,0.002326,0.002326,0.0,0,1,0,0,0.0
LF_positive,4,0.060465,0.060465,0.0,13,4,0,0,0.764706
LF_positive2,5,0.023256,0.023256,0.0,4,1,0,0,0.8
LF_misc,6,0.148837,0.148837,0.055814,15,1,0,0,0.9375
LF_v_cause_h,7,0.004651,0.004651,0.0,2,0,0,0,1.0
LF_v_h,8,0.286047,0.086047,0.0,31,11,0,0,0.738095
LF_h_v,9,0.262791,0.083721,0.002326,36,8,0,0,0.818182


#### Iterating on Labeling Function Design:
When writing labeling functions, you will want to iterate on the process outlined above several times. You should focus on tuning individual LFs, based on emprical accuracy metrics, and adding new LFs to improve coverage.

In [40]:
### See Notebook Part 3 for Generative Model Training