# Extracting Virus-Host Relation Mentions
### UC Davis Epicenter for Disease Dynamics


- Input files required to work:
    - documents: 'pdfs.tsv'
    - host/species names: 'domestic_names.csv', 'ictv_animals.csv', 'ictv_viruses.csv', 'virus_abbrev.csv', 'virus_abbrev_noparen.csv'

## Part I: Preprocessing the Text Corpus

In [32]:
import numpy as np
import pandas as pd

In [33]:
import os
from pathlib import Path

In [34]:
# Load Snorkel
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

n_docs = 500 if 'CI' in os.environ else 2591

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [35]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence
from snorkel.parser import TSVDocPreprocessor

### First: reading in the documents using a document preprocessor

The PDF documents have been converted to a .tsv file, with a format of document name tab-separated by document content. The doc preprocessor reads in the documents. 

In [36]:
doc_preprocessor = TSVDocPreprocessor('pdfs.tsv', max_docs=n_docs)

### Running a `CorpusParser`

We use Spacy, an NLP preprocessing tool, which splits the documents into sentences and tokens. 

In [37]:
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)

Clearing existing...
Running UDF...



  0%|                                            | 0/2591 [00:00<?, ?it/s]
  0%|                                  | 2/2591 [00:02<1:02:16,  1.44s/it]
  0%|                                  | 3/2591 [00:05<1:12:39,  1.68s/it]
  0%|                                  | 4/2591 [00:05<1:00:35,  1.41s/it]
  0%|                                    | 5/2591 [00:06<56:24,  1.31s/it]
  0%|                                  | 6/2591 [00:09<1:17:25,  1.80s/it]
  0%|                                  | 7/2591 [00:11<1:14:51,  1.74s/it]
  0%|                                  | 8/2591 [00:14<1:32:28,  2.15s/it]
  0%|                                  | 9/2591 [00:16<1:34:03,  2.19s/it]
  0%|▏                                | 10/2591 [00:19<1:35:33,  2.22s/it]
  0%|▏                                | 11/2591 [00:22<1:44:19,  2.43s/it]
  0%|▏                                | 12/2591 [00:23<1:28:12,  2.05s/it]
  1%|▏                                | 13/2591 [00:25<1:24:05,  1.96s/it]
  1%|▏                  

Wall time: 1min 41s


In [38]:
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 39
Sentences: 14435


### Import dictionaries for entity matching

We create matcher functions from pre-defined dictionaries to match virus/host names in the text data. The functions match full names, abbreviations, and acronyms. The data is given by ICTV virus classification and IUCN list of animal species.

In [39]:
# Create a list of animal host names 
domestic_names = pd.read_csv('domestic_names.csv')
names1 = domestic_names.iloc[:,0]
names2 = domestic_names.iloc[:,1]
names3 = domestic_names.iloc[:,2]
names_list = names1.append([names2,names3])
names_list = names_list.tolist()
names_list.append("dromedary")
names_list.append("Peking")

In [40]:
ictv_animals = pd.read_csv('ictv_animals.csv')
#print('Total animal names:', ictv_animals.count().sum()) # total number of animal names in the ddict
ictv_series = ictv_animals.stack().reset_index().iloc[:,2]
ictv_list = ictv_series.tolist()

In [41]:
# Function that gets first letter of genus + species name 
def name(s): 
    # split the string into a list  
    l = s.split() 
    new_word = ""  # begins as empty string
    if len(l) == 2:
        for i in range(len(l)-1): 
            s = l[i] 
            # adds the capital first character  
            new_word += (s[0].upper()+'. ') 
        new_word += l[-1].title() # add the last word
        return new_word 
    else:
        return s

In [42]:
ictv_list2 = [name(s) for s in ictv_list] # shortened species names list
animals_list = list(set(names_list + ictv_list + ictv_list2))

In [43]:
# Create a list of virus names
ictv_viruses = pd.read_csv('ictv_viruses.csv')
# create copies of certain virus names without the digit at the end
ictv_viruses['Species2'] = ictv_viruses['Species'].str.replace('\d+', '', regex=True)

In [44]:
ictv_v_series = ictv_viruses.stack().reset_index().iloc[:,2].drop_duplicates()
virus_list = ictv_v_series.tolist()

In [45]:
virus_abbrev = pd.read_csv('virus_abbrev.csv', header = None)
virus_abbrev_noparen = pd.read_csv('virus_abbrev_noparen.csv', header = None)

In [46]:
virus_list = virus_list + virus_abbrev.iloc[:,0].tolist() + virus_abbrev_noparen.iloc[:,0].tolist()

In [47]:
# Clean up white space
animals_list = [animal.strip() for animal in animals_list]
virus_list = [virus.strip() for virus in virus_list]

In [48]:
print('Number of virus names to match:', len(virus_list))
print('Number of host names to match:', len(animals_list))

Number of virus names to match: 6392
Number of host names to match: 69877


## Part II: Candidate Extraction

The next step is to extract candidates from the text. A `candidate` in Snorkel is the object we want to make a prediction on. In our case, the candidate are pairs of virus-host species mentions. Our task will be to predict which pairs are correctly described as linked in the text.

In [49]:
from snorkel.matchers import DictionaryMatch
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.models import candidate_subclass

In [50]:
# Define the candidate schema to extract (virus-host pair). This is a subclass of candidate and is defined using a helper function. The VirusHost mention connects two Spans of text and creates the table in the database backend.
VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

### Writing a basic `CandidateExtractor`

* `CandidateExtractor` is a basic function to extract **candidate Virus-Host relation mentions** from the corpus.

* We will extract `Candidates` by identifying, for each `Sentence`, all pairs of n-grams (up to 7-grams) that were tagged. (An n-gram is a span of text made up of n tokens; A token is a string of contiguous characters between two spaces). 

<br>

We do this with three objects:

* A `ContextSpace` defines the "space" of all candidates we even potentially consider; in this case we use the `Ngrams` subclass, and look for all n-grams up to 7 words long

* A `Matcher` heuristically filters the candidates we use. 

* A `CandidateExtractor` combines this all together

In [51]:
# Define the dictionary matchers, define the candidate extractor
ngrams = Ngrams(n_max=10)
virus_matcher = DictionaryMatch(d = virus_list)
animals_matcher = DictionaryMatch(d = animals_list, stemmer = 'porter')
cand_extractor = CandidateExtractor(VirusHost, [ngrams, ngrams], [virus_matcher, animals_matcher], nested_relations = True)

### Split the docs into 3 sets: training, development, and testing sets

In [52]:
from snorkel.models import Document

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)

In [53]:
# Number of candidates per set
print(len(train_sents))
print(len(dev_sents))
print(len(test_sents))

11869
1759
807


In [54]:
%%time
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(VirusHost).filter(VirusHost.split == i).count())

Clearing existing...
Running UDF...



  0%|                                           | 0/11869 [00:00<?, ?it/s]
  0%|                                 | 17/11869 [00:00<01:18, 150.85it/s]
  0%|                                 | 23/11869 [00:00<01:58, 100.24it/s]
  0%|                                  | 29/11869 [00:00<02:36, 75.57it/s]
  0%|                                  | 37/11869 [00:00<02:36, 75.81it/s]
  0%|▏                                 | 45/11869 [00:00<02:36, 75.77it/s]
  0%|▏                                 | 58/11869 [00:00<02:18, 85.46it/s]
  1%|▏                                 | 67/11869 [00:00<02:43, 72.24it/s]
  1%|▏                                 | 80/11869 [00:00<02:25, 81.06it/s]
  1%|▎                                 | 99/11869 [00:01<02:07, 92.52it/s]
  1%|▎                               | 113/11869 [00:01<01:55, 102.16it/s]
  1%|▎                               | 137/11869 [00:01<01:36, 121.60it/s]
  1%|▍                               | 156/11869 [00:01<01:28, 132.42it/s]
  1%|▍                  

 15%|████▌                          | 1749/11869 [00:14<01:01, 163.28it/s]
 15%|████▌                          | 1768/11869 [00:14<01:06, 152.58it/s]
 15%|████▋                          | 1786/11869 [00:14<01:15, 133.69it/s]
 15%|████▋                          | 1802/11869 [00:14<01:14, 134.36it/s]
 15%|████▉                           | 1817/11869 [00:15<01:43, 97.50it/s]
 15%|████▊                          | 1830/11869 [00:15<01:36, 103.71it/s]
 16%|████▊                          | 1843/11869 [00:15<01:37, 102.90it/s]
 16%|████▊                          | 1858/11869 [00:15<01:28, 112.90it/s]
 16%|████▉                          | 1878/11869 [00:15<01:17, 129.68it/s]
 16%|████▉                          | 1893/11869 [00:15<01:16, 130.01it/s]
 16%|████▉                          | 1908/11869 [00:15<01:23, 119.95it/s]
 16%|█████                          | 1921/11869 [00:15<01:24, 117.90it/s]
 16%|█████                          | 1939/11869 [00:15<01:16, 129.60it/s]
 16%|█████               

 30%|█████████▋                      | 3613/11869 [00:28<01:30, 90.90it/s]
 31%|█████████▊                      | 3627/11869 [00:28<01:24, 97.41it/s]
 31%|█████████▊                      | 3638/11869 [00:28<01:29, 92.33it/s]
 31%|█████████▊                      | 3650/11869 [00:28<01:23, 99.01it/s]
 31%|█████████▊                      | 3661/11869 [00:28<01:24, 96.77it/s]
 31%|█████████▉                      | 3671/11869 [00:28<01:31, 89.90it/s]
 31%|█████████▉                      | 3683/11869 [00:29<01:24, 97.04it/s]
 31%|█████████▋                     | 3705/11869 [00:29<01:10, 115.36it/s]
 31%|█████████▋                     | 3719/11869 [00:29<01:10, 116.13it/s]
 31%|█████████▋                     | 3732/11869 [00:29<01:11, 113.18it/s]
 32%|█████████▊                     | 3757/11869 [00:29<01:00, 133.94it/s]
 32%|█████████▊                     | 3775/11869 [00:29<00:56, 144.14it/s]
 32%|█████████▉                     | 3797/11869 [00:29<00:50, 160.46it/s]
 32%|█████████▉          

 49%|███████████████                | 5765/11869 [00:41<00:37, 163.70it/s]
 49%|███████████████                | 5787/11869 [00:41<00:34, 175.73it/s]
 49%|███████████████▏               | 5811/11869 [00:41<00:31, 190.75it/s]
 49%|███████████████▏               | 5832/11869 [00:42<00:33, 181.10it/s]
 49%|███████████████▎               | 5854/11869 [00:42<00:31, 190.39it/s]
 49%|███████████████▎               | 5874/11869 [00:42<00:32, 185.80it/s]
 50%|███████████████▍               | 5894/11869 [00:42<00:47, 127.12it/s]
 50%|███████████████▍               | 5910/11869 [00:42<00:57, 102.94it/s]
 50%|███████████████▉                | 5924/11869 [00:42<01:00, 97.65it/s]
 50%|███████████████▌               | 5941/11869 [00:43<00:53, 110.05it/s]
 50%|███████████████▌               | 5955/11869 [00:43<00:51, 115.35it/s]
 50%|███████████████▌               | 5968/11869 [00:43<00:51, 114.44it/s]
 50%|███████████████▌               | 5981/11869 [00:43<00:55, 106.29it/s]
 51%|███████████████▋    

 62%|███████████████████▉            | 7384/11869 [00:56<00:47, 93.74it/s]
 62%|███████████████████▉            | 7396/11869 [00:56<00:44, 99.89it/s]
 62%|███████████████████▉            | 7407/11869 [00:56<00:48, 91.54it/s]
 63%|███████████████████▍           | 7424/11869 [00:56<00:43, 102.47it/s]
 63%|███████████████████▍           | 7435/11869 [00:56<00:43, 102.94it/s]
 63%|████████████████████            | 7446/11869 [00:57<00:47, 94.03it/s]
 63%|███████████████████▍           | 7459/11869 [00:57<00:43, 100.47it/s]
 63%|███████████████████▌           | 7473/11869 [00:57<00:42, 103.29it/s]
 63%|███████████████████▌           | 7485/11869 [00:57<00:41, 105.88it/s]
 63%|███████████████████▌           | 7496/11869 [00:57<00:41, 105.62it/s]
 63%|████████████████████▏           | 7507/11869 [00:57<00:46, 94.35it/s]
 63%|███████████████████▋           | 7528/11869 [00:57<00:38, 112.89it/s]
 64%|███████████████████▋           | 7542/11869 [00:57<00:42, 102.81it/s]
 64%|████████████████████

 81%|█████████████████████████      | 9608/11869 [01:09<00:10, 215.84it/s]
 81%|█████████████████████████▏     | 9634/11869 [01:09<00:10, 221.80it/s]
 81%|█████████████████████████▏     | 9657/11869 [01:10<00:11, 194.88it/s]
 82%|█████████████████████████▎     | 9678/11869 [01:10<00:12, 181.31it/s]
 82%|█████████████████████████▎     | 9697/11869 [01:10<00:11, 182.39it/s]
 82%|█████████████████████████▍     | 9718/11869 [01:10<00:11, 186.98it/s]
 82%|█████████████████████████▍     | 9743/11869 [01:10<00:10, 201.92it/s]
 82%|█████████████████████████▌     | 9764/11869 [01:10<00:10, 192.15it/s]
 83%|█████████████████████████▌     | 9798/11869 [01:10<00:09, 212.00it/s]
 83%|█████████████████████████▋     | 9821/11869 [01:10<00:11, 181.39it/s]
 83%|█████████████████████████▋     | 9841/11869 [01:11<00:13, 145.65it/s]
 83%|█████████████████████████▋     | 9858/11869 [01:11<00:19, 105.20it/s]
 83%|█████████████████████████▊     | 9877/11869 [01:11<00:16, 119.92it/s]
 83%|████████████████████

 95%|█████████████████████████████▌ | 11331/11869 [01:24<00:06, 82.16it/s]
 96%|█████████████████████████████▌ | 11342/11869 [01:24<00:06, 83.89it/s]
 96%|█████████████████████████████▋ | 11358/11869 [01:24<00:05, 94.45it/s]
 96%|████████████████████████████▊ | 11377/11869 [01:24<00:04, 109.36it/s]
 96%|█████████████████████████████▋ | 11390/11869 [01:24<00:04, 99.86it/s]
 96%|█████████████████████████████▊ | 11402/11869 [01:24<00:04, 99.48it/s]
 96%|█████████████████████████████▊ | 11414/11869 [01:24<00:04, 99.23it/s]
 96%|█████████████████████████████▊ | 11425/11869 [01:25<00:04, 93.70it/s]
 96%|████████████████████████████▉ | 11444/11869 [01:25<00:03, 109.98it/s]
 97%|████████████████████████████▉ | 11457/11869 [01:25<00:04, 100.68it/s]
 97%|████████████████████████████▉ | 11469/11869 [01:25<00:03, 102.09it/s]
 97%|█████████████████████████████ | 11483/11869 [01:25<00:03, 108.61it/s]
 97%|█████████████████████████████ | 11499/11869 [01:25<00:03, 119.72it/s]
 97%|████████████████████

Number of candidates: 1808
Clearing existing...
Running UDF...



  0%|                                            | 0/1759 [00:00<?, ?it/s]
  2%|▋                                 | 35/1759 [00:00<00:05, 340.72it/s]
  4%|█▏                                | 63/1759 [00:00<00:05, 314.82it/s]
  4%|█▌                                | 79/1759 [00:00<00:07, 234.64it/s]
  6%|██                               | 110/1759 [00:00<00:06, 247.20it/s]
  7%|██▍                              | 130/1759 [00:00<00:07, 222.63it/s]
  9%|██▊                              | 152/1759 [00:00<00:07, 219.37it/s]
 10%|███▍                             | 182/1759 [00:00<00:06, 229.99it/s]
 12%|███▊                             | 204/1759 [00:00<00:07, 220.29it/s]
 13%|████▍                            | 234/1759 [00:00<00:06, 230.70it/s]
 15%|████▊                            | 257/1759 [00:01<00:07, 207.43it/s]
 16%|█████▏                           | 278/1759 [00:01<00:07, 200.02it/s]
 18%|█████▊                           | 308/1759 [00:01<00:06, 213.85it/s]
 19%|██████▎            

Number of candidates: 629
Clearing existing...
Running UDF...



  0%|                                             | 0/807 [00:00<?, ?it/s]
  1%|▎                                    | 6/807 [00:00<00:14, 55.19it/s]
  2%|▋                                   | 16/807 [00:00<00:12, 62.37it/s]
  3%|█▏                                  | 28/807 [00:00<00:11, 70.22it/s]
  4%|█▌                                  | 34/807 [00:00<00:15, 51.17it/s]
  5%|█▉                                  | 43/807 [00:00<00:13, 55.76it/s]
  6%|██▏                                 | 50/807 [00:00<00:13, 57.24it/s]
  7%|██▌                                 | 57/807 [00:00<00:12, 57.75it/s]
  8%|███                                 | 68/807 [00:01<00:11, 65.46it/s]
 10%|███▌                                | 79/807 [00:01<00:10, 68.96it/s]
 11%|███▉                                | 87/807 [00:01<00:12, 59.23it/s]
 12%|████▎                              | 100/807 [00:01<00:10, 65.88it/s]
 13%|████▋                              | 108/807 [00:01<00:11, 61.31it/s]
 14%|████▉              

Number of candidates: 218
Wall time: 1min 54s


In [55]:
print("Number of training candidates:", session.query(VirusHost).filter(VirusHost.split == 0).count())
print("Number of development candidates:", session.query(VirusHost).filter(VirusHost.split == 1).count())
print("Number of test candidates:", session.query(VirusHost).filter(VirusHost.split == 2).count())
print("Total candidates extracted:", session.query(VirusHost).count())

Number of training candidates: 1808
Number of development candidates: 629
Number of test candidates: 218
Total candidates extracted: 2655


## Part III: Writing Labeling Functions

Labeling functions encode our heuristics and weak supervision signals to generate (noisy) labels for our training candidates.

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets). 

A labeling function is just a Python function that accepts a `Candidate` and returns `1` to mark the `Candidate` as true, `-1` to mark the `Candidate` as false, and `0` to abstain from labeling the `Candidate`.

In [56]:
# Labeling functions
import re
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

In [57]:
def LF_related(c):
    return 1 if 'related' in c.get_parent().words else 0

def LF_isolated(c):
    return 1 if 'isolated' in c.get_parent().words else 0

def LF_detected(c):
    return 1 if 'detected' in c.get_parent().words else 0
    

In [58]:
labeled = []
for c in session.query(VirusHost).filter(VirusHost.split == 0).all():
    if LF_related(c) != 0 or LF_isolated(c) != 0 or LF_detected(c) != 0:
        labeled.append(c)
print("Number labeled:", len(labeled))

Number labeled: 250


In [59]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(labeled, session)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[7, 8, 9, 10, 11, 12], [81], [92]], [[23, 24, 25, 26, 27, 28, 29], [89, 90, 91], [1…

In [60]:
# Running the LFs
from snorkel.annotations import LabelAnnotator
LFs = [
    LF_related, LF_isolated, LF_detected
]
labeler = LabelAnnotator(lfs=LFs)

In [61]:
np.random.seed(1701)
%time L_train = labeler.apply(split=0)
L_train

Clearing existing...
Running UDF...



  0%|                                            | 0/1808 [00:00<?, ?it/s]
  1%|▏                                 | 13/1808 [00:00<00:14, 127.79it/s]
  1%|▎                                  | 18/1808 [00:00<00:22, 78.63it/s]
  1%|▍                                  | 23/1808 [00:00<00:27, 66.11it/s]
  2%|▋                                  | 35/1808 [00:00<00:23, 75.86it/s]
  3%|▉                                  | 48/1808 [00:00<00:20, 86.40it/s]
  3%|█▏                                 | 61/1808 [00:00<00:18, 94.65it/s]
  4%|█▍                                 | 72/1808 [00:00<00:17, 97.81it/s]
  5%|█▌                                | 84/1808 [00:00<00:16, 102.83it/s]
  5%|█▊                                | 97/1808 [00:00<00:15, 108.41it/s]
  6%|█▉                               | 109/1808 [00:01<00:15, 110.19it/s]
  7%|██▏                              | 123/1808 [00:01<00:14, 115.46it/s]
  8%|██▍                              | 136/1808 [00:01<00:14, 119.24it/s]
  8%|██▋                

 77%|████████████████████████▋       | 1392/1808 [00:12<00:03, 114.55it/s]
 78%|████████████████████████▊       | 1404/1808 [00:12<00:03, 112.95it/s]
 78%|█████████████████████████       | 1416/1808 [00:12<00:03, 110.93it/s]
 79%|█████████████████████████▎      | 1428/1808 [00:12<00:03, 110.77it/s]
 80%|█████████████████████████▌      | 1441/1808 [00:12<00:03, 114.15it/s]
 80%|█████████████████████████▋      | 1453/1808 [00:12<00:03, 114.62it/s]
 81%|█████████████████████████▉      | 1465/1808 [00:12<00:03, 107.83it/s]
 82%|██████████████████████████      | 1476/1808 [00:12<00:03, 105.45it/s]
 82%|██████████████████████████▎     | 1488/1808 [00:12<00:02, 108.32it/s]
 83%|██████████████████████████▌     | 1500/1808 [00:13<00:02, 110.44it/s]
 84%|██████████████████████████▊     | 1513/1808 [00:13<00:02, 113.62it/s]
 84%|██████████████████████████▉     | 1525/1808 [00:13<00:02, 114.89it/s]
 85%|███████████████████████████▏    | 1537/1808 [00:13<00:02, 108.00it/s]
 86%|████████████████████

Wall time: 15.7 s


<1808x3 sparse matrix of type '<class 'numpy.int32'>'
	with 280 stored elements in Compressed Sparse Row format>

In [62]:
L_train.get_candidate(session, 0)

VirusHost(Span("b'USA'", sentence=475, chars=[117,119], words=[20,20]), Span("b''", sentence=475, chars=[233,232], words=[41,41]))

In [63]:
L_train.get_key(session, 0)

LabelKey (LF_related)

Getting statistics about the resulting label matrix:

* **Coverage** is the fraction of candidates that the labeling function emits a non-zero label for.
* **Overlap** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a non-zero label for.
* **Conflict** is the fraction candidates that the labeling function emits a non-zero label for and that another labeling function emits a *conflicting* non-zero label for.

In [65]:
L_train.lf_stats(session)

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_related,0,0.042588,0.011615,0.0
LF_isolated,1,0.07135,0.010509,0.0
LF_detected,2,0.040929,0.007743,0.0
