# Virus-Host Species Relation Extraction
## Notebook 1
### UC Davis Epicenter for Disease Dynamics


- Input files required:
    - documents: 'pdfs.tsv'
    - host/species names: 'domestic_names.csv', 'ictv_animals.csv', 'ictv_viruses.csv', 'virus_abbrev.csv'

## Part I: Preprocessing the Text Corpus

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
import os
from pathlib import Path

In [3]:
# Load Snorkel
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

n_docs = 500

In [4]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence
from snorkel.parser import TSVDocPreprocessor

### Reading in the documents using a document preprocessor

The original papers were stored as PDF documents. They have been combined and converted to one .tsv file, with a format of document name tab-separated by document content. The doc preprocessor reads in the documents. 

In [5]:
#doc_preprocessor = TSVDocPreprocessor('pdfs.tsv', max_docs=n_docs) # old file (39 papers)
doc_preprocessor = TSVDocPreprocessor('pdfs_big.tsv', max_docs=n_docs) # new files (88 papers)

### Running a `CorpusParser`

We use Spacy, an NLP preprocessing tool, which splits the documents into sentences and tokens. 

In [6]:
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)

Clearing existing...
Running UDF...


 18%|███████▏                                 | 88/500 [00:23<01:46,  3.87it/s]


Wall time: 29.8 s


In [7]:
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 88
Sentences: 19011


### Import dictionaries for entity matching

We create matcher functions from pre-defined dictionaries to match virus/host names in the text data. The functions match full names, abbreviations, and acronyms. The data is given by ICTV virus classification and IUCN list of animal species.

In [8]:
# Create a list of animal host names 
domestic_names = pd.read_csv('domestic_names.csv')
names_list = domestic_names.iloc[:,0].tolist()

In [9]:
ictv_animals = pd.read_csv('ictv_animals.csv')
#print('Total animal names:', ictv_animals.count().sum()) # total number of animal names in the ddict
ictv_series = ictv_animals.stack().reset_index().iloc[:,2]
ictv_list = ictv_series.tolist()

In [10]:
# Function that gets first letter of genus + species name 
def name(s): 
    # split the string into a list  
    l = s.split() 
    new_word = ""  # begins as empty string
    if len(l) == 2:
        for i in range(len(l)-1): 
            s = l[i] 
            # adds the capital first character  
            new_word += (s[0].upper()+'. ') 
        new_word += l[-1].title() # add the last word
        return new_word 
    else:
        return s

In [11]:
ictv_list2 = [name(s) for s in ictv_list] # shortened species names list
animals_list = list(set(names_list + ictv_list + ictv_list2))

In [12]:
# Create a list of virus names
ictv_viruses = pd.read_csv('ictv_viruses.csv')
# create copies of certain virus names without the digit at the end
ictv_viruses['Species2'] = ictv_viruses['Species'].str.replace('\d+', '', regex=True)

In [13]:
ictv_v_series = ictv_viruses.stack().reset_index().iloc[:,2].drop_duplicates()
virus_list = ictv_v_series.tolist()

In [14]:
virus_abbrev = pd.read_csv('virus_abbrev.csv', header = None)
virus_list = virus_list + virus_abbrev.iloc[:,0].tolist() 

# remove terms we don't want to match
dont_want2 = ['bat', 'langur', 'mcp', 'con', 'spf', '(SPF)', 'his', 'pfu', '(PFU)', '(NSP)', 'mal', 'ifa', '(IFA)', 'wrc', '(WRC)', 'fitc', '(fitc)'] 
for v in virus_list:
    if v.lower() in dont_want2:
        virus_list.remove(v)

In [15]:
# Clean up white space and remove any empty strings
animals_list = [animal.strip() for animal in animals_list]
animals_list = list(filter(None, animals_list))
virus_list = [virus.strip() for virus in virus_list]
virus_list = list(filter(None, virus_list))

In [16]:
# search the lists for any unwanted terms:
r = re.compile("mal", flags=re.IGNORECASE)
animals_list2 = []
for a in animals_list:
    if len(a) < 10:       
        animals_list2.append(a)
new_list = list(filter(r.match, animals_list2))
print(new_list)

['mallard', 'Malbrouck', 'Maleo', 'Malia', 'Mala', 'Malvasía', 'Mallard', 'Mal']


In [17]:
# remove terms we don't want to match
animals_list.remove('Mal')
animals_list.remove('Ou')
animals_list.remove('Marta')

In [18]:
print('Virus terms to match:', len(virus_list))
print('Host terms to match:', len(animals_list))

Virus terms to match: 8473
Host terms to match: 69881


## Part II: Candidate Extraction

The next step is to extract candidates from the text. A `candidate` in Snorkel is the object we want to make a prediction on. In our case, the candidate are pairs of virus-host species mentions. Our task will be to predict which pairs are correctly described as linked in the text.

In [19]:
from snorkel.matchers import DictionaryMatch
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.models import candidate_subclass

In [20]:
# Define the candidate schema to extract (virus-host pair). This is a subclass of candidate and is defined using a helper function. The VirusHost mention connects two Spans of text and creates the table in the database backend.
VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

### Writing a basic `CandidateExtractor`

* `CandidateExtractor` is a basic function to extract **candidate Virus-Host relation mentions** from the corpus.

* We will extract `Candidates` by identifying, for each `Sentence`, all pairs of n-grams (up to 7-grams) that were tagged. (An n-gram is a span of text made up of n tokens. So a 7-gram has 7 tokens; A token is a string of contiguous characters between two spaces). 

<br>

We do this with three objects:

* A `ContextSpace` defines the "space" of all candidates we even potentially consider; in this case we use the `Ngrams` subclass, and look for all n-grams up to 7 words long

* A `Matcher` heuristically filters the candidates we use. 

* A `CandidateExtractor` combines this all together

In [21]:
# Define the dictionary matchers, define the candidate extractor
ngrams = Ngrams(n_max=10)
virus_matcher = DictionaryMatch(d = virus_list)
animals_matcher = DictionaryMatch(d = animals_list)
cand_extractor = CandidateExtractor(VirusHost, [ngrams, ngrams], [virus_matcher, animals_matcher], nested_relations = True)

### Split the docs into 3 sets: training, development, and testing sets

In [22]:
from snorkel.models import Document

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)

In [23]:
# Number of sentences per set
print(len(train_sents))
print(len(dev_sents))
print(len(test_sents))

15388
1915
1708


In [24]:
%%time
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(VirusHost).filter(VirusHost.split == i).count())

Clearing existing...
Running UDF...


100%|███████████████████████████████████| 15388/15388 [00:32<00:00, 469.15it/s]


Number of candidates: 3808
Clearing existing...
Running UDF...


100%|█████████████████████████████████████| 1915/1915 [00:04<00:00, 478.74it/s]


Number of candidates: 430
Clearing existing...
Running UDF...


100%|█████████████████████████████████████| 1708/1708 [00:04<00:00, 415.07it/s]


Number of candidates: 542
Wall time: 41 s


In [25]:
print("Number of training candidates:", session.query(VirusHost).filter(VirusHost.split == 0).count())
print("Number of development candidates:", session.query(VirusHost).filter(VirusHost.split == 1).count())
print("Number of test candidates:", session.query(VirusHost).filter(VirusHost.split == 2).count())
print("Total candidates extracted:", session.query(VirusHost).count())

Number of training candidates: 3808
Number of development candidates: 430
Number of test candidates: 542
Total candidates extracted: 4780


In [26]:
cand_extracted = []
for c in session.query(VirusHost).filter(VirusHost.split == 1).all():
    cand_extracted.append(c)
print("Development set candidates extracted:", len(cand_extracted))

Development set candidates extracted: 430


In [27]:
# viewing and hand lableing the first 100 candidates of the development set

from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(cand_extracted[150:250], session, height = 350)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[36, 41, 42], [81, 82], [0]], [[43, 44, 45, 46, 47, 48], [20, 21, 22, 23, 24, 25, 2…

In [28]:
# sentenceviewer can be used to hand-label data for gold label set: just export the sqlite table to csv format and make sure the util_virushost.py file points to the file location. 

### Part III: Import gold labels (hand labels) to check performance

The hand labeled set is used to evaluate the quality of the model. Hand labels can be manually created by using the SentenceViewer cell (clicking checkmarks or x's or indicate positive or negative labels).

In [29]:
from util_virushost import load_external_labels

%time missed = load_external_labels(session, VirusHost, annotator_name = 'gold', split=1)

AnnotatorLabels created: 126
Wall time: 2.77 s


In [30]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name = 'gold', split=1)
L_gold_dev

<430x1 sparse matrix of type '<class 'numpy.int32'>'
	with 126 stored elements in Compressed Sparse Row format>

### Next steps: Developing Labeling Functions in Notebook 2