# Virus-Host Species Relation Extraction
## Notebook 1
### UC Davis Epicenter for Disease Dynamics


- Input files required to work:
    - documents: 'pdfs.tsv'
    - host/species names: 'domestic_names.csv', 'ictv_animals.csv', 'ictv_viruses.csv', 'virus_abbrev.csv'

## Part I: Preprocessing the Text Corpus

In [74]:
import numpy as np
import pandas as pd

In [75]:
import os
from pathlib import Path

In [76]:
# Load Snorkel
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

n_docs = 500

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [77]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence
from snorkel.parser import TSVDocPreprocessor

### Reading in the documents using a document preprocessor

The PDF documents have been converted to a .tsv file, with a format of document name tab-separated by document content. The doc preprocessor reads in the documents. 

In [78]:
doc_preprocessor = TSVDocPreprocessor('pdfs.tsv', max_docs=n_docs)

### Running a `CorpusParser`

We use Spacy, an NLP preprocessing tool, which splits the documents into sentences and tokens. 

In [79]:
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)

Clearing existing...
Running UDF...


  8%|██▊                                 | 39/500 [01:05<11:31,  1.50s/it]


Wall time: 1min 13s


In [80]:
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 39
Sentences: 14435


### Import dictionaries for entity matching

We create matcher functions from pre-defined dictionaries to match virus/host names in the text data. The functions match full names, abbreviations, and acronyms. The data is given by ICTV virus classification and IUCN list of animal species.

In [117]:
# Create a list of animal host names 
domestic_names = pd.read_csv('domestic_names.csv')
names1 = domestic_names.iloc[:,0]
names2 = domestic_names.iloc[:,1]
names3 = domestic_names.iloc[:,2]
names_list = names1.append([names2,names3])
names_list = names_list.tolist()
names_list.append("dromedary")
names_list.append("Peking")

In [118]:
ictv_animals = pd.read_csv('ictv_animals.csv')
#print('Total animal names:', ictv_animals.count().sum()) # total number of animal names in the ddict
ictv_series = ictv_animals.stack().reset_index().iloc[:,2]
ictv_list = ictv_series.tolist()

In [119]:
# Function that gets first letter of genus + species name 
def name(s): 
    # split the string into a list  
    l = s.split() 
    new_word = ""  # begins as empty string
    if len(l) == 2:
        for i in range(len(l)-1): 
            s = l[i] 
            # adds the capital first character  
            new_word += (s[0].upper()+'. ') 
        new_word += l[-1].title() # add the last word
        return new_word 
    else:
        return s

In [120]:
ictv_list2 = [name(s) for s in ictv_list] # shortened species names list
animals_list = list(set(names_list + ictv_list + ictv_list2))
dont_want = ['Once', 'Ounce', 'Mal']
animals_list = [a for a in animals_list if a not in dont_want]

In [121]:
# Create a list of virus names
ictv_viruses = pd.read_csv('ictv_viruses.csv')
# create copies of certain virus names without the digit at the end
ictv_viruses['Species2'] = ictv_viruses['Species'].str.replace('\d+', '', regex=True)

In [122]:
ictv_v_series = ictv_viruses.stack().reset_index().iloc[:,2].drop_duplicates()
virus_list = ictv_v_series.tolist()

In [123]:
virus_abbrev = pd.read_csv('virus_abbrev.csv', header = None)
virus_list = virus_list + virus_abbrev.iloc[:,0].tolist()

In [124]:
# Clean up white space and remove any empty strings
animals_list = [animal.strip() for animal in animals_list]
animals_list = list(filter(None, animals_list))
virus_list = [virus.strip() for virus in virus_list]
virus_list = list(filter(None, virus_list))

In [125]:
# search the list for unwanted terms:
#import re
#r = re.compile("inia")
#new_list = list(filter(r.match, animals_list))
#print(new_list)

In [126]:
print('Number of virus names to match:', len(virus_list))
print('Number of host names to match:', len(animals_list))

Number of virus names to match: 6659
Number of host names to match: 69875


## Part II: Candidate Extraction

The next step is to extract candidates from the text. A `candidate` in Snorkel is the object we want to make a prediction on. In our case, the candidate are pairs of virus-host species mentions. Our task will be to predict which pairs are correctly described as linked in the text.

In [127]:
from snorkel.matchers import DictionaryMatch
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.models import candidate_subclass

In [128]:
# Define the candidate schema to extract (virus-host pair). This is a subclass of candidate and is defined using a helper function. The VirusHost mention connects two Spans of text and creates the table in the database backend.
VirusHost = candidate_subclass('VirusHost', ['virus', 'host'])

### Writing a basic `CandidateExtractor`

* `CandidateExtractor` is a basic function to extract **candidate Virus-Host relation mentions** from the corpus.

* We will extract `Candidates` by identifying, for each `Sentence`, all pairs of n-grams (up to 7-grams) that were tagged. (An n-gram is a span of text made up of n tokens; A token is a string of contiguous characters between two spaces). 

<br>

We do this with three objects:

* A `ContextSpace` defines the "space" of all candidates we even potentially consider; in this case we use the `Ngrams` subclass, and look for all n-grams up to 7 words long

* A `Matcher` heuristically filters the candidates we use. 

* A `CandidateExtractor` combines this all together

In [129]:
# Define the dictionary matchers, define the candidate extractor
ngrams = Ngrams(n_max=10)
virus_matcher = DictionaryMatch(d = virus_list)
animals_matcher = DictionaryMatch(d = animals_list)
cand_extractor = CandidateExtractor(VirusHost, [ngrams, ngrams], [virus_matcher, animals_matcher], nested_relations = True)

### Split the docs into 3 sets: training, development, and testing sets

In [130]:
from snorkel.models import Document

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)

In [131]:
# Number of candidates per set
print(len(train_sents))
print(len(dev_sents))
print(len(test_sents))

11869
1759
807


In [132]:
%%time
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(VirusHost).filter(VirusHost.split == i).count())

Clearing existing...
Running UDF...


100%|██████████████████████████████| 11869/11869 [00:17<00:00, 662.40it/s]


Number of candidates: 1063
Clearing existing...
Running UDF...


100%|████████████████████████████████| 1759/1759 [00:02<00:00, 703.51it/s]


Number of candidates: 218
Clearing existing...
Running UDF...


100%|██████████████████████████████████| 807/807 [00:01<00:00, 496.56it/s]


Number of candidates: 100
Wall time: 22.2 s


In [133]:
print("Number of training candidates:", session.query(VirusHost).filter(VirusHost.split == 0).count())
print("Number of development candidates:", session.query(VirusHost).filter(VirusHost.split == 1).count())
print("Number of test candidates:", session.query(VirusHost).filter(VirusHost.split == 2).count())
print("Total candidates extracted:", session.query(VirusHost).count())

Number of training candidates: 1063
Number of development candidates: 218
Number of test candidates: 100
Total candidates extracted: 1381


In [134]:
cand_extracted = []
for c in session.query(VirusHost).filter(VirusHost.split == 0).all():
    cand_extracted.append(c)
print("Number extracted:", len(cand_extracted))

Number extracted: 1063


In [135]:
from snorkel.viewer import SentenceNgramViewer

SentenceNgramViewer(cand_extracted, session)

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[59], [650], [269]], [[65], [28, 29, 30, 31, 32, 33], [202]], [[574], [196], [155]]…

In [136]:
# sentenceviewer can be used to hand-label data for gold label set: just export the sqlite table to csv format and make sure the util_virushost.py file points to the file location. Then the gold labels will be saved and can be exported such as below:

### Part III: Import gold labels (hand labels) to check performance

The hand labeled set is used to evaluate the quality of the model.

In [137]:
from util_virushost import load_external_labels

%time missed = load_external_labels(session, VirusHost, annotator_name = 'gold')

AnnotatorLabels created: 5
Wall time: 281 ms


In [138]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name = 'gold')
L_gold_dev

<1063x1 sparse matrix of type '<class 'numpy.int32'>'
	with 5 stored elements in Compressed Sparse Row format>

### Next steps: Developing Labeling Functions in Notebook 2