# Phenotype acronym extraction

This module looks at both tables and text to identify acronyms used to refer to phenotypes. These acronyms are then used to further expand SNP/phenotype relations.

## Preparations

We start by configuring Jupyter and setting up our environment.

In [2]:
%load_ext autoreload
%autoreload 2

import sys
import cPickle
import numpy as np
import sqlalchemy

# set the paths to snorkel and gwaskb
sys.path.append('../snorkel-tables')
sys.path.append('../src')
sys.path.append('../src/crawler')

# set up the directory with the input papers
abstract_dir = '../data/db/papers'

# set up matplotlib
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,4)

# create a Snorkel session
from snorkel import SnorkelSession
session = SnorkelSession()



## Load corpus

We load our usual corpus of GWAS papers.

### Tables corpus

Part of the acronyms are found in tables. We parse these like in the other table-based modules.

In [3]:
from snorkel.parser import XMLMultiDocParser

xml_parser = XMLMultiDocParser(
    path=abstract_dir,
    doc='./*',
    text='.//table',
    id='.//article-id[@pub-id-type="pmid"]/text()',
    keep_xml_tree=True)

In [4]:
from snorkel.parser import CorpusParser, OmniParser
from snorkel.models import Corpus

# parses tables into rows, cols, cells...
table_parser = OmniParser(timeout=1000000)

try:
    table_corpus = session.query(Corpus).filter(Corpus.name == 'GWAS Table Corpus').one()
except:
    cp = CorpusParser(xml_parser, table_parser)
    %time table_corpus = cp.parse_corpus(name='GWAS Table Corpus', session=session)
    session.add(table_corpus)
    session.commit()

print 'Loaded corpus of %d documents' % len(table_corpus)

Loaded corpus of 589 documents


### Text copus

We also seek mentions of acronyms in the paper text.

The following parser extracts sentences from each paper abstract, title, and the first 5 paragraphs.

In [5]:
from snorkel.parser import SentenceParser
from snorkel.parser import CorpusParser
from snorkel.models import Corpus

from extractor.parser import UnicodeXMLDocParser, GWASXMLDocParser

xml_parser = GWASXMLDocParser(
    path=abstract_dir,
    doc='./*',
    title='.//front//article-title//text()',
    abstract='.//abstract//p//text()',
    n_par=5,
    id='.//article-id[@pub-id-type="pmid"]/text()',
    keep_xml_tree=True)

sent_parser = SentenceParser()

try:
    text_corpus = session.query(Corpus).filter(Corpus.name == 'GWAS Text Corpus').one()
except:
    cp = CorpusParser(xml_parser, sent_parser)
    %time text_corpus = cp.parse_corpus(name='GWAS Text Corpus', session=session)
    session.add(text_corpus)
    session.commit()

print 'Loaded corpus of %d documents' % len(text_corpus)

Loaded corpus of 589 documents


## Candidate extraction

Next, we generate candidates from both tables and text.

### From phenotype / acronym tables

Many papers have tables with an acronym column and a phenotype column. In this section, we extract candidates from these tables.

We define matchers for cells whose header contains a word that is indicative of a phenotype or an acronym.

In [6]:
# create a Snorkel class for the relation we will extract
from snorkel.models import candidate_subclass
AcroPhenRel = candidate_subclass('AcroPhenRel', ['acro','phen'])

# Define a candidate space
from snorkel.candidates import TableCells
cells = TableCells()

# Create a list of possible words that could denote phenotypes
acro_words = ['abbreviation', 'acronym', 'phenotype']
phen_words = ['trait', 'phenotype', 'description']

# Define matchers
from snorkel.matchers import CellNameDictionaryMatcher
phen_matcher = CellNameDictionaryMatcher(axis='col', d=phen_words, n_max=3, ignore_case=True)
acro_matcher = CellNameDictionaryMatcher(axis='col', d=acro_words, n_max=3, ignore_case=True)

# we will be looking only at aligned cells
from snorkel.throttlers import AlignmentThrottler
row_align_filter = AlignmentThrottler(axis='row', infer=True)

# create the candidate extractor
from snorkel.candidates import CandidateExtractor
ce1 = CandidateExtractor(AcroPhenRel, [cells, cells], [acro_matcher, phen_matcher], throttler=row_align_filter)

# collect that cells that will be searched for candidates
tables = [table for doc in table_corpus.documents for table in doc.tables]

We are now ready to perform relation extraction.

In [7]:
from snorkel.models import CandidateSet

try:
    tab_rels = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Tables Set').one()
except:
    %time tab_rels = ce1.extract(tables, 'AcroPhenRel Tables Set', session)
    
print "%s relations extracted, e.g." % len(tab_rels)
for cand in tab_rels[:10]:
    print cand

526 relations extracted, e.g.
AcroPhenRel(Span("MP:0000284", parent=477868, chars=[0,9], words=[0,1]), Span("double outlet right ventricle", parent=477869, chars=[0,28], words=[0,3]))
AcroPhenRel(Span("heart right ventricle hypertrophy", parent=477885, chars=[0,32], words=[0,3]), Span("MP:0000276", parent=477884, chars=[0,9], words=[0,1]))
AcroPhenRel(Span("abnormal heart right atrium morphology", parent=477829, chars=[0,37], words=[0,4]), Span("MP:0003922", parent=477828, chars=[0,9], words=[0,1]))
AcroPhenRel(Span("MP:0002625", parent=477876, chars=[0,9], words=[0,1]), Span("heart left ventricle hypertrophy", parent=477877, chars=[0,31], words=[0,3]))
AcroPhenRel(Span("right pulmonary isomerism", parent=477917, chars=[0,24], words=[0,2]), Span("MP:0000531", parent=477916, chars=[0,9], words=[0,1]))
AcroPhenRel(Span("MP:0009570", parent=477852, chars=[0,9], words=[0,1]), Span("abnormal right lung morphology", parent=477853, chars=[0,29], words=[0,3]))
AcroPhenRel(Span("MP:0002766", pa

### From table phrases

Another way of definining acronyms is in text, e.g. as in "Body Mass Index (BMI)". We are now going to extract such candidates from phrases that are found in paper tables.

In [8]:
# Define a candidate space
from snorkel.candidates import OmniNgrams
ngrams3 = OmniNgrams(n_max=3)
ngrams8 = OmniNgrams(n_max=8)

# Define matchers
from snorkel.matchers import RegexMatchSpan
phen_matcher = RegexMatchSpan(rgx=r'.+ \([a-zA-Z0-9_-]{1,10}[\);]')
acro_matcher = RegexMatchSpan(rgx=r'\([a-zA-Z0-9_-]{1,10}[\);]')

# We only look at phenotype and acronym matches that overlap
from snorkel.throttlers import ContainmentThrottler, WordLengthThrottler, CombinedThrottler
containment_filter = ContainmentThrottler()
# length_filter = WordLengthThrottler(op='max', idx=1, lim=15)
# ovl_len_filter = CombinedThrottler([overlap_filter, length_filter])

# create the candidate extractor
from snorkel.candidates import CandidateExtractor
ce2 = CandidateExtractor(AcroPhenRel, [ngrams3, ngrams8], [acro_matcher, phen_matcher], self_relations=True, nested_relations=True, throttler=containment_filter)

Let us now extract these candidates.

In [9]:
from snorkel.models import CandidateSet

try:
    txt_tab_rels = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Tables Set 2').one()
except:
    %time txt_tab_rels = ce2.extract(table_corpus.documents, 'AcroPhenRel Tables Set 2', session)
    
print "%s relations extracted, e.g." % len(txt_tab_rels)
for cand in txt_tab_rels[:10]:
    print cand

6545 relations extracted, e.g.
AcroPhenRel(Span("(SE)", parent=488275, chars=[5,8], words=[1,3]), Span("Beta (SE)", parent=488275, chars=[0,8], words=[0,3]))
AcroPhenRel(Span("(SE)", parent=488282, chars=[5,8], words=[1,3]), Span("Beta (SE)", parent=488282, chars=[0,8], words=[0,3]))
AcroPhenRel(Span("(SE)", parent=488277, chars=[5,8], words=[1,3]), Span("Beta (SE)", parent=488277, chars=[0,8], words=[0,3]))
AcroPhenRel(Span("(SNFs)", parent=469891, chars=[12,17], words=[2,4]), Span("GWAS Sample (SNFs)", parent=469891, chars=[0,17], words=[0,4]))
AcroPhenRel(Span("(SNFs)", parent=469891, chars=[12,17], words=[2,4]), Span("Sample (SNFs)", parent=469891, chars=[5,17], words=[1,4]))
AcroPhenRel(Span("(Unrelateds)", parent=469892, chars=[12,23], words=[2,4]), Span("GWAS Sample (Unrelateds)", parent=469892, chars=[0,23], words=[0,4]))
AcroPhenRel(Span("(Unrelateds)", parent=469892, chars=[12,23], words=[2,4]), Span("Sample (Unrelateds)", parent=469892, chars=[5,23], words=[1,4]))
AcroPhenRe

### From text

Finally, we repeat the same extraction process for candidates that are found in text.

In [10]:
# Define a candidate space
from snorkel.candidates import Ngrams
ngrams3 = Ngrams(n_max=3)
ngrams8 = Ngrams(n_max=8)

# Define matchers
from snorkel.matchers import RegexMatchSpan
phen_matcher = RegexMatchSpan(rgx=r'.+ \([a-zA-Z0-9_-]{1,10}[\);]')
acro_matcher = RegexMatchSpan(rgx=r'\([a-zA-Z0-9_-]{1,10}[\);]')

# We only look at phenotype and acronym matches that overlap
from snorkel.throttlers import ContainmentThrottler, WordLengthThrottler, CombinedThrottler
containment_filter = ContainmentThrottler()
# length_filter = WordLengthThrottler(op='max', idx=1, lim=15)
# ovl_len_filter = CombinedThrottler([overlap_filter, length_filter])

# create the candidate extractor
from snorkel.candidates import CandidateExtractor
ce3 = CandidateExtractor(AcroPhenRel, [ngrams3, ngrams8], [acro_matcher, phen_matcher], self_relations=True, nested_relations=True, throttler=containment_filter)

We extract the candidates.

In [11]:
from snorkel.models import CandidateSet

try:
    txt_txt_rels = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Text Set5').one()
except:
    sentences = [s for doc in text_corpus for s in doc.sentences]
    %time txt_txt_rels = ce3.extract(sentences, 'AcroPhenRel Text Set5', session)
    
print "%s relations extracted, e.g." % len(txt_txt_rels)
for cand in txt_txt_rels[:10]:
    print cand

30512 relations extracted, e.g.
AcroPhenRel(Span("(aBMD)", parent=526044, chars=[276,281], words=[36,38]), Span("bone mineral density (aBMD)", parent=526044, chars=[255,281], words=[33,38]))
AcroPhenRel(Span("(aBMD)", parent=526044, chars=[276,281], words=[36,38]), Span("mineral density (aBMD)", parent=526044, chars=[260,281], words=[34,38]))
AcroPhenRel(Span("(aBMD)", parent=526044, chars=[276,281], words=[36,38]), Span("density (aBMD)", parent=526044, chars=[268,281], words=[35,38]))
AcroPhenRel(Span("(aBMD)", parent=526044, chars=[276,281], words=[36,38]), Span("with areal bone mineral density (aBMD)", parent=526044, chars=[244,281], words=[31,38]))
AcroPhenRel(Span("(aBMD)", parent=526044, chars=[276,281], words=[36,38]), Span("areal bone mineral density (aBMD)", parent=526044, chars=[249,281], words=[32,38]))
AcroPhenRel(Span("(GWA)", parent=526044, chars=[198,202], words=[23,25]), Span("genome-wide association (GWA)", parent=526044, chars=[174,202], words=[21,25]))
AcroPhenRel(Sp

### Combining the results

Finally, we merge all the candiates into a single set.

In [12]:
from snorkel.models import CandidateSet

try:
    rels = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Canidates').one()
except:
    rels = CandidateSet(name='AcroPhenRel Canidates')
    for c in tab_rels: rels.append(c)
    for c in txt_tab_rels: rels.append(c)
    for c in txt_txt_rels: rels.append(c)

    session.add(rels)
    session.commit()

print '%d candidates in total' % len(rels)

37583 candidates in total


Write down our results (for debugging)

In [12]:
with open('acronyms.tmp.tsv', 'w') as f:
    for rel in rels:
        pmid = rel[0].parent.document.name
        try:
            out_str = '%s\t%s\t%s\n' % (pmid, unicode(rel[1].get_span()), unicode(rel[0].get_span()))
        except:
            print (pmid, unicode(rel[1].get_span()), rel[0].get_span())
        f.write(out_str.encode("UTF-8"))

## Creating a gold set

It will be helpful to have a list of gold labels against which to evaluate the accuracy of our system.

We are going to load here a list of candidates that we have previously labeled by hand.

In [13]:
annotations = dict()
with open('util/acronyms.anotated.txt') as f:
    text = f.read()
    for line in text.split('\r'):
        doc_id, str1, str2, res = line.strip().split('\t')
        res = 1 if int(res) == 1 else -1
        annotations[(doc_id, str2, str1)] = res

The format of this file is: pmid, phenotype, acronym, label. We originally generated it from 100 random candidates.

## Learning the correctness of relations extracted from tables

Next, we are going to use a machine learning classifier to identify correct acronyms amond our set of candidates.

First, we are going to train a classifier for candidates that have been extracted from tables (that had a phenotype and an acronym column).

### Creating training and test sets

We first split data into an (unlabeled) training set (since we will use unsupervised risk estimation to train a candidate on it), and a dev/test set.

In [93]:
# helper fn
def r2id(r):
    doc_id = r[0].parent.document.name
    str1, str2 = r[0].get_span(), r[1].get_span()
    return (doc_id, str1, str2)

try:
    tab_train_c = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Table Training Candidates').one()
    tab_devtest_c = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Table Dev/Test Candidates').one()
except:
    # delete any previous sets with that name
    session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Table Training Candidates').delete()
    session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Table Dev/Test Candidates').delete()

    # helpers/config
    frac_test = 0.5

    # initialize the new sets
    tab_train_c = CandidateSet(name='AcroPhenRel Table Training Candidates')
    tab_devtest_c = CandidateSet(name='AcroPhenRel Table Dev/Test Candidates')

    # choose a random subset for the labeled set
    n_test = len(tab_rels) * frac_test
    test_idx = set(np.random.choice(len(tab_rels), size=(n_test,), replace=False))

    # add to the sets
    for i, c in enumerate(tab_rels):
        if i in test_idx:
            tab_devtest_c.append(c)
        elif r2id(c) in annotations:
            tab_devtest_c.append(c)
        else:
            tab_train_c.append(c)

    # save the results
    session.add(tab_train_c)
    session.add(tab_devtest_c)
    session.commit()

print 'Initialized %d training and %d dev/testing candidates' % (len(tab_train_c), len(tab_devtest_c))
print "Positive labels in dev/test set: %s" % len([c for c in tab_devtest_c if annotations.get(r2id(c),0)==1])
print "Negative labels in dev/test set: %s" % len([c for c in tab_devtest_c if annotations.get(r2id(c),0)==-1])

Initialized 238 training and 288 dev/testing candidates
Positive labels in dev/test set: 36
Negative labels in dev/test set: 21


### Labeling functions

Following the data programming approach, we define set of labeling functions. We will learn their accuracy via unsupervised learning and use them for classifying candidates.

In [16]:
def LF1_digits(m):
    txt = m[1].get_span()
    frac_num = len([ch for ch in txt if ch.isdigit()]) / float(len(txt))
    return -1 if frac_num > 0.5 else +1
def LF1_short(m):
    txt = m[1].get_span()
    return -1 if len(txt) < 5 else 0

LF_tables = [LF1_digits, LF1_short]

We compute the LFs's on our training set.

In [17]:
from snorkel.annotations import LabelManager
label_manager = LabelManager()

try:
    %time L_tab_train = label_manager.load(session, tab_train_c, 'AcroPhenRel Table Training LF Labels')
except sqlalchemy.orm.exc.NoResultFound:
    %time L_tab_train = label_manager.create(session, tab_train_c, 'AcroPhenRel Table Training LF Labels', f=LF_tables)

CPU times: user 664 ms, sys: 54.8 ms, total: 718 ms
Wall time: 867 ms


And we learn their accuracy.

In [18]:
from snorkel.learning import NaiveBayes

tab_model = NaiveBayes()
tab_model.train(L_tab_train, n_iter=10000, rate=1e-2)

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



Training marginals (!= 0.5):	238
Features:			2
Begin training for rate=0.01, mu=1e-06
	Learning epoch = 0	Gradient mag. = 0.342480
	Learning epoch = 250	Gradient mag. = 0.182756
	Learning epoch = 500	Gradient mag. = 0.099294
	Learning epoch = 750	Gradient mag. = 0.057097
	Learning epoch = 1000	Gradient mag. = 0.041849
	Learning epoch = 1250	Gradient mag. = 0.053292
	Learning epoch = 1500	Gradient mag. = 0.088556
	Learning epoch = 1750	Gradient mag. = 0.148214
	Learning epoch = 2000	Gradient mag. = 0.226156
	Learning epoch = 2250	Gradient mag. = 0.281349
	Learning epoch = 2500	Gradient mag. = 0.262219
	Learning epoch = 2750	Gradient mag. = 0.190201
	Learning epoch = 3000	Gradient mag. = 0.118443
	Learning epoch = 3250	Gradient mag. = 0.067976
	Learning epoch = 3500	Gradient mag. = 0.037414
	Learning epoch = 3750	Gradient mag. = 0.020197
	Learning epoch = 4000	Gradient mag. = 0.010847
	Learning epoch = 4250	Gradient mag. = 0.005864
	Learning epoch = 4500	Gradient mag. = 0.003234
	Learnin

### Candidate classification

Finally, we classify the entire set of candidates. We start by applying the labelling functions.

In [84]:
from snorkel.annotations import LabelManager
label_manager = LabelManager()

try:
    %time L_tab_all = label_manager.load(session, tab_rels, 'AcroPhenRel Table LF Labels')
except sqlalchemy.orm.exc.NoResultFound:
    %time L_tab_all = label_manager.create(session, tab_rels, 'AcroPhenRel Table LF Labels', f=LF_tables)

CPU times: user 830 ms, sys: 18.6 ms, total: 849 ms
Wall time: 861 ms


We use the model to predict which ones are correct.

In [94]:
scores = tab_model.odds_unw(L_tab_all)
tab_acronyms = [r2id(c) for (c, s) in zip(tab_rels, scores) if s > 0]

print 'Identified %d acronyms predicted to be correct, e.g.' % len(tab_acronyms)
print tab_acronyms[:10]

Identified 262 acronyms predicted to be correct, e.g.
[(u'24068947', u'MP:0000284', u'double outlet right ventricle'), (u'24068947', u'MP:0002625', u'heart left ventricle hypertrophy'), (u'24068947', u'MP:0009570', u'abnormal right lung morphology'), (u'24068947', u'MP:0002766', u'situs inversus'), (u'24068947', u'MP:0010429', u'abnormal heart left ventricle outflow tract morphology'), (u'24068947', u'MP:0000531', u'right pulmonary isomerism'), (u'24068947', u'MP:0000276', u'heart right ventricle hypertrophy'), (u'24068947', u'MP:0000542', u'left-sided isomerism'), (u'24068947', u'MP:0003922', u'abnormal heart right atrium morphology'), (u'24068947', u'MP:0009569', u'abnormal left lung morphology')]


## Learning the correctness of relations extracted from text

Next, we repeat our classification procedure on relations that have been extracted from phrases.

We start by creating the set of all phrase relations.

In [20]:
from snorkel.models import CandidateSet

try:
    txt_rels = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Text Canidates').one()
except:
    txt_rels = CandidateSet(name='AcroPhenRel Text Canidates')
    for c in txt_tab_rels: txt_rels.append(c)
    for c in txt_txt_rels: txt_rels.append(c)

    session.add(txt_rels)
    session.commit()

print 'Collected %d candidates from phrases' % len(txt_rels)

Collected 37057 candidates from phrases


### Creating training and test sets

We first split data into an (unlabeled) training set (since we will use unsupervised risk estimation to train a candidate on it), and a dev/test set.

In [21]:
# helper
def r2id(r):
    doc_id = r[0].parent.document.name
    str1, str2 = r[0].get_span(), r[1].get_span()
    return (doc_id, str1, str2)
    
try:
    txt_train_c = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Phrase Training Candidates').one()
    txt_devtest_c = session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Phrase Dev/Test Candidates').one()
except:
    # delete any previous sets with that name
    session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Phrase Training Candidates').delete()
    session.query(CandidateSet).filter(CandidateSet.name == 'AcroPhenRel Phrase Dev/Test Candidates').delete()

    # helpers/config
    frac_test = 0.5

    # initialize the new sets
    txt_train_c = CandidateSet(name='AcroPhenRel Phrase Training Candidates')
    txt_devtest_c = CandidateSet(name='AcroPhenRel Phrase Dev/Test Candidates')

    # choose a random subset for the labeled set
    n_test = len(txt_rels) * frac_test
    test_idx = set(np.random.choice(len(txt_rels), size=(n_test,), replace=False))

    # add to the sets
    for i, c in enumerate(txt_rels):
        if i in test_idx:
            txt_devtest_c.append(c)
        elif r2id(c) in annotations:
            txt_devtest_c.append(c)
        else:
            txt_train_c.append(c)

    # save the results
    session.add(txt_train_c)
    session.add(txt_devtest_c)
    session.commit()

print 'Initialized %d training and %d dev/testing candidates' % (len(txt_train_c), len(txt_devtest_c))
print "Positive labels in dev/test set: %s" % len([c for c in txt_devtest_c if annotations.get(r2id(c),0)==1])
print "Negative labels in dev/test set: %s" % len([c for c in txt_devtest_c if annotations.get(r2id(c),0)==-1])

Initialized 18456 training and 18601 dev/testing candidates
Positive labels in dev/test set: 28
Negative labels in dev/test set: 125


### Labelling functions

Following the data programming approach, we define set of labeling functions. We will learn their accuracy via unsupervised learning and use them for classifying candidates.

In [106]:
import re
from bs4 import BeautifulSoup as soup
from snorkel.lf_helpers import get_left_tokens, left_text

# helper fn
def r2id(r):
    doc_id = r[0].parent.document.name
    str1, str2 = r[0].get_span(), r[1].get_span()
    acro = str1[1:-1]
    phen = str2.split(' (')[0]
    return (doc_id, acro, phen)

# positive LFs
def LF_acro_matches(m):
    _, acro, phen = r2id(m)
    words = phen.strip().split()
    if len(acro) == len(words):
        w_acro = ''.join([w[0] for w in words])
        if w_acro.lower() == acro.lower():
            return +1
    return 0

def LF_acro_matches_with_dashes(m):
    _, acro, phen = r2id(m)
    words = re.split(' |-', phen)
    if len(acro) == len(words) and len(words) > 0:
        w_acro = ''.join([w[0] for w in words if w])
        if w_acro.lower() == acro.lower():
            return +1
    return 0

def LF_acro_first_letter(m):
    _, acro, phen = r2id(m)
    if not any(l.islower() for l in phen): return 0
    words = phen.strip().split()
    if len(acro) <= len(words):
        if words[0].lower() == acro[0].lower():
            return +1
    return 0

def LF_acro_prefix(m):
    _, acro, phen = r2id(m)
    phen = phen.replace('-', '')
    if phen[:2].lower() == acro[:2].lower():
        return +5
    return 0

def LF_acro_matches_last_letters(m):
    _, acro, phen = r2id(m)
    words = phen.strip().split()
#     prev_words = m.span1.pre_window(d=1) + words
    prev_words = left_text(m[1], window=1) + words
    w_prev_acro = ''.join([w[0] for w in prev_words])
    if w_prev_acro.lower() == acro.lower(): return 0
    for r in (1,2):
        new_acro = acro[r:]
        if len(new_acro) < 3: continue
        if len(new_acro) == len(words):
            w_acro = ''.join([w[0] for w in words])
            if w_acro.lower() == new_acro.lower():
                return +1
    return 0

def LF_full_cell(m):
    """If only phrase in cell is A B C (XYZ), then it's correct"""
    if not hasattr(m[1].parent, 'cell'): return 0
    _, acro, phen = r2id(m)
#     if not phen[0].lower() == acro[0].lower(): return 0
    cell = m[1].parent.cell
    txt_cell = soup(cell.text).text if cell.text is not None else ''
    txt_span = m[1].get_span()
    return 1 if cell.text == txt_span or txt_cell == txt_span else 0
#     return 1 if m[1].parent.cell.text == m[1].get_span() else 0

def LF_start(m):
    punc = ',.;!?()\'"'
    if hasattr(m[1].parent, 'cell'): return 0 # this is only for when we're within a sentence
    if m[1].get_word_start() == 0 or any(c in punc for c in left_text(m[1], window=1)):
        _, acro, phen = r2id(m)
        if phen[0].lower() == acro[0].lower(): 
            return +1
    return 0

LF_txt_pos = [LF_acro_matches, LF_acro_matches_with_dashes, LF_acro_first_letter, LF_acro_prefix, LF_acro_matches_last_letters, LF_full_cell, LF_start]

# negative LFs
def LF_digits(m):
    txt = m[1].get_span()
    frac_num = len([ch for ch in txt if ch.isdigit()]) / float(len(txt))
    return -1 if frac_num > 0.5 else +1

def LF_short(m):
    _, acro, phen = r2id(m)
    return -1 if len(acro) == 1 else 0

def LF_lc(m):
    _, acro, phen = r2id(m)
    return -1 if all(l.islower() for l in acro) else 0

def LF_uc(m):
    _, acro, phen = r2id(m)
    return -2 if not any(l.islower() for l in phen) else 0

def LF_punc(m):
    _, acro, phen = r2id(m)
    punc = ',.;!?()'
    return -1 if any(c in punc for c in phen) else 0
    

LF_txt_neg = [LF_digits, LF_short, LF_lc, LF_uc, LF_punc]

LF_txt = LF_txt_pos + LF_txt_neg

We compute the LFs on our training set.

In [107]:
from snorkel.annotations import LabelManager
label_manager = LabelManager()

try:
    %time L_txt_train = label_manager.load(session, txt_train_c, 'AcroPhenRel Phrase Training LF Labels k')
except sqlalchemy.orm.exc.NoResultFound:
    %time L_txt_train = label_manager.create(session, txt_train_c, 'AcroPhenRel Phrase Training LF Labels k', f=LF_txt)

Generating annotations for 18456 candidates...
Loading sparse Label matrix...
CPU times: user 4min 57s, sys: 2.51 s, total: 5min
Wall time: 5min 1s


And we learn their accuracy.

In [108]:
from snorkel.learning import NaiveBayes

txt_model = NaiveBayes()
txt_model.train(L_txt_train, n_iter=10000, rate=1e-1)

Training marginals (!= 0.5):	18456
Features:			12
Begin training for rate=0.1, mu=1e-06
	Learning epoch = 0	Gradient mag. = 0.297661
	Learning epoch = 250	Gradient mag. = 0.179091
	Learning epoch = 500	Gradient mag. = 0.080737
	Learning epoch = 750	Gradient mag. = 0.041660
	Learning epoch = 1000	Gradient mag. = 0.022234
	Learning epoch = 1250	Gradient mag. = 0.013254
	Learning epoch = 1500	Gradient mag. = 0.009804
	Learning epoch = 1750	Gradient mag. = 0.008692
	Learning epoch = 2000	Gradient mag. = 0.008353
	Learning epoch = 2250	Gradient mag. = 0.008261
	Learning epoch = 2500	Gradient mag. = 0.008269
	Learning epoch = 2750	Gradient mag. = 0.008336
	Learning epoch = 3000	Gradient mag. = 0.008454
	Learning epoch = 3250	Gradient mag. = 0.008627
	Learning epoch = 3500	Gradient mag. = 0.008874
	Learning epoch = 3750	Gradient mag. = 0.009231
	Learning epoch = 4000	Gradient mag. = 0.009771
	Learning epoch = 4250	Gradient mag. = 0.010644
	Learning epoch = 4500	Gradient mag. = 0.012181
	Learn

### Candidate classification

Finally, we classify the entire set of candidates. We start by applying the labelling functions.

In [110]:
from snorkel.annotations import LabelManager
label_manager = LabelManager()

try:
    %time L_txt_all = label_manager.load(session, txt_rels, 'AcroPhenRel Table LF Labels f')
except sqlalchemy.orm.exc.NoResultFound:
    %time L_txt_all = label_manager.create(session, txt_rels, 'AcroPhenRel Table LF Labels f', f=LF_txt)

Generating annotations for 37057 candidates...
Loading sparse Label matrix...
CPU times: user 9min 37s, sys: 4.64 s, total: 9min 42s
Wall time: 9min 33s


We use the model to predict which ones are correct.

In [111]:
scores = txt_model.odds_unw(L_txt_all)
txt_acronyms = [r2id(c) for (c, s) in zip(txt_rels, scores) if s > 0]

print 'Identified %d acronyms predicted to be correct, e.g.' % len(txt_acronyms)
print txt_acronyms[:10]

Identified 24544 acronyms predicted to be correct, e.g.
[(u'24465473', u'SE', u'Beta'), (u'24465473', u'SE', u'Beta'), (u'24465473', u'SE', u'Beta'), (u'23958962', u'SNFs', u'GWAS Sample'), (u'23958962', u'SNFs', u'Sample'), (u'23958962', u'Unrelateds', u'GWAS Sample'), (u'23958962', u'Unrelateds', u'Sample'), (u'23958962', u'Unrelateds', u'Sample'), (u'23958962', u'Unrelateds', u'Replication Sample'), (u'19081515', u'vCJD', u'p')]


### Store the predicted candidates

In [102]:
acronyms = tab_acronyms + txt_acronyms
print '%d acronyms resolved' % len(acronyms)

# store relations to annotate
with open('results/nb-output/acronyms.extracted.all.tsv', 'w') as f:
    for doc_id, str1, str2 in acronyms:
        if doc_id.endswith('-doc'): doc_id = doc_id[:-4]
        try:
            out = u'{}\t{}\t{}\n'.format(doc_id, unicode(str2), str1)
            f.write(out.encode("UTF-8"))
        except:
            print 'ERROR:', str1, str2

4451 acronyms resolved
