# DeepDiveLite (DDL) Demo

In [472]:
%load_ext autoreload
%autoreload 2

import os, sys, re, cPickle
from ddlite import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Raw input -> Sentences

As a first stage we load a set of documents as raw strings:

In [399]:
docs = [open('raw/%s' % fn).read() for fn in os.listdir('raw')]

Next, we transform these document strings to a _list of lists_ of DDL `Sentence` objects.  We use the `SentenceParser.parse` method to parse the documents, which by default does a variety of NLP pre-processing as well:

In [None]:
%time parsed_docs = parse_docs_multicore(docs[:10])

In [410]:
%%time
sents = []
for i,doc in enumerate(docs[:20]):
    print "Parsing doc %s..." % i
    sents.append(list(parser.parse(doc)))

Parsing doc 0...
Parsing doc 1...
Parsing doc 2...
Parsing doc 3...
Parsing doc 4...
Parsing doc 5...
Parsing doc 6...
Parsing doc 7...
Parsing doc 8...
Parsing doc 9...
Parsing doc 10...
Parsing doc 11...
Parsing doc 12...
Parsing doc 13...
Parsing doc 14...
Parsing doc 15...
Parsing doc 16...
Parsing doc 17...
Parsing doc 18...
Parsing doc 19...
CPU times: user 1.09 s, sys: 481 ms, total: 1.57 s
Wall time: 8min 14s


In [412]:
for i,sent in enumerate(sents):
    print i, len(sent)

0 1
1 1
2 7
3 1
4 1
5 91
6 1
7 1
8 1
9 1
10 1
11 1
12 2
13 1
14 243
15 185
16 237
17 1
18 1
19 1


In [402]:
%%time
sents = []
for i,doc in enumerate(docs[:20]):
    print "Parsing doc %s..." % i
    sents.append(list(parser.parse(doc)))

Parsing doc 0...
Parsing doc 1...
Parsing doc 2...
Parsing doc 3...
Parsing doc 4...
Parsing doc 5...
Parsing doc 6...
Parsing doc 7...
Parsing doc 8...
Parsing doc 9...
Parsing doc 10...
Parsing doc 11...
Parsing doc 12...
Parsing doc 13...
Parsing doc 14...
Parsing doc 15...
Parsing doc 16...
Parsing doc 17...
Parsing doc 18...
Parsing doc 19...
CPU times: user 1.18 s, sys: 497 ms, total: 1.67 s
Wall time: 8min 21s


In [408]:
%time sents = list(parser.parse_docs(docs[:20]))

CPU times: user 1.06 s, sys: 200 ms, total: 1.26 s
Wall time: 7min 48s


Since parsing / preprocessing (above) is probably the slowest part of the process, we'll save the processed `Sentence` objects to disk as follows:

In [None]:
cPickle.dump(sents, open('saved_sents.pkl', 'wb'))

For now, we'll pick a _random_ sentence to work with:

In [21]:
sent = sents[15][4]; sent

Sentence(words=[u'Although', u'the', u'BMPR-II', u'tail', u'is', u'not', u'involved', u'in', u'BMP', u'signaling', u'via', u'Smad', u'proteins', u'mutations', u'truncating', u'this', u'domain', u'are', u'present', u'in', u'patients', u'with', u'primary', u'pulmonary', u'hypertension', u'PPH'], lemmas=[u'although', u'the', u'bmpr-ii', u'tail', u'is', u'not', u'involv', u'in', u'bmp', u'signal', u'via', u'smad', u'protein', u'mutat', u'truncat', u'thi', u'domain', u'are', u'present', u'in', u'patient', u'with', u'primari', u'pulmonari', u'hypertens', u'pph'], poses=[u'IN', u'DT', u'JJ', u'NN', u'VBZ', u'RB', u'VBN', u'IN', u'NNP', u'NNP', u'IN', u'NNP', u'NNS', u'NNS', u'VBG', u'DT', u'NN', u'VBP', u'JJ', u'IN', u'NNS', u'IN', u'JJ', u'JJ', u'NN', u'NNP'], dep_parents=[7, 4, 4, 7, 7, 7, 19, 10, 10, 7, 13, 13, 10, 19, 14, 17, 15, 19, 0, 21, 19, 25, 25, 25, 21, 25], dep_labels=[u'mark', u'det', u'amod', u'nsubjpass', u'auxpass', u'neg', u'advcl', u'case', u'compound', u'nmod', u'case', u'c

## Candidate Extraction

First, we load a dictionary of gene and phenotype names- these are the entities that we want to extract relations over:

In [22]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 3, genes)

# Schema is: HPO_ID | NAME | TYPE (exact, lemma)
phenos = [line.rstrip().split('\t')[1] for line in open('dicts/pheno_terms.tsv')]

Next, we define the type of relation we want to look for.  To do this, we'll define a DDL `Relations` operator, which is built from two `Entity`-type operators:

In [389]:
rels = Relations(
    DictionaryMatch('G', genes, ignore_case=False), 
    DictionaryMatch('P', phenos), 
    [sent])

We can also render a visualization of the relations / their contexts:

In [390]:
rels.relations[1].render()

## Distant Supervision

We can create **_rule functions_** using a variety of helper attributes and tools both from `ddlite` and `treedlib`.  **These functions must return values $\in\{-1,0,1\}$**

In [391]:
def rule_1(r):
    return 1 if 'mutat' in r.lemmas else 0

def rule_2(r):
    return 1 if re.search(r'{{G}}.*in patients with.*{{P}}', r.tagged_sent) else 0

def rule_3(r):
    return 1 if len(r.e2_idxs) > 1 else -1

rules = [rule_1, rule_2, rule_3]

In [393]:
rels.apply_rules(rules)
rels.rules

array([[ 1.,  1.],
       [ 0.,  1.],
       [-1.,  1.]])

In [397]:
rels.get_rule_priority_vote_accuracy([1, 1])

1.0

## Feature Extraction

Feature extraction is push-button, although custom treedlib feature sets can be passed in as well:

In [366]:
rels.extract_features()
rels.F

<95x2 sparse matrix of type '<type 'numpy.float64'>'
	with 187 stored elements in Compressed Sparse Row format>

## Learning

Here we use a very simple method & implementation:

In [368]:
rels.learn_feats_and_weights(sample=True, verbose=True)

Learning epoch = 0
Learning epoch = 100
Learning epoch = 200
Learning epoch = 300
Learning epoch = 400
Learning epoch = 500
Learning epoch = 600
Learning epoch = 700
Learning epoch = 800
Learning epoch = 900


  
