# TreeDLib

Given a set of **dictionaries** $D_1,...,D_n$, we wish to generate a set of XPath / regex expressions that yield features...

In [37]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('src')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Dependency paths

I want to build some simple dep path / tree viewer code...

In [59]:
from util import load_sentences
sents = load_sentences('test/test1.parsed.tsv')

In [60]:
print sents[1]

SentenceInput(doc_id=1, sent_id=1, text='Autosomal dominant polycystic kidney disease is the most common human monogenic disorder and is caused by mutations in the PKD1 or PKD2 genes.', words=['Autosomal', 'dominant', 'polycystic', 'kidney', 'disease', 'is', 'the', 'most', 'common', 'human', 'monogenic', 'disorder', 'and', 'is', 'caused', 'by', 'mutations', 'in', 'the', 'PKD1', 'or', 'PKD2', 'genes', '.'], lemmas=['autosomal', 'dominant', 'polycystic', 'kidney', 'disease', 'be', 'the', 'most', 'common', 'human', 'monogenic', 'disorder', 'and', 'be', 'cause', 'by', 'mutation', 'in', 'the', 'pkd1', 'or', 'pkd2', 'gene', '.'], poses=['JJ', 'JJ', 'JJ', 'NN', 'NN', 'VBZ', 'DT', 'RBS', 'JJ', 'JJ', 'JJ', 'NN', 'CC', 'VBZ', 'VBN', 'IN', 'NNS', 'IN', 'DT', 'NN', 'CC', 'NN', 'NNS', '.'], ners=['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], word_idxs=[0, 10, 19, 30, 37, 45, 48, 52, 57, 64, 70, 80, 89, 93, 96, 103, 106, 116,

Note that some of the dep paths have **multiple root nodes**; however I am guessing that only _one_ is the real root node, and the rest are 'duds', i.e. _without children_.

I'll use this hypothesis to bypass this for now, but should look in documentation of [CoreNLP dep parser options](http://stanfordnlp.github.io/CoreNLP/depparse.html) for how to properly deal with this...

In [221]:
from dep_path import DepTree
dts = map(DepTree, sents)

In [222]:
dt = dts[1]
dt.render_tree()

In [240]:
p1 = root.xpath(".//node[@cid='P1']/ancestor::node/@word")
p2 = root.xpath(".//node[@cid='G1']/ancestor::node/@word")
common = set(p1).intersection(p2)
b1 = []
b2 = []
for n in reversed(p1):
    b1.append(n)
    if n in common:
        break
for n in reversed(p2):
    if n in common:
        break
    b2.append(n)
print b1 + b2[::-1]


['disease', 'disorder', 'caused', 'mutations']


# Getting features with TreeDLib

Let's start assuming we have an xml tree as input, of the form generated by `DepTree.to_xml_str()`

Let's first mark the two genes in the above example, **PKD1** and **PKD2**, as candidate mentions _1_ and _2_ respectively:

In [227]:
import lxml.etree as et
root = dt.to_xml()
root.findall(".//node[@word='Autosomal']")[0].set('cid', 'P1')
root.findall(".//node[@word='dominant']")[0].set('cid', 'P1')
root.findall(".//node[@word='polycystic']")[0].set('cid', 'P1')
root.findall(".//node[@word='kidney']")[0].set('cid', 'P1')
root.findall(".//node[@word='disease']")[0].set('cid', 'P1')
root.findall(".//node[@word='PKD1']")[0].set('cid', 'G1')
root.findall(".//node[@word='PKD2']")[0].set('cid', 'G2')

In [195]:
len(root.findall(".//node[@cid='G2']../node"))

3

In [220]:
res = root.xpath(".//node[@word='kidney']/preceding-sibling::node/@lemma")
print len(res)
print res

3
['autosomal', 'dominant', 'polycystic']


Now let's try executing a simple feature

In [266]:
from get_features import Get, Mention, Left, Right, Between
for f in Get(Mention('P1'), 'lemma').apply(root):
    print f

LEMMA-MENTION[disease]
LEMMA-MENTION[autosomal]
LEMMA-MENTION[dominant]
LEMMA-MENTION[polycystic]
LEMMA-MENTION[kidney]
LEMMA-MENTION[disease_autosomal]
LEMMA-MENTION[autosomal_dominant]
LEMMA-MENTION[dominant_polycystic]
LEMMA-MENTION[polycystic_kidney]
LEMMA-MENTION[disease_autosomal_dominant]
LEMMA-MENTION[autosomal_dominant_polycystic]
LEMMA-MENTION[dominant_polycystic_kidney]
LEMMA-MENTION[disease_autosomal_dominant_polycystic]
LEMMA-MENTION[autosomal_dominant_polycystic_kidney]
LEMMA-MENTION[disease_autosomal_dominant_polycystic_kidney]


In [267]:
for f in Get(Between(Mention('P1'), Mention('G2')), 'lemma').apply(root):
    print f

LEMMA-BETWEEN-MENTION-MENTION[disease_disorder_cause_mutation_pkd1]


# Notes

## TreeDLib features

Some interesting ones:
* Anything involving **counting / aggregation**
* In similar vein, **_path length_**

Some good basic features _not_ involving external dictionaries
* Siblings & subsets of sibling set / path to
* Parents & subsets of the parent set / path to
    * Candidate direct parent
    * Cadidate sentence ROOT
* **Path between** (relationships)

As for features _involving a dictionary_, do we ever not just want the path from the candidate to the e.g. keyword...?
* Path from one cadndidate entity to a KW
* Indicator of whether KW is on path between two entities
* Indicator of whether KW is in (sentence, paragraph, doc)

## Integrating dictionaries

Let's start with a simple example from DDLIB, path from candidate mention to keyword