# TreeDLib

Given a set of **dictionaries** $D_1,...,D_n$, we wish to generate a set of XPath / regex expressions that yield features...

In [37]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('src')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Dependency paths

I want to build some simple dep path / tree viewer code...

In [59]:
from util import load_sentences
sents = load_sentences('test/test1.parsed.tsv')

In [60]:
print sents[1]

SentenceInput(doc_id=1, sent_id=1, text='Autosomal dominant polycystic kidney disease is the most common human monogenic disorder and is caused by mutations in the PKD1 or PKD2 genes.', words=['Autosomal', 'dominant', 'polycystic', 'kidney', 'disease', 'is', 'the', 'most', 'common', 'human', 'monogenic', 'disorder', 'and', 'is', 'caused', 'by', 'mutations', 'in', 'the', 'PKD1', 'or', 'PKD2', 'genes', '.'], lemmas=['autosomal', 'dominant', 'polycystic', 'kidney', 'disease', 'be', 'the', 'most', 'common', 'human', 'monogenic', 'disorder', 'and', 'be', 'cause', 'by', 'mutation', 'in', 'the', 'pkd1', 'or', 'pkd2', 'gene', '.'], poses=['JJ', 'JJ', 'JJ', 'NN', 'NN', 'VBZ', 'DT', 'RBS', 'JJ', 'JJ', 'JJ', 'NN', 'CC', 'VBZ', 'VBN', 'IN', 'NNS', 'IN', 'DT', 'NN', 'CC', 'NN', 'NNS', '.'], ners=['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], word_idxs=[0, 10, 19, 30, 37, 45, 48, 52, 57, 64, 70, 80, 89, 93, 96, 103, 106, 116,

Note that some of the dep paths have **multiple root nodes**; however I am guessing that only _one_ is the real root node, and the rest are 'duds', i.e. _without children_.

I'll use this hypothesis to bypass this for now, but should look in documentation of [CoreNLP dep parser options](http://stanfordnlp.github.io/CoreNLP/depparse.html) for how to properly deal with this...

In [94]:
from dep_path import DepTree
dts = map(DepTree, sents)

In [97]:
dt = dts[1]
import xml.etree.ElementTree as et
root = dt.to_xml()

In [98]:
dt.to_xml_str()

'<node dep_parent="0" id="12" lemma="disorder" ner="O" pos="NN" word="disorder" word_idx="80"><node dep_label="nsubj" dep_parent="12" id="5" lemma="disease" ner="O" pos="NN" word="disease" word_idx="37"><node dep_label="amod" dep_parent="5" id="1" lemma="autosomal" ner="O" pos="JJ" word="Autosomal" word_idx="0" /><node dep_label="amod" dep_parent="5" id="2" lemma="dominant" ner="O" pos="JJ" word="dominant" word_idx="10" /><node dep_label="amod" dep_parent="5" id="3" lemma="polycystic" ner="O" pos="JJ" word="polycystic" word_idx="19" /><node dep_label="nn" dep_parent="5" id="4" lemma="kidney" ner="O" pos="NN" word="kidney" word_idx="30" /></node><node dep_label="cop" dep_parent="12" id="6" lemma="be" ner="O" pos="VBZ" word="is" word_idx="45" /><node dep_label="det" dep_parent="12" id="7" lemma="the" ner="O" pos="DT" word="the" word_idx="48" /><node dep_label="amod" dep_parent="12" id="9" lemma="common" ner="O" pos="JJ" word="common" word_idx="57"><node dep_label="advmod" dep_parent="9" 

In [57]:
dts[1].render_tree()

## TreeDLib features

Some interesting ones:
* Anything involving **counting / aggregation**
* In similar vein, **_path length_**

Some good basic features _not_ involving external dictionaries
* Siblings & subsets of sibling set / path to
* Parents & subsets of the parent set / path to
    * Candidate direct parent
    * Cadidate sentence ROOT
* **Path between** (relationships)

As for features _involving a dictionary_, do we ever not just want the path from the candidate to the e.g. keyword...?
* Path from one cadndidate entity to a KW
* Indicator of whether KW is on path between two entities
* Indicator of whether KW is in (sentence, paragraph, doc)

## Integrating dictionaries

Let's start with a simple example from DDLIB, path from candidate mention to keyword