# TreeDLib

In [348]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('src')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Simple demo: Generating DDLib Mention features

As a first simple demo let's generate (almost all of) the mention-level features generated by [ddlib](http://deepdive.stanford.edu/doc/basics/gen_feats.html).

First, let's load a few sample sentences and convert one of them to XML format for testing (and also visualize it, for kicks!)

In [349]:
from util import load_sentences, tag_candidate
from dep_path import DepTree
dts = map(DepTree, load_sentences('test/test1.parsed.tsv'))
dt = dts[1]
dt.render_tree()

Next, let's tag some of the nodes as **candidates** by giving them a non-null **cid** (using an extremely hack-ey method for now...)

In [484]:
root = dt.to_xml()
tag_candidate(root, ['Autosomal', 'dominant', 'polycystic', 'kidney', 'disease'], 'P1')
tag_candidate(root, ['PKD1'], 'G1')
tag_candidate(root, ['PKD2'], 'G2')

In [485]:
# Loading the feature template lib
from feature_template import Indicator, Mention, Left, Right, Between, Keyword, RgxIndicator

Let's start by looking at a simple **feature template**, which gets the **POS tags** of the nodes comprising a mention:

In [488]:
Indicator(Mention('P1'), 'pos')

<POS-MENTION, XPaths='set(["//node[@cid='P1']/@pos"])', subsets=None>

We see that the templates are compositional.  We can _apply_ them to the root node of an XML tree (we use a convenience class `apply_and_print` which just prints the results of `apply` one per line...):

In [491]:
Indicator(Mention('P1'), 'pos').apply_and_print(root)

POS-MENTION[NN_JJ_JJ_JJ_NN]


Or, all the n-grams of length up to 3 comprising the mention:

In [492]:
Indicator(Mention('P1', subsets=3), 'lemma').apply_and_print(root)

LEMMA-MENTION[disease]
LEMMA-MENTION[autosomal]
LEMMA-MENTION[dominant]
LEMMA-MENTION[polycystic]
LEMMA-MENTION[kidney]
LEMMA-MENTION[disease_autosomal]
LEMMA-MENTION[autosomal_dominant]
LEMMA-MENTION[dominant_polycystic]
LEMMA-MENTION[polycystic_kidney]
LEMMA-MENTION[disease_autosomal_dominant]
LEMMA-MENTION[autosomal_dominant_polycystic]
LEMMA-MENTION[dominant_polycystic_kidney]


Another example- the lemmas of the _siblings_ to the right of the mention:

In [493]:
Indicator(Right(Mention('G2')), 'lemma').apply_and_print(root)

LEMMA-RIGHT-OF-MENTION[gene]


Another example: the dep-labels **between** the mention and a keyword from a dictionary:

In [495]:
d = ['disorder', 'disease']
Indicator(Between(Keyword(d), Mention('G1')), 'dep_label').apply_and_print(root)

DEP_LABEL-BETWEEN-KEYWORD-MENTION[nsubj_conj_and_agent_prep_in]
DEP_LABEL-BETWEEN-KEYWORD-MENTION[conj_and_agent_prep_in]


We can pretty cleanly generate all the DDLib features (excepting counting ones, leaving for later; see `basic_features.py`):

In [497]:
from basic_features import get_mention_features, get_relation_features
for f in get_mention_features('G2', d, root):
    print f

POS-MENTION[NN]
NER-MENTION[O]
LEMMA-MENTION[pkd2]
WORD-MENTION[PKD2]
STARTS_WITH_CAPITAL
LEMMA-LEFT-OF-MENTION[the]
LEMMA-RIGHT-OF-MENTION[gene]
WORD-KEYWORD[disorder]
WORD-KEYWORD[disease]
LEMMA-BETWEEN-MENTION-KEYWORD[pkd1_mutation_cause_disorder]
LEMMA-BETWEEN-MENTION-KEYWORD[pkd1_mutation_cause]
DEP_LABEL-BETWEEN-MENTION-KEYWORD[conj_or_prep_in_agent_conj_and_nsubj]
DEP_LABEL-BETWEEN-MENTION-KEYWORD[conj_or_prep_in_agent_conj_and]


And the relation features:

In [499]:
for f in get_relation_features('P1', 'G1', d, root):
    print f

LEMMA-BETWEEN-MENTION-MENTION[polycystic_dominant_autosomal_disease_disorder_cause_mutation]
WORD-BETWEEN-MENTION-MENTION[polycystic_dominant_Autosomal_disease_disorder_caused_mutations]
STARTS_WITH_CAPITAL_1
STARTS_WITH_CAPITAL_2
LEMMA-BETWEEN-MENTION-MENTION[polycystic]
LEMMA-BETWEEN-MENTION-MENTION[dominant]
LEMMA-BETWEEN-MENTION-MENTION[autosomal]
LEMMA-BETWEEN-MENTION-MENTION[disease]
LEMMA-BETWEEN-MENTION-MENTION[disorder]
LEMMA-BETWEEN-MENTION-MENTION[cause]
LEMMA-BETWEEN-MENTION-MENTION[mutation]
LEMMA-BETWEEN-MENTION-MENTION[polycystic_dominant]
LEMMA-BETWEEN-MENTION-MENTION[dominant_autosomal]
LEMMA-BETWEEN-MENTION-MENTION[autosomal_disease]
LEMMA-BETWEEN-MENTION-MENTION[disease_disorder]
LEMMA-BETWEEN-MENTION-MENTION[disorder_cause]
LEMMA-BETWEEN-MENTION-MENTION[cause_mutation]
LEMMA-BETWEEN-MENTION-MENTION[polycystic_dominant_autosomal]
LEMMA-BETWEEN-MENTION-MENTION[dominant_autosomal_disease]
LEMMA-BETWEEN-MENTION-MENTION[autosomal_disease_disorder]
LEMMA-BETWEEN-MENTION-M

# Notes

## TreeDLib features

Some interesting ones:
* Anything involving **counting / aggregation**
* In similar vein, **_path length_**

Some good basic features _not_ involving external dictionaries
* Siblings & subsets of sibling set / path to
* Parents & subsets of the parent set / path to
    * Candidate direct parent
    * Cadidate sentence ROOT
* **Path between** (relationships)

As for features _involving a dictionary_, do we ever not just want the path from the candidate to the e.g. keyword...?
* Path from one cadndidate entity to a KW
* Indicator of whether KW is on path between two entities
* Indicator of whether KW is in (sentence, paragraph, doc)