# TreeDLib

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [18]:
#from treedlib import *
# Note: reloading for submodules doesn't work, so we load directly here
from treedlib.util import *
from treedlib.structs import *
from treedlib.templates import *
from treedlib.features import *
import lxml.etree as et

We define three classes of operators:
* _NodeSets:_ $S : 2^T \mapsto 2^T$
* _Indicators:_ $I : 2^T \mapsto \{0,1\}^F$
* _Combinators:_ $C : \{0,1\}^F \times \{0,1\}^F \mapsto \{0,1\}^F$

where $T$ is a given input tree, and $F$ is the dimension of the feature space.

## Genomics debugging

In [5]:
%sql postgresql://ajratner@localhost:6432/genomics_ajratner

u'Connected: ajratner@genomics_ajratner'

In [6]:
res = %sql SELECT words FROM sentences_input WHERE doc_id = '20142852' AND section_id = 'Body.0' AND sent_id = 107;
print ' '.join(res[0][0].split('|^|'))

1 rows affected.
The dHMN V/CMT2D phenotype has subsequently been shown to be more commonly due to mutations in the _ BSCL2 _ gene , -LRB- -RRB- which usually causes Silver syndrome -LRB- spastic legs and distal amyotrophy of the upper limbs -RRB- but can present -LRB- in 33 % of cases -RRB- with just amyotrophy of the upper limbs .


In [7]:
res = %sql SELECT words, lemmas, poses, ners, dep_paths AS "dep_labels", dep_parents FROM sentences_input WHERE doc_id = '20142852' AND section_id = 'Body.0' AND sent_id = 107;
rows = [dict((k, v.split('|^|')) for k,v in dict(row).iteritems()) for row in res]
xts = map(corenlp_to_xmltree, rows)

1 rows affected.


In [8]:
%%sql
SELECT gene_wordidxs, pheno_wordidxs
FROM genepheno_causation
WHERE doc_id = '20142852' AND section_id = 'Body.0' AND sent_id = 107;

5 rows affected.


gene_wordidxs,pheno_wordidxs
[18],"[34, 35, 36, 37, 38]"
[18],"[33, 34, 35, 36, 37, 38]"
[18],"[52, 53, 54, 55, 56]"
[18],[52]
[18],"[33, 34]"


In [9]:
gidxs = [18]
pidxs = [52, 53, 54, 55, 56]

In [10]:
xt = xts[0]
xt.render_tree(highlight=[gidxs, pidxs])

In [11]:
Indicator(Between(Mention(0), Mention(1)), 'word').print_apply(xt.root, [gidxs, pidxs])

WORD:BETWEEN-MENTION-and-MENTION[gene_causes_present]


In [12]:
op = (Indicator(Between(Mention(0), Mention(1)), a) for a in ['word','dep_path'])
hasattr(op, '__iter__')

True

In [13]:
c = Compile([(Indicator(Between(Mention(0), Mention(1)), a) for a in ['word','dep_path'])])

In [17]:
c

<Indicator:word:BETWEEN-MENTION-and-MENTION, xpath="//*[{0}][1]/ancestor-or-self::*[count(. | //*[{1}][1]/ancestor-or-self::*) = count(//*[{1}][1]/ancestor-or-self::*)][1]/descendant-or-self::*[(count(.//*[{0}]) = count(//*[{0}])) or (count(.//*[{1}]) = count(//*[{1}]))]">
<Indicator:dep_path:BETWEEN-MENTION-and-MENTION, xpath="//*[{0}][1]/ancestor-or-self::*[count(. | //*[{1}][1]/ancestor-or-self::*) = count(//*[{1}][1]/ancestor-or-self::*)][1]/descendant-or-self::*[(count(.//*[{0}]) = count(//*[{0}])) or (count(.//*[{1}]) = count(//*[{1}]))]">

In [19]:
print_gen(get_relation_features(xt.root, gidxs, pidxs))

WORD:BETWEEN-MENTION-and-MENTION[gene_causes_present]
LEMMA:BETWEEN-MENTION-and-MENTION[gene_cause_present]
POS:BETWEEN-MENTION-and-MENTION[NN_VBZ_VB]
NER:BETWEEN-MENTION-and-MENTION[O_O_DATE]
DEP_LABEL:BETWEEN-MENTION-and-MENTION[prep_in_rcmod_conj_but]
WORD:BETWEEN-MENTION-and-MENTION[gene_causes]
WORD:BETWEEN-MENTION-and-MENTION[causes_present]
WORD:BETWEEN-MENTION-and-MENTION[gene_causes_present]
LEMMA:BETWEEN-MENTION-and-MENTION[gene_cause]
LEMMA:BETWEEN-MENTION-and-MENTION[cause_present]
LEMMA:BETWEEN-MENTION-and-MENTION[gene_cause_present]
POS:BETWEEN-MENTION-and-MENTION[NN_VBZ]
POS:BETWEEN-MENTION-and-MENTION[VBZ_VB]
POS:BETWEEN-MENTION-and-MENTION[NN_VBZ_VB]
NER:BETWEEN-MENTION-and-MENTION[O_O]
NER:BETWEEN-MENTION-and-MENTION[O_DATE]
NER:BETWEEN-MENTION-and-MENTION[O_O_DATE]
DEP_LABEL:BETWEEN-MENTION-and-MENTION[prep_in_rcmod]
DEP_LABEL:BETWEEN-MENTION-and-MENTION[rcmod_conj_but]
DEP_LABEL:BETWEEN-MENTION-and-MENTION[prep_in_rcmod_conj_but]
LEMMA:FILTER-BY(pos=VB):BETWEEN-MENT

# TODO: Why not working?  Highlight mentions in chart & text!

* Fix this!
* _Features to add:_
    * `INV_`?
    * Parent
    * lemma+dep_path
    * candidates in between?
    * Better way to do siblings, when siblings have children...?
    * Modifiers before e.g. "We investigated whether..." / NEGATIONS (see Johannes's email / list)

## Debugging pipeline

We'll debug here, also to show the general most current procedure for debugging treedlib on examples in a SQL database (e.g. from DeepDive)

In [None]:
%sql postgresql://ajratner@localhost:5432/deepdive_spouse

In [None]:
%%sql 
SELECT sentence_text
FROM sentences 
WHERE doc_id = '79205745-b593-4b98-8a94-da6b8238fefc' AND sentence_index = 32;

In [None]:
res = %sql SELECT tokens AS "words", lemmas, pos_tags, ner_tags, dep_types AS "dep_labels", dep_tokens AS "dep_parents" FROM sentences WHERE doc_id = '79205745-b593-4b98-8a94-da6b8238fefc' AND sentence_index = 32;
xts = map(corenlp_to_xmltree, res)

In [None]:
xt = xts[0]
xt.render_tree(highlight=[[21,22], [33,34]])

In [None]:
print_gen(get_relation_features(xt.root, [21,22], [33,34]))

## Feature focus: Preceding statements which nullify or negate meaning

Example:
> _Ex1:_ To investigate whether mutations in the SURF1 gene are a cause of Charcot-Marie-Tooth -LRB- CMT -RRB- disease

> _Ex2:_ To investigate the genetic effect of a new mutation found in exon 17 of the myophosphorylase -LRB- PYGM -RRB- gene as a cause of McArdle disease -LRB- also known as type 5 glycogenosis -RRB-.

Notes:
* These seem to mostly be **_modifiers of the primary verb_**?
    * We are only sampling from a limited set of patterns of sentences (due to narrow DSR set) currently...
* Modifiers in general...?
* _I know how RNNs claim to / do handle this phenomenon..._ *

In [None]:
ex1_id = ('24027061', 'Abstract.0', 1)
ex1_raw="""
<node dep_parent="0" lemma="investigate" ner="O" pos="VB" word="investigate" word_idx="1"><node dep_parent="2" dep_path="aux" lemma="to" ner="O" pos="TO" word="To" word_idx="0"/><node dep_parent="2" dep_path="ccomp" lemma="cause" ner="O" pos="NN" word="cause" word_idx="10"><node dep_parent="11" dep_path="mark" lemma="whether" ner="O" pos="IN" word="whether" word_idx="2"/><node dep_parent="11" dep_path="nsubj" lemma="mutation" ner="O" pos="NNS" word="mutations" word_idx="3"><node dep_parent="4" dep_path="prep_in" lemma="gene" ner="O" pos="NN" word="gene" word_idx="7"><node dep_parent="8" dep_path="det" lemma="the" ner="O" pos="DT" word="the" word_idx="5"/><node dep_parent="8" dep_path="nn" lemma="surf1" ner="O" pos="NN" word="SURF1" word_idx="6"/></node></node><node dep_parent="11" dep_path="cop" lemma="be" ner="O" pos="VBP" word="are" word_idx="8"/><node dep_parent="11" dep_path="det" lemma="a" ner="O" pos="DT" word="a" word_idx="9"/><node dep_parent="11" dep_path="prep_of" lemma="Charcot-Marie-Tooth" ner="O" pos="NNP" word="Charcot-Marie-Tooth" word_idx="12"/><node dep_parent="11" dep_path="dep" lemma="disease" ner="O" pos="NN" word="disease" word_idx="16"><node dep_parent="17" dep_path="appos" lemma="CMT" ner="O" pos="NNP" word="CMT" word_idx="14"/></node></node></node>
"""
xt1 = XMLTree(et.fromstring(ex1_raw))
ex2_id = ('15262743', 'Abstract.0', 1)
ex2_raw="""
<node dep_parent="0" lemma="investigate" ner="O" pos="VB" word="investigate" word_idx="1"><node dep_parent="2" dep_path="aux" lemma="to" ner="O" pos="TO" word="To" word_idx="0"/><node dep_parent="2" dep_path="dobj" lemma="effect" ner="O" pos="NN" word="effect" word_idx="4"><node dep_parent="5" dep_path="det" lemma="the" ner="O" pos="DT" word="the" word_idx="2"/><node dep_parent="5" dep_path="amod" lemma="genetic" ner="O" pos="JJ" word="genetic" word_idx="3"/><node dep_parent="5" dep_path="prep_of" lemma="mutation" ner="O" pos="NN" word="mutation" word_idx="8"><node dep_parent="9" dep_path="det" lemma="a" ner="O" pos="DT" word="a" word_idx="6"/><node dep_parent="9" dep_path="amod" lemma="new" ner="O" pos="JJ" word="new" word_idx="7"/><node dep_parent="9" dep_path="vmod" lemma="find" ner="O" pos="VBN" word="found" word_idx="9"><node dep_parent="10" dep_path="prep_in" lemma="exon" ner="O" pos="NN" word="exon" word_idx="11"><node dep_parent="12" dep_path="num" lemma="17" ner="NUMBER" pos="CD" word="17" word_idx="12"/><node dep_parent="12" dep_path="prep_of" lemma="gene" ner="O" pos="NN" word="gene" word_idx="19"><node dep_parent="20" dep_path="det" lemma="the" ner="O" pos="DT" word="the" word_idx="14"/><node dep_parent="20" dep_path="nn" lemma="myophosphorylase" ner="O" pos="NN" word="myophosphorylase" word_idx="15"/><node dep_parent="20" dep_path="nn" lemma="pygm" ner="O" pos="NN" word="PYGM" word_idx="17"/></node></node><node dep_parent="10" dep_path="prep_as" lemma="cause" ner="O" pos="NN" word="cause" word_idx="22"><node dep_parent="23" dep_path="det" lemma="a" ner="O" pos="DT" word="a" word_idx="21"/><node dep_parent="23" dep_path="prep_of" lemma="disease" ner="O" pos="NN" word="disease" word_idx="25"><node dep_parent="26" dep_path="nn" lemma="McArdle" ner="PERSON" pos="NNP" word="McArdle" word_idx="24"/><node dep_parent="26" dep_path="vmod" lemma="know" ner="O" pos="VBN" word="known" word_idx="28"><node dep_parent="29" dep_path="advmod" lemma="also" ner="O" pos="RB" word="also" word_idx="27"/><node dep_parent="29" dep_path="prep_as" lemma="glycogenosis" ner="O" pos="NN" word="glycogenosis" word_idx="32"><node dep_parent="33" dep_path="nn" lemma="type" ner="O" pos="NN" word="type" word_idx="30"/><node dep_parent="33" dep_path="num" lemma="5" ner="NUMBER" pos="CD" word="5" word_idx="31"/></node></node></node></node></node></node></node></node>
"""
xt2 = XMLTree(et.fromstring(ex2_raw))

In [None]:
xt1.render_tree()
xt2.render_tree()

### Testing XML speeds

How does it compare between:
* parse to XML via this python code, store as string, then parse from string at runtime
* just parse to XML at runtime via this python code?

In [None]:
# Map sentence to xmltree
%time xts = map(corenlp_to_xmltree, rows)

In [None]:
# Pre-process to xml string
xmls = [xt.to_str() for xt in map(corenlp_to_xmltree, rows)]

# Parse @ runtime using lxml
%time roots = map(et.fromstring, xmls)

### Table example

In [None]:
# Some wishful thinking...
table_xml = """
<div class="table-wrapper">
    <h3>Causal genomic relationships</h3>
    <table>
        <tr><th>Gene</th><th>Variant</th><th>Phenotype</th></tr>
        <tr><td>ABC</td><td><i>AG34</i></td><td>Headaches during defecation</td></tr>
        <tr><td>BDF</td><td><i>CT2</i></td><td>Defecation during headaches</td></tr>
        <tr><td>XYG</td><td><i>AT456</i></td><td>Defecasomnia</td></tr>
    </table>
</div>
"""
from IPython.core.display import display_html, HTML
display_html(HTML(table_xml))