# Relation extraction
* Goal: find meaningful associations between pairs of named entities
* Co-occurrences will already tell us something
    * Entity pairs appearing in the same context (e.g. sentences) often tend to be somehow related
    * Type of association will remain a mystery
    * Lots of false positives, naturally
* Usually we have a predefined set of relation types we are interested in
    * Specific to our domain of interest
    * E.g. located-at/in, subsidiary, works-at
    * We did some dep-search patterns for this with limited success as you may remember

# RE as a classification task

* First we generate all relevant entity pairs
    * Usually from a single sentence
    * Filter: if we are looking for works-at relations, it makes no sense to generate ORG-ORG pairs etc.
* For each pair a classification makes a decision: relation exists (and type) or no relation
* Which pairs to consider here: Frank and James worked for Comcast, not for Time Warner.
* Features should now capture some information about the association of the given entity pair in the sentence

# Lets give it a try with our familiar TFIDF vectors

* Our data consists of biomedical publications and we want to find out where different types of bacteria live in
    * i.e. find the relations between bacteria and habitat entities
    * Take a look: [http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-18845825](http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-18845825)
    * Or here: [http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-19175621](http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-19175621)
* Only one relation type, so our problem is a binary classification task
* Would it make sense to see what words appear between the given entities?

* This time around, our data will come as XML, so we'll learn how to deal with that
    * under every sentence, of our interest will be `entity`, `interaction`, `token` and `dependency`
    

```
    <sentence charOffset="142-319" id="BB_EVENT_16.d34.s1" tail=" " text="Of the 104 isolations of Salmonella sp. from egg pulp, 97 were obtained from strontium chloride M broth, 42 from strontium selenite broth and 57 from strontium selenite A broth.">
      <entity charOffset="0-437" given="True" headOffset="171-176" id="BB_EVENT_16.d34.s1.e1" origId="BB-event-1016123.T2" origOffset="142-579" text="Of the 104 isolations of Salmonella sp. from egg pulp, 97 were obtained from strontium chloride M broth, 42 from strontium selenite broth and 57 from strontium selenite A broth. The results suggest that the first medium may be used more successfully than bi-selenite based media for enrichment and subsequent detection of salmonellae in egg products; however, the growth of S. pullorum was not satisfactory in strontium chloride M broth." type="Paragraph" />
      <entity charOffset="25-35" given="True" headOffset="25-35" id="BB_EVENT_16.d34.s1.e6" origId="BB-event-1016123.T7" origOffset="167-177" text="Salmonella" type="Bacteria" />
      <entity charOffset="45-53" given="True" headOffset="49-53" id="BB_EVENT_16.d34.s1.e7" origId="BB-event-1016123.T8" origOffset="187-195" text="egg pulp" type="Habitat" />
      <entity charOffset="45-48" given="True" headOffset="45-48" id="BB_EVENT_16.d34.s1.e8" origId="BB-event-1016123.T9" origOffset="187-190" text="egg" type="Habitat" />
      <entity charOffset="77-103" given="True" headOffset="98-103" id="BB_EVENT_16.d34.s1.e9" origId="BB-event-1016123.T10" origOffset="219-245" text="strontium chloride M broth" type="Habitat" />
      <entity charOffset="113-137" given="True" headOffset="132-137" id="BB_EVENT_16.d34.s1.e10" origId="BB-event-1016123.T11" origOffset="255-279" text="strontium selenite broth" type="Habitat" />
      <entity charOffset="150-176" given="True" headOffset="171-176" id="BB_EVENT_16.d34.s1.e11" origId="BB-event-1016123.T12" origOffset="292-318" text="strontium selenite A broth" type="Habitat" />
      <interaction directed="True" e1="BB_EVENT_16.d34.s1.e6" e1Role="Bacteria" e2="BB_EVENT_16.d34.s1.e11" e2Role="Location" id="BB_EVENT_16.d34.s1.i2" origId="BB-event-1016123.R3" type="Lives_In" />
      <interaction directed="True" e1="BB_EVENT_16.d34.s1.e6" e1Role="Bacteria" e2="BB_EVENT_16.d34.s1.e7" e2Role="Location" id="BB_EVENT_16.d34.s1.i3" origId="BB-event-1016123.R4" type="Lives_In" />
      <interaction directed="True" e1="BB_EVENT_16.d34.s1.e6" e1Role="Bacteria" e2="BB_EVENT_16.d34.s1.e10" e2Role="Location" id="BB_EVENT_16.d34.s1.i4" origId="BB-event-1016123.R5" type="Lives_In" />
      <interaction directed="True" e1="BB_EVENT_16.d34.s1.e6" e1Role="Bacteria" e2="BB_EVENT_16.d34.s1.e9" e2Role="Location" id="BB_EVENT_16.d34.s1.i5" origId="BB-event-1016123.R6" type="Lives_In" />
      <analyses>
        <tokenization ProteinNameSplitter="True" date="03.04.16 19:54:21" source="TEES" tokenizer="McCC">
          <token POS="IN" charOffset="0-2" headScore="0" id="bt_0" text="Of" />
          <token POS="DT" charOffset="3-6" headScore="1" id="bt_1" text="the" />
          <token POS="CD" charOffset="7-10" headScore="1" id="bt_2" text="104" />
          <token POS="NNS" charOffset="11-21" headScore="2" id="bt_3" text="isolations" />
          <token POS="IN" charOffset="22-24" headScore="0" id="bt_4" text="of" />
          <token POS="FW" charOffset="25-35" headScore="1" id="bt_5" text="Salmonella" />
          <token POS="FW" charOffset="36-39" headScore="2" id="bt_6" text="sp." />
          <token POS="IN" charOffset="40-44" headScore="0" id="bt_7" text="from" />
          <token POS="NN" charOffset="45-48" headScore="1" id="bt_8" text="egg" />
          <token POS="NN" charOffset="49-53" headScore="2" id="bt_9" text="pulp" />
          <token POS="," charOffset="53-54" headScore="1" id="bt_10" text="," />
          <token POS="CD" charOffset="55-57" headScore="1" id="bt_11" text="97" />
          <token POS="VBD" charOffset="58-62" headScore="1" id="bt_12" text="were" />
          <token POS="VBN" charOffset="63-71" headScore="1" id="bt_13" text="obtained" />
          <token POS="IN" charOffset="72-76" headScore="0" id="bt_14" text="from" />
          <token POS="NN" charOffset="77-86" headScore="1" id="bt_15" text="strontium" />
          <token POS="NN" charOffset="87-95" headScore="1" id="bt_16" text="chloride" />
          <token POS="NN" charOffset="96-97" headScore="1" id="bt_17" text="M" />
          <token POS="NN" charOffset="98-103" headScore="2" id="bt_18" text="broth" />
          <token POS="," charOffset="103-104" headScore="1" id="bt_19" text="," />
          <token POS="CD" charOffset="105-107" headScore="1" id="bt_20" text="42" />
          <token POS="IN" charOffset="108-112" headScore="0" id="bt_21" text="from" />
          <token POS="NN" charOffset="113-122" headScore="1" id="bt_22" text="strontium" />
          <token POS="NN" charOffset="123-131" headScore="1" id="bt_23" text="selenite" />
          <token POS="NN" charOffset="132-137" headScore="2" id="bt_24" text="broth" />
          <token POS="CC" charOffset="138-141" headScore="0" id="bt_25" text="and" />
          <token POS="CD" charOffset="142-144" headScore="1" id="bt_26" text="57" />
          <token POS="IN" charOffset="145-149" headScore="0" id="bt_27" text="from" />
          <token POS="NN" charOffset="150-159" headScore="1" id="bt_28" text="strontium" />
          <token POS="NN" charOffset="160-168" headScore="1" id="bt_29" text="selenite" />
          <token POS="NN" charOffset="169-170" headScore="1" id="bt_30" text="A" />
          <token POS="NN" charOffset="171-176" headScore="2" id="bt_31" text="broth" />
          <token POS="." charOffset="176-177" headScore="1" id="bt_32" text="." />
        </tokenization>
        <parse ProteinNameSplitter="True" date="03.04.16 19:54:21" parser="McCC" pennstring="(S1 (S (S (PP (IN Of) (NP (NP (DT the) (CD 104) (NNS isolations)) (PP (IN of) (NP (FW Salmonella) (FW sp.))) (PP (IN from) (NP (NN egg) (NN pulp))))) (, ,) (NP (CD 97)) (VP (VBD were) (VP (VBN obtained) (PP (IN from) (NP (NP (NN strontium) (NN chloride) (NN M) (NN broth)) (, ,) (NP (NP (NP (CD 42)) (PP (IN from) (NP (NN strontium) (NN selenite) (NN broth)))) (CC and) (NP (NP (CD 57)) (PP (IN from) (NP (NN strontium) (NN selenite) (NN A) (NN broth)))))))))) (. .)))" source="TEES" stanford="ok" stanfordDate="03.04.16 19:59:27" stanfordSource="TEES" tokenizer="McCC">
          <dependency id="sd_0" t1="bt_3" t2="bt_1" type="det" />
          <dependency id="sd_1" t1="bt_3" t2="bt_2" type="num" />
          <dependency id="sd_2" t1="bt_13" t2="bt_3" type="prep_of" />
          <dependency id="sd_3" t1="bt_6" t2="bt_5" type="nn" />
          <dependency id="sd_4" t1="bt_3" t2="bt_6" type="prep_of" />
          <dependency id="sd_5" t1="bt_9" t2="bt_8" type="nn" />
          <dependency id="sd_6" t1="bt_3" t2="bt_9" type="prep_from" />
          <dependency id="sd_7" t1="bt_13" t2="bt_10" type="punct" />
          <dependency id="sd_8" t1="bt_13" t2="bt_11" type="nsubjpass" />
          <dependency id="sd_9" t1="bt_13" t2="bt_12" type="auxpass" />
          <dependency id="sd_10" t1="bt_18" t2="bt_15" type="nn" />
          <dependency id="sd_11" t1="bt_18" t2="bt_16" type="nn" />
          <dependency id="sd_12" t1="bt_18" t2="bt_17" type="nn" />
          <dependency id="sd_13" t1="bt_13" t2="bt_18" type="prep_from" />
          <dependency id="sd_14" t1="bt_18" t2="bt_19" type="punct" />
          <dependency id="sd_15" t1="bt_18" t2="bt_20" type="appos" />
          <dependency id="sd_16" t1="bt_24" t2="bt_22" type="nn" />
          <dependency id="sd_17" t1="bt_24" t2="bt_23" type="nn" />
          <dependency id="sd_18" t1="bt_20" t2="bt_24" type="prep_from" />
          <dependency id="sd_19" t1="bt_18" t2="bt_26" type="appos" />
          <dependency id="sd_20" t1="bt_20" t2="bt_26" type="conj_and" />
          <dependency id="sd_21" t1="bt_31" t2="bt_28" type="nn" />
          <dependency id="sd_22" t1="bt_31" t2="bt_29" type="nn" />
          <dependency id="sd_23" t1="bt_31" t2="bt_30" type="nn" />
          <dependency id="sd_24" t1="bt_26" t2="bt_31" type="prep_from" />
          <dependency id="sd_25" t1="bt_13" t2="bt_32" type="punct" />
          <phrase begin="0" charOffset="0-54" end="10" id="bp_0" type="PP" />
          <phrase begin="0" charOffset="0-177" end="32" id="bp_1" type="S" />
          <phrase begin="0" end="33" id="bp_2" type="S" />
          <phrase begin="0" end="33" id="bp_3" type="S1" />
          <phrase begin="1" charOffset="3-24" end="4" id="bp_4" type="NP" />
          <phrase begin="1" charOffset="3-54" end="10" id="bp_5" type="NP" />
          <phrase begin="4" charOffset="22-44" end="7" id="bp_6" type="PP" />
          <phrase begin="5" charOffset="25-44" end="7" id="bp_7" type="NP" />
          <phrase begin="7" charOffset="40-54" end="10" id="bp_8" type="PP" />
          <phrase begin="8" charOffset="45-54" end="10" id="bp_9" type="NP" />
          <phrase begin="11" charOffset="55-62" end="12" id="bp_10" type="NP" />
          <phrase begin="12" charOffset="58-177" end="32" id="bp_11" type="VP" />
          <phrase begin="13" charOffset="63-177" end="32" id="bp_12" type="VP" />
          <phrase begin="14" charOffset="72-177" end="32" id="bp_13" type="PP" />
          <phrase begin="15" charOffset="77-104" end="19" id="bp_14" type="NP" />
          <phrase begin="15" charOffset="77-177" end="32" id="bp_15" type="NP" />
          <phrase begin="20" charOffset="105-112" end="21" id="bp_16" type="NP" />
          <phrase begin="20" charOffset="105-141" end="25" id="bp_17" type="NP" />
          <phrase begin="20" charOffset="105-177" end="32" id="bp_18" type="NP" />
          <phrase begin="21" charOffset="108-141" end="25" id="bp_19" type="PP" />
          <phrase begin="22" charOffset="113-141" end="25" id="bp_20" type="NP" />
          <phrase begin="26" charOffset="142-149" end="27" id="bp_21" type="NP" />
          <phrase begin="26" charOffset="142-177" end="32" id="bp_22" type="NP" />
          <phrase begin="27" charOffset="145-177" end="32" id="bp_23" type="PP" />
          <phrase begin="28" charOffset="150-177" end="32" id="bp_24" type="NP" />
        </parse>
      </analyses>
    </sentence>
```

In [89]:
from xml.etree import cElementTree as ET
from sklearn.svm import LinearSVC, SVC
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import normalize
from gzip import GzipFile
from itertools import product, tee
import scipy
import networkx as nx

#parsing xml is actually quite easy
def open_xml(input_xml):
    if input_xml.endswith('.gz'):
        with GzipFile(input_xml) as xml_file:
            tree = ET.parse(xml_file)
    else:
        tree = ET.parse(input_xml)
    return tree
tree_tmp=open_xml("data/BB_EVENT_16-train.xml")
print(tree_tmp)

<xml.etree.ElementTree.ElementTree object at 0x7f6546ceb400>


In [34]:
for sentence in tree_tmp.findall('.//sentence'):
        bacteria = sentence.findall('entity[@type="Bacteria"]')
        for b in bacteria:
            print(b.attrib["text"])
            print(b.attrib)
            print()

Salmonellae
{'given': 'True', 'origId': 'BB-event-1016123.T4', 'charOffset': '103-114', 'origOffset': '103-114', 'type': 'Bacteria', 'id': 'BB_EVENT_16.d34.s0.e3', 'headOffset': '103-114', 'text': 'Salmonellae'}

Salmonella
{'given': 'True', 'origId': 'BB-event-1016123.T7', 'charOffset': '25-35', 'origOffset': '167-177', 'type': 'Bacteria', 'id': 'BB_EVENT_16.d34.s1.e6', 'headOffset': '25-35', 'text': 'Salmonella'}

salmonellae
{'given': 'True', 'origId': 'BB-event-1016123.T14', 'charOffset': '144-155', 'origOffset': '464-475', 'type': 'Bacteria', 'id': 'BB_EVENT_16.d34.s2.e13', 'headOffset': '144-155', 'text': 'salmonellae'}

S. pullorum
{'given': 'True', 'origId': 'BB-event-1016123.T17', 'charOffset': '196-207', 'origOffset': '516-527', 'type': 'Bacteria', 'id': 'BB_EVENT_16.d34.s2.e16', 'headOffset': '199-207', 'text': 'S. pullorum'}

Streptococcus pyogenes
{'given': 'True', 'origId': 'BB-event-10658649.T3', 'charOffset': '59-81', 'origOffset': '59-81', 'type': 'Bacteria', 'id': 'BB

In [44]:
def generate_pairs(tree, features='tfidf', vectorizer=None, verbose=False):
    """
    Generates all pairs for relation classification.
    """
    # make sure we have index for every word in the data
    if features == 'tfidf' and vectorizer == None:
        vectorizer = TfidfVectorizer(ngram_range=(1,3))
        documents = [d.get('text') for d in tree.findall('document')]
        vectorizer.fit(documents)
    
    labels = []
    feature_matrix = []
    
    for sentence in tree.findall('.//sentence'):
        bacteria = sentence.findall('entity[@type="Bacteria"]')
        hab = sentence.findall('entity[@type="Habitat"]') + sentence.findall('entity[@type="Geographical"]')
        for i, (b, h) in enumerate(product(bacteria, hab)):
            # gather the labels, true/false
            if sentence.find('interaction[@e1="%s"][@e2="%s"][@type="Lives_In"]' % (b.get('id'), h.get('id'))) != None:
                labels.append(1)
                if verbose:
                    print(text_between(b,h,sentence))
            else:
                labels.append(0)
            
            if features == 'tfidf':
                feature_matrix.append(tfidf_features(b, h, sentence, vectorizer))
            else:
                feature_matrix.append(parse_features(b, h, sentence))
    
    if features != 'tfidf':
        if vectorizer == None:
            vectorizer = DictVectorizer()
            vectorizer.fit(feature_matrix)
        feature_matrix = vectorizer.transform(feature_matrix)
    
    return labels, scipy.sparse.vstack(feature_matrix), vectorizer

print(generate_pairs(tree_tmp,verbose=True))

detection in 
detection in 
sp. from 
sp. from egg pulp, 97 were obtained from 


in 
in 
strains colonizing 




isolates from 
among 
were isolated by 
isolated from the 
cyanobacterium 
gastrointestinal pathogen 
pathogen 
pathogen 

grown in 
grown in tryptic soy broth and 
grown in tryptic soy broth and nutrient broth to 
grown in tryptic soy broth and nutrient broth to 
grown in tryptic soy broth and nutrient broth to apple and 
grown in tryptic soy broth and nutrient broth to apple and 




to attach to 
to attach to 
to attach to lettuce and 
to attach to lettuce and 
 but not NB, supported capsule production by 
 supported capsule production by 



to 
to 
and Lb. gasseri K 7 cells to 
and Lb. gasseri K 7 cells to 
cells to 
cells to 





pathogen 

is an opportunistic pathogen and a major cause of 
 we showed that 


utilizes different regulatory systems and adhesins in attachment to biotic and abiotic surfaces and that QS is a main regulatory pathway in adhesion to an 
util

Our tfidf features for the words occurring between the given entities:

In [41]:
def text_between(bacteria, habitat, sentence):
    b_beg, b_end = bacteria.get('charOffset').split(',')[0].split('-')
    h_beg, h_end = habitat.get('charOffset').split(',')[0].split('-')

    if b_beg < h_beg:
        text_b = sentence.get('text')[int(b_end)+1:int(h_beg)]
    else:
        text_b = sentence.get('text')[int(h_end)+1:int(b_beg)]
    return text_b


def tfidf_features(bacteria, habitat, sentence, vectorizer):
    """
    Builds tfidf vectors for the words between the entities.
    """    
    return vectorizer.transform([text_between(bacteria,habitat,sentence)])

Lets see it in action!

In [56]:
print("Relation extraction with TFIDF vectors")

train_tree = open_xml('data/BB_EVENT_16-train.xml')
train_labels, train_features, train_vectorizer = generate_pairs(train_tree)

print("Number of features: %s" % train_features.shape[1])

devel_tree = open_xml('data/BB_EVENT_16-devel.xml')
devel_labels, devel_features = generate_pairs(devel_tree, vectorizer=train_vectorizer)[:2]

print("Devel set: %s examples, %s positive, %s negative" % (len(devel_labels), devel_labels.count(1), devel_labels.count(0)))

baseline = DummyClassifier(strategy='uniform')
baseline.fit(train_features, train_labels)
print('Random baseline accuracy: %.3f, f-score: %.3f' % (baseline.score(devel_features, devel_labels)*100,
                                                  metrics.f1_score(devel_labels, baseline.predict(devel_features))*100))

print('All positive baseline accuracy: %.3f, f-score: %.3f' % (metrics.accuracy_score(devel_labels, [1]*len(devel_labels))*100,
                                                  metrics.f1_score(devel_labels, [1]*len(devel_labels))*100))

for c in range(-15, 15):
    classifier = LinearSVC(C=2**c)
    classifier.fit(train_features, train_labels)
    pred = classifier.predict(devel_features)
    print("C: 2^%s  Accuracy: %.3f  F-score: %.3f P: %.3f R: %.3f" % (c, metrics.accuracy_score(devel_labels, pred)*100,
                                                      metrics.f1_score(devel_labels, pred)*100,
                                                      metrics.precision_score(devel_labels,pred)*100,
                                                     metrics.recall_score(devel_labels,pred)*100))

Relation extraction with TFIDF vectors
Number of features: 21746
Devel set: 506 examples, 173 positive, 333 negative
Random baseline accuracy: 51.581, f-score: 41.446
All positive baseline accuracy: 34.190, f-score: 50.957
C: 2^-15  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-14  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-13  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-12  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-11  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-10  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-9  Accuracy: 65.810  F-score: 0.000 P: 0.000 R: 0.000
C: 2^-8  Accuracy: 65.810  F-score: 7.487 P: 50.000 R: 4.046
C: 2^-7  Accuracy: 67.194  F-score: 16.162 P: 64.000 R: 9.249
C: 2^-6  Accuracy: 68.972  F-score: 25.592 P: 71.053 R: 15.607
C: 2^-5  Accuracy: 69.368  F-score: 29.864 P: 68.750 R: 19.075
C: 2^-4  Accuracy: 69.565  F-score: 33.621 P: 66.102 R: 22.543
C: 2^-3  Accuracy: 69.565  F-score: 35.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


C: 2^6  Accuracy: 70.158  F-score: 40.316 P: 63.750 R: 29.480
C: 2^7  Accuracy: 70.158  F-score: 41.245 P: 63.095 R: 30.636
C: 2^8  Accuracy: 69.960  F-score: 33.913 P: 68.421 R: 22.543
C: 2^9  Accuracy: 69.170  F-score: 25.000 P: 74.286 R: 15.029
C: 2^10  Accuracy: 68.182  F-score: 21.463 P: 68.750 R: 12.717
C: 2^11  Accuracy: 68.379  F-score: 17.526 P: 80.952 R: 9.827
C: 2^12  Accuracy: 66.798  F-score: 33.333 P: 53.165 R: 24.277
C: 2^13  Accuracy: 67.194  F-score: 18.627 P: 61.290 R: 10.983
C: 2^14  Accuracy: 67.194  F-score: 17.000 P: 62.963 R: 9.827


Accuracy is pretty poor metric in this case: the class distribution is not uniform and in the end we are interested only in the positive class (Lives_In relation).

F-score not familiar? Check this: [https://en.wikipedia.org/wiki/F1_score](https://en.wikipedia.org/wiki/F1_score)

# Dependency parses to the rescue!
* The same semantic relation of two entities can be expressed in limitless ways
* Just looking at the words between the entities won't cut it
* Lets have a look at this sentence and assume we are interested in the relation between ATR and Nor1 proteins:
<img src="figs/parse_path.png">
* In linear order there are 12 tokens between these entities. Most likely the same word sequence won't appear anywhere else in the whole biomedical literature.
* On the other hand looking at the shortest dependency path between the same entities has only 3 tokens separating them
    * Using dependencies tends to densify our feature space
    * Dependency types are also a strong indicator of 

# Lets get back to our task, this time with a bigger hammer

In [85]:
def parse_features(bacteria, habitat, sentence):
    """
    Builds simple dependency path features for the pair.
    """
    b_head_token = sentence.find('.//token[@charOffset="%s"]' % bacteria.get('headOffset'))
    h_head_token = sentence.find('.//token[@charOffset="%s"]' % habitat.get('headOffset'))

    graph, token_dict = _build_graph(sentence)

    path = nx.shortest_path(graph, source=b_head_token.get('id'), target=h_head_token.get('id'))
    
    edges = []
    for t1, t2 in pairwise(path):
        edge = graph.get_edge_data(t1,t2)
        if edge['direction'][0] == t1:
            direction = '>'
        else:
            direction = '<'
        
        edges.append((token_dict[t1], token_dict[t2], direction, edge['type']))


    features = {}
    for e in edges:
        #print(e[0].get("text"),e[1].get("text"),e[2],e[3])
        features[e[2]+e[3]] = 1 # Directed dependency unigram
        features[e[3]] = 1 # Undirected dep unigram
    
        features['W_'+e[1].get('text')] = 1 # Word unigrams along the dep path
    #features={}
    # Dependency bigrams
    for i in range(len(edges)-1):
        path_string = '.'.join(p[3] for p in edges[i:i+1])
        features['D_ngram_:'+path_string] = 1

    # Dependency trigrams
    for i in range(len(edges)-2):
        path_string = '.'.join(p[3] for p in edges[i:i+2])
        features['D_ngram_:'+path_string] = 1

    #print(features)
    return features

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def _build_graph(sentence):
    """
    Builds a graph from the syntactic parse.
    """
    graph = nx.Graph() # undirected graph, since we want to find undirected shortest paths
    
    analyses = sentence.find('analyses')
    tokenization = analyses.find('tokenization')
    tokens = tokenization.findall('token')
    
    token_dict = {t.attrib['id']: t for t in tokens}
    
    for token in tokens:
        graph.add_node(token.attrib['id'])
        
    parses = analyses.find('parse')
    dependencies = parses.findall('dependency')
    for d in dependencies:
        graph.add_edge(d.attrib['t1'], d.attrib['t2'], id=d.attrib['id'], type=d.attrib['type'],
                       direction=(d.attrib['t1'], d.attrib['t2']))

    return graph, token_dict

In [86]:
print("Relation extraction with parse features")
train_labels, train_features, train_vectorizer = generate_pairs(train_tree, features='parse')
devel_labels, devel_features = generate_pairs(devel_tree, vectorizer=train_vectorizer, features='parse')[:2]

print("Number of features: %s" % train_features.shape[1])

for c in range(-15, 15):
    classifier = LinearSVC(C=2**c)
    classifier.fit(train_features, train_labels)
    pred = classifier.predict(devel_features)
    print("C: 2^%s  Accuracy: %.3f  F-score: %.3f P: %.3f R:%.3f" % (c, metrics.accuracy_score(devel_labels, pred)*100,
                                                      metrics.f1_score(devel_labels, pred)*100,
                                                      metrics.precision_score(devel_labels,pred)*100,
                                                     metrics.recall_score(devel_labels,pred)*100))


Relation extraction with parse features
Number of features: 1081
C: 2^-15  Accuracy: 66.996  F-score: 17.734 P: 60.000 R:10.405
C: 2^-14  Accuracy: 66.996  F-score: 21.596 P: 57.500 R:13.295
C: 2^-13  Accuracy: 67.589  F-score: 26.786 P: 58.824 R:17.341
C: 2^-12  Accuracy: 67.787  F-score: 37.066 P: 55.814 R:27.746
C: 2^-11  Accuracy: 65.415  F-score: 36.364 P: 49.020 R:28.902
C: 2^-10  Accuracy: 67.194  F-score: 44.295 P: 52.800 R:38.150
C: 2^-9  Accuracy: 67.787  F-score: 48.580 P: 53.472 R:44.509
C: 2^-8  Accuracy: 67.787  F-score: 52.478 P: 52.941 R:52.023
C: 2^-7  Accuracy: 68.775  F-score: 56.111 P: 54.011 R:58.382
C: 2^-6  Accuracy: 69.763  F-score: 58.311 P: 55.155 R:61.850
C: 2^-5  Accuracy: 69.763  F-score: 59.631 P: 54.854 R:65.318
C: 2^-4  Accuracy: 70.553  F-score: 62.469 P: 55.357 R:71.676
C: 2^-3  Accuracy: 69.368  F-score: 61.153 P: 53.982 R:70.520
C: 2^-2  Accuracy: 68.972  F-score: 59.432 P: 53.738 R:66.474
C: 2^-1  Accuracy: 67.194  F-score: 56.771 P: 51.659 R:63.006

In [87]:
print(train_vectorizer.get_feature_names())

['<abbrev', '<advcl', '<advmod', '<agent', '<amod', '<appos', '<ccomp', '<conj_and', '<csubj', '<dep', '<dobj', '<hyphen', '<infmod', '<nn', '<nsubj', '<nsubjpass', '<num', '<parataxis', '<partmod', '<pobj', '<prep', '<prep_after', '<prep_against', '<prep_among', '<prep_as', '<prep_at', '<prep_before', '<prep_by', '<prep_due_to', '<prep_during', '<prep_followed_by', '<prep_for', '<prep_from', '<prep_in', '<prep_in_addition_to', '<prep_of', '<prep_on', '<prep_throughout', '<prep_to', '<prep_with', '<prep_within', '<prepc_for', '<prepc_in', '<prepc_with', '<purpcl', '<rcmod', '<xcomp', '<xsubj', '>abbrev', '>advcl', '>advmod', '>agent', '>amod', '>appos', '>ccomp', '>conj_and', '>conj_but', '>csubj', '>dep', '>dobj', '>hyphen', '>infmod', '>nn', '>npadvmod', '>nsubj', '>nsubjpass', '>parataxis', '>partmod', '>pobj', '>poss', '>prep_against', '>prep_among', '>prep_as', '>prep_before', '>prep_between', '>prep_by', '>prep_for', '>prep_from', '>prep_in', '>prep_including', '>prep_inside', '>

In [88]:
import numpy
classifier = LinearSVC(C=2**-4)
classifier.fit(train_features, train_labels)
for feature_index in numpy.argsort(-classifier.coef_[0])[:10]:
    print(train_vectorizer.get_feature_names()[feature_index])


W_human
W_pathogen
>prep_in
W_surfaces
W_PNS
D_ngram_:prep_of.prep_to
W_studies
D_ngram_:amod.dobj
W_populations
W_isolates
