# Relation extraction
* Goal: find meaningful associations between named entities
* Co-occurrences will already tell us something
    * Entity pairs appearing in the same context (e.g. sentences) often tend to be somehow related
    * Type of association will remain a mystery
* Usually we have a predefined set of relation types we are interested in
    * Specific to our domain of interest
    * E.g. located-at/in, subsidiary, works-at

# RE as a classification task
* First we generate all relevant entity pairs
    * Usually from a single sentence
    * Note: if we are looking for works-at relations, it makes no sense to generate ORG-ORG pairs etc.
* For each pair a classification makes a decision: relation exists (and type) or no relation
* Which pairs to consider here: Frank and James worked for Comcast, not for Time Warner.
* Features should now capture some information about the association of the given entity pair in the sentence

# Lets give it a try with our familiar TFIDF vectors
* Our data consists of biomedical publications and we want to find out where different types of bacteria live in
    * i.e. find the relations between bacteria and habitat entities
    * Take a look: [http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-18845825](http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-18845825)
    * Or here: [http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-19175621](http://evexdb.org/curation/brat/#/bb2016/train_brat/BB-event-19175621)
* Only one relation type, so our problem is a binary classification task
* Would it make sense to see what words appear between the given entities?

In [54]:
from xml.etree import cElementTree as ET
from sklearn.svm import LinearSVC, SVC
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import normalize
from gzip import GzipFile
from itertools import product, tee, izip
import scipy
import networkx as nx

def open_xml(input_xml):
    if input_xml.endswith('.gz'):
        with GzipFile(input_xml) as xml_file:
            tree = ET.parse(xml_file)
    else:
        tree = ET.parse(input_xml)
    return tree

def generate_pairs(tree, features='tfidf', vectorizer=None):
    """
    Generates all pairs for relation classification.
    """
    if features == 'tfidf' and vectorizer == None:
        vectorizer = TfidfVectorizer(ngram_range=(1,3))
        documents = [d.get('text') for d in tree.findall('document')]
        vectorizer.fit(documents)
    
    labels = []
    feature_matrix = []
    
    for sentence in tree.findall('.//sentence'):
        bacteria = sentence.findall('entity[@type="Bacteria"]')
        hab = sentence.findall('entity[@type="Habitat"]') + sentence.findall('entity[@type="Geographical"]')
        for i, (b, h) in enumerate(product(bacteria, hab)):
            #import pdb; pdb.set_trace()
            if sentence.find('interaction[@e1="%s"][@e2="%s"][@type="Lives_In"]' % (b.get('id'), h.get('id'))) != None:
                labels.append(1)
            else:
                labels.append(0)

            if features == 'tfidf':
                feature_matrix.append(tfidf_features(b, h, sentence, vectorizer))
            else:
                feature_matrix.append(parse_features(b, h, sentence))
    
    if features != 'tfidf':
        if vectorizer == None:
            vectorizer = DictVectorizer()
            vectorizer.fit(feature_matrix)
        feature_matrix = vectorizer.transform(feature_matrix)
    
    return labels, scipy.sparse.vstack(feature_matrix), vectorizer

Our tfidf features for the words occurring between the given entities:

In [55]:
def tfidf_features(bacteria, habitat, sentence, vectorizer):
    """
    Builds tfidf vectors for the words between the entities.
    """
    b_beg, b_end = bacteria.get('charOffset').split(',')[0].split('-')
    h_beg, h_end = habitat.get('charOffset').split(',')[0].split('-')

    if b_beg < h_beg:
        text_between = sentence.get('text')[int(b_end)+1:int(h_beg)]
    else:
        text_between = sentence.get('text')[int(h_end)+1:int(b_beg)]
    
    return vectorizer.transform([text_between])

Lets see it in action!

In [56]:
print "Relation extraction with TFIDF vectors"

train_tree = open_xml('BB_EVENT_16-train.xml')
train_labels, train_features, train_vectorizer = generate_pairs(train_tree)

print "Number of features: %s" % train_features.shape[1]

devel_tree = open_xml('BB_EVENT_16-devel.xml')
devel_labels, devel_features = generate_pairs(devel_tree, vectorizer=train_vectorizer)[:2]

print "Devel set: %s examples, %s positive, %s negative" % (len(devel_labels), devel_labels.count(1), devel_labels.count(0))

baseline = DummyClassifier(strategy='uniform')
baseline.fit(train_features, train_labels)
print 'Random baseline accuracy: %.3f, f-score: %.3f' % (baseline.score(devel_features, devel_labels)*100,
                                                  metrics.f1_score(devel_labels, baseline.predict(devel_features))*100)

print 'All positive baseline accuracy: %.3f, f-score: %.3f' % (metrics.accuracy_score(devel_labels, [1]*len(devel_labels))*100,
                                                  metrics.f1_score(devel_labels, [1]*len(devel_labels))*100)

for c in range(-15, 15):
    classifier = LinearSVC(C=2**c, random_state=42)
    classifier.fit(train_features, train_labels)
    pred = classifier.predict(devel_features)
    print "C: 2^%s  Accuracy: %.3f  F-score: %.3f" % (c, metrics.accuracy_score(devel_labels, pred)*100,
                                                      metrics.f1_score(devel_labels, pred)*100)

Relation extraction with TFIDF vectors
Number of features: 21746
Devel set: 506 examples, 173 positive, 333 negative
Random baseline accuracy: 47.628, f-score: 37.762
All positive baseline accuracy: 34.190, f-score: 50.957
C: 2^-15  Accuracy: 65.810  F-score: 0.000
C: 2^-14  Accuracy: 65.810  F-score: 0.000
C: 2^-13  Accuracy: 65.810  F-score: 0.000
C: 2^-12  Accuracy: 65.810  F-score: 0.000
C: 2^-11  Accuracy: 65.810  F-score: 0.000
C: 2^-10  Accuracy: 65.810  F-score: 0.000
C: 2^-9  Accuracy: 65.810  F-score: 0.000
C: 2^-8  Accuracy: 65.810  F-score: 7.487
C: 2^-7  Accuracy: 67.194  F-score: 16.162
C: 2^-6  Accuracy: 68.972  F-score: 25.592
C: 2^-5  Accuracy: 69.368  F-score: 29.864
C: 2^-4  Accuracy: 69.565  F-score: 33.621
C: 2^-3  Accuracy: 69.565  F-score: 35.833
C: 2^-2  Accuracy: 69.763  F-score: 38.554
C: 2^-1  Accuracy: 69.368  F-score: 39.216
C: 2^0  Accuracy: 68.972  F-score: 37.450
C: 2^1  Accuracy: 69.170  F-score: 38.095
C: 2^2  Accuracy: 67.984  F-score: 37.209
C: 2^3  

Accuracy is pretty poor metric in this case: the class distribution is not uniform and in the end we are interested only in the positive class (Lives_In relation).

F-score not familiar? Check this: [https://en.wikipedia.org/wiki/F1_score](https://en.wikipedia.org/wiki/F1_score)

# Dependency parses to the rescue!
* The same semantic relation of two entities can be expressed in limitless ways
* Just looking at the words between the entities won't cut it
* Lets have a look at this sentence and assume we are interested in the relation between ATR and Nor1 proteins:
<img src="figs/parse_path.png">
* In linear order there are 12 tokens between these entities. Most likely the same word sequence won't appear anywhere else in the whole biomedical literature.
* On the other hand looking at the shortest dependency path between the same entities has only 3 tokens separating them
    * Using dependencies tends to densify our feature space
    * Dependency types are also a strong indicator of 

# Lets get back to our task, this time with a bigger hammer

In [40]:
def parse_features(bacteria, habitat, sentence):
    """
    Builds simple dependency path features for the pair.
    """
    b_head_token = sentence.find('.//token[@charOffset="%s"]' % bacteria.get('headOffset'))
    h_head_token = sentence.find('.//token[@charOffset="%s"]' % habitat.get('headOffset'))

    graph, token_dict = _build_graph(sentence)

    path = nx.shortest_path(graph, source=b_head_token.get('id'), target=h_head_token.get('id'))
    
    edges = []
    for t1, t2 in pairwise(path):
        edge = graph.edge[t1][t2]
        if edge['direction'][0] == t1:
            direction = '>'
        else:
            direction = '<'
        
        edges.append((token_dict[t1], token_dict[t2], direction, edge['type']))


    features = {}
    for e in edges:
        features[e[2]+e[3]] = 1 # Undirected dependency unigram
        features[e[3]] = 1 # Directed dep unigram
    """
        features['W_'+e[1].get('text')] = 1 # Word unigrams along the dep path

    # Dependency bigrams
    for i in range(len(edges)-1):
        path_string = '.'.join(p[3] for p in edges[i:i+1])
        features['D_ngram_:'+path_string] = 1

    # Dependency trigrams
    for i in range(len(edges)-2):
        path_string = '.'.join(p[3] for p in edges[i:i+2])
        features['D_ngram_:'+path_string] = 1
    """

    return features

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def _build_graph(sentence):
    """
    Builds a graph from the syntactic parse.
    """
    graph = nx.Graph() # undirected graph, since we want to find undirected shortest paths
    
    analyses = sentence.find('analyses')
    tokenization = analyses.find('tokenization')
    tokens = tokenization.findall('token')
    
    token_dict = {t.attrib['id']: t for t in tokens}
    
    for token in tokens:
        graph.add_node(token.attrib['id'])
        
    parses = analyses.find('parse')
    dependencies = parses.findall('dependency')
    for d in dependencies:
        graph.add_edge(d.attrib['t1'], d.attrib['t2'], id=d.attrib['id'], type=d.attrib['type'],
                       direction=(d.attrib['t1'], d.attrib['t2']))

    return graph, token_dict

In [47]:
print "Relation extraction with parse features"
train_labels, train_features, train_vectorizer = generate_pairs(train_tree, features='parse')
devel_labels, devel_features = generate_pairs(devel_tree, vectorizer=train_vectorizer, features='parse')[:2]

print "Number of features: %s" % train_features.shape[1]

for c in range(-15, 15):
    classifier = LinearSVC(C=2**c, random_state=42)
    classifier.fit(train_features, train_labels)
    pred = classifier.predict(devel_features)
    print "C: 2^%s  Accuracy: %.3f  F-score: %.3f" % (c, metrics.accuracy_score(devel_labels, pred)*100,
                                                      metrics.f1_score(devel_labels, pred)*100)


Relation extraction with parse features
Number of features: 163
C: 2^-15  Accuracy: 67.589  F-score: 21.154
C: 2^-14  Accuracy: 68.577  F-score: 26.047
C: 2^-13  Accuracy: 68.577  F-score: 32.340
C: 2^-12  Accuracy: 69.170  F-score: 37.097
C: 2^-11  Accuracy: 67.391  F-score: 38.662
C: 2^-10  Accuracy: 67.984  F-score: 43.750
C: 2^-9  Accuracy: 67.787  F-score: 46.557
C: 2^-8  Accuracy: 67.787  F-score: 48.903
C: 2^-7  Accuracy: 68.182  F-score: 52.226
C: 2^-6  Accuracy: 67.984  F-score: 53.179
C: 2^-5  Accuracy: 67.391  F-score: 53.782
C: 2^-4  Accuracy: 67.589  F-score: 56.614
C: 2^-3  Accuracy: 67.391  F-score: 56.693
C: 2^-2  Accuracy: 66.008  F-score: 53.763
C: 2^-1  Accuracy: 65.020  F-score: 52.291
C: 2^0  Accuracy: 64.229  F-score: 51.733
C: 2^1  Accuracy: 65.217  F-score: 53.439
C: 2^2  Accuracy: 65.613  F-score: 54.211
C: 2^3  Accuracy: 64.625  F-score: 52.520
C: 2^4  Accuracy: 64.032  F-score: 52.105
C: 2^5  Accuracy: 64.229  F-score: 48.433
C: 2^6  Accuracy: 64.625  F-score

Without showing a single word to the classifier we ended up with an F-score of 56.7

In [42]:
print train_vectorizer.get_feature_names()

['<abbrev', '<advcl', '<advmod', '<agent', '<amod', '<appos', '<ccomp', '<conj_and', '<csubj', '<dep', '<dobj', '<hyphen', '<infmod', '<nn', '<nsubj', '<nsubjpass', '<num', '<parataxis', '<partmod', '<pobj', '<prep', '<prep_after', '<prep_against', '<prep_among', '<prep_as', '<prep_at', '<prep_before', '<prep_by', '<prep_due_to', '<prep_during', '<prep_followed_by', '<prep_for', '<prep_from', '<prep_in', '<prep_in_addition_to', '<prep_of', '<prep_on', '<prep_throughout', '<prep_to', '<prep_with', '<prep_within', '<prepc_for', '<prepc_in', '<prepc_with', '<purpcl', '<rcmod', '<xcomp', '<xsubj', '>abbrev', '>advcl', '>advmod', '>agent', '>amod', '>appos', '>ccomp', '>conj_and', '>conj_but', '>csubj', '>dep', '>dobj', '>hyphen', '>infmod', '>nn', '>npadvmod', '>nsubj', '>nsubjpass', '>parataxis', '>partmod', '>pobj', '>poss', '>prep_against', '>prep_among', '>prep_as', '>prep_before', '>prep_between', '>prep_by', '>prep_for', '>prep_from', '>prep_in', '>prep_including', '>prep_inside', '>

# TEES
* Feature engineering for relation extraction is painful
* Luckily there are free tools available
* One of them is [TEES (Turku Event Extraction System)](http://jbjorne.github.io/TEES/)
    * Warning: comes bundled with tools specifically trained for biomedical domain, use your own parser etc. for other purposes
    * Uses the same XML format seen in this example

In [48]:
print "TEES system"
TEES_tree = open_xml('devel-pred.xml.gz')
TEES_labels, TEES_features = generate_pairs(TEES_tree, vectorizer=train_vectorizer, features='parse')[:2]
print 'TEES accuracy: %.3f, f-score: %.3f' % (metrics.accuracy_score(devel_labels, TEES_labels)*100,
                                              metrics.f1_score(devel_labels, TEES_labels)*100)

TEES system
TEES accuracy: 77.075, f-score: 68.478


OK, no competition here... Then again, TEES uses ~40K features for this task.

# EVEX
* Biomedical events extracted from the whole biomedical literature
    * Imagine an event as a collection of pairwise relations between genes and proteins
* Website: [http://www.evexdb.org/](http://www.evexdb.org/)
* The fun begins when you start looking into things on a large-scale:
<img src="figs/network.png">
* This is an automatically produced gene regulatory network with 13K human genes and 236K relations

# Bonus: so, what are these events?
* Have a look at this sentence: Kuluttajatuotteita kotiin, puutarhaan ja ulkoiluun valmistavan Fiskarsin liikevaihto kasvoi heinä–syyskuussa 62 prosenttia verrattuna vuoden takaiseen ajankohtaan.
* Rough translation: ... Fiskars increased its revenue by 62% compared to the previous year during Q3.
* Relevant entities here are Fiskars (organization), 62% (amount), Q3 (time)
* How would you represent this information structurally? Binary relations?

# What next?
* Annotating text-level mentions of entity relations is extremely labor-intensive
* In many cases we just don't have the training data for supervised relation classification
* Next week we'll look into unsupervised or distantly supervised alternatives