# Open Information Extraction
* Goal: finding any kind of (not predefined) relations between entities
* In practice these relations are usually triples (entity1, relation, entity2)
* Existing methods vary on how the relation (or lack of it) is discovered
* Many systems rely on finding relation phrases between the entities
    * Lets have a look: http://openie.allenai.org/
* Simple solution:
    * Train a classifier (for instance CRF) to tag relation phrases between entities
    * How do we get training data for any type of relation?
        * No practical way of manually annotating enough data
        * Lets use heuristics:
            * Find all co-occuring entity pairs and look at their shortest dependency paths
            * If the path matches certain criteria (e.g. not too long) assume the pair to be positive example
            * Otherwise assume the pair to be negative, i.e. there is no relation between them
    * If we do not use lexical features the classifier will hopefully generalize to any type of relation
        * POS tags, chunking, dependencies are still valid features
* Issues with OpenIE
    * Relations are based on the detected phrases, they have to be identical to aggregate the information (we have learned to measure the similarities between phrases, so this is not strictly true)
    * Many of the relations are completely irrelevant:
        * [http://openie.allenai.org/search?arg1=Othello&rel=&arg2=Shakespeare&corpora=](http://openie.allenai.org/search?arg1=Othello&rel=&arg2=Shakespeare&corpora=)

# Aggregating information: relational clustering
* Sometimes we know that a pair of entities co-occurs multiple times, could we use all this information to decide the relation?
* We can surely try...
* In the next example we are interested in location entities (countries, cities etc.)
* Lets gather all **surface patterns** of the relations for our entity pairs
    * In this example we've chosen lexicalized dependency paths, e.g. *&lt;nmod sijaita >nmod* (sijaita = located in)
    * Bag-of-words might do the trick just as well.
* If we fill a matrix with this data we end up with entity pairs as rows and surface patterns as columns
    * If a pair of entities occurs in a given pattern, that cell in the matrix will have value 1 (or frequency etc.)
    * What we end up with is basically a feature matrix for each pair -> lets cluster them
* Now our relations are based on the aggregated information of all surface patterns, we can't pinpoint a single textual mention explicitly describing the relation (it might not even exist)

We already know how to extract named entities (NER lectures) and we also know how to find the surface patterns (feature generation for supervised relation extraction). Here we have it all pulled together:

In [64]:
import pickle
import numpy as np

from collections import Counter, defaultdict

from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import normalize

print "Reading data"
pairs = pickle.load(open('pairs.pkl'))

rel_counter = Counter([p[2] for p in pairs])
pair_counter = Counter([(p[0][3], p[1][3]) for p in pairs])

print "Unique relations: %s" % len(rel_counter.keys())
print "Unique pairs: %s" % len(pair_counter.keys())
print "Pairs total: %s" % len(pairs)

loc_pairs = [p for p in pairs if p[0][1] == 'loc' and p[1][1] == 'loc' and rel_counter[p[2]] >= 10
             and pair_counter[(p[0][3], p[1][3])] >= 10 and 'conj' not in p[2] and len(p[2].split(' ')) >= 3]

for pair in loc_pairs[:10]:
    print pair[0][3].decode('utf-8'), pair[1][3].decode('utf-8'), pair[2].decode('utf-8')


Reading data
Unique relations: 2163262
Unique pairs: 2681141
Pairs total: 3985380
Eurooppa Afrikka <nmod:poss maa <dobj muuttaa >xcomp kaltainen >nmod:poss maa >nmod:poss
Afrikka Eurooppa <nmod:poss maa <nmod:poss kaltainen <xcomp muuttaa >dobj maa >nmod:poss
Tampere Pohjois-Satakunnan <nmod:poss markkina-alue <nmod:poss paikallis#radio >nmod radio >appos oy >compound:nn viestintä >nmod:poss
Pohjois-Satakunnan Tampere <nmod:poss viestintä <compound:nn oy <appos radio <nmod paikallis#radio >nmod:poss markkina-alue >nmod:poss
Pori Oulu <nmod tulla >nmod
Oulu Pori <nmod tulla >nmod
Slovakia Puola <nmod pitää >nmod
Puola Slovakia <nmod pitää >nmod
Suomi Japani <nmod olla >nmod
Japani Suomi <nmod olla >nmod


* *loc_pairs* is a filtered list of entity pairs collected from a large text corpus (Thanks Jenna!):
    * only location entities are included
    * only surface patterns and entity pairs which occur at least 10 times are included
    * only surface patterns with one or more words between the entities are included
    * No conjunction surface forms are included
* These filterings are arbitrary results of trial and error!

In [76]:
print "Clustering..."
pair_dict = defaultdict(dict)

for p in loc_pairs:
    pair_dict[(p[0][3], p[1][3])][p[2]]=1

pair_list = pair_dict.keys()
vectorizer = DictVectorizer()
features = normalize(vectorizer.fit_transform(pair_dict.values()))

print "Feature matrix shape: %s, %s" % features.shape

n_clusters = 20
c = MiniBatchKMeans(n_clusters=n_clusters)

clusters = c.fit_predict(features)

print "Done!"

Clustering...
Feature matrix shape: 1268, 764
Done!


In [67]:
distances=c.transform(features)
nearest=np.argpartition(distances,10,axis=0)[:10]

print nearest.shape

for cluster_num,cluster in enumerate(nearest.T):
    print "**************", cluster_num,"***********"
    for pair_id in cluster:
        print ">> ",pair_id,">> ", pair_list[pair_id][0], '\t', pair_list[pair_id][1]

(10, 20)
************** 0 ***********
>>  413 >>  Pori 	Jyväskylä
>>  47 >>  Helsinki 	vannas#joki
>>  631 >>  Islanti 	Grönlanti
>>  1232 >>  Konala 	Helsinki
>>  1058 >>  Afganistan 	Kiina
>>  980 >>  Venäjä 	Norja
>>  59 >>  Grönlanti 	Islanti
>>  850 >>  Hämeenlinna 	Hämeenlinna
>>  43 >>  Pohjanmaa 	Helsinki
>>  1019 >>  Savonlinna 	Punkaharju
************** 1 ***********
>>  295 >>  Puola 	Norja
>>  225 >>  Norja 	Puola
>>  1235 >>  Tanska 	Viro
>>  121 >>  Viro 	Latvia
>>  888 >>  Saksa 	Viro
>>  886 >>  Latvia 	Puola
>>  148 >>  Viro 	Saksa
>>  1140 >>  Saksa 	Latvia
>>  744 >>  Norja 	Latvia
>>  140 >>  Viro 	Norja
************** 2 ***********
>>  561 >>  Yhdysvalta 	Meksiko
>>  1123 >>  Helsinki 	Berliini
>>  1042 >>  Vaasa 	Kokkola
>>  1043 >>  Syyria 	Jordania
>>  9 >>  suomi#lahti 	Helsinki
>>  610 >>  Berliini 	Helsinki
>>  697 >>  Yhdysvalta 	Intia
>>  494 >>  Kokkola 	Vaasa
>>  1175 >>  Intia 	Yhdysvalta
>>  925 >>  Helsinki 	suomi#lahti
************** 3 ***********
>> 

How do we make any sense out of these clusters? Lets look at the feature weights of the cluster centers like before.

In [68]:
print "Important features:"

for i in range(n_clusters):
    print "**************", i,"***********"
    for ii in np.argsort(-c.cluster_centers_[i])[:5]:
        print vectorizer.get_feature_names()[ii]

Important features:
************** 0 ***********
<nmod olla >nmod
<nmod suunnata >nmod
<nmod alue >nmod:poss
<nmod:poss alue >nmod
<nmod sijaita >nmod
************** 1 ***********
<nmod mukana >nmod
<nmod juontaa >nmod
<nmod osa >nmod
<nsubj miehittää >dobj
<appos pää#kaupunki >nmod:poss
************** 2 ***********
<nmod siirtyä >nmod
<nmod sijaita >nmod
<nmod:poss raja >nmod
<nmod pelata >nmod
<appos pää#kaupunki >nmod:poss
************** 3 ***********
<nmod:poss mestari >nmod:poss
<nmod tutkia >nmod
<dobj kohdata >nsubj
<nmod mennä >nsubj
<nmod suuntautua >nmod
************** 4 ***********
<nsubj tulla >nmod
<nmod ilmestyä >nmod
<appos maa#kunta >nmod:poss
>nsubj:cop kaupunki >nmod:poss
<dobj liittää >nmod
************** 5 ***********
<nmod jatkua >nmod
<appos ESB >appos
<nmod:poss suur#lähetystö >appos
<nmod:poss suur#lähetystö >nmod
<nmod:poss suur#lähetystö >nmod:poss
************** 6 ***********
<name Tanska >name
<nmod pitää >dobj
<dobj pitää >nmod
<nsubj olla >nmod
<nsubj tarj

These surface patterns should then be used to decide a relation type for each cluster

Issues:
* Like always, we don't really know the optimal number of clusters
* Clusters are mutually exclusive, i.e. an entity pair can have only one relation between them

# A step further: distant supervision
* In many cases we have an existing knowledge base of relations:
    * Wikipedia/wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page
    * Freebase: https://www.freebase.com/
* However, we have no connection between these known relations and their textual mentions
* Distant supervision: any attempt of aligning these two sources of information

# Latent feature model
* Lets consider similar matrix as in the relational cluster
* Adding relation data from existing knowledge bases is trivial: we create a new column for each known relation (in addition to the surface pattern columns)
* Naturally these knowledge bases are not exhaustive, otherwise there would be no point in trying to find related entity pairs
* Thus the goal is to create a model to fill in the blanks in our matrix
* One way to achieve this is with a latent feature model where we learn a feature vector for each entity tuple and for each relation. We can then use these feature vectors to model the probability of a single cell in the matrix being equal to 1, i.e. a relation existing for a given entity pair, by taking dot product of these latent feature vectors and using the result as an input for a logistic function.
* Side note: the logistic function aside this is actually a matrix factorization M = AV, where M is our original matrix, A is the latent feature matrix for relations and V for entity pairs.
* Since the model is forced to use the dot product, a pair having relations X and Y will force these relations to have similar feature vectors and vice versa.
* How do we train a model like this then?
    * Naive approach is to use the known relations as positives and unknown as negatives
        * This is obviously unrealistic as a missing relation doesn't mean it doesn't exist
        * Gives us still a working model, but training may be sensitive to the proportion of sampled negatives etc.
    * State-of-the-art systems use ranking objectives instead, i.e. a known relation should have a higher predicted probability than an unknown one.

# Enough theory, this is how it is done in practice
* We have again similar data matrix as before
* This time we add a new column *located_in*
    * We use a small list of countries and their capitals as our distant supervision, i.e. for a set of (country, capital) pairs we set the *located_in* to 1, for others it will remain 0
* Lets build the model in Keras

Lets import all we need + something we don't

In [69]:
import codecs
import os
import sys
import cPickle as pickle
import random

import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential, Graph
from keras.layers.core import Dense, Dropout, Activation, TimeDistributedDense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import model_from_json

from collections import defaultdict, Counter

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import DictVectorizer

In [70]:
print "Loading data"
train_data = pickle.load(open('./distant_train_data.pkl', 'rb'))
pair_count, rel_count, pair_ids, rel_ids, distant_pairs, train_data = train_data
reverse_pair_ids = {i: p for p, i in pair_ids.items()}

print ''
print "Unique pairs: %s, unique relations: %s" % (len(pair_ids), len(rel_ids))
print "Distant pairs in final set: %s"  % len(set(pair_ids.keys()).intersection(set([(p[0][3], p[1][3] ) for p in distant_pairs])))


Loading data

Unique pairs: 7693, unique relations: 3863
Distant pairs in final set: 11


Down below is our model implemented in Keras.

In [71]:
model = Graph()
        
model.add_input(name='e1', input_shape=(1, ), dtype='int')
model.add_input(name='e2', input_shape=(1, ), dtype='int')

model.add_node(Embedding(pair_count, 50, input_length=1, trainable=True), input='e1', name='e1e')
model.add_node(Embedding(rel_count, 50, input_length=1, trainable=True), input='e2', name='e2e')

model.add_node(TimeDistributedDense(1, activation='sigmoid'), name='outd', inputs=['e1e', 'e2e'], merge_mode='dot', concat_axis=-1)

model.add_output(name='out', input='outd')

print "Building model"
model.compile(loss={'out':'binary_crossentropy',},
              optimizer='adam')
print "Done!"

Building model
Done!


Time to train it!

In [72]:
print "Training"
model.fit(train_data, verbose=1, nb_epoch=6, batch_size=1024, validation_split=0.2)
print "Done!"

Training
Train on 800160 samples, validate on 200040 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Done!


Did we actually learn something? Lets get some predictions!

In [73]:
print "Predicting"
test_pairs = np.asarray([[i] for p, i in pair_ids.items()])
test_rels = np.asarray([[rel_ids['LOCATED_IN']]*test_pairs.shape[0]]).T

pred = model.predict({'e1': test_pairs, 'e2':test_rels})
pred_list = pred['out'][:,0,0]

dis_pairs = set([(d[0][3], d[1][3]) for d in distant_pairs])
res = [reverse_pair_ids[test_pairs[i][0]] for i in np.argsort(-pred_list)]
print "Done!"

print "Top hits with training pairs included (***):"

for r in res[:20]:
    if r in dis_pairs:
        in_train = '***'
    else:
        in_train = ''
    print r[0], r[1].decode('utf-8'), in_train

Predicting
Done!
Top hits with training pairs included (***):
Kuuba Havanna ***
Intia New Delhi 
Israel Jerusalem 
Puola Varsova ***
Afganistan Kabul 
Italia Rooma 
Norja Oslo ***
Latvia Riika ***
Irlanti Dublin 
Kiina Peking ***
Bulgaria Sofia 
Belgia Bryssel 
Saksa Berliini ***
Ranska Pariisi 
Itä-Suomen Pohjois-Karjalan 
Portugali Lissabon ***
Oulu Kannus 
Ruotsi Tukholma ***
Viro Tallinna ***
Kannus Oulu 


**Aaaaand we are done!**