# Open Information Extraction


Here will be implementing an *open information extraction* system, almost entirely from scratch. Information extraction takes a body of freeform text and extracts the contained information in a computer interpretable form. The word *open* simply means that the text/facts are arbitrary, so it will work with any input rather than a specific domain (e.g. legal texts).

As an example, given the input:

> "Trolls really don't like the sun."
  

you may extract the "fact":
```
('Trolls', 'do not like', 'the sun')
```

The approach is based on the paper "*Identifying Relations for Open Information Extraction*", by Fader, Soderland & Etzioni.

The steps of the system are as follows:
*  Tokenise and split on sentences
1. Part of speech tagging - token level
2. Part of speech tagging - sentence level
3. Named entity resolution
4. Relation extraction


*  Summarise "*20,000 leagues under the seas*" by Jules Verne *(provided)*

A simple NLP library, called `ogonek`, is used. It has some basic functionality that we will require

Its documentation can be found below in a markdown cell.

## Ogonek

A tiny NLP library, that contains exactly the functionality I don't want you to implement for this coursework!



### Tokenisation and sentence splitting
`ogonek.Tokenise()`

A class that tokenises some text and splits it into sentences. Construct an instance with `tokens = ogonek.Tokenise('My text')`; it then has the same interface as a list of lists:
* `len(tokens)`: Number of extracted sentences (not words)
* `tokens[i]`: Sentence i, where i ranges from 0 to one less than `len(tokens)`. A sentence is a list of tokens.



### Word vectors
`ogonek.Glove()`

Constructing a `glove = ogonek.Glove()` object loads a heavily pruned Glove word vectors from the file `baby_glove.zip` into memory, and will then translate tokens into word vectors. Note that it automatically lowercases any token it is handed, so you don't need to. Has the following interface:
* `glove.len_vec()` - Returns the length of the word vectors; should be 300.
* `len(glove)` - Returns how many word vectors it knows of.
* `token in glove` - Returns `True` if it has a word vector for that token, `False` otherwise.
* `glove[token]` - Returns the word vector for the given token; raises an error if it does not have one.
* `glove.decode(token)` - Returns the word vector for the given token, but if the word vector is unknown returns a vector of zeros instead (silent failure).
* `glove.decodes(list of tokens)` - Returns a list of word vectors, one for each token. Has the same silent failure behaviour as `decode`.



### Groningen Meaning Bank dataset
`ogonek.GMB()`

Provides access to the Groningen Meaning Bank dataset, which is supplied in the file `ner_dataset.csv`. Replicates the interface of the tokenisation system as far as it can. Construct with `gmb = ogonek.GMB()`; has the following interface:
* `len(gmb)`: Number of sentences (not words) in data set
* `gmb[i]`: Sentence i, where i ranges from 0 to one less than `len(gmb)`. A sentence is a list of tokens.
* `gmb.pos(i)`: A list of POS tags that match with sentence i. Note that these are the full Penn Treebank tags (not the reduced set used below).
* `gmb.ner(i)`: A list of named entities that match with sentence i. Using outside-inside scheme.



### Pretty printing

`ogonek.aligned_print(*)` takes multiple lists and prints them out, aligning them so that all elements in position 0 of all lists are aligned vertically (extra space added as required), and then elements in position 1 and so on. For showing tags and a sentence with everything aligned. Also does word wrap and colour coding.

In [1]:
%matplotlib inline

import time
import string
import re

import numpy as np
import matplotlib.pyplot as plt
import numpy

import ogonek

import warnings
warnings.filterwarnings("ignore")

## Useful variables

In [2]:
mb = ogonek.GMB()

In [3]:
# Dictionary giving descriptions of the reduced part of speech tags...
rpos_desc = {'C' : 'Coordinating conjunction',
             '0' : 'Cardinal number',
             'D' : 'Determiner',
             'E' : 'Existential there',
             'I' : 'Preposition or subordinating conjunction',
             'J' : 'Adjective',
             'N' : 'Noun',
             'P' : 'Predeterminer',
             'S' : 'Possessive ending',
             'M' : 'Pronoun',
             'R' : 'Adverb',
             'Z' : 'Particle',
             'T' : 'to',
             'V' : 'Verb',
             'A' : 'Anything else',
             '.' : 'All punctuation'}



# Reduced list of part of speech tags as a list...
num_to_rpos = ['C', '0', 'D', 'E', 'I', 'J', 'N', 'P',
               'S', 'M', 'R', 'Z', 'T', 'V', 'A', '.']



# Dictionary that maps a reduced part of speech
# tag to it's index in the above list; useful for vectors/matrices etc...
rpos_to_num = {'C' : 0,
               '0' : 1,
               'D' : 2,
               'E' : 3,
               'I' : 4,
               'J' : 5,
               'N' : 6,
               'P' : 7,
               'S' : 8,
               'M' : 9,
               'R' : 10,
               'Z' : 11,
               'T' : 12,
               'V' : 13,
               'A' : 14,
               '.' : 15}



# Dictionary that maps the full part of speech tags to the reduced set...
pos_to_rpos = {'CC' : 'C',
               'CD' : '0',
               'DT' : 'D',
               'EX' : 'E',
               'FW' : 'A',
               'IN' : 'I',
               'JJ' : 'J',
               'JJR' : 'J',
               'JJS' : 'J',
               'LS' : 'A',
               'MD' : 'A',
               'NN' : 'N',
               'NNS' : 'N',
               'NNP' : 'N',
               'NNPS' : 'N',
               'PDT' : 'P',
               'POS' : 'S',
               'PRP' : 'M',
               'PRP$' : 'M',
               'RB' : 'R',
               'RBR' : 'R',
               'RBS' : 'R',
               'RP' : 'Z',
               'SYM' : 'A',
               'TO' : 'T',
               'UH' : 'A',
               'VB' : 'V',
               'VBD' : 'V',
               'VBG' : 'V',
               'VBN' : 'V',
               'VBP' : 'V',
               'VBZ' : 'V',
               'WDT' : 'D',
               'WP' : 'M',
               'WP$' : 'S',
               'WRB' : 'R',
               '-' : '.',
               'LRB' : '.',
               'RRB' : '.',
               '``' : '.',
               '"' : '.',
               '.' : '.',
               ',' : '.',
               ';' : '.',
               ':' : '.',
               '$' : '.'}    



## Load book, tokenise and split on sentences
The below code reads in the book, chops it down to just the text of the book, and then tokenises it using the provided `ogonek` library.

In [4]:
# Loop file, only keeping lines between indicators...
lines = []
record = False

with open('20,000 Leagues Under the Seas.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        if record:
            if line.startswith('***END OF THE PROJECT GUTENBERG'):
                break
      
            lines.append(line)
    
        else:
            if line.startswith('***START OF THE PROJECT GUTENBERG'):
                record = True

text = ''.join(lines)


# Tokenise...
under_the_seas = ogonek.Tokenise(text)


# Print 10 random sentences to check it worked...
numpy.random.seed(0)

for i in range(10):
    toks = numpy.random.choice(under_the_seas)
    print('{:02d}. {}'.format(i+1, ' '.join(toks)))


01. It was the regime of verticality .
02. Now then , the tides are not strong in the Pacific , and if you can not unballast the Nautilus , which seems impossible to me , I do not see how it will float off . "
03. Captain Nemo left the cave , and we climbed back up the bank of shellfish in the midst of these clear waters not yet disturbed by divers at work .
04. Likewise the pilothouse and the beacon housing were withdrawn into the hull until they lay exactly flush with it .
05. Instead of digging all around the Nautilus , which would have entailed even greater difficulties , Captain Nemo had an immense trench outlined on the ice , eight meters from our port quarter .
06. We would not go five miles without bumping into a fellow countryman .
07. The oars , mast , and sail are in the skiff .
08. Under existing conditions some ten men at the most should be enough to operate it . "
09. Nobody appeared on our arrival .
10. We gasped .


# Part of speech tagging - token level

The goal here is to train a classifier that indicates which of the part of speech tags (the reduced set provided above) each word is. For this initial approach we're going to treat words (tokens) individually, without context. For features the Glove word vectors are going to be used (provided by `ogonek.Glove()`).

Instead of training a single classifier a slight modification of a random kitchen sink for each part of speech tag is going to be used. Specifically, a logistic random kitchen sink that indicates the probability that the word should be labelled with the associated tag. This is a *one vs all* classifier - you have a classifier for every tag, run them all on each word, and then select the tag with the highest probability (it's inconsistent - they won't sum to 1!). A logistic random kitchen sink is simply a normal kitchen sink that is pushed through a sigmoid function (in neural network terms, the final layer has a non-linearity),
$$\operatorname{Sig}(z) = \frac{1}{1 + e^{-z}}$$
such that the final binary classifier is
$$P(\textrm{tag}) = \operatorname{Sig}\left(\sum_{k \in K} \alpha_k \phi\left(\vec{x} \cdot \vec{w}_k\right)\right)$$
For the cost function we will be maximising the log likelihood of the dataset. This will require gradient descent; Nestorov, including backtracking line search to select the initial step size, to get all marks. We will be using 300 random features, in addition to the 300 provided by glove (total of 601 - bias term is the +1), as that keeps the resulting data matrix during training small enough that it completes reasonably quickly.

The Groningen Meaning Bank dataset has been provided; it can be accessed via the class `ogonek.GMB`. It includes lots of sentences, each as a list of tokens, plus part of speech tags as a list aligned with the sentence.
Source: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

In [5]:
# Load word vectors; in a seperate cell as this takes a couple seconds...
glove = ogonek.Glove()


# Groningen Meaning Bank dataset - a set of sentences each tagged
# with part of speech and named entity recognitiuon tags...
gmb = ogonek.GMB()
print('GMB sentences = {}'.format(len(gmb)))
print()


# Print out 5 random sentences from GMB with POS and NER tags, to illustrate the data...
numpy.random.seed(1)
for _ in range(5):
    i = numpy.random.randint(len(gmb))
    ogonek.aligned_print(gmb[i], gmb.pos(i), gmb.ner(i))


GMB sentences = 47959

[0mThe U.S.  space agency is  making final preparations to launch the first 
[31mDT  NNP   NN    NN     VBZ VBG    JJ    NNS          TO VB     DT  JJ    
[34mO   B-geo O     O      O   O      O     O            O  O      O   O     
[0m
[0mdirect space probe to the distant planet of Pluto . 
[31mJJ     NN    NN    TO DT  JJ      NN     IN NNP   . 
[34mO      O     O     O  O   O       O      O  B-geo O 
[0m
[0mOn Monday , the freighter Torgelow was hijacked off the eastern coast of 
[31mIN NNP    , DT  NN        NNP      VBD VBN      IN  DT  JJ      NN    IN 
[34mO  B-tim  O O   O         B-art    O   O        O   O   O       O     O  
[0m
[0mSomalia . 
[31mNNP     . 
[34mB-geo   O 
[0m
[0mChile and Bolivia are associate members . 
[31mNNP   CC  NNP     VBP JJ        NNS     . 
[34mB-gpe O   B-gpe   O   O         O       O 
[0m
[0mVenezuela has freed 11 Colombian soldiers who had been detained after entering 
[31mNNP       VBZ VBN   CD JJ   

In [6]:
#create training data
def createTrainingSet():
    #create list of arrays representing the word vectors of 300 
    gloveList = []
    #create list of tags (ys)
    posWordList = []
    sentenceNum = 0
    
    for sentence in gmb:
        sentencePosList = []
        sentencePosList = gmb.pos(sentenceNum)
        sentenceNum += 1

        wordNum = 0
        for word in sentence:
        #find glove vector for each word
            if word in glove:
                
                gloveList.append(glove[word])

                pos = sentencePosList[wordNum]

                posWordList.append(rpos_to_num[pos_to_rpos[pos]])
                wordNum+=1

    return gloveList,posWordList
#glovelist corresponds to x
#posWordList corresponds to y
gloveList,posWordList = createTrainingSet()

#create training set as 0.3 of data
trainingX = gloveList[:(int(0.3*len(gloveList)))]
trainingY = posWordList[:(int(0.3*len(posWordList)))]


In [7]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

#kitchen sink
#x is the glovek vector len 300
#k is the feature vector len 300 + 1 bias

def returnX (x, K = 300): # Extend x with K new features . . . 
    #create kitchen sink
    #create extended x vectors. contain 300 random value vector, 300 dim glove word vector and one bias term
    x = np.array(x)
    #Random vectors
    w = numpy.random.standard_normal((K, x.shape[1])) 

    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    #extended vectors
    
    ex = numpy.append(x,nf, axis=1)
    
    #add bias term                            
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)

    return w, ex


                                
def prediction(ex, alpha):
    #make prediction using alpha
    pred = ex@alpha
    
    pred = np.clip(sigmoid(pred),1e-3,1-1e-3)

    return pred
                                
def costFunction(pred,y): 
    
    #cost function is minimising the negative likelihood 
    pred = np.array(pred)

    cost = -(y@numpy.log(pred)+(1-y)@numpy.log(1-pred))
    
    return cost

def gradient(actual,pred,ex):
    #calculate gradient of cost function
    diff = (pred-actual)
    gradient = ex.T@diff
    #(actual - pred)*ex.T
    return gradient
    


In [8]:
#create extended x matrix and return corresponding w
w, ex = returnX(trainingX)

In [9]:
def backtrack(ex,y,alpha):
    #backtracking algorithm to calculate step size
    pred = prediction(ex,alpha)
    mu = 0.2
    beta = 0.8
    #starting step size
    step = 0.001
    #initalise first alpha
    alphaZer = alpha
    
    #initialised gradients of first alpha
    alphaZerGrad = gradient(y,pred,ex)
    
    #f is cost function

    f = lambda alpha:costFunction(prediction(ex,alpha),y)
    i = 0
    
    #keep making step size smaller until it meets condition
    while f(alphaZer - step*alphaZerGrad)>=f(alphaZer)-step*mu*numpy.linalg.norm(alphaZerGrad)**2:
        step = beta*step
        i+= 1
    print('step size from back tracking {} '.format(step))
    return step
    


In [10]:
def nesterov(ex, y):
    #nesterov gradient descent algorithm
    costs = []
    #initialise alpha vector
    alpha = numpy.random.normal(0, 1, ex.shape[1])
    #calculate gradient
    delta_f = lambda alpha: gradient(y, prediction(ex, alpha),ex)

    # Initialise variables

    #initialise step size using
    step = backtrack(ex,y,alpha) 
    
#     step = 1e-5
    
    lam = 0.9

    velocity = 0
    converged = False
    for i in range(256):
        # update velocity gradient and use to update alpha
        
        velocityNew = lam * velocity + step * delta_f(alpha - lam * velocity)
        
        alpha = alpha - velocityNew
        velocity = velocityNew

        # Get cost and check convergence
        costs.append(costFunction(prediction(ex, alpha),y))
        
        
        #check for convergence
        if len(costs) >= 8 and numpy.all(numpy.isclose(costs[-8:], costs[-1])):
            converged = True
            break

    return alpha, costs[-1], prediction(ex, alpha)

In [11]:
# A test/train split - train with [0:split], test with [split:len(gmb)]
split = int(len(gmb) * 0.3) # Have a lot of data, and don't want you waiting around too long to train!
print('Using {} sentences for training'.format(split))


def train_tag_model(tag):    
#train each tag and store probability to dictionary

    tagNum = rpos_to_num[tag]
    
    start = time.time()
    
    #create training y for each tag. make 1 if tag corresponds to training tag else make 0
    y = np.copy(trainingY)
    
    for i in range(y.shape[0]):
        if y[i] == tagNum:
            y[i] = 1
        else:
            y[i] = 0
    #y[y!=tagNum]=int(0)
    #y[y==tagNum]=int(1)
    
    
    alpha, cost, finalPrediction = nesterov(ex, y)

    
    end = time.time()
    
    print('  (took {:g} seconds)'.format(end-start))
    
    return alpha
    

    
   
 
# Code to train a model for each reduced POS tag...
rpos_model = {}
for tag in rpos_desc:
    print('Training {}'.format(tag))
    rpos_model[tag] = train_tag_model(tag)

alphaDict = rpos_model 

Using 14387 sentences for training
Training C
  (took 77.3458 seconds)
Training 0
  (took 85.6905 seconds)
Training D
  (took 76.3827 seconds)
Training E
  (took 77.4351 seconds)
Training I
  (took 68.995 seconds)
Training J
  (took 68.9006 seconds)
Training N
  (took 69.1477 seconds)
Training P
  (took 69.7048 seconds)
Training S
  (took 69.0622 seconds)
Training M
  (took 68.4559 seconds)
Training R
  (took 76.0234 seconds)
Training Z
  (took 71.8655 seconds)
Training T
  (took 69.9556 seconds)
Training V
  (took 71.6595 seconds)
Training A
  (took 76.1969 seconds)
Training .
  (took 70.936 seconds)


In [12]:
# next we will train the models above, then estimate POS tags...
def token_pos(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    Basically calls the models for each tag and selects the tag with the
    highest probability."""
    
    #create ex word vector for each word in sentence. length is num words by 601

    xList = []
    wordsList = []
    for word in sentence:
        x = glove.decode(word)
        xList.append(x)
    x = np.array(xList)
 
    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    ex = numpy.append(x,nf, axis=1)
    
                                
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)
    
    predictionArray = np.zeros([len(sentence),16])
    
    #calculate prediction for each tag using alphas calculated 
    #returns matrix of number of words by 16
    for tag, alpha in alphaDict.items():
        pred = prediction(ex,alpha)
        tagNum = rpos_to_num[tag]
        predictionArray[:,tagNum] = pred
    
    #take maximum value of tag as the final prediction and append to a list for each word

    wordPos = []
    for i in range(predictionArray.shape[0]):

        bestProb = np.argmax(predictionArray[i])
        
        rpos = num_to_rpos[bestProb]

        wordPos.append(rpos)



    return wordPos

In [13]:
# Code to test the performance of POS tagger...
correct = 0
tested = 0
pershown = 0

# can finish earlier if taking too long
stop_percent = 100

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = token_pos(gmb[i])

    truth = gmb.pos(i)
    
    for g,t in zip(guess, truth):
        if g==pos_to_rpos[t]:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 93.0%
  (took 87.3445 seconds)




# Part of speech tagging - sentence level

While the previous step works very well we need POS tags to be super accurate, as everything else depends on them. We will now introduce context. This is done by calculating transition probabilities between tags and solving a Markov random chain using the forward-backwards algorithm to find the maximum a posteriori (MAP) POS tag assignment for the entire sentence. The adjacency matrix contains $\log P(\textrm{second pos tag} | \textrm{first pos tag})$.


In [14]:
def CreateAdjacencies():
    
    #create matrix which calculates the adjacent probabilityies.
    
    #initialise matrix
    adMat = np.ones([16,16])
    
    #loop through all words and count the number of times one tag follows another
    wordNum = 0
    for i in range(len(posWordList)-1):
            
        word = posWordList[i]
        nextWord = posWordList[i+1]

        adMat[word,nextWord]+=1
        
        
    #normalise by taking percentage for word
    posSum = np.sum(adMat,axis=1).reshape(-1,1)
    

    adMat = adMat/posSum
 
    #return log of the matrix
    adMat = np.log(adMat)
    
    return adMat

admat = CreateAdjacencies()   



In [15]:
def emissionMatrix(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    Basically calls the models for each tag and selects the tag with the
    highest probability."""
    
    
    #create emission matrix using part one alphas to predict tag for each word without context

    xList = []
    wordsList = []
    for word in sentence:

        x = glove.decode(word)
        xList.append(x)
    x = np.array(xList)

    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    ex = numpy.append(x,nf, axis=1)
    
                                
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)
    
    #calculate probability of word for each tag
    predictionArray = np.zeros([len(sentence),16])
    
    for tag, alpha in alphaDict.items():
        pred = prediction(ex,alpha)
        tagNum = rpos_to_num[tag]
        predictionArray[:,tagNum] = pred
    
    sumPred = np.sum(predictionArray,axis=1).reshape(-1,1)
    
    predictionArray = predictionArray/sumPred
   
    
    return predictionArray



In [16]:
def sentence_pos(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    A more advanced version of token_pos that uses neighbours as well."""
    
   #use forward backward algorithm to calculate the conditional probability of tag given the rest of the words in the sentence
    
    fwd = []
    f_prev = {}
    #return emission matrix
    emMatrix = emissionMatrix(sentence)
    emMatrix = np.log(emMatrix)
    
    forwards = np.zeros([len(sentence),len(rpos_to_num)])
    #loops through time step and states
    
    #tags are states and words are observable
    for t, word in enumerate(sentence):
        f_curr = {}
    
        if t == 0:
            # base case for the forward part

            forwards[t] = emMatrix[t]

        else:
            
            for key, s  in rpos_to_num.items():
                prev_f_sum = np.zeros(len(rpos_to_num))
                for key, sPrev in rpos_to_num.items():
                    prev_f_sum[sPrev] = forwards[t-1,sPrev]+admat[sPrev,s]+emMatrix[t,s]

                forwards[t,s] = numpy.max(prev_f_sum)



    #backwards
    backwards = np.zeros([len(sentence),len(rpos_to_num)])
    
    for t in reversed(range(len(sentence))):
    #for st in rpos_desc:

        if t == len(sentence)-1:
            # base case uses the forwards 

            backwards[t] = forwards[t]

        else:
            
            for key, s  in rpos_to_num.items():
                prev_f_sum = np.zeros(len(rpos_to_num))
                for key, sPrev in rpos_to_num.items():
                    prev_f_sum[sPrev] = backwards[t+1,sPrev]+admat[s,sPrev]+emMatrix[t,s]

                backwards[t,s] = numpy.max(prev_f_sum)
    
    #return highest index of backward as prediction for the tag
    finalPredIndex = numpy.argmax(backwards, axis=1)
    
    finalPred = []
    
    for i in finalPredIndex:
        finalPred.append(num_to_rpos[i])
        
        
    return finalPred
    
    


In [17]:
# Code to test the performance of your improved POS tagger...
correct = 0
tested = 0
pershown = 0
stop_percent = 100

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = sentence_pos(gmb[i])
    truth = gmb.pos(i)
    
    for g,t in zip(guess, truth):
        if g==pos_to_rpos[t]:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 93.6%
  (took 496.841 seconds)


# Named entity recognition

The next step is to identify names, that is the entities that "facts" may apply to. While training a further classifier does work (same as above, inc. dynamic programming) there would be little point in repeating the exercise. Instead, a simple rule based approach using *regular expressions* is going to be used.


Given part of speech tagging a name can be defined as:
* An optional *determiner*, e.g. *the* (1 or none)
* An arbitrary number of *adjectives* (could be none)
* A single *noun*


In [18]:
def sentence_ner(sentence, pos):
    """Given a sentence as a list of tokens and it's part of speech tags
    this returns a list of the same length with True wherever it thinks
    there is a name."""
    
    #return array of true and false for each word if it follows the rules above
    ret = [False] * len(sentence)
    
    #regular expression for names
    regex = 'D?J*N'
    
    #create string of tags
    joinPos = ''.join(pos)
    #check for regex matches in the string
    for match in re.finditer(regex, joinPos):
        start = match.start()
        end = match.end()
        #for all matches make value true 
        for a in range(start,end):
            ret[a] = True
    

    
    return ret


In [19]:
# Code to test the performance of the NER tagger...
correct = 0
tested = 0
pershown = 0
stop_percent = 100 # If you want faster feedback you can reduce this

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = sentence_ner(gmb[i], [pos_to_rpos[p] for p in gmb.pos(i)])
    truth = [ner!='O' for ner in gmb.ner(i)]
    
    for g,t in zip(guess, truth):
        if g==t:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 63.4%
  (took 0.532609 seconds)


# Relation extraction

This is where the paper "*Identifying Relations for Open Information Extraction*" comes in, specifically one of its novel contributions. It extracts relations using this procedure:
1. Find relation text by matching a human-designed pattern to the POS tags
2. Identify the named entities to the left and right of the relation text.
3. Generate the relation tuple (left named entity, relation text, right named entity)

(all previous approaches found names then relations - turns out it works much better the other way around)

Relation text is identified as:
`(Ve (Wo* Pa)?)+`
where
* `Ve = Verb Particle? Adverb?`
* `Wo = Noun | Adjective | Adverb | Pronoun | Determiner`
* `Pa = Preposition or subordinating conjunction | Particle`
* `| =` or, so either of the options
* `? =` optional
* `+ =` at least one, but can be many
* `* =` an arbitrary number of repetitions, including the option for none.

We will need to convert the above rules into a regular expression

In [20]:
def extract(sentence):
    """Given a sentence, as a list of tokens, this returns a list of all relations
    extracted from the sentence. Each relation is a tuple with three entries:
    (named entity one, relation, named entity two)"""
    pos = sentence_pos(sentence)

    ner = sentence_ner(sentence, pos)
    
    #create regex expression for relation
    ret = []
    Ve = '(VZ?R?)'
    Wo = '(N|J|R|M|D)'
    Pa = '(I|Z)'
    regex = '({}({}*{})?)+'.format(Ve,Wo,Pa)
    
    #create string of tags
    joinPos = ''.join(pos)
    
    for match in re.finditer(regex, joinPos):
        

        start = match.start()
        
        end = match.end()
        
        relation = sentence[start:end]
        
        #check value before and after the relation string to check if the named entity before or after
        leftEnt = []
        leftInd = []
        
        #check if tags before the relation are named entities
        for i in reversed(range(0,start)):
            if ner[i] ==True:
                leftInd.append(i)
                
            else:
                break
        leftInd = np.sort(leftInd)
        
        
        for j in range(len(leftInd)):
            
            leftEnt.append(sentence[j])
            
        #check if tags after the relation are named entities
        rightEnt = []
     
        for i in range(end,len(sentence)):
            
            
            if ner[i] ==True:
                rightEnt.append(sentence[i])
            else:
                break
   
        if len(leftInd) == 0 or len(rightEnt)==0:
            break
        else:
            #create tuple of named left, named right and relation if left and right exist
            retTup = (' '.join(leftEnt),' '.join(relation),' '.join(rightEnt))
           
            ret.append(retTup)
        
    
    return ret


In [21]:
# Small test of the above...
tests = ['London is full of pigeons.',
         'In 1781 William Herschel discovered Uranus', # 1
         "Trolls really don't like the sun.",
         'Giant owls would enjoy eatting people.',
         "Dragons collect gold, but they don't make microprocessors."] # 2

# 1. Seems to miss William - misclassified it, at least with the model answer.
# 2. Should extract two facts, first sensible, second absurd.

for sentence in tests:
    print(sentence)
    tokens = ogonek.Tokenise(sentence)
    
    rels = extract(tokens[0])
    for rel in rels:
        print('  ' + ' -- '.join(rel))
    print()
    

London is full of pigeons.
  London -- is full of -- pigeons

In 1781 William Herschel discovered Uranus
  In 1781 William -- discovered -- Uranus

Trolls really don't like the sun.

Giant owls would enjoy eatting people.

Dragons collect gold, but they don't make microprocessors.
  Dragons -- collect -- gold



## 20000 relations under the seas

While the above may have tested each step of the system, the below code runs it on the book "*20,000 leagues under the seas*" by Jules Verne (widely considered to be the first science fiction book, and full of fairly dubious claims about what goes on underwater).

In [None]:
for index in range(len(under_the_seas)):
    sentence = under_the_seas[index]
    rels = extract(sentence)
    
    if len(rels)>0:
        print(' '.join(sentence))
        for rel in rels:
            print('  ' + ' -- '.join(rel))
        print()