# Open Information Extraction


Here will be implementing an *open information extraction* system, almost entirely from scratch. Information extraction takes a body of freeform text and extracts the contained information in a computer interpretable form. The word *open* simply means that the text/facts are arbitrary, so it will work with any input rather than a specific domain (e.g. legal texts).

As an example, given the input:

> "Trolls really don't like the sun."
  

you may extract the "fact":
```
('Trolls', 'do not like', 'the sun')
```

The approach is based on the paper "*Identifying Relations for Open Information Extraction*", by Fader, Soderland & Etzioni.

The steps of the system are as follows:
*  Tokenise and split on sentences
1. Part of speech tagging - token level
2. Part of speech tagging - sentence level
3. Named entity resolution
4. Relation extraction


*  Summarise "*20,000 leagues under the seas*" by Jules Verne *(provided)*

A simple NLP library, called `ogonek`, is used. It has some basic functionality that we will require

Its documentation can be found below in a markdown cell.

## Ogonek

A tiny NLP library, that contains exactly the functionality I don't want you to implement for this coursework!



### Tokenisation and sentence splitting
`ogonek.Tokenise()`

A class that tokenises some text and splits it into sentences. Construct an instance with `tokens = ogonek.Tokenise('My text')`; it then has the same interface as a list of lists:
* `len(tokens)`: Number of extracted sentences (not words)
* `tokens[i]`: Sentence i, where i ranges from 0 to one less than `len(tokens)`. A sentence is a list of tokens.



### Word vectors
`ogonek.Glove()`

Constructing a `glove = ogonek.Glove()` object loads a heavily pruned Glove word vectors from the file `baby_glove.zip` into memory, and will then translate tokens into word vectors. Note that it automatically lowercases any token it is handed, so you don't need to. Has the following interface:
* `glove.len_vec()` - Returns the length of the word vectors; should be 300.
* `len(glove)` - Returns how many word vectors it knows of.
* `token in glove` - Returns `True` if it has a word vector for that token, `False` otherwise.
* `glove[token]` - Returns the word vector for the given token; raises an error if it does not have one.
* `glove.decode(token)` - Returns the word vector for the given token, but if the word vector is unknown returns a vector of zeros instead (silent failure).
* `glove.decodes(list of tokens)` - Returns a list of word vectors, one for each token. Has the same silent failure behaviour as `decode`.



### Groningen Meaning Bank dataset
`ogonek.GMB()`

Provides access to the Groningen Meaning Bank dataset, which is supplied in the file `ner_dataset.csv`. Replicates the interface of the tokenisation system as far as it can. Construct with `gmb = ogonek.GMB()`; has the following interface:
* `len(gmb)`: Number of sentences (not words) in data set
* `gmb[i]`: Sentence i, where i ranges from 0 to one less than `len(gmb)`. A sentence is a list of tokens.
* `gmb.pos(i)`: A list of POS tags that match with sentence i. Note that these are the full Penn Treebank tags (not the reduced set used below).
* `gmb.ner(i)`: A list of named entities that match with sentence i. Using outside-inside scheme.



### Pretty printing

`ogonek.aligned_print(*)` takes multiple lists and prints them out, aligning them so that all elements in position 0 of all lists are aligned vertically (extra space added as required), and then elements in position 1 and so on. For showing tags and a sentence with everything aligned. Also does word wrap and colour coding.

In [1]:
%matplotlib inline

import time
import string
import re

import numpy as np
import matplotlib.pyplot as plt
import numpy

import ogonek

import warnings
warnings.filterwarnings("ignore")

## Useful variables

In [2]:
mb = ogonek.GMB()

In [3]:
# Dictionary giving descriptions of the reduced part of speech tags...
rpos_desc = {'C' : 'Coordinating conjunction',
             '0' : 'Cardinal number',
             'D' : 'Determiner',
             'E' : 'Existential there',
             'I' : 'Preposition or subordinating conjunction',
             'J' : 'Adjective',
             'N' : 'Noun',
             'P' : 'Predeterminer',
             'S' : 'Possessive ending',
             'M' : 'Pronoun',
             'R' : 'Adverb',
             'Z' : 'Particle',
             'T' : 'to',
             'V' : 'Verb',
             'A' : 'Anything else',
             '.' : 'All punctuation'}



# Reduced list of part of speech tags as a list...
num_to_rpos = ['C', '0', 'D', 'E', 'I', 'J', 'N', 'P',
               'S', 'M', 'R', 'Z', 'T', 'V', 'A', '.']



# Dictionary that maps a reduced part of speech
# tag to it's index in the above list; useful for vectors/matrices etc...
rpos_to_num = {'C' : 0,
               '0' : 1,
               'D' : 2,
               'E' : 3,
               'I' : 4,
               'J' : 5,
               'N' : 6,
               'P' : 7,
               'S' : 8,
               'M' : 9,
               'R' : 10,
               'Z' : 11,
               'T' : 12,
               'V' : 13,
               'A' : 14,
               '.' : 15}



# Dictionary that maps the full part of speech tags to the reduced set...
pos_to_rpos = {'CC' : 'C',
               'CD' : '0',
               'DT' : 'D',
               'EX' : 'E',
               'FW' : 'A',
               'IN' : 'I',
               'JJ' : 'J',
               'JJR' : 'J',
               'JJS' : 'J',
               'LS' : 'A',
               'MD' : 'A',
               'NN' : 'N',
               'NNS' : 'N',
               'NNP' : 'N',
               'NNPS' : 'N',
               'PDT' : 'P',
               'POS' : 'S',
               'PRP' : 'M',
               'PRP$' : 'M',
               'RB' : 'R',
               'RBR' : 'R',
               'RBS' : 'R',
               'RP' : 'Z',
               'SYM' : 'A',
               'TO' : 'T',
               'UH' : 'A',
               'VB' : 'V',
               'VBD' : 'V',
               'VBG' : 'V',
               'VBN' : 'V',
               'VBP' : 'V',
               'VBZ' : 'V',
               'WDT' : 'D',
               'WP' : 'M',
               'WP$' : 'S',
               'WRB' : 'R',
               '-' : '.',
               'LRB' : '.',
               'RRB' : '.',
               '``' : '.',
               '"' : '.',
               '.' : '.',
               ',' : '.',
               ';' : '.',
               ':' : '.',
               '$' : '.'}    



## Load book, tokenise and split on sentences
The below code reads in the book, chops it down to just the text of the book, and then tokenises it using the provided `ogonek` library.

In [4]:
# Loop file, only keeping lines between indicators...
lines = []
record = False

with open('20,000 Leagues Under the Seas.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        if record:
            if line.startswith('***END OF THE PROJECT GUTENBERG'):
                break
      
            lines.append(line)
    
        else:
            if line.startswith('***START OF THE PROJECT GUTENBERG'):
                record = True

text = ''.join(lines)


# Tokenise...
under_the_seas = ogonek.Tokenise(text)


# Print 10 random sentences to check it worked...
numpy.random.seed(0)

for i in range(10):
    toks = numpy.random.choice(under_the_seas)
    print('{:02d}. {}'.format(i+1, ' '.join(toks)))


01. It was the regime of verticality .
02. Now then , the tides are not strong in the Pacific , and if you can not unballast the Nautilus , which seems impossible to me , I do not see how it will float off . "
03. Captain Nemo left the cave , and we climbed back up the bank of shellfish in the midst of these clear waters not yet disturbed by divers at work .
04. Likewise the pilothouse and the beacon housing were withdrawn into the hull until they lay exactly flush with it .
05. Instead of digging all around the Nautilus , which would have entailed even greater difficulties , Captain Nemo had an immense trench outlined on the ice , eight meters from our port quarter .
06. We would not go five miles without bumping into a fellow countryman .
07. The oars , mast , and sail are in the skiff .
08. Under existing conditions some ten men at the most should be enough to operate it . "
09. Nobody appeared on our arrival .
10. We gasped .


# Part of speech tagging - token level

The goal here is to train a classifier that indicates which of the part of speech tags (the reduced set provided above) each word is. For this initial approach we're going to treat words (tokens) individually, without context. For features the Glove word vectors are going to be used (provided by `ogonek.Glove()`).

Instead of training a single classifier a slight modification of a random kitchen sink for each part of speech tag is going to be used. Specifically, a logistic random kitchen sink that indicates the probability that the word should be labelled with the associated tag. This is a *one vs all* classifier - you have a classifier for every tag, run them all on each word, and then select the tag with the highest probability (it's inconsistent - they won't sum to 1!). A logistic random kitchen sink is simply a normal kitchen sink that is pushed through a sigmoid function (in neural network terms, the final layer has a non-linearity),
$$\operatorname{Sig}(z) = \frac{1}{1 + e^{-z}}$$
such that the final binary classifier is
$$P(\textrm{tag}) = \operatorname{Sig}\left(\sum_{k \in K} \alpha_k \phi\left(\vec{x} \cdot \vec{w}_k\right)\right)$$
For the cost function we will be maximising the log likelihood of the dataset. This will require gradient descent; Nestorov, including backtracking line search to select the initial step size, to get all marks. We will be using 300 random features, in addition to the 300 provided by glove (total of 601 - bias term is the +1), as that keeps the resulting data matrix during training small enough that it completes reasonably quickly.

The Groningen Meaning Bank dataset has been provided; it can be accessed via the class `ogonek.GMB`. It includes lots of sentences, each as a list of tokens, plus part of speech tags as a list aligned with the sentence.
Source: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

In [5]:
# Load word vectors; in a seperate cell as this takes a couple seconds...
glove = ogonek.Glove()


# Groningen Meaning Bank dataset - a set of sentences each tagged
# with part of speech and named entity recognitiuon tags...
gmb = ogonek.GMB()
print('GMB sentences = {}'.format(len(gmb)))
print()


# Print out 5 random sentences from GMB with POS and NER tags, to illustrate the data...
numpy.random.seed(1)
for _ in range(5):
    i = numpy.random.randint(len(gmb))
    ogonek.aligned_print(gmb[i], gmb.pos(i), gmb.ner(i))


GMB sentences = 47959

[0mThe U.S.  space agency is  making final preparations to launch the first 
[31mDT  NNP   NN    NN     VBZ VBG    JJ    NNS          TO VB     DT  JJ    
[34mO   B-geo O     O      O   O      O     O            O  O      O   O     
[0m
[0mdirect space probe to the distant planet of Pluto . 
[31mJJ     NN    NN    TO DT  JJ      NN     IN NNP   . 
[34mO      O     O     O  O   O       O      O  B-geo O 
[0m
[0mOn Monday , the freighter Torgelow was hijacked off the eastern coast of 
[31mIN NNP    , DT  NN        NNP      VBD VBN      IN  DT  JJ      NN    IN 
[34mO  B-tim  O O   O         B-art    O   O        O   O   O       O     O  
[0m
[0mSomalia . 
[31mNNP     . 
[34mB-geo   O 
[0m
[0mChile and Bolivia are associate members . 
[31mNNP   CC  NNP     VBP JJ        NNS     . 
[34mB-gpe O   B-gpe   O   O         O       O 
[0m
[0mVenezuela has freed 11 Colombian soldiers who had been detained after entering 
[31mNNP       VBZ VBN   CD JJ   

In [6]:
#create training data
def createTrainingSet():
    #create list of arrays representing the word vectors of 300 
    gloveList = []
    #create list of tags (ys)
    posWordList = []
    sentenceNum = 0
    
    for sentence in gmb:
        sentencePosList = []
        sentencePosList = gmb.pos(sentenceNum)
        sentenceNum += 1

        wordNum = 0
        for word in sentence:
        #find glove vector for each word
            if word in glove:
                
                gloveList.append(glove[word])

                pos = sentencePosList[wordNum]

                posWordList.append(rpos_to_num[pos_to_rpos[pos]])
                wordNum+=1

    return gloveList,posWordList
#glovelist corresponds to x
#posWordList corresponds to y
gloveList,posWordList = createTrainingSet()

#create training set as 0.3 of data
trainingX = gloveList[:(int(0.3*len(gloveList)))]
trainingY = posWordList[:(int(0.3*len(posWordList)))]


In [7]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

#kitchen sink
#x is the glovek vector len 300
#k is the feature vector len 300 + 1 bias

def returnX (x, K = 300): # Extend x with K new features . . . 
    #create kitchen sink
    #create extended x vectors. contain 300 random value vector, 300 dim glove word vector and one bias term
    x = np.array(x)
    #Random vectors
    w = numpy.random.standard_normal((K, x.shape[1])) 

    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    #extended vectors
    
    ex = numpy.append(x,nf, axis=1)
    
    #add bias term                            
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)

    return w, ex


                                
def prediction(ex, alpha):
    #make prediction using alpha
    pred = ex@alpha
    
    pred = np.clip(sigmoid(pred),1e-3,1-1e-3)

    return pred
                                
def costFunction(pred,y): 
    
    #cost function is minimising the negative likelihood 
    pred = np.array(pred)

    cost = -(y@numpy.log(pred)+(1-y)@numpy.log(1-pred))
    
    return cost

def gradient(actual,pred,ex):
    #calculate gradient of cost function
    diff = (pred-actual)
    gradient = ex.T@diff
    #(actual - pred)*ex.T
    return gradient
    


In [8]:
#create extended x matrix and return corresponding w
w, ex = returnX(trainingX)

In [9]:
def backtrack(ex,y,alpha):
    #backtracking algorithm to calculate step size
    pred = prediction(ex,alpha)
    mu = 0.2
    beta = 0.8
    #starting step size
    step = 0.001
    #initalise first alpha
    alphaZer = alpha
    
    #initialised gradients of first alpha
    alphaZerGrad = gradient(y,pred,ex)
    
    #f is cost function

    f = lambda alpha:costFunction(prediction(ex,alpha),y)
    i = 0
    
    #keep making step size smaller until it meets condition
    while f(alphaZer - step*alphaZerGrad)>=f(alphaZer)-step*mu*numpy.linalg.norm(alphaZerGrad)**2:
        step = beta*step
        i+= 1
    print('step size from back tracking {} '.format(step))
    return step
    


In [10]:
def nesterov(ex, y):
    #nesterov gradient descent algorithm
    costs = []
    #initialise alpha vector
    alpha = numpy.random.normal(0, 1, ex.shape[1])
    #calculate gradient
    delta_f = lambda alpha: gradient(y, prediction(ex, alpha),ex)

    # Initialise variables

    #initialise step size using
    step = backtrack(ex,y,alpha) 
    
#     step = 1e-5
    
    lam = 0.9

    velocity = 0
    converged = False
    for i in range(256):
        # update velocity gradient and use to update alpha
        
        velocityNew = lam * velocity + step * delta_f(alpha - lam * velocity)
        
        alpha = alpha - velocityNew
        velocity = velocityNew

        # Get cost and check convergence
        costs.append(costFunction(prediction(ex, alpha),y))
        
        
        #check for convergence
        if len(costs) >= 8 and numpy.all(numpy.isclose(costs[-8:], costs[-1])):
            converged = True
            break

    return alpha, costs[-1], prediction(ex, alpha)

In [11]:
# A test/train split - train with [0:split], test with [split:len(gmb)]
split = int(len(gmb) * 0.3) # Have a lot of data, and don't want you waiting around too long to train!
print('Using {} sentences for training'.format(split))


def train_tag_model(tag):    
#train each tag and store probability to dictionary

    tagNum = rpos_to_num[tag]
    
    start = time.time()
    
    #create training y for each tag. make 1 if tag corresponds to training tag else make 0
    y = np.copy(trainingY)
    
    for i in range(y.shape[0]):
        if y[i] == tagNum:
            y[i] = 1
        else:
            y[i] = 0
    #y[y!=tagNum]=int(0)
    #y[y==tagNum]=int(1)
    
    
    alpha, cost, finalPrediction = nesterov(ex, y)

    
    end = time.time()
    
    print('  (took {:g} seconds)'.format(end-start))
    
    return alpha
    

    
   
 
# Code to train a model for each reduced POS tag...
rpos_model = {}
for tag in rpos_desc:
    print('Training {}'.format(tag))
    rpos_model[tag] = train_tag_model(tag)

alphaDict = rpos_model 

Using 14387 sentences for training
Training C
  (took 77.3458 seconds)
Training 0
  (took 85.6905 seconds)
Training D
  (took 76.3827 seconds)
Training E
  (took 77.4351 seconds)
Training I
  (took 68.995 seconds)
Training J
  (took 68.9006 seconds)
Training N
  (took 69.1477 seconds)
Training P
  (took 69.7048 seconds)
Training S
  (took 69.0622 seconds)
Training M
  (took 68.4559 seconds)
Training R
  (took 76.0234 seconds)
Training Z
  (took 71.8655 seconds)
Training T
  (took 69.9556 seconds)
Training V
  (took 71.6595 seconds)
Training A
  (took 76.1969 seconds)
Training .
  (took 70.936 seconds)


In [12]:
# next we will train the models above, then estimate POS tags...
def token_pos(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    Basically calls the models for each tag and selects the tag with the
    highest probability."""
    
    #create ex word vector for each word in sentence. length is num words by 601

    xList = []
    wordsList = []
    for word in sentence:
        x = glove.decode(word)
        xList.append(x)
    x = np.array(xList)
 
    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    ex = numpy.append(x,nf, axis=1)
    
                                
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)
    
    predictionArray = np.zeros([len(sentence),16])
    
    #calculate prediction for each tag using alphas calculated 
    #returns matrix of number of words by 16
    for tag, alpha in alphaDict.items():
        pred = prediction(ex,alpha)
        tagNum = rpos_to_num[tag]
        predictionArray[:,tagNum] = pred
    
    #take maximum value of tag as the final prediction and append to a list for each word

    wordPos = []
    for i in range(predictionArray.shape[0]):

        bestProb = np.argmax(predictionArray[i])
        
        rpos = num_to_rpos[bestProb]

        wordPos.append(rpos)



    return wordPos

In [13]:
# Code to test the performance of POS tagger...
correct = 0
tested = 0
pershown = 0

# can finish earlier if taking too long
stop_percent = 100

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = token_pos(gmb[i])

    truth = gmb.pos(i)
    
    for g,t in zip(guess, truth):
        if g==pos_to_rpos[t]:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 93.0%
  (took 87.3445 seconds)




# Part of speech tagging - sentence level

While the previous step works very well we need POS tags to be super accurate, as everything else depends on them. We will now introduce context. This is done by calculating transition probabilities between tags and solving a Markov random chain using the forward-backwards algorithm to find the maximum a posteriori (MAP) POS tag assignment for the entire sentence. The adjacency matrix contains $\log P(\textrm{second pos tag} | \textrm{first pos tag})$.


In [14]:
def CreateAdjacencies():
    
    #create matrix which calculates the adjacent probabilityies.
    
    #initialise matrix
    adMat = np.ones([16,16])
    
    #loop through all words and count the number of times one tag follows another
    wordNum = 0
    for i in range(len(posWordList)-1):
            
        word = posWordList[i]
        nextWord = posWordList[i+1]

        adMat[word,nextWord]+=1
        
        
    #normalise by taking percentage for word
    posSum = np.sum(adMat,axis=1).reshape(-1,1)
    

    adMat = adMat/posSum
 
    #return log of the matrix
    adMat = np.log(adMat)
    
    return adMat

admat = CreateAdjacencies()   



In [15]:
def emissionMatrix(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    Basically calls the models for each tag and selects the tag with the
    highest probability."""
    
    
    #create emission matrix using part one alphas to predict tag for each word without context

    xList = []
    wordsList = []
    for word in sentence:

        x = glove.decode(word)
        xList.append(x)
    x = np.array(xList)

    nf = numpy.sin(numpy.einsum('ef,gf->eg', x , w))
    
    ex = numpy.append(x,nf, axis=1)
    
                                
    ones = np.ones(x.shape[0])[:,None]
    ex = numpy.append(ones,ex, axis=1)
    
    #calculate probability of word for each tag
    predictionArray = np.zeros([len(sentence),16])
    
    for tag, alpha in alphaDict.items():
        pred = prediction(ex,alpha)
        tagNum = rpos_to_num[tag]
        predictionArray[:,tagNum] = pred
    
    sumPred = np.sum(predictionArray,axis=1).reshape(-1,1)
    
    predictionArray = predictionArray/sumPred
   
    
    return predictionArray



In [16]:
def sentence_pos(sentence):
    """Given a sentence, as a list of tokens, this should return part of
    speech tags, as a list of strings (the codes in the rpos_desc dictionary).
    A more advanced version of token_pos that uses neighbours as well."""
    
   #use forward backward algorithm to calculate the conditional probability of tag given the rest of the words in the sentence
    
    fwd = []
    f_prev = {}
    #return emission matrix
    emMatrix = emissionMatrix(sentence)
    emMatrix = np.log(emMatrix)
    
    forwards = np.zeros([len(sentence),len(rpos_to_num)])
    #loops through time step and states
    
    #tags are states and words are observable
    for t, word in enumerate(sentence):
        f_curr = {}
    
        if t == 0:
            # base case for the forward part

            forwards[t] = emMatrix[t]

        else:
            
            for key, s  in rpos_to_num.items():
                prev_f_sum = np.zeros(len(rpos_to_num))
                for key, sPrev in rpos_to_num.items():
                    prev_f_sum[sPrev] = forwards[t-1,sPrev]+admat[sPrev,s]+emMatrix[t,s]

                forwards[t,s] = numpy.max(prev_f_sum)



    #backwards
    backwards = np.zeros([len(sentence),len(rpos_to_num)])
    
    for t in reversed(range(len(sentence))):
    #for st in rpos_desc:

        if t == len(sentence)-1:
            # base case uses the forwards 

            backwards[t] = forwards[t]

        else:
            
            for key, s  in rpos_to_num.items():
                prev_f_sum = np.zeros(len(rpos_to_num))
                for key, sPrev in rpos_to_num.items():
                    prev_f_sum[sPrev] = backwards[t+1,sPrev]+admat[s,sPrev]+emMatrix[t,s]

                backwards[t,s] = numpy.max(prev_f_sum)
    
    #return highest index of backward as prediction for the tag
    finalPredIndex = numpy.argmax(backwards, axis=1)
    
    finalPred = []
    
    for i in finalPredIndex:
        finalPred.append(num_to_rpos[i])
        
        
    return finalPred
    
    


In [17]:
# Code to test the performance of your improved POS tagger...
correct = 0
tested = 0
pershown = 0
stop_percent = 100

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = sentence_pos(gmb[i])
    truth = gmb.pos(i)
    
    for g,t in zip(guess, truth):
        if g==pos_to_rpos[t]:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 93.6%
  (took 496.841 seconds)


# Named entity recognition

The next step is to identify names, that is the entities that "facts" may apply to. While training a further classifier does work (same as above, inc. dynamic programming) there would be little point in repeating the exercise. Instead, a simple rule based approach using *regular expressions* is going to be used.


Given part of speech tagging a name can be defined as:
* An optional *determiner*, e.g. *the* (1 or none)
* An arbitrary number of *adjectives* (could be none)
* A single *noun*


In [18]:
def sentence_ner(sentence, pos):
    """Given a sentence as a list of tokens and it's part of speech tags
    this returns a list of the same length with True wherever it thinks
    there is a name."""
    
    #return array of true and false for each word if it follows the rules above
    ret = [False] * len(sentence)
    
    #regular expression for names
    regex = 'D?J*N'
    
    #create string of tags
    joinPos = ''.join(pos)
    #check for regex matches in the string
    for match in re.finditer(regex, joinPos):
        start = match.start()
        end = match.end()
        #for all matches make value true 
        for a in range(start,end):
            ret[a] = True
    

    
    return ret


In [19]:
# Code to test the performance of the NER tagger...
correct = 0
tested = 0
pershown = 0
stop_percent = 100 # If you want faster feedback you can reduce this

start = time.time()
for i in range(split, len(gmb)):
    percent = int(100 * (i - split) / (len(gmb) - split))
    if percent>pershown:
        pershown = percent
        print('\r{: 3d}%'.format(percent), end='')
    
    if percent>=stop_percent:
        break
    
    guess = sentence_ner(gmb[i], [pos_to_rpos[p] for p in gmb.pos(i)])
    truth = [ner!='O' for ner in gmb.ner(i)]
    
    for g,t in zip(guess, truth):
        if g==t:
            correct += 1
        tested += 1
end = time.time()

print()
print('Percentage correct = {:.1f}%'.format(100 * correct / tested))
print('  (took {:g} seconds)'.format(end - start))


 99%
Percentage correct = 63.4%
  (took 0.532609 seconds)


# Relation extraction

This is where the paper "*Identifying Relations for Open Information Extraction*" comes in, specifically one of its novel contributions. It extracts relations using this procedure:
1. Find relation text by matching a human-designed pattern to the POS tags
2. Identify the named entities to the left and right of the relation text.
3. Generate the relation tuple (left named entity, relation text, right named entity)

(all previous approaches found names then relations - turns out it works much better the other way around)

Relation text is identified as:
`(Ve (Wo* Pa)?)+`
where
* `Ve = Verb Particle? Adverb?`
* `Wo = Noun | Adjective | Adverb | Pronoun | Determiner`
* `Pa = Preposition or subordinating conjunction | Particle`
* `| =` or, so either of the options
* `? =` optional
* `+ =` at least one, but can be many
* `* =` an arbitrary number of repetitions, including the option for none.

We will need to convert the above rules into a regular expression

In [20]:
def extract(sentence):
    """Given a sentence, as a list of tokens, this returns a list of all relations
    extracted from the sentence. Each relation is a tuple with three entries:
    (named entity one, relation, named entity two)"""
    pos = sentence_pos(sentence)

    ner = sentence_ner(sentence, pos)
    
    #create regex expression for relation
    ret = []
    Ve = '(VZ?R?)'
    Wo = '(N|J|R|M|D)'
    Pa = '(I|Z)'
    regex = '({}({}*{})?)+'.format(Ve,Wo,Pa)
    
    #create string of tags
    joinPos = ''.join(pos)
    
    for match in re.finditer(regex, joinPos):
        

        start = match.start()
        
        end = match.end()
        
        relation = sentence[start:end]
        
        #check value before and after the relation string to check if the named entity before or after
        leftEnt = []
        leftInd = []
        
        #check if tags before the relation are named entities
        for i in reversed(range(0,start)):
            if ner[i] ==True:
                leftInd.append(i)
                
            else:
                break
        leftInd = np.sort(leftInd)
        
        
        for j in range(len(leftInd)):
            
            leftEnt.append(sentence[j])
            
        #check if tags after the relation are named entities
        rightEnt = []
     
        for i in range(end,len(sentence)):
            
            
            if ner[i] ==True:
                rightEnt.append(sentence[i])
            else:
                break
   
        if len(leftInd) == 0 or len(rightEnt)==0:
            break
        else:
            #create tuple of named left, named right and relation if left and right exist
            retTup = (' '.join(leftEnt),' '.join(relation),' '.join(rightEnt))
           
            ret.append(retTup)
        
    
    return ret


In [21]:
# Small test of the above...
tests = ['London is full of pigeons.',
         'In 1781 William Herschel discovered Uranus', # 1
         "Trolls really don't like the sun.",
         'Giant owls would enjoy eatting people.',
         "Dragons collect gold, but they don't make microprocessors."] # 2

# 1. Seems to miss William - misclassified it, at least with the model answer.
# 2. Should extract two facts, first sensible, second absurd.

for sentence in tests:
    print(sentence)
    tokens = ogonek.Tokenise(sentence)
    
    rels = extract(tokens[0])
    for rel in rels:
        print('  ' + ' -- '.join(rel))
    print()
    

London is full of pigeons.
  London -- is full of -- pigeons

In 1781 William Herschel discovered Uranus
  In 1781 William -- discovered -- Uranus

Trolls really don't like the sun.

Giant owls would enjoy eatting people.

Dragons collect gold, but they don't make microprocessors.
  Dragons -- collect -- gold



## 20000 relations under the seas

While the above may have tested each step of the system, the below code runs it on the book "*20,000 leagues under the seas*" by Jules Verne (widely considered to be the first science fiction book, and full of fairly dubious claims about what goes on underwater).

In [23]:
for index in range(len(under_the_seas)):
    sentence = under_the_seas[index]
    rels = extract(sentence)
    
    if len(rels)>0:
        print(' '.join(sentence))
        for rel in rels:
            print('  ' + ' -- '.join(rel))
        print()

6. At Full Steam 7. A Whale of Unknown Species 8. " Mobilis in Mobili " 9. The Tantrums of Ned Land 10. The Man of the Waters 11. The Nautilus 12. Everything through Electricity 13. Some Figures 14. The Black Current 15. An Invitation in Writing 16. Strolling the Plains 17. An Underwater Forest 18. Four Thousand Leagues Under the Pacific 19. Vanikoro 20. The Torres Strait 21. Some Days Ashore 22. The Lightning Bolts of Captain Nemo 23. " Aegri Somnia " 24. The Coral Realm SECOND PART 1. The Indian Ocean 2. A New Proposition from Captain Nemo 3. A Pearl Worth Ten Million 4. The Red Sea 5. Arabian Tunnel 6. The Greek Islands 7. The Mediterranean in Forty - Eight Hours 8. The Bay of Vigo 9. A Lost Continent 10. The Underwater Coalfields 11. The Sargasso Sea 12. Sperm Whales and Baleen Whales 13. The Ice Bank 14. The South Pole 15. Accident or Incident ?
  6. At Full Steam 7. A -- Everything through -- Electricity 13. Some Figures 14.

16. Shortage of Air 17. From Cape Horn to the Amazon 1

No transoceanic navigational undertaking has been conducted with more ability , no business dealings have been crowned with greater success .
  No transoceanic navigational undertaking -- has been conducted with -- more ability
  No transoceanic navigational -- have been crowned with -- greater success

Its paddle wheels were churning the sea with perfect steadiness .
  Its paddle -- were churning the sea with -- perfect steadiness

At 4:17 in the afternoon , during a high tea for passengers gathered in the main lounge , a collision occurred , scarcely noticeable on the whole , affecting the Scoti 's hull in that quarter a little astern of its port paddle wheel .
  At -- gathered in -- the main lounge

Fortunately this compartment did not contain the boilers , because their furnaces would have been abruptly extinguished .
  Fortunately this compartment -- did not contain -- the boilers

My departure for France was set for early May .
  My -- was set for -- early May

So only two possib

" All right then , my fine harpooner , if vertebrates several hundred meters long and proportionate in bulk live at such depths , their surface areas make up millions of square centimeters , and the pressure they undergo must be assessed in billions of kilograms .
  " All -- make up millions of -- square

Calculate , then , how much resistance of bone structure and strength of constitution they'd need in order to withstand such pressures ! "
  Calculate , -- need in -- order

" the Canadian replied , unconsciously echoing a famous catchphrase of the scientist Arago .
  " -- echoing a famous catchphrase of -- the scientist Arago

The Scoti 's accident was undeniable .
  The -- was -- undeniable

Now then , this hole did not make itself , and since it had not resulted from underwater rocks or underwater machines , it must have been caused by the perforating tool of some animal .
  Now then -- did not make -- itself

But one of them , the captain of the Monroe , knew that Ned Land had shi

The narwhale seemed motionless .
  The narwhale -- seemed -- motionless

The Abraham Lincoln stayed at half steam , advancing cautiously so as not to awaken its adversary .
  The Abraham Lincoln -- stayed at -- half steam

The frigate approached without making a sound , stopped two cable lengths from the animal and coasted .
  The frigate -- approached without making -- a sound

Barely twenty feet separated him from the motionless animal .
  Barely -- separated him from -- the motionless animal

Would Commander Farragut put a longboat to sea ?
  Would Commander Farragut -- put -- a longboat

My clothes were weighing me down .
  My -- were -- weighing

My mouth was filling with water .
  My -- was filling with -- water

" If master would oblige me by leaning on my shoulder , master will swim with much greater ease . "
  " If master -- me by -- leaning

But our circumstances were no less dreadful .
  But our -- were -- no less dreadful

Conseil had coolly reasoned out this hypothesis and

There silver - plated dinnerware gleamed under rays pouring from light fixtures in the ceiling , whose glare was softened and tempered by delicately painted designs .
  There -- pouring from -- light fixtures

In the center of this room stood a table , richly spread .
  In the -- stood -- a table

Captain Nemo indicated the place I was to occupy .
  Captain Nemo -- indicated -- the place

Feel free to sample all of these foods .
  Feel -- all of -- these foods

Your mattress was made from the ocea 's softest eelgrass .
  Your -- was made from -- the ocea

The sea is simply the vehicle for a prodigious , unearthly mode of existence ; it is simply movement and love ; it is living infinity , as one of your poets put it .
  The sea -- is simply the vehicle for -- a prodigious

The sea is a vast pool of nature .
  The sea -- is a vast pool of -- nature

Our globe began with the sea , so to speak , and who can say we wo not end with it !
  Our globe -- began with -- the sea

The sea does not

Then a door opened into the galley , 3 meters long and located between the vesse 's huge storage lockers .
  Then a -- opened into -- the galley

There , even more powerful and obedient than gas , electricity did most of the cooking .
  There -- did most of -- the cooking

Next to this galley was a bathroom , conveniently laid out , with faucets supplying hot or cold water at will .
  Next to -- was -- a bathroom

After the galley came the cre 's quarters , 5 meters long .
  After the -- came -- the cre

At the far end stood a fourth watertight bulkhead , separating the cre 's quarters from the engine room .
  At -- stood -- a fourth watertight bulkhead

In any event , every morning we sanitize the ship by ventilating it in the open air .
  In any -- we sanitize the ship by -- ventilating

Could its transmission have been immeasurably increased by some unknown system of levers ?
  Could -- have been -- immeasurably
  Could -- increased by -- some unknown system

It noticeably takes the

Little by little , the mists were dispersed under the action of the su 's rays .
  Little by -- were dispersed under -- the action

The radiant orb cleared the eastern horizon .
  The radiant orb -- cleared -- the eastern horizon

Under its gaze , the sea caught on fire like a trail of gunpowder .
  Under its -- caught on -- fire

Five days passed in this way with no change in our situation .
  Five -- passed in -- this way

The note was worded as follows :
  The note -- was worded as follows -- :

Professor Aronnax Aboard the Nautilus November 16 , 1867 Captain Nemo invites Professor Aronnax on a hunting trip that will take place tomorrow morning in his Crespo Island forests .
  Professor Aronnax -- Aboard -- the Nautilus November

" If Captain Nemo does sometimes go ashore , " I told them , " at least he only picks desert islands ! "
  " If -- does sometimes go -- ashore

Ned Land shook his head without replying ; then he and Conseil left me .
  Ned Land -- shook his head without -- 

Not a single object was visible past ten paces .
  Not a single -- was -- visible past ten paces

A wall of superb rocks stood before us , imposing in its sheer mass :
  A wall -- stood before -- us

a pile of gigantic stone blocks , an enormous granite cliffside pitted with dark caves but not offering a single gradient we could climb up .
  a pile of gigantic -- pitted with -- dark caves

Captain Nem 's companion picked up the animal , loaded it on his shoulder , and we took to the trail again .
  Captain -- picked up -- the animal

Heavy clouds passed above us , forming and fading swiftly .
  Heavy clouds -- passed above -- us

This incident did not interrupt our walk .
  This incident -- did not interrupt -- our walk

For two hours we were sometimes led over plains of sand , sometimes over prairies of seaweed that were quite arduous to cross .
  For -- we were sometimes led over -- plains

Luckily these voracious animals have poor eyesight .
  Luckily these voracious animals -- have

The Nautilus drew near Wailea Bay , an unlucky place for Englan 's Captain Dillon , who was the first to shed light on the longstanding mystery surrounding the disappearance of ships under the Count de La PΘrouse .
  The Nautilus -- drew near -- Wailea Bay

These mollusks belonged to the species known by name as Ostrea lamellosa , whose members are quite common off Corsica .
  These mollusks -- known by -- name
  These mollusks -- are quite common off -- Corsica

On December 25 the Nautilus navigated amid the island group of the New Hebrides , which the Portuguese seafarer Queir≤s discovered in 1606 , which Commander Bougainville explored in 1768 , and to which Captain Cook gave its current name in 1773. This group is chiefly made up of nine large islands and forms a 120 - league strip from the north - northwest to the south - southeast , lying between latitude 2 degrees and 15 degrees south , and between longitude 164 degrees and 168 degrees .
  On December 25 the -- discovered in -- 

The situation was indeed dangerous , but as if by magic , the Nautilus seemed to glide right down the middle of these rampaging reefs .
  The situation -- was -- indeed

The Nautilus drew near this island , which I can see to this day with its remarkable fringe of screw pines .
  The Nautilus -- drew near -- this island

The Nautilus had just struck a reef , and it remained motionless , listing slightly to port .
  The Nautilus -- had just struck -- a reef

To the south and east , heads of coral were already on display , left uncovered by the ebbing waters .
  To -- were already on -- display
  To -- uncovered by -- the ebbing waters

However , the ship had not suffered in any way , so solidly joined was its hull .
  However , -- had not suffered in -- any way

" No , Professor Aronnax , the Nautilus is not consigned to perdition .
  " No -- is not -- consigned

Now then , the tides are not strong in the Pacific , and if you can not unballast the Nautilus , which seems impossible to me

The morning shadows were lifting .
  The morning -- shadows were -- lifting

The island was soon on view through the dissolving mists , first its beaches , then its summits .
  The island -- was soon on -- view

They obviously were true Papuans , men of fine stock , athletic in build , forehead high and broad , nose large but not flat , teeth white .
  They -- were -- true Papuans

Their woolly , red - tinted hair was in sharp contrast to their bodies , which were black and glistening like those of Nubians .
  Their woolly -- was in -- sharp contrast

So the skiff did not leave shipside that day , much to the displeasure of Mr. Land who could not complete his provisions .
  So the -- did not leave shipside that -- day

" For two hours our fishing proceeded energetically but without bringing up any rarities .
  " For two -- proceeded -- energetically

Our dragnet was filled with Midas abalone , harp shells , obelisk snails , and especially the finest hammer shells I had seen to that day

Conseil was eager to accept , and this time the Canadian proved perfectly amenable to going with us .
  Conseil -- was -- eager

A coral is a unit of tiny animals assembled over a polypary that is brittle and stony in nature .
  A coral -- is a unit of -- tiny animals
  A coral -- assembled over -- a polypary

Our path was bordered by hopelessly tangled bushes , formed from snarls of shrubs all covered with little star - shaped , white - streaked flowers .
  Our path -- was bordered by -- hopelessly tangled bushes

Sheer chance had placed me in the presence of the most valuable specimens of this zoophyte .
  Sheer chance -- had placed me in -- the presence

This coral was the equal of those fished up from the Mediterranean off the Barbary Coast or the shores of France and Italy .
  This coral -- was the equal of -- those fished

Actual petrified thickets and long alcoves from some fantastic school of architecture kept opening up before our steps .
  Actual -- kept opening up before -- 

Several times we used our slanting fins , which internal levers could set at an oblique angle to our waterline .
  Several times -- we used -- our slanting fins

Its masting was visible for an instant , but it could not have seen the Nautilus because we were lying too low in the water .
  Its -- was visible for -- an instant

" But modern science has not endorsed these designations , and this mollusk is now known by the name argonaut .
  " But -- has not endorsed -- these designations
  " But -- is now known by -- the name argonaut

They belonged to that species of argonaut covered with protuberances and exclusive to the seas near India .
  They -- covered with -- protuberances

These graceful mollusks were swimming backward by means of their locomotive tubes , sucking water into these tubes and then expelling it .
  These graceful mollusks -- were swimming backward by -- means

" The argonaut is free to leave its shell , " I told Conseil , " but it never does . "
  " The -- is -- free

Captain Nemo left the cave , and we climbed back up the bank of shellfish in the midst of these clear waters not yet disturbed by divers at work .
  Captain Nemo -- left -- the cave

The shallows drew noticeably closer to the surface of the sea , and soon , walking in only a meter of water , my head passed well above the level of the ocean .
  The shallows -- drew -- noticeably

But this lofty plateau measured only a few fathoms , and soon we reentered Our Element .
  But this lofty -- measured only -- a few fathoms

A stone cut in the shape of a sugar loaf , which he gripped between his feet while a rope connected it to his boat , served to lower him more quickly to the ocean floor .
  A stone -- cut in -- the shape

This diver did not see us .
  This diver -- did not see -- us

His movements were systematically executed , and for half an hour no danger seemed to threaten him .
  His -- were -- systematically

A gigantic shadow appeared above the poor diver .
  A gigantic -- shadow ap

" In that case , captain , " Conseil said in all seriousness , " on the offchance that this creature might be the last of its line , would not it be advisable to spare its life , in the interests of science ? "
  " -- said in -- all seriousness

Just then , as mute and emotionless as ever , seven crewmen climbed onto the platform .
  Just -- climbed onto -- the platform

Six rowers sat on the thwarts , and the coxswain took the tiller .
  Six -- sat on -- the thwarts
  Six rowers -- took -- the tiller

The skiff pulled clear , and carried off by its six oars , it headed swiftly toward the dugong , which by then was floating two miles from the Nautilus .
  The skiff -- pulled -- clear

Harpoons used for hunting whales are usually attached to a very long rope that pays out quickly when the wounded animal drags it with him .
  Harpoons -- used for -- hunting whales

His body leaning slightly back , Ned Land brandished his harpoon with expert hands .
  His body -- brandished his harpoon wi

There , in place of natural wonders , the watery mass offered some thrilling and dreadful scenes to my eyes .
  There , in -- offered -- some thrilling

Compared to the vast liquid plains of the Pacific , the Mediterranean is a mere lake , but it is an unpredictable lake with fickle waves , today kindly and affectionate to those frail single - masters drifting between a double ultramarine of sky and water , tomorrow bad - tempered and turbulent , agitated by the winds , demolishing the strongest ships beneath sudden waves that smash down with a headlong wallop .
  Compared to -- is -- a mere lake

One of these boats made a dreadful first impression :
  One of -- made -- a dreadful

How many lives were dashed in this shipwreck !
  How many -- were dashed in -- this shipwreck !

How many victims were swept under the waves !
  How many -- were swept under -- the waves !

Meanwhile , briskly unconcerned , the Nautilus ran at full propeller through the midst of these ruins .
  Meanwhile , -

A few sails were on the horizon , no doubt ships going as far as Cape São Roque to find favorable winds for doubling the Cape of Good Hope .
  A few sails -- were on -- the horizon

The sky was overcast .
  The sky -- was -- overcast

A squall was on the way .
  A squall -- was on -- the way

" So far you've visited the ocean depths only by day and under sunlight .
  " -- visited the ocean depths only by -- day

In a few moments we had put on our equipment .
  In a few -- we had put on -- our equipment

The waters were profoundly dark , but Captain Nemo pointed to a reddish spot in the distance , a sort of wide glow shimmering about two miles from the Nautilus .
  The waters -- were -- profoundly

After half an hour of walking , the seafloor grew rocky .
  After half -- grew -- rocky

I glimpsed piles of stones covered by a couple million zoophytes and tangles of algae .
  I -- covered by -- a couple

Those piles of stones just mentioned were laid out on the ocean floor with a distinct

But will master tell me why this huge smelter suspended operations , and how it is that an oven was replaced by the tranquil waters of a lake ? "
  But will -- tell me why -- this huge smelter
  But will master -- suspended -- operations

Then the waters of the Atlantic rushed inside the mountain .
  Then the -- rushed inside -- the mountain

There ensued a dreadful struggle between the elements of fire and water , a struggle ending in King Neptun 's favor .
  There ensued -- ending in -- King Neptun

The gradients got steeper and narrower .
  The gradients -- got -- steeper

At an elevation of about thirty meters , the nature of the terrain changed without becoming any easier .
  At an -- changed without becoming -- any easier

Pudding stones and trachyte gave way to black basaltic rock :
  Pudding -- gave -- way

Then , among this basaltic rock , there snaked long , hardened lava flows inlaid with veins of bituminous coal and in places covered by wide carpets of sulfur .
  Then -- co

" You've never fished these seas , Ned ? "
  " -- never -- fished these seas

" So the southern right whale is still unknown to you .
  " So the southern -- is still -- unknown

And if one of these animals went from the Bering Strait to the Davis Strait , it is quite simply because ther 's some passageway from the one sea to the other , either along the coasts of Canada or Siberia . "
  And if -- went from -- the Bering Strait

the Canadian asked , tipping me a wink .
  the -- me -- a wink

Brandishing an imaginary harpoon , his hands positively trembled .
  Brandishing -- positively -- trembled

Those animals are only members of the genus Balaenoptera furnished with dorsal fins , and like sperm whales , they're generally smaller than the bowhead whale . "
  Those animals -- are only members of -- the genus Balaenoptera
  Those animals are -- furnished with -- dorsal fins

People mistake them for islets .
  People mistake -- them for -- islets

) People claim these animals can circle a

Even though the surface of the sea has solidified into ice , its lower strata are still open , thanks to that divine justice that puts the maximum density of salt water one degree above its freezing point .
  Even though -- has solidified into -- ice

" The Nautilus has huge air tanks ; we will fill them up and they will supply all the oxygen we need . "
  " The -- has -- huge air tanks

" Captain Nemo was right .
  " Captain -- was -- right

Near four o'clock Captain Nemo informed me that the platform hatches were about to be closed .
  Near four o'clock -- informed me that -- the platform hatches

The weather was fair , the skies reasonably clear , the cold quite brisk , namely - 12 degrees centigrade ; but after the wind had lulled , this temperature did not seem too unbearable .
  The weather -- was -- fair

This operation was swiftly executed because the fresh ice was still thin .
  This operation -- was swiftly executed because -- the fresh ice

The main ballast tanks were filled

The Nautilus had gone a few more miles during the night .
  The Nautilus -- had gone a few more miles during -- the night

The sky was growing brighter .
  The sky -- was growing -- brighter

Mists were rising from the cold surface of the water .
  Mists -- were rising from -- the cold surface

Captain Nemo headed toward the peak , which he no doubt planned to make his observatory .
  Captain Nemo -- headed toward -- the peak

For a man out of practice at treading land , the captain scaled the steepest slopes with a supple agility I could not equal , and which would have been envied by hunters of Pyrenees mountain goats .
  For a -- scaled the steepest slopes with -- a supple agility

jets of liquid rising like hundreds of magnificent bouquets .
  jets -- rising like -- hundreds

Captain Nemo had brought a spyglass with a reticular eyepiece , which corrected the su 's refraction by means of a mirror , and he used it to observe the orb sinking little by little along a very extended diag

Since conditions inside were universally unbearable , how eagerly , how happily , we put on our diving suits to take our turns working !
  Since -- inside were universally -- unbearable

Arms grew weary , hands were rubbed raw , but who cared about exhaustion , what difference were wounds ?
  Arms -- grew -- weary
  Arms -- were -- rubbed

Life - sustaining air reached our lungs !
  Life - -- reached -- our lungs !

Captain Nemo set the example and was foremost in submitting to this strict discipline .
  Captain Nemo -- set -- the example

Only two meters separated us from the open sea .
  Only -- separated us from -- the open sea

But the shi 's air tanks were nearly empty .
  But the -- were nearly -- empty

Headaches and staggering fits of dizziness made me reel like a drunk .
  Headaches -- made me reel like -- a drunk

My companions were experiencing the same symptoms .
  My -- were experiencing -- the same symptoms

That day , the sixth of our imprisonment , Captain Nemo conclude

In this locality a number of sea turtles were sleeping on the surface of the waves .
  In this -- were sleeping on -- the surface

In truth , this animal is a living fishhook , promising wealth and happiness to the greenest fisherman in the business .
  In truth -- is -- a living fishhook

Their tenacity was so great , they would rip apart rather than let go .
  Their -- was so -- great

In this way we caught several loggerheads , reptiles a meter wide and weighing 200 kilos .
  In this -- we caught -- several loggerheads

This fishing ended our stay in the waterways of the Amazon , and that evening the Nautilus took to the high seas once more .
  This fishing -- ended our stay in -- the waterways

For six months we had been prisoners aboard the Nautilus .
  For -- we had been -- prisoners
  For -- aboard -- the Nautilus

If this measure proved fruitless , it could arouse the captai 's suspicions , make our circumstances even more arduous , and jeopardize the Canadia 's plans .
  If th

The Nautilus kept descending .
  The Nautilus -- kept -- descending

In despair , poor Ned went into seclusion like Captain Nemo .
  In despair -- went into -- seclusion

How many casualties have been caused by these opaque mists !
  How many -- have been caused by -- these opaque mists !

How many collisions have occurred with these reefs , where the breaking surf is covered by the noise of the wind !
  How many -- have occurred with -- these reefs

These banks are the result of marine sedimentation , an extensive accumulation of organic waste brought either from the equator by the Gulf Strea 's current , or from the North Pole by the countercurrent of cold water that skirts the American coast .
  These banks -- are the result of -- marine sedimentation
  These banks -- brought either from -- the equator

Here , too , erratically drifting chunks collect from the ice breakup .
  Here -- collect from -- the ice breakup

Here a huge boneyard forms from fish , mollusks , and zoophytes dyi

His chest was heaving , swelling with sobs .
  His -- was -- heaving

Voices were answering each other hurriedly .
  Voices -- were answering -- each other hurriedly

Could a more frightening name have rung in our ears under more frightening circumstances ?
  Could a more frightening -- have rung in -- our ears

What crashes from the waters breaking against sharp rocks on the seafloor , where the hardest objects are smashed , where tree trunks are worn down and worked into " a shaggy fur , " as Norwegians express it !
  What crashes -- breaking against -- sharp rocks

The Nautilus defended itself like a human being .
  The Nautilus -- defended itself like -- a human

Its steel muscles were cracking .
  Its steel -- were -- cracking

The nuts gave way , and ripped out of its socket , the skiff was hurled like a stone from a sling into the midst of the vortex .
  The nuts -- gave -- way

My head struck against an iron timber , and with this violent shock I lost consciousness .
  My -- st