# Week 2: Parsing, Relation Extration and Open Information Extraction
### COMP61332: Text Mining, Department of Computer Science, University of Manchester (Riza Batista-Navarro and Viktor Schlegel)


In this lab session, you will try out some Python code based on the **spaCy** library (https://spacy.io/) for the NLP tasks discussed in the Week 2 Lecture, as well as an application of NLP (Open Information Extraction or Open IE).
After this session, you should be able to:
- apply **part-of-speech (POS) tagging** on text
- apply **dependency parsing** on text
- develop rules for **extracting relations** from text
- develop rules for **extracting Open Information Extraction (Open IE) triples**
- explore and visualise knowledge extracted by Open IE in the form of a graph (optional)

You are provided with three text files (drawn from https://en.wikipedia.org/wiki/Timeline_of_historic_inventions), each containing a list of inventions (from the 1700s, 1800s and 1900s), that you can use for experimentation.

<font color="red">NOTE: If you are using Google Colab, upload the three text files using its in-built file browser.</font>


## Preparation of necessary packages

In [None]:
# Loading
!pip install spacy==3.0
!python -m spacy download en_core_web_sm

import spacy
from spacy.lang.en import English
from spacy.pipeline import Sentencizer


import en_core_web_sm

nlp = spacy.load('en_core_web_sm')

from spacy import displacy



## File loading

The function below takes as parameter the path to a file containing plain text.

In [None]:
import codecs

def load_file(path):
    file_text = ''
    file = codecs.open(path, 'r', encoding = 'utf-8')
    file_lines = file.readlines()
    for line in file_lines:
        # Text cleaning, remove any whitespace lines
        line = line.replace('\n','')
        file_text = file_text + line
    file.close()
    return file_text



## Sentence segmentation

The code below is the same as the one we used in Week 1. Customise the list of sentence delimiters if necessary.

In [None]:
# Create a new NLP pipeline, specifying English as the language of interest so that English models are loaded.
nlp = English()

config = {"punct_chars": ["."]}

# Add the component to the pipeline.
sentencizer = nlp.add_pipe('sentencizer', config=config)

# Load the contents of a text file; change the parameter to use another/your own text file.
text = load_file('1800s.txt')

# The following line applies the pipeline (so far only sentence segmentation) on the given text, and stores the result in doc.
annotations = nlp(text)

# Check the result of sentence segmentation.
sents_list = []
for sent in annotations.sents:
    sents_list.append(sent.text.strip())
    
# Check how many sentences were produced
print('Number of sentences: ', len(sents_list))

print('SENTENCE NO.\tSENTENCE:')
for i, sent in enumerate(sents_list):
    print(i, '\t', sent)



## Dependency parsing (with in-built POS tagging)

### One example for easier visualisation/debugging

The code below applies a dependency parser on a sentence. For each token, the following attributes are printed: 
- token.text (the token text)
- token.lemma_ (the base form of the token)
- token.pos_ (the POS tag according to the Universal Dependencies scheme; see https://universaldependencies.org/u/pos/)
- token.tag_ (the POS tag according to the Penn Treebank; see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- token_dep_ (the dependency type)
- child.text:child.dep_ (a list of dependents; the text and dependency type are displayed for each dependent)

Moreoever, a **visualisation** of the tree is displayed, which can be helpful later, when you try to write your own rules.

In [None]:
# Example 1
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("Nicolas Appert invents the canning process for food.")
for token in doc:
    print(token.text + '\t' + token.lemma_ + '\t' + token.pos_ + '\t' + token.tag_ + '\t' + token.dep_ + '\t' + str([child.text + ':' + child.dep_ for child in token.children]))

displacy.render(doc, style='dep', jupyter=True)

### <font color='red'>Below, write down any observations you have based on the results of dependency parsing. For example, what kind of dependencies should you follow if you want to reach names of inventors or inventions from the main verb of a sentence?</font>
<br>
<br>
<br>
<br>
<br>
<br>

## Relation extraction

### Extracting relations using rules that process the results of dependency parsing

The code below uses very basic rules to extract **inventor-invention relations**.

In [None]:
# Create a class that will store every inventor-invention relation that we extract
class Relation:
    inventor = ''
    invention = ''

In [None]:
relations = []

for sent in sents_list:
    doc = nlp(sent)
    for token in doc:
        # For now, check if the root of the tree is a verb that is a variant of "invent"
        if token.dep_ == 'ROOT' and token.lemma_ == 'invent':
            r = Relation()
            dependents = token.children
            for d in dependents:
                if d.dep_ == 'dobj':
                    r.invention = d.text
                elif d.dep_ == 'nsubj':
                    r.inventor = d.text
            if r.inventor != '' and r.invention != '':
                print('Sentence:', sent)
                print(r.inventor, '-', r.invention)
                relations.append(r)

for r in relations:
    print(r.inventor, '-', r.invention)

### <font color='red'>The code below is the same as the one above. How will you extend the rules in order to improve the extracted relations? For example: (1) how can you handle verbs that are synonymous to "invent"? (2) How will you handle sentences written in the passive voice, e.g., "X was invented by Y"? </font>
### <font color='red'>You can test your idea and code using the above sample code "Example 1" with the sentence "The canning process for food was invented by Nicolas Appert.".</font>

In [None]:
relations = []

for sent in sents_list:
    doc = nlp(sent)
    for token in doc:
        # For now, check if the root of the tree is a verb that is a variant of "invent"
        if token.dep_ == 'ROOT' and token.lemma_ == 'invent':
            r = Relation()
            dependents = token.children
            for d in dependents:
                if d.dep_ == 'dobj':
                    r.invention = d.text
                elif d.dep_ == 'nsubj':
                    r.inventor = d.text
            if r.inventor != '' and r.invention != '':
                print('Sentence:', sent)
                print(r.inventor, '-', r.invention)
                relations.append(r)

for r in relations:
    print(r.inventor, '-', r.invention)

### <font color='red'>Summarise below how you extended the code above to improve the extracted relations. In writing your new rules, what other verbs and/or sentence constructions did you consider?</font>
<br>
<br>
<br>
<br>
<br>
<br>

## Open Information Extraction

### Adapting the code above to generate ARG1-PREDICATE-ARG2 triples instead of relations

We can easily adapt the relation extracted code above to extract **ARG1-PREDICATE-ARG2 triples** (for Open Information Extraction) instead of relations. Note that for Open IE, we are interested in any kind of triples (not just those pertaining to inventors and their inventions).

In [None]:
# Example 2
# Create a class that will store every ARG1-PREDICATE-ARG2 triple that we extract
class Triple:
    arg1 = ''
    predicate = ''
    arg2 = ''

In [None]:
triples = []

for sent in sents_list:
    doc = nlp(sent)
    for token in doc:
        # For now, assume that the root of the tree is the predicate
        if token.dep_ == 'ROOT':
            t = Triple()
            # Store the lemmatised form (to normalise)
            t.predicate = token.lemma_
            dependents = token.children
            for d in dependents:
                if d.dep_ == 'dobj':
                    t.arg2 = d.text
                elif d.dep_ == 'nsubj':
                    t.arg1 = d.text
            if t.arg1 != '' and t.arg2 != '':
                print('Sentence:', sent)
                print(t.arg1, '-', t.predicate, '-', t.arg2)
                triples.append(t)

for t in triples:
    print(t.arg1, '-', t.predicate, '-', t.arg2)


### <font color='red'>The code below is the same as the one above. How will you extend it to improve the extracted triples? For example, can you try to extract full names of people (rather than just their last names)?</font>

In [None]:
triples = []

for sent in sents_list:
    doc = nlp(sent)
    for token in doc:
        # For now, assume that the root of the tree is the predicate
        if token.dep_ == 'ROOT':
            t = Triple()
            # Store the lemmatised form (to normalise)
            t.predicate = token.lemma_
            dependents = token.children
            for d in dependents:
                if d.dep_ == 'dobj':
                    t.arg2 = d.text
                elif d.dep_ == 'nsubj':
                    t.arg1 = d.text
            if t.arg1 != '' and t.arg2 != '':
                print('Sentence:', sent)
                print(t.arg1, '-', t.predicate, '-', t.arg2)
                triples.append(t)

for t in triples:
    print(t.arg1, '-', t.predicate, '-', t.arg2)


### <font color='red'>Summarise below how you extended the code above to improve the extracted Open IE triples.</font>
<br>
<br>
<br>
<br>
<br>
<br>

## Graph representation of knowledge extracted by Open IE

The code below builds a graph representation based on a list of triples extracted by Open IE.

In [None]:
!pip install networkx
import networkx as nx

# Function for assigning unique IDs to arguments of triples
# Returns a dictionary of IDs and the corresponding argument names
def normalise_arguments(triples):
    global arg_id
    # arg_id below serves as the ID
    arg_id = 1
    argument_dict = dict()
    for triple in triples:
        argument_dict[triple.arg1] = arg_id
        arg_id += 1
        argument_dict[triple.arg2] = arg_id
        arg_id += 1
    return argument_dict


In [None]:
# Function for generating the graph
from typing import List
# to_graph fuction was not called/used which will make the code below not running.
def to_graph(triples: List[Triple]) -> nx.Graph:
    global argument_dict
    argument_dict = normalise_arguments(triples)
    # Unlike arguments, predicates do not need to be unique
    predicate_id = arg_id
    graph = nx.Graph()
    for triple in triples:
        graph.add_node(argument_dict[triple.arg1], name=triple.arg1, type='subject')
        graph.add_node(predicate_id, name=triple.predicate, type='predicate')
        graph.add_node(argument_dict[triple.arg2], name=triple.arg2, type='object')
        graph.add_edge(argument_dict[triple.arg1], predicate_id)
        graph.add_edge(argument_dict[triple.arg2], predicate_id)
        predicate_id += 1
    return graph


The code below is for generating a visualisation of a given graph. 

In [None]:
!pip install pyvis
import os
import shutil
import tempfile
from pyvis.network import Network
from typing import List

class PyVisPrinter:
    """Class to visualise a (serialized) dataset entry."""

    def __init__(self, path=None):
        self.path = tempfile.mkdtemp(prefix='vis-', dir='/tmp') or path
        
    def clean(self):
        shutil.rmtree(self.path)

    def print_graph(self, graph: nx.Graph, filename):

        vis = Network(bgcolor="#222222",
                      width='100%',
                      font_color="white", notebook=True)
        
        for idx, (node, data) in enumerate(graph.nodes(data=True)):
            vis.add_node(
                node,
                title=data['name'],
                label=data['name'],
                color='yellow' if data['type'] == 'predicate' else 'green' if data['type'] == 'subject' else 'blue'
            )

        for i, (source, target) in enumerate(graph.edges()):
            if source in vis.get_nodes() and target in vis.get_nodes():
                vis.add_edge(source, target)
            else:
                self.logger.warning("source or target not in graph!")

        name = os.path.join(self.path, filename)
        return vis
    

The code below will generate and display an HTML file depicting the graph. 
<font color="red">NOTE: If you are using Google Colab, the generated HTML will fail to display from within Google Colab. However, you can download the generated HTML file and open it in a separate browser tab/window.</font>

In [None]:
# Generate a graph based on a list of triples
graph = to_graph(triples)

# Generate an HTML file with the graph visualisation and display it
p = PyVisPrinter()
v = p.print_graph(graph, 'my_graph.html')
v.show('my_graph.html')

Below is a very basic way to query the graph.
The code will return a tree traversed using depth-first search (DFS) with 'pen' as a starting point (answering the question "Who invented the pen?").
Try other queries by changing the starting point (i.e., change 'pen' to 'Volta' to answer the question: "What did Volta invent?")

In [None]:
successors = nx.dfs_tree(graph, argument_dict['pen'])
for s in successors:
    print (graph.nodes[s])