## **Information(triple) Extraction using Python and spaCy**

For creating our Knowledge Graphs we need to extract information from our corpus, this information should be in the form of triples.
An Example of a triple :

** (subject, relation, object)

"Education minister Mr.Peter decides to close down schools"

Here the triple will be: 

(Peter, close, schools)

Here we can see that Peter and schools are the 2 entities. These entities are related to each other by the term 'close'.

When plotting a knowledge graph we have 'Peter' and 'Schools' as nodes, and edge/relation being 'close'.

# METHODS

## 1. Extracting only the important part of a long sentence using Rule-Based Matching

let’s try to extract hypernym-hyponym pairs by using patterns/rules.

* hypernym are basically a word with broad meaning.
* hyponym can be defined as a sub-catogary of hypernyms

For example, spoon is a hyponym of cutlery(hypernym).

In our case we can say that the NI assembly will be a hypernym and its ministers can act as hyponyms.

hypernym-hyponym entity extraction is very important in creating our knowldge graphs. Because this brings an element of hierarchy to our knowledge graphs.

In [604]:
# DEPENDENCIES
import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 200)

# load spaCy model
nlp = spacy.load("en_core_web_sm")

Lets consider an example :

"GDP in developing countries such as Vietnam will continue growing at a high rate."

Eventhough the above sentence have two distinct information:

"Countries such as Vietnam" &

"Vietnam's GDP growing at a high rate"

We only focus on the most important take-away from the above sentence. Which is:

"Countries such as Vietnam"

This lets our Knowledge graph learn the information that Vietnam is a country and it should be connected to the Country Node.

The idea of segmenting a long sentence into seperate sentences and then extracting the triples will be done at a later part of our project using well structured models.

### Pattern: X such as Y “Hearst Patterns”

In [605]:
# sample text 
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 


# create a spaCy object 
doc = nlp(text)

To be able to pull out the desired information from the above sentence, it is really important to understand its syntactic structure – things like the subject, object, modifiers, and parts-of-speech (POS) in the sentence.

We can easily explore these syntactic details in the sentence by using spaCy:

In [606]:
# print token, dependency, POS tag 
for tok in doc: 
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

GDP --> nsubj --> NOUN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> SCONJ
Vietnam --> pobj --> PROPN
will --> aux --> VERB
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


Have a look around the terms “such” and “as” . They are followed by a noun (“countries”). And after them, we have a proper noun (“Vietnam”) that acts as a hyponym.

So, let’s create the required pattern using the dependency tags and the POS tags:

In [607]:
#define the pattern 
pattern = [{'POS':'NOUN'}, 
           {'LOWER': 'such'}, 
           {'LOWER': 'as'}, 
           {'POS': 'PROPN'} ]  #proper noun

Let’s extract the pattern from the text:

In [608]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print("Sentence extracted from indices" ,matches[0][1], "to --->", matches[0][2])
print(span.text)

Sentence extracted from indices 3 to ---> 7
countries such as Vietnam


It works perfectly. However, if we could get “developing countries” instead of just “countries”, then the output would make more sense.

So, we will now also capture the modifier of the noun just before “such as” by using the code below:

In [609]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'}]

matcher.add("matching_1", None, pattern)
matches = matcher(doc)

span = doc[matches[0][1]:matches[0][2]]

print("Sentence extracted from indices" ,matches[0][1], "to --->", matches[0][2])
print(span.text)

Sentence extracted from indices 2 to ---> 7
developing countries such as Vietnam


Here, “developing countries” is the hypernym and “Vietnam” is the hyponym. Both of them are semantically related.

Note: The key ‘OP’: ‘?’ in the pattern above means that the modifier (‘amod’) can occur once or not at all.

In a similar manner, we can get several pairs from any piece of text:

* Fruits such as apples
* Cars such as Ferrari
* Flowers such as rose


Similarly we can create patterns to extract only the relevant information from a sentence.

We can focus on building a set of text-patterns that can be employed to extract meaningful information from text. These patterns are popularly known as “Hearst Patterns”.

These are the following patterns that we can use for our project:
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/hearst_patterns-768x435.png)

## 2. Subtree Matching for Relation Extraction

**It is difficult to build patterns that generalize well across different sentences.** 

To enhance the rule-based methods for relation/information extraction, we should try to understand the dependency structure of the sentences.

Let’s take a sample text and build its dependency graphing tree:

# dependency graphs

Lets consider a sentence:

"Instagram was acquired by facebook." 



In [610]:
text_1 = "Instagram was acquired by facebook." 
doc_1 = nlp(text_1) 

for tok in doc_1: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Instagram --> nsubjpass --> PROPN
was --> auxpass --> AUX
acquired --> ROOT --> VERB
by --> agent --> ADP
facebook --> pobj --> PROPN
. --> punct --> PUNCT


In [611]:
text_1 = "Instagram was acquired by facebook." 

# Plot the dependency graph 
doc_1 = nlp(text_1) 
displacy.render(doc_1, style='dep',jupyter=True)

If you look at the entities in the sentence – Instagram and Facebook – they are related by the term ‘acquired’. 

Here, the subject (facebook) is the acquirer and the object (Instagram) is the entity that is getting acquired.

Triple should be in the form:

(facebook, acquired, instagram)

Now consider this statement:

"Instagram, a photo/video sharing platform, was acquired by Facebook."

Its dependency graph will look something like this:

In [612]:
text_2 = "Instagram, a photo/video sharing platform, was acquired by Facebook."
doc_2 = nlp(text_2) 

for tok in doc_2: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Instagram --> nsubjpass --> PROPN
, --> punct --> PUNCT
a --> det --> DET
photo --> nmod --> NOUN
/ --> punct --> SYM
video --> compound --> NOUN
sharing --> compound --> NOUN
platform --> appos --> NOUN
, --> punct --> PUNCT
was --> auxpass --> AUX
acquired --> ROOT --> VERB
by --> agent --> ADP
Facebook --> pobj --> PROPN
. --> punct --> PUNCT


In [613]:
text_2 = "Instagram, a photo/video sharing platform, was acquired by Facebook."

# Plot the dependency graph 
doc_2 = nlp(text_2) 
displacy.render(doc_2, style='dep',jupyter=True)

We have to check which dependency paths are common between multiple sentences. This method is known as Subtree matching.

For both the above sentences, the dependency tag for "Instagram"  is nsubjpass which stands for a passive subject (as it is a passive sentence). The other entity "facebook " is the object in this sentence and the term "acquired" is the ROOT of the sentence which means it somehow connects the object and the subject.

Let’s define a function to perform subtree matching:

In [614]:
def subtree_matcher(doc): 
  x = '' 
  y = '' 
  
  # iterate through all the tokens in the input sentence 
  for i,tok in enumerate(doc): 
    # extract subject 
    if tok.dep_.find("subjpass") == True: 
      y = tok.text 
      
    # extract object 
    if tok.dep_.endswith("obj") == True: 
      x = tok.text 
      
  return y, x

Lets try to extract the information:

In [615]:
text_1 = "Instagram was acquired by facebook." 
doc_1 = nlp(text_1) 

for tok in doc_1: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Instagram --> nsubjpass --> PROPN
was --> auxpass --> AUX
acquired --> ROOT --> VERB
by --> agent --> ADP
facebook --> pobj --> PROPN
. --> punct --> PUNCT


In [616]:
subtree_matcher(doc_1)

('Instagram', 'facebook')

In [617]:
text_2 = "Instagram, a photo/video sharing platform, was acquired by Facebook." 

doc_2 = nlp(text_2) 
subtree_matcher(doc_2)

('Instagram', 'Facebook')

Here, we can see that eventhough the sentence have more words. Our model was able to extract the subject and object of the sentence accurately.

### ACTIVE SENTENCE AND PASSIVE SENTENCE

The above sentences we discussed were in the form of a Passive sentence.

* Passive Sentece: In this type of sentences the subject (instagram) performs the action (acquired): "Instagram was acquired by facebook."
* Active Sentences: Here we can see that the object performs the action. For example:  "Facebook acquired Instagram"

Lets check how our model reacts to Active sentences:



In [618]:
text_3 = "Facebook acquired instagram." 
doc_3 = nlp(text_3) 
subtree_matcher(doc_3)

('', 'instagram')

This was a wrong output, we expects Facebook to be in the position of subject but we recieved 'instagram' as the subject.

Now, lets compare the dependency and POS tags between a Passive Sentence and Active Sentence

In [619]:
# PASSIVE SENTENCE
text_1 = "Instagram was acquired by facebook." 
doc_1 = nlp(text_1) 

for tok in doc_1: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Instagram --> nsubjpass --> PROPN
was --> auxpass --> AUX
acquired --> ROOT --> VERB
by --> agent --> ADP
facebook --> pobj --> PROPN
. --> punct --> PUNCT


In [620]:
# ACTIVE SENTENCE
text_3 = "Facebook acquired instagram." 
doc_3 = nlp(text_3) 

for tok in doc_3: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Facebook --> nsubj --> PROPN
acquired --> ROOT --> VERB
instagram --> dobj --> NOUN
. --> punct --> PUNCT


It turns out that the grammatical functions (subject and object) of the terms 'Facebook' and ''Instagram' have been interchanged in the active voice. However, now the dependency tag for the subject has changed to ‘nsubj’ from ‘nsubjpass’. This tag indicates that the sentence is in the active voice.

We can use this property to modify our subtree matching function. Given below is the new function for subtree matching:

In [621]:
def subtree_matcher_1(doc):
  subjpass = 0

  for i,tok in enumerate(doc):
    # find dependency tag that contains the text "subjpass"    
    if tok.dep_.find("subjpass") == True:
      subjpass = 1

  x = ''
  y = ''

  # if subjpass == 1 then sentence is passive
  if subjpass == 1:
    for i,tok in enumerate(doc):
      if tok.dep_.find("subjpass") == True:
        y = tok.text

      if tok.dep_.endswith("obj") == True:
        x = tok.text
  
  # if subjpass == 0 then sentence is not passive
  else:
    for i,tok in enumerate(doc):
      if tok.dep_.endswith("subj") == True:
        x = tok.text

      if tok.dep_.endswith("obj") == True:
        y = tok.text

  return x,y

In [622]:
subtree_matcher_1(doc_1)

('facebook', 'Instagram')

In [623]:
subtree_matcher_1(doc_3)

('Facebook', 'instagram')

With our updated model we can see that the Subject-Object pairs are being extracted accurately.

In [633]:
subtree_matcher_1(nlp("Education minister Mr.Peter decides to close down schools"))

('Peter', 'schools')

In [628]:
subtree_matcher_1(nlp("Northernm Ireland figths the virus efficiently."))

('Ireland', 'virus')

We can see that **"Northern Ireland"** was not captured as the subject efficiently

In [631]:
text = "Northern Ireland figths the virus efficiently." 
doc = nlp(text) 

for tok in doc: 
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Northern --> compound --> PROPN
Ireland --> nsubj --> PROPN
figths --> ROOT --> VERB
the --> det --> DET
virus --> dobj --> NOUN
efficiently --> advmod --> ADV
. --> punct --> PUNCT


Lets modify our model inorder to capture the "compound"

In [629]:
def subtree_matcher_2(doc):
  subjpass = 0

  for i,tok in enumerate(doc):
    # find dependency tag that contains the text "subjpass"    
    if tok.dep_.find("subjpass") == True:
      subjpass = 1

  x = ''
  y = ''

  # if subjpass == 1 then sentence is passive
  if subjpass == 1:
    for i,tok in enumerate(doc):
      if tok.dep_.find("subjpass") == True:
        y = tok.text
        if doc[i-1].dep_ == "compound":      
          y = (str(doc[i-1]), str(doc[i]))
          y = " ".join(y)


      if tok.dep_.endswith("obj") == True:
        x = tok.text
  
  # if subjpass == 0 then sentence is not passive
  else:
    for i,tok in enumerate(doc):
      if tok.dep_.endswith("subj") == True:
        x = tok.text
        if doc[i-1].dep_ == "compound":
          x = (str(doc[i-1]), str(doc[i]))
          x = " ".join(x)

      if tok.dep_.endswith("obj") == True:
        y = tok.text

  return x,y

In [635]:
subtree_matcher_2(nlp("Northern Ireland figths the virus efficiently."))

('Northern Ireland', 'virus')