## Entity and Relation Extraction From Unstructured Text for Building an SVO/SPO Pipeline


A desirable capability for automatic knowledge graph construction is Entity and Relation extraction. In this article/colab from the EKGF Technical Pillar we will demonstrate a few approaches you may find useful in your efforts to convert unstructured data into SVO/SPO triples.

There are few open source frameworks available that one can use to get started on the conversion.

For this exercise we will primarily run through using CoreNLP and the Spacy frameworks. Secondarily we will also briefly demonstrate usage of NLTK for parsing and Haystack with Entity extraction.

To run our examples, we will need to do some library installs which is necessary for our safari below.

Many of the examples we print only a subset of the results so it reads better, and the output suppressed / captured. Please change the print_limit variable to a greater number as needed.



In [None]:
%%capture
#install libraries used in this proof of concept
!pip install --upgrade pip
!pip install spacy -U
!pip install PyYAML -U
!pip install thinc -U
!pip install textacy -U
!pip install stanza -U
!pip install bs4 -U
#!pip install owlready2 -U
!pip install -qU transformers==4.13 sentence-transformers
!pip install rdflib -U
!pip install spacy_conll -U
#!pip install ipypublish[sphinx]
#!pip install stanford_openie -U 

# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack
!pip install grpcio-tools==1.34.1
# Install the latest master of Haystack and install the version of torch that works with the colab GPUs
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install nltk -U

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Below we will leverage a source article from Richmond Federal Reserve as source for parsing Entities and Relations to SVO/SPO. 

We will use the Beautiful Soup library to parse the article. While you should be able to use this approach for any html based article, you will need to alter the "find" directive to look for an html snippet with the desired content to be processed as a table and text.

Citation/reference:
How Charlotte’s Banks Responded to COVID-19
By Elizabeth Medlin - The Federal Reserve Bank of Richmond (
Views expressed in this article are those of the authors and not necessarily those of their respective Reserve Banks or the Federal Reserve System.)

In [None]:
# Processing English text
 
import requests
from bs4 import BeautifulSoup
import pprint 
# citation/reference: 
#https://www.richmondfed.org/publications/research/coronavirus/economic_impact_covid-19_11-10-21
#How Charlotte’s Banks Responded to COVID-19
#By Elizabeth Medlin - The Federal Reserve Bank of Richmond (
#Views expressed in this article are those of the authors and not necessarily those of their respective Reserve Banks or the Federal Reserve System.)
URL = "https://www.richmondfed.org/publications/research/coronavirus/economic_impact_covid-19_11-10-21"
r = requests.get(URL)
  
soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
# change the below to suit your needs, in Richmond Fed articles , they currently put their content in the following <div> 
table = soup.find('div', attrs = {'data-component':'Rich Text'}) 

article_text = table.get_text()
with open('econ_fed_covid.txt', 'w') as f:
  f.write(article_text)


Next we will download and initialize NLTK, CoreNLP, and Stanza for our CoreNLP examples.

In [None]:
%%capture
import nltk

nltk.download('all')

# Set the CORENLP_HOME environment variable to point to the installation location for CoreNLP
corenlp_dir = './corenlp'
import os
os.environ["CORENLP_HOME"] = corenlp_dir

# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
# below can be used for shutdown
#!wget "localhost:9001/shutdown?key=`cat /tmp/corenlp.shutdown.server0`" -O -
import stanza

print("Downloading English model...")
stanza.download('en')
from stanza.server import CoreNLPClient

stanza.install_corenlp(dir=corenlp_dir)

CoreNLP is from the Stanford NLP Group. We use the Stanza python library which is an interface to CoreNLP, which doesn't require Java programming but does require the CoreNLP distribution to be installed. When invoking Stanza it will automatically download CoreNLP and install it in a folder in the users home. If instead you need to install CoreNLP in an offline manner, it is possible to download the files separately and specify the location using a CORENLP_HOME variable. 

At the time of this article/colab CoreNLP needs to be run separately for relation extraction outside of Stanza using the OpenIE interface which will be seen later.
Note that many python libraries auto start. That said, we ran into issues with them not starting in some cases. Port 9001 should be used instead of 9000 becasue of conflict with colab.

In [None]:
corenlp_dir = './corenlp'
import os
os.environ["CORENLP_HOME"] = corenlp_dir
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

LOGDIR = '/tmp/log'
get_ipython().system_raw(
    'java -mx8g -cp "./corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -server_id server0 -threads 5 &'
    .format(LOGDIR)
)


Initialize Stanza pipeline for pyhon CoreNLP interface

In [None]:
%%capture
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en')


Run stanza extracting entities. Note that only 10 will be printed below. Feel free to alter as need be.

Some of but not all of the types can be found here for reference:
https://www.nltk.org/book/ch07.html#tab-ne-types

In [None]:
# Processing English text
en_doc = en_nlp(article_text)
from tabulate import tabulate

ent_list=[]
ent_list.append(['Mention Text','Type','Start','End'])
for ent in en_doc.ents:
    ent_list.append([ent.text,ent.type, ent.start_char, ent.end_char])

print_limit=20
print(tabulate(ent_list[:print_limit], headers='firstrow', tablefmt='fancy_grid'))

╒═══════════════════╤══════════╤═════════╤═══════╕
│ Mention Text      │ Type     │   Start │   End │
╞═══════════════════╪══════════╪═════════╪═══════╡
│ COVID-19          │ EVENT    │      22 │    30 │
├───────────────────┼──────────┼─────────┼───────┤
│ Charlotte         │ GPE      │      97 │   106 │
├───────────────────┼──────────┼─────────┼───────┤
│ North Carolina    │ GPE      │     108 │   122 │
├───────────────────┼──────────┼─────────┼───────┤
│ JPMorgan Chase    │ ORG      │     124 │   138 │
├───────────────────┼──────────┼─────────┼───────┤
│ Charlotte         │ GPE      │     227 │   236 │
├───────────────────┼──────────┼─────────┼───────┤
│ U.S. Bank         │ ORG      │     291 │   300 │
├───────────────────┼──────────┼─────────┼───────┤
│ Charlotte         │ GPE      │     338 │   347 │
├───────────────────┼──────────┼─────────┼───────┤
│ Corporate Trust   │ ORG      │     363 │   378 │
├───────────────────┼──────────┼─────────┼───────┤
│ Fifth Third Bank  │ ORG      

Save to CoNLL a file that could be parsed to RDF and back
https://github.com/acoli-repo/conll-rdf/issues/84

We plan on delving into this capability in a subsequent part. The above demonstrates how an external java program can take the CoNLL. You can store information into RDF with associated ontologies.

The CoNLL conversion to RDF can be executed following this guide: https://github.com/acoli-repo/conll-rdf#getting-started

In [None]:
from stanza.utils.conll import CoNLL
CoNLL.write_doc2conll(en_doc, "output.conllu")




Example of printing sentence dependencies

Citation/Reference:
https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_Beginners_Guide.ipynb

In [None]:
# Print the dependencies of the first sentence in the doc object
# Format - (Token, Index of head, Nature of dependency)
# Index starts from 1, 0 is reserved for ROOT
print_limit=2
for i, sentence in enumerate(en_doc.sentences[:print_limit]):
     print(*[f'sentence: {i+1}\tid: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sentence.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for word in sentence.words], sep='\n')


sentence: 1	id: 1	word: Prior	head id: 7	head: pandemic	deprel: case
sentence: 1	id: 2	word: to	head id: 1	head: Prior	deprel: fixed
sentence: 1	id: 3	word: the	head id: 7	head: pandemic	deprel: det
sentence: 1	id: 4	word: COVID	head id: 7	head: pandemic	deprel: compound
sentence: 1	id: 5	word: -	head id: 6	head: 19	deprel: punct
sentence: 1	id: 6	word: 19	head id: 4	head: COVID	deprel: nummod
sentence: 1	id: 7	word: pandemic	head id: 12	head: rolling	deprel: obl
sentence: 1	id: 8	word: ,	head id: 12	head: rolling	deprel: punct
sentence: 1	id: 9	word: several	head id: 10	head: banks	deprel: amod
sentence: 1	id: 10	word: banks	head id: 12	head: rolling	deprel: nsubj
sentence: 1	id: 11	word: were	head id: 12	head: rolling	deprel: aux
sentence: 1	id: 12	word: rolling	head id: 0	head: root	deprel: root
sentence: 1	id: 13	word: out	head id: 12	head: rolling	deprel: compound:prt
sentence: 1	id: 14	word: major	head id: 16	head: plans	deprel: amod
sentence: 1	id: 15	word: expansion	head id: 16

Haystack provides another way to extract entities using BERT pretrained model. The need for training models will likely be necessary to handle anything more than common cases.

In [None]:
from haystack.nodes import TextConverter
from haystack.nodes import PreProcessor
from haystack.nodes import EntityExtractor
from haystack.document_stores import ElasticsearchDocumentStore
import pprint

entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER")

entities = entity_extractor.extract(text=article_text)
print_limit=30
for ent_print in entities[:print_limit]:
    pprint.pprint(ent_print)



INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.
INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

{'end': 24,
 'entity_group': 'MISC',
 'score': 0.99746394,
 'start': 22,
 'word': 'CO'}
{'end': 27,
 'entity_group': 'MISC',
 'score': 0.80356276,
 'start': 26,
 'word': '##D'}
{'end': 30,
 'entity_group': 'MISC',
 'score': 0.87813723,
 'start': 28,
 'word': '19'}
{'end': 106,
 'entity_group': 'LOC',
 'score': 0.99827015,
 'start': 97,
 'word': 'Charlotte'}
{'end': 122,
 'entity_group': 'LOC',
 'score': 0.99937963,
 'start': 108,
 'word': 'North Carolina'}
{'end': 138,
 'entity_group': 'ORG',
 'score': 0.99823684,
 'start': 124,
 'word': 'JPMorgan Chase'}
{'end': 236,
 'entity_group': 'LOC',
 'score': 0.9985196,
 'start': 227,
 'word': 'Charlotte'}
{'end': 300,
 'entity_group': 'ORG',
 'score': 0.9978406,
 'start': 291,
 'word': 'U. S. Bank'}
{'end': 347,
 'entity_group': 'LOC',
 'score': 0.99868613,
 'start': 338,
 'word': 'Charlotte'}
{'end': 378,
 'entity_group': 'ORG',
 'score': 0.9854084,
 'start': 363,
 'word': 'Corporate Trust'}
{'end': 484,
 'entity_group': 'ORG',
 'score': 0.9

Extract SVO/SPO using OpenIE from CoreNLP

In [None]:

# for a contrast in models/pipelines try switching to kbp to see a different model used https://stanfordnlp.github.io/CoreNLP/kbp.html

#def extract_triples(text, annotators=["openie"], properties={}):
#def extract_triples(text, annotators=["tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp"], properties={}):

#uncomment below for kbp
#def extract_triples(text, annotators=["kbp"], properties={}):
#comment below for kbp
def extract_triples(text, annotators=["tokenize,ssplit,pos,depparse,ner,natlog,openie"], properties={}):
    with CoreNLPClient(
        annotators=annotators, properties=properties,  endpoint='http://localhost:9001', be_quiet=True , start_server='DONT_START'
    ) as client:
        ann = client.annotate(text)
        triples = []
        for sentence in ann.sentence:
            #comment below for kbp
            for triple in sentence.openieTriple:
            #uncommment below for kbp
            #for triple in sentence.kbpTriple:
                #print(triple)
                triples.append(
                    {
                        "subject": triple.subject,
                        "relation": triple.relation,
                        "object": triple.object,
                    }
                )

    return triples


Save results into ttl file using rdflib

In [None]:
from rdflib import Graph, URIRef, BNode, Literal
extracted_triples = extract_triples(article_text)

from urllib.parse import urlencode, quote_plus, quote
g = Graph()
for rel in extracted_triples:
      g.add((URIRef(quote_plus(rel['subject'])), URIRef(quote_plus(rel['relation'])), URIRef(quote_plus(rel['object']))))                
 
v = g.serialize(destination='openie_example.ttl', format='ttl')
print_limit=30
for ent_print in extracted_triples[:print_limit]:
    pprint.pprint(ent_print)


{'object': 'Prior COVID 19 pandemic',
 'relation': 'were rolling out expansion plans to',
 'subject': 'banks'}
{'object': 'major expansion plans',
 'relation': 'were rolling out',
 'subject': 'several banks'}
{'object': 'COVID 19 pandemic',
 'relation': 'were rolling out expansion plans to',
 'subject': 'several banks'}
{'object': 'COVID 19 pandemic',
 'relation': 'were rolling out expansion plans to',
 'subject': 'banks'}
{'object': 'expansion plans',
 'relation': 'were rolling out',
 'subject': 'several banks'}
{'object': 'major expansion plans in Charlotte',
 'relation': 'were rolling out',
 'subject': 'several banks'}
{'object': 'expansion plans in Charlotte',
 'relation': 'were rolling out',
 'subject': 'banks'}
{'object': 'expansion plans in Charlotte',
 'relation': 'were rolling out',
 'subject': 'several banks'}
{'object': 'expansion plans',
 'relation': 'were rolling out',
 'subject': 'banks'}
{'object': 'major expansion plans',
 'relation': 'were rolling out',
 'subject': 'ba

Yet another example of triple extract using CoreNLP & NLTK to setup a sentence triplet 
Citation/Reference : https://github.com/kj-lai/SentenceTriplet

In [None]:
# reference https://github.com/kj-lai/SentenceTriplet
import nltk, pandas as pd, numpy as np
from nltk.parse.corenlp import CoreNLPParser, CoreNLPDependencyParser
from nltk.tree import ParentedTree
dep_parser = CoreNLPDependencyParser(url='http://localhost:9001')
pos_tagger = CoreNLPParser(url='http://localhost:9001' ,  tagtype='pos')

def triplet_extraction (input_sent, output=['parse_tree','spo','result']):
    # Parse the input sentence with Stanford CoreNLP Parser
    pos_type = pos_tagger.tag(input_sent.split())
    parse_tree, = ParentedTree.convert(list(pos_tagger.parse(input_sent.split()))[0])
    dep_type, = ParentedTree.convert(dep_parser.parse(input_sent.split()))
    # Extract subject, predicate and object
    subject = extract_subject(parse_tree)
    predicate = extract_predicate(parse_tree)
    objects = extract_object(parse_tree)
    if 'parse_tree' in output:
        print('---Parse Tree---')
        parse_tree.pretty_print()
    if 'spo' in output:
        print('---Subject---')
        print(subject)
        print('---Predicate---')
        print(predicate)
        print('---Object---')
        print(objects)
    if 'result' in output:
        print('---Result---')
        print(' '.join([subject[0], predicate[0], objects[0]]))

def extract_subject (parse_tree):
    # Extract the first noun found in NP_subtree
    subject = []
    for s in parse_tree.subtrees(lambda x: x.label() == 'NP'):
        for t in s.subtrees(lambda y: y.label().startswith('NN')):
            output = [t[0], extract_attr(t)]
            # Avoid empty or repeated values
            if output != [] and output not in subject:
                subject.append(output) 
    if len(subject) != 0: return subject[0] 
    else: return ['']

def extract_predicate (parse_tree):
    # Extract the deepest(last) verb foybd ub VP_subtree
    output, predicate = [],[]
    for s in parse_tree.subtrees(lambda x: x.label() == 'VP'):
        for t in s.subtrees(lambda y: y.label().startswith('VB')):
            output = [t[0], extract_attr(t)]
            if output != [] and output not in predicate:    
                predicate.append(output)
    if len(predicate) != 0: return predicate[-1]
    else: return ['']

def extract_object (parse_tree):
    # Extract the first noun or first adjective in NP, PP, ADP siblings of VP_subtree
    objects, output, word = [],[],[]
    for s in parse_tree.subtrees(lambda x: x.label() == 'VP'):
        for t in s.subtrees(lambda y: y.label() in ['NP','PP','ADP']):
            if t.label() in ['NP','PP']:
                for u in t.subtrees(lambda z: z.label().startswith('NN')):
                    word = u          
            else:
                for u in t.subtrees(lambda z: z.label().startswith('JJ')):
                    word = u
            if len(word) != 0:
                output = [word[0], extract_attr(word)]
            if output != [] and output not in objects:
                objects.append(output)
    if len(objects) != 0: return objects[0]
    else: return ['']

def extract_attr (word):
    attrs = []     
    # Search among the word's siblings
    if word.label().startswith('JJ'):
        for p in word.parent(): 
            if p.label() == 'RB':
                attrs.append(p[0])
    elif word.label().startswith('NN'):
        for p in word.parent():
            if p.label() in ['DT','PRP$','POS','JJ','CD','ADJP','QP','NP']:
                attrs.append(p[0])
    elif word.label().startswith('VB'):
        for p in word.parent():
            if p.label() == 'ADVP':
                attrs.append(p[0])
    # Search among the word's uncles
    if word.label().startswith('NN') or word.label().startswith('JJ'):
        for p in word.parent().parent():
            if p.label() == 'PP' and p != word.parent():
                attrs.append(' '.join(p.flatten()))
    elif word.label().startswith('VB'):
        for p in word.parent().parent():
            if p.label().startswith('VB') and p != word.parent():
                attrs.append(' '.join(p.flatten()))
    return attrs

In [None]:
ent_text = nltk.sent_tokenize(article_text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
print_limit=10
for sentence in ent_text[:print_limit]:
    triplet_extraction(sentence)

---Parse Tree---
                                                             S                                                                         
        _____________________________________________________|_______________________________________________________________________   
       |                               |           |                      VP                                                         | 
       |                               |           |          ____________|____________________                                      |  
       PP                              |           |         |                                 VP                                    | 
   ____|________                       |           |         |       __________________________|____                                 |  
  |             PP                     |           |         |      |     |                         NP                               | 
  |     ________|____       

Use Spacy for Entity extraction

In [None]:
import spacy

spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
doc = nlp(article_text)

print_limit=30
for entity in doc.ents[:print_limit]:
  print(entity.label_, ' | ', entity.text)

for tok in doc[:print_limit]:
  print(tok.text, "...", tok.dep_)


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
GPE  |  Charlotte
GPE  |  North Carolina
ORG  |  JPMorgan Chase
GPE  |  Charlotte
ORG  |  U.S. Bank
GPE  |  Charlotte
ORG  |  Corporate Trust
ORG  |  Fifth Third Bank
CARDINAL  |  20
GPE  |  Charlotte
ORG  |  Truist Bank
ORG  |  BB&T Bank
ORG  |  SunTrust Bank
GPE  |  Charlotte
ORG  |  Truist
ORDINAL  |  sixth
GPE  |  the United States
GPE  |  Charlotte
GPE  |  the United States
DATE  |  March 2020
DATE  |  several weeks
DATE  |  the months
DATE  |  June 2020
ORG  |  the Federal Reserve
CARDINAL  |  34
MONEY  |  $100 billion
DATE  |  2007-2009
ORG  |  Fed
CARDINAL  |  three
ORG  |  COVID-19 recession

         ... dep
Prior ... advmod
to ... prep
the ... det
COVID-19 ... compound
pandemic ... pobj
, ... punct
several ... amod
banks ... nsubj
were ... aux
rolling ... ROOT
out ... prt
major ... amod
expansion ... compound
plans ... dobj
in ... prep
Charlotte ... pobj
, ... pu

Define a function that can be used to extract SVO from spacy, doesn't require separate library such as Textacy

In [None]:
# ref:
# https://github.com/rock3125/enhanced-subject-verb-object-extraction/blob/master/subject_verb_object_extract.py
# Copyright 2017 Peter de Vocht
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import en_core_web_sm
from collections.abc import Iterable

# use spacy small model
nlp = en_core_web_sm.load()

# dependency markers for subjects
SUBJECTS = {"nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"}
# dependency markers for objects
OBJECTS = {"dobj", "dative", "attr", "oprd"}
# POS tags that will break adjoining items
BREAKER_POS = {"CCONJ", "VERB"}
# words that are negations
NEGATIONS = {"no", "not", "n't", "never", "none"}


# does dependency set contain any coordinating conjunctions?
def contains_conj(depSet):
    return "and" in depSet or "or" in depSet or "nor" in depSet or \
           "but" in depSet or "yet" in depSet or "so" in depSet or "for" in depSet


# get subs joined by conjunctions
def _get_subs_from_conjunctions(subs):
    more_subs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_subs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(more_subs) > 0:
                more_subs.extend(_get_subs_from_conjunctions(more_subs))
    return more_subs


# get objects joined by conjunctions
def _get_objs_from_conjunctions(objs):
    more_objs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_objs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(more_objs) > 0:
                more_objs.extend(_get_objs_from_conjunctions(more_objs))
    return more_objs


# find sub dependencies
def _find_subs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
        if len(subs) > 0:
            verb_negated = _is_negated(head)
            subs.extend(_get_subs_from_conjunctions(subs))
            return subs, verb_negated
        elif head.head != head:
            return _find_subs(head)
    elif head.pos_ == "NOUN":
        return [head], _is_negated(tok)
    return [], False


# is the tok set's left or right negated?
def _is_negated(tok):
    parts = list(tok.lefts) + list(tok.rights)
    for dep in parts:
        if dep.lower_ in NEGATIONS:
            return True
    return False


# get all the verbs on tokens with negation marker
def _find_svs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs


# get grammatical objects for a given set of dependencies (including passive sentences)
def _get_objs_from_prepositions(deps, is_pas):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or
                         (tok.pos_ == "PRON" and tok.lower_ == "me") or
                         (is_pas and tok.dep_ == 'pobj')])
    return objs


# get objects from the dependencies using the attribute dependency
def _get_objs_from_attrs(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(_get_objs_from_prepositions(rights, is_pas))
                    if len(objs) > 0:
                        return v, objs
    return None, None


# xcomp; open complement - verb has no suject
def _get_obj_from_xcomp(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(_get_objs_from_prepositions(rights, is_pas))
            if len(objs) > 0:
                return v, objs
    return None, None


# get all functional subjects adjacent to the verb passed in
def _get_all_subs(v):
    verb_negated = _is_negated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(_get_subs_from_conjunctions(subs))
    else:
        foundSubs, verb_negated = _find_subs(v)
        subs.extend(foundSubs)
    return subs, verb_negated


# find the main verb - or any aux verb if we can't find it
def _find_verbs(tokens):
    verbs = [tok for tok in tokens if _is_non_aux_verb(tok)]
    if len(verbs) == 0:
        verbs = [tok for tok in tokens if _is_verb(tok)]
    return verbs


# is the token a verb?  (excluding auxiliary verbs)
def _is_non_aux_verb(tok):
    return tok.pos_ == "VERB" and (tok.dep_ != "aux" and tok.dep_ != "auxpass")


# is the token a verb?  (excluding auxiliary verbs)
def _is_verb(tok):
    return tok.pos_ == "VERB" or tok.pos_ == "AUX"


# return the verb to the right of this verb in a CCONJ relationship if applicable
# returns a tuple, first part True|False and second part the modified verb if True
def _right_of_verb_is_conj_verb(v):
    # rights is a generator
    rights = list(v.rights)

    # VERB CCONJ VERB (e.g. he beat and hurt me)
    if len(rights) > 1 and rights[0].pos_ == 'CCONJ':
        for tok in rights[1:]:
            if _is_non_aux_verb(tok):
                return True, tok

    return False, v


# get all objects for an active/passive sentence
def _get_all_objs(v, is_pas):
    # rights is a generator
    rights = list(v.rights)

    objs = [tok for tok in rights if tok.dep_ in OBJECTS or (is_pas and tok.dep_ == 'pobj')]
    objs.extend(_get_objs_from_prepositions(rights, is_pas))

    #potentialNewVerb, potentialNewObjs = _get_objs_from_attrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potential_new_verb, potential_new_objs = _get_obj_from_xcomp(rights, is_pas)
    if potential_new_verb is not None and potential_new_objs is not None and len(potential_new_objs) > 0:
        objs.extend(potential_new_objs)
        v = potential_new_verb
    if len(objs) > 0:
        objs.extend(_get_objs_from_conjunctions(objs))
    return v, objs


# return true if the sentence is passive - at he moment a sentence is assumed passive if it has an auxpass verb
def _is_passive(tokens):
    for tok in tokens:
        if tok.dep_ == "auxpass":
            return True
    return False


# resolve a 'that' where/if appropriate
def _get_that_resolution(toks):
    for tok in toks:
        if 'that' in [t.orth_ for t in tok.lefts]:
            return tok.head
    return None


# simple stemmer using lemmas
def _get_lemma(word: str):
    tokens = nlp(word)
    if len(tokens) == 1:
        return tokens[0].lemma_
    return word


# print information for displaying all kinds of things of the parse tree
def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])


# expand an obj / subj np using its chunk
def expand(item, tokens, visited):
    if item.lower_ == 'that':
        temp_item = _get_that_resolution(tokens)
        if temp_item is not None:
            item = temp_item

    parts = []

    if hasattr(item, 'lefts'):
        for part in item.lefts:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    parts.append(item)

    if hasattr(item, 'rights'):
        for part in item.rights:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    if hasattr(parts[-1], 'rights'):
        for item2 in parts[-1].rights:
            if item2.pos_ == "DET" or item2.pos_ == "NOUN":
                if item2.i not in visited:
                    visited.add(item2.i)
                    parts.extend(expand(item2, tokens, visited))
            break

    return parts


# convert a list of tokens to a string
def to_str(tokens):
    if isinstance(tokens, Iterable):
        return ' '.join([item.text for item in tokens])
    else:
        return ''


# find verbs and their subjects / objects to create SVOs, detect passive/active sentences
def findSVOs(tokens):
    svos = []
    is_pas = _is_passive(tokens)
    verbs = _find_verbs(tokens)
    visited = set()  # recursion detection
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            isConjVerb, conjV = _right_of_verb_is_conj_verb(v)
            if isConjVerb:
                v2, objs = _get_all_objs(conjV, is_pas)
                for sub in subs:
                    for obj in objs:
                        objNegated = _is_negated(obj)
                        if is_pas:  # reverse object / subject for passive
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v2.lemma_ if verbNegated or objNegated else v2.lemma_, to_str(expand(sub, tokens, visited))))
                        else:
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v2.lower_ if verbNegated or objNegated else v2.lower_, to_str(expand(obj, tokens, visited))))
            else:
                v, objs = _get_all_objs(v, is_pas)
                for sub in subs:
                    if len(objs) > 0:
                        for obj in objs:
                            objNegated = _is_negated(obj)
                            if is_pas:  # reverse object / subject for passive
                                svos.append((to_str(expand(obj, tokens, visited)),
                                             "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            else:
                                svos.append((to_str(expand(sub, tokens, visited)),
                                             "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                    else:
                        # no obj - just return the SV parts
                        svos.append((to_str(expand(sub, tokens, visited)),
                                     "!" + v.lower_ if verbNegated else v.lower_,))

    return svos


In [None]:
#from subject_verb_object_extract import findSVOs, nlp
tokens = nlp(article_text)
svos = findSVOs(tokens)
print_limit=30
for svo_print in svos[:print_limit]:
    pprint.pprint(svo_print)

('major expansion plans in', 'roll', 'several banks')
('retail banking', 'foray', 'JPMorgan Chase')
('private , corporate banking', 'include', 'its other offerings in')
('a strong foothold in via its Trust venture', 'have', 'U.S. Bank')
('20 new branches in the Charlotte area', 'open', 'Fifth Third Bank')
('Charlotte', 'choose', 'Truist Bank , result ,')
('its headquarters', 'choose', 'Truist Bank , result ,')
('the largest bank in', 'create', 'The Truist merger')
('assets', 'create', 'The Truist merger')
('Charlotte (', 'hit', 'COVID-19')
('March 2020', 'hit', 'COVID-19')
('orders in for several weeks', 'hit', 'COVID-19')
('home', 'stay', 'orders in for')
('Charlotte banks', 'responded')
('home', 'work', 'employees')
('demand', 'increased')
('many branches', 'reopened')
('their offices', 'return', 'bank employees')
('the results of a " stress test " for with assets of',
 'conduct',
 'the Federal Reserve')
('the results of', 'release', 'the Federal Reserve')
('various recessionary scen

Textacy also can be used to extract SVO/SPO triples, but has many more dependancies.
Please refer to quickstart on filtering
https://textacy.readthedocs.io/en/0.12.0/quickstart.html


In [None]:
import textacy
from textacy import extract

textacy_svos = textacy.extract.subject_verb_object_triples(doc)
print_limit=30

textacy_svos_list = list(textacy_svos)          

for svo_print in textacy_svos_list[:print_limit]:
    pprint.pprint(svo_print)

SVOTriple(subject=[banks], verb=[were, rolling], object=[expansion, plans])
SVOTriple(subject=[JPMorgan, Chase], verb=[planned], object=[to, foray, into, retail, banking, to, complement, its, other, offerings, in, Charlotte, ,, including, private, ,, corporate, and, commercial, banking])
SVOTriple(subject=[U.S., Bank], verb=[had], object=[foothold])
SVOTriple(subject=[U.S., Bank], verb=[wanted], object=[to, deepen, its, retail, presence, in, the, area, by, opening, retail, branches])
SVOTriple(subject=[Fifth, Third, Bank], verb=[planned], object=[to, open, 20, new, branches, in, the, Charlotte, area])
SVOTriple(subject=[Truist, Bank], verb=[chose], object=[Charlotte])
SVOTriple(subject=[Truist, merger], verb=[created], object=[bank])
SVOTriple(subject=[COVID-19], verb=[hit], object=[Charlotte, United, States])
SVOTriple(subject=[Federal, Reserve], verb=[conducted], object=[results])
SVOTriple(subject=[Federal, Reserve], verb=[released], object=[results])
SVOTriple(subject=[test], verb=

Example of java command to start server (not required)

In [None]:
#!java -mx8g -cp "./corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -server_id server0 -threads 5 &

Way to shutdown corenlp server through endpoint

In [None]:
!wget "localhost:9001/shutdown?key=`cat /tmp/corenlp.shutdown.server0`" -O -

--2022-06-27 03:01:36--  http://localhost:9001/shutdown?key=fbo3js50rt4jui0b1s96e2b350
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:9001... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21 [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Shutdown successful!

2022-06-27 03:01:36 (3.35 MB/s) - written to stdout [21/21]



#Final Thoughts
We hope you have enjoyed this introduction/safari for SVO/SPO Entity and Relation extraction. 

#Bibliography

Mining an economic news article using pre-trained language models - Oliver Batey

https://towardsdatascience.com/mining-an-economic-news-article-using-pre-trained-language-models-f75af041ecf0
https://github.com/oliver-batey/Text-Mining/blob/main/keyterm_extraction.py


How Charlotte’s Banks Responded to COVID-19
By Elizabeth Medlin - The Federal Reserve Bank of Richmond (
Views expressed in this article are those of the authors and not necessarily those of their respective Reserve Banks or the Federal Reserve System.)

https://www.richmondfed.org/publications/research/coronavirus/economic_impact_covid-19_11-10-21


In [None]:
%%capture
# NOTE: please download this notebook and upload to the /content folder - this is a manual step!
# https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab
!jupyter nbconvert --to html /content/FEDReviewCLTBanksCovid.ipynb
!jupyter nbconvert --to latex /content/FEDReviewCLTBanksCovid.ipynb
!apt install texlive-xetex texlive-fonts-recommended texlive-generic-recommended
!jupyter nbconvert --to pdf /content/FEDReviewCLTBanksCovid.ipynb
!git clone https://github.com/hakimel/reveal.js.git
!jupyter nbconvert /content/FEDReviewCLTBanksCovid.ipynb --to slides --reveal-prefix reveal.js 
