In [1]:
import pandas as pd
import json

pd.options.display.max_rows = 600
pd.set_option('max_colwidth', 800)

In [2]:
base_dir = '/home/adiga/my_work/entity-relation-repo/data/'
in_file_path = base_dir + 'parsed_data_vp.jl'

In [3]:
in_file = open(in_file_path, 'r')
in_data = []
for l in in_file:
    in_data.append(json.loads(l))

df_relations = pd.DataFrame(in_data)

In [4]:
# Clean the text 
# /\n\t\t\t
# KPMG’s 
# \n\n
# Mumbai: Indian banking services, apart ...
# service — key staff will be...
# “We request all our...suspended,”
# Better communication (is needed) to let...
# 40 per cent, 40%, 40 percent etc
# could be short-lived and...

In [5]:
# df_relations[['0_sub','0_verb','0_obj']]

In [7]:
df_relations[['0_noun1','0_verb','0_noun2','text']]

Unnamed: 0,0_noun1,0_verb,0_noun2,text
0,,,,NEW DELHI:
1,resolution,professionals (,RP’s) to run,"Lenders have appointed resolution professionals (RP’s) to run the insolvency proceedings of Ballarpur Industries Limited (BILT) and KSK Mahanadi Power, kicking off the process to find buyers for the two companies, as per people aware of the matter.\n\n"
2,,,,KPMG’s Anuj Jain has been appointed to run the insolvency proceedings of BILT.
3,,,,Sumit Binani will manage the insolvency process for KSK Mahanadi Power.\n\n
4,,,,Jain is also the RP of Jaypee Infratech.
5,,,,Jain and Binani did not respond to requests for comment.\n\n
6,,,,"As per insolvency rules, a committee of creditors (CoC) is formed once an application for insolvency is admitted against a company at the NCLT."
7,,,,The CoC subsequently appoints a resolution professional or an administrator to run the company on their behalf till a buyer is found.\n\n
8,,,,Ballarpur Industries owes an IDBI Bank-led consortium of lenders Rs. 2000 crore that it has not been able to repay.
9,,,,"KSK Mahanadi Power has defaulted on loan repayments of Rs. 15,000 crore.\n\n"


In [7]:
import spacy
import textacy
import en_core_web_sm
from spacy import displacy
sp_core_nlp = en_core_web_sm.load()

In [8]:
index = 43
sent = df_relations.loc[index,:].text
doc = sp_core_nlp(sent)
displacy.render(doc, style="dep")

In [9]:
texts = df_relations.sample(40).text

In [10]:
pattern = r'(<AUX>*<VERB>?[<ADP><PART><ADV>]*<VERB>+<ADP>?)'

for sent in texts:
    print(sent)
    doc = sp_core_nlp(sent)
    verb_phrases = textacy.extract.pos_regex_matches(doc, pattern)
    with doc.retokenize() as retokenizer:
        n_chks = [chk for chk in doc.noun_chunks]
        prev = None
        for nc in n_chks:
            if prev == None:
                prev = nc
                continue
            # If the two noun chunks are adjucent to each other, merge them together
            if prev.end == nc.start:
                retokenizer.merge(doc[prev.start:nc.end])
    # Now, whatever is left in between the noun chunks, they have the relationships!
    print('NC:', [nc for nc in doc.noun_chunks])
    print('VP:', [vp for vp in verb_phrases])
    print({tk.text: tk.pos_ for tk in doc if len(tk.text) <= 4})
    print('-'*30)

Our employees are also facing the same challenges that you are and so we are asking for your help too.” 


NC: [Our employees, the same challenges, that, you, we, your help]
VP: [are also facing, are asking for]
{'we': 'PRON', 'help': 'NOUN', 'the': 'DET', '\n\n': 'SPACE', 'your': 'PRON', 'you': 'PRON', 'and': 'CCONJ', 'same': 'ADJ', 'are': 'AUX', 'so': 'ADV', 'that': 'PRON', 'too': 'ADV', 'Our': 'PRON', '”': 'PUNCT', 'also': 'ADV', 'for': 'ADP', '.': 'PUNCT'}
------------------------------
The microfinance sector’s gross loan portfolio grew 24% year-on-year to Rs 2.11 lakh crore at the end of December 2019, despite the economic slowdown and repayment crises in pockets of Assam, Karnataka and Maharashtra.
NC: [The microfinance sector’s gross loan portfolio, year, the end, December, the economic slowdown, repayment crises, pockets, Assam, Karnataka, Maharashtra]
VP: [’s, grew, crore at]
{'lakh': 'ADJ', 'the': 'DET', '’s': 'VERB', 'to': 'ADP', 'year': 'NOUN', 'and': 'CCONJ', 'at': 'ADP',

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


NC: [the wake, the COVID-19 outbreak, the Hotel & Restaurant Association, Eastern India, hotels, all categories, Kolkata, isolation rooms, guests, they, the country]
VP: [has asked, to keep, are coming from]
{'&': 'CCONJ', 'wake': 'NOUN', 'from': 'ADP', 'In': 'ADP', 'to': 'PART', 'has': 'AUX', 'of': 'ADP', 'keep': 'VERB', 'in': 'ADP', 'are': 'AUX', ',': 'PUNCT', 'they': 'PRON', 'for': 'ADP', '.': 'PUNCT', 'all': 'DET', 'the': 'DET', 'or': 'CCONJ'}
------------------------------
Sumit Binani will manage the insolvency process for KSK Mahanadi Power.


NC: [Sumit Binani, the insolvency process, KSK Mahanadi Power]
VP: [will manage]
{'\n\n': 'SPACE', 'the': 'DET', 'will': 'AUX', 'KSK': 'PROPN', 'for': 'ADP', '.': 'PUNCT'}
------------------------------
The central government and state governments have decided to lock down 75 districts across the country till March 31 where Covid-19 cases have been reported.


NC: [The central government, state governments, 75 districts, the country, March

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


NC: [Mr Mukherjee’s guidance, us, investors, Uttrayan founder-cum managing director Kartick Biswas]
VP: [will help, negotiating with, said]
{'with': 'ADP', '-': 'PUNCT', 'will': 'AUX', 'us': 'PRON', 'said': 'VERB', 'help': 'VERB', '’s': 'PROPN', '\n\n': 'SPACE', 'Mr': 'PROPN', 'cum': 'NOUN', '”': 'PUNCT', ',': 'PUNCT', '.': 'PUNCT'}
------------------------------
Several banks have decided to stagger their work force under business continuity plans based on government guidelines and a standard operating procedure.
NC: [Several banks, their work force, business continuity plans, government guidelines, a standard operating procedure]
VP: [have decided to stagger, based on]
{'have': 'AUX', 'work': 'NOUN', 'a': 'DET', 'to': 'PART', 'on': 'ADP', 'and': 'CCONJ', '.': 'PUNCT'}
------------------------------
The Confederation of Indian Industry has urged the government to exempt services such as e-commerce including delivery boys and supply chain vendors, food processing warehousing and IT-ITe

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


NC: [Public, these modes, digital payment, the convenience, their homes, online channels, cash, which, places, money, bills, the central bank, a statement]
VP: [can use, avoid using, may require going to, crowded, for sending, paying, said in]
{'said': 'VERB', 'to': 'ADP', 'cash': 'NOUN', 'bank': 'NOUN', 'and': 'CCONJ', 'may': 'AUX', 'last': 'ADJ', '\n\n': 'SPACE', 'the': 'DET', 'from': 'ADP', 'use': 'VERB', ',': 'PUNCT', 'week': 'NOUN', 'or': 'CCONJ', 'a': 'DET', '“': 'PUNCT', 'of': 'ADP', 'in': 'ADP', 'for': 'ADP', '”': 'PUNCT', 'can': 'AUX', '.': 'PUNCT'}
------------------------------
The rotation of staff attending office and those advised to work from home will be made once in two days.


NC: [The rotation, staff, office, home, two days]
VP: [attending, advised, to work from, will be made]
{'from': 'ADP', 'home': 'NOUN', 'work': 'VERB', 'will': 'AUX', 'to': 'PART', 'two': 'NUM', 'The': 'DET', 'and': 'CCONJ', 'of': 'ADP', 'made': 'VERB', 'in': 'ADP', 'be': 'AUX', '\n\n': 'SPACE', 

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


In [24]:
import traceback

def comes_before(tok1, tok2):
    #print(tok1, ',', tok2)
    return tok1.start < tok2.start

def get_node_edge_pairs(doc, verb_phrases, noun_chunks):
    # If either of the lists are empty, nothing to do here
    if len(verb_phrases) == 0 or len(noun_chunks) == 0:
        return None

    vp_i = nc_i = 0
    node_edge_node_list = []
    start_node = end_node = edge = None
    while True:
        # If still both the lists have unseen tokens
        if vp_i < len(verb_phrases) and nc_i < len(noun_chunks):
            if comes_before(verb_phrases[vp_i], noun_chunks[nc_i]):
                start_tok = verb_phrases[vp_i]  
                visited = False
                while(vp_i < len(verb_phrases) and comes_before(verb_phrases[vp_i], noun_chunks[nc_i])):
                    vp_i += 1
                    visited = True

                # Update vp_i if it had entered the loop
                if visited:
                    vp_i -= 1
                end_tok = verb_phrases[vp_i]
                
                # Mark this as edge
                edge = doc[start_tok.start:end_tok.end]
                # Move to next verb phrase
                vp_i += 1

            else:
                start_tok = noun_chunks[nc_i]
                visited = False
                while(nc_i < len(noun_chunks) and comes_before(noun_chunks[nc_i], verb_phrases[vp_i])):
                    nc_i += 1
                    visited = True

                # Update nc_i if it had entered the loop
                if visited:
                    nc_i -= 1   
                end_tok = noun_chunks[nc_i]

                # Identify the start node, edge and end node
                if start_node == None:
                    start_node = doc[start_tok.start:end_tok.end]
                else:
                    end_node = doc[start_tok.start:end_tok.end]
                    if edge != None:
                        # Found a node-edge-node triple here, reset the markers
                        node_edge_node_list.append((start_node, edge, end_node))
                        start_node = end_node
                        edge = None
                    else:
                        print('Triplet list so far:{}'.format(node_edge_node_list))
                        print('Something wrong! edge_node is not set {}'.format(doc))
                        print('start_node {} end_node {}'.format(start_node, end_node))
                # Move to next noun chunk
                nc_i += 1

            # End of inner if-else
        else:
            # If either of the list has been consumed
            if vp_i == len(verb_phrases):
                # Verb phrases have been consumed, noun chuncks are available
                # Remaining noun chunks will be the end_node if edge is already set
                end_node = doc[noun_chunks[nc_i].start:]
                if edge != None:
                    # Found a node-edge-node triple here, reset the markers
                    node_edge_node_list.append((start_node, edge, end_node))
                    start_node = end_node
                    edge = None
                break
            else:
                # Noun chuncks have been consumed, verb phrases are available
                print('Not sure what to do with un-used VB: {}'.format(verb_phrases[vp_i:]))
                # Create an edge using the first un-used verb phrase
                unused_vp = verb_phrases[vp_i]
                edge = doc[unused_vp.start:unused_vp.end]
                
                # Create a dummy end node
                end_node = None
                # Create and add a triplet
                node_edge_node_list.append((start_node, edge, end_node))
                start_node = end_node
                edge = None
                
                break
    # End of while loop
    print('Triplets:', node_edge_node_list)
    return node_edge_node_list

In [51]:
def get_truncated_noun_chunks(noun_chunks, verb_phrases):
    # Identify if there is any overlap between noun chunks and verb phrases
    truncated_noun_chunks = {}
    for i,nc_vp in enumerate(zip(noun_chunks, verb_phrases)):
        nc,vp = nc_vp
        if nc.start == vp.start:
            while len(nc) > 0 and len(vp) > 0 and nc.start == vp.start:
                nc = nc[1:]
                vp = vp[1:]
            truncated_noun_chunks[i] = nc

    # Replace with the truncated noun_chunk
    for ind,nc in truncated_noun_chunks.items():
        noun_chunks[ind] = nc
    # Identify the empty noun chunks and drop them
    noun_chunks = [nc for nc in noun_chunks if len(nc) > 0]
    return noun_chunks

In [50]:
for sent in texts:
    doc = sp_core_nlp(sent)
    verb_phrases = list(textacy.extract.pos_regex_matches(doc, pattern))
    noun_chunks = list(doc.noun_chunks)
    print(doc)
    print('NC:', noun_chunks)
    print('VP:', verb_phrases)
    noun_chunks = get_truncated_noun_chunks(noun_chunks, verb_phrases)
    print('Truncated_NC:', noun_chunks)
    node_edge_node_list = get_node_edge_pairs(doc, verb_phrases, noun_chunks)
    print('-'*30)
    

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


Our employees are also facing the same challenges that you are and so we are asking for your help too.” 


NC: [Our employees, the same challenges, that, you, we, your help]
VP: [are also facing, are asking for]
Truncated_NC: [Our employees, the same challenges, that, you, we, your help]
Triplets: [(Our employees, are also facing, the same challenges that you are and so we), (the same challenges that you are and so we, are asking for, your help too.” 

)]
------------------------------
The microfinance sector’s gross loan portfolio grew 24% year-on-year to Rs 2.11 lakh crore at the end of December 2019, despite the economic slowdown and repayment crises in pockets of Assam, Karnataka and Maharashtra.
NC: [The microfinance sector’s gross loan portfolio, year, the end, December, the economic slowdown, repayment crises, pockets, Assam, Karnataka, Maharashtra]
VP: [’s, grew, crore at]
Truncated_NC: [The microfinance sector’s gross loan portfolio, year, the end, December, the economic slowd

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",


ATMs will be operating across the country.


NC: [ATMs, the country]
VP: [will be operating across]
Truncated_NC: [ATMs, the country]
Triplets: [(ATMs, will be operating across, the country.

)]
------------------------------
The company further noted that "it will continue to assess the situation and will consider resumption of its business operations at an appropriate time".


NC: [The company, it, the situation, resumption, its business operations, an appropriate time]
VP: [further noted, will continue to assess, will consider]
Truncated_NC: [The company, it, the situation, resumption, its business operations, an appropriate time]
Triplets: [(The company, further noted, it), (it, will continue to assess, the situation), (the situation, will consider, resumption of its business operations at an appropriate time".

)]
------------------------------
“We are coordinating and communicating all instances of trouble these companies are facing with secretaries at the state level to take act

  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
  action="once",
