In [1]:
import spacy
from spacy import displacy 
from spacy.matcher import Matcher 

import visualise_spacy_tree
from IPython.display import Image, display

nlp = spacy.load('en_core_web_sm')

In [2]:
import glob, re
import pandas as pd

from tqdm.auto import tqdm

tqdm.pandas()

  from pandas import Panel


# Information Extraction

The task of Information Extraction (IE) involves extracting meaningful information from unstructured text data and presenting it in a structured format.

* Reference: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/) 
* Supplimentary: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/?utm_source=blog&utm_medium=nlp-project-information-extraction)
* NLP Learning Path: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2020/01/learning-path-nlp-2020/?utm_source=blog&utm_medium=nlp-project-information-extraction)

Information Extraction (IE) is a crucial cog in the field of Natural Language Processing (NLP) and linguistics. It’s widely used for tasks such as Question Answering Systems, Machine Translation, Entity Extraction, Event Extraction, Named Entity Linking, Coreference Resolution, Relation Extraction, etc.

In information extraction, there is an important concept of triples.

> A triple represents a couple of entities and a relation between them. For example, (Obama, born, Hawaii) is a triple in which ‘Obama’ and ‘Hawaii’ are the related entities, and the relation between them is ‘born’.

## Different approaches to Information Extraction

![ietypes](../meta/IE_types.webp)

* In Traditional Information Extraction, the relations to be extracted are pre-defined

* In Open Information Extraction, the relations are not pre-defined. The system is free to extract any relations it comes across while going through the text data.

### Different Approaches to Traditional Information Extraction

* Rule-based Approach: We define a set of rules for the syntax and other grammatical properties of a natural language and then use these rules to extract information from text
* Supervised: Let’s say we have a sentence S. It has two entities E1 and E2. Now, the supervised machine learning model has to detect whether there is any relation (R) between E1 and E2. So, in a supervised approach, the task of relation extraction turns into the task of relation detection. The only drawback of this approach is that it needs a lot of labeled data to train a model
* Semi-supervised: When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text

# Information Extraction using spaCy


We all know that sentences are made up of words belonging to different Parts of Speech (POS). There are eight different POS in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and intersection.

The POS determines how a specific word functions in meaning in a given sentence. For example, take the word “right”. In the sentence, “The boy was awarded chocolate for giving the right answer”, “right” is used as an adjective. Whereas, in the sentence, “You have the right to say whatever you want”, “right” is treated as a noun.

This goes to show that the POS tag of a word carries a lot of significance when it comes to understanding the meaning of a sentence. And we can leverage it to extract meaningful information from our text.

In [3]:
text = "This is a sample sentence."

doc = nlp(text)

for token in doc:
    print(token.text,'->',token.pos_)

This -> DET
is -> AUX
a -> DET
sample -> NOUN
sentence -> NOUN
. -> PUNCT


if we wanted to extract nouns from the sentences, we could take a look at POS tags of the words/tokens in the sentence, using the attribute .pos_, and extract them accordingly.

In [4]:
for token in doc:
    if token.pos_ == 'NOUN':
        print(token.text)

sample
sentence


It was that easy to extract words based on their POS tags. But sometimes extracting information purely based on the POS tags is not enough. Have a look at the sentence below:

In [5]:
text = "The children love cream biscuits"

doc = nlp(text)

for token in doc:
    print(token.text,'->',token.pos_)

The -> DET
children -> NOUN
love -> VERB
cream -> NOUN
biscuits -> NOUN


If I wanted to extract the subject and the object from a sentence, I can’t do that based on their POS tags. For that, I need to look at how these words are related to each other. These are called Dependencies.

We can make use of spaCy’s displacy visualizer that displays the word dependencies in a graphical manner:

In [6]:
text = "The children love cream biscuits"

doc = nlp(text)

displacy.render(doc, style='dep',jupyter=True)

This directed graph is known as a dependency graph. It represents the relations between different words of a sentence.

Each word is a node in the Dependency graph. The relationship between words is denoted by the edges. For example, “The” is a determiner here, “children” is the subject of the sentence, “biscuits” is the object of the sentence, and “cream” is a compound word that gives us more information about the object.

The arrows carry a lot of significance here:

* The arrowhead points to the words that are dependent on the word pointed by the origin of the arrow
* The former is referred to as the child node of the latter. For example, “children” is the child node of “love”
* The word which has no incoming arrow is called the root node of the sentence

Like we have an attribute for POS in SpaCy tokens, we similarly have an attribute for extracting the dependency of a token denoted by dep_

In [7]:
print('Subjects:')        
print(*[token.text for token in doc if token.dep_ == 'nsubj'])
print()
print('Objects:')        
print(*[token.text for token in doc if token.dep_ == 'dobj'])

Subjects:
children

Objects:
biscuits


Using POS tags and Dependency tags, we can look for relationships between different entities in a sentence. For example, in the sentence “The cat perches on the window sill“, we have the subject, “cat”, the object “window sill”, related by the preposition “on”. We can look for such relationships and much more to extract meaningful information from our text data.

#  United Nations General Debate Corpus

In [8]:
folders = glob.glob('UNGDC-1970-2018/Converted sessions/Session*')

df = pd.DataFrame(columns={'Country','Speech','Session','Year'})
i = 0 
for file in folders:
    speech = glob.glob(file+'/IND*.txt')
    with open(speech[0],encoding='utf8') as f:
        df.loc[i,'Speech'] = f.read()
        df.loc[i,'Year'] = speech[0].split('_')[-1].split('.')[0]
        df.loc[i,'Session'] = speech[0].split('_')[-2]
        df.loc[i,'Country'] = speech[0].split('_')[0].split("\\")[-1]
        i += 1 
        
df.head()

Unnamed: 0,Session,Year,Country,Speech
0,25,1970,IND,"40.\t Mr. President, I offer you our congratul..."
1,26,1971,IND,"38.\tMr. President, on behalf of the people of..."
2,27,1972,IND,"Mr. President, I offer you on behalf of India ..."
3,28,1973,IND,"﻿122.\tMr. President, I bring to you and to al..."
4,29,1974,IND,"Mr. President, I have already had occasion to ..."


In [9]:
def clean(text):
    
    text = re.sub('[0-9]+.\t','',str(text))
    text = re.sub('\n ','',str(text))
    text = re.sub('\n',' ',str(text))
    text = re.sub("'s",'',str(text))
    text = re.sub("-",' ',str(text))
    text = re.sub("— ",'',str(text))
    text = re.sub('\"','',str(text))
    text = re.sub("Mr\.",'Mr',str(text))
    text = re.sub("Mrs\.",'Mrs',str(text))
    text = re.sub("[\(\[].*?[\)\]]", "", str(text))
    
    return text

df['Speech_clean'] = df['Speech'].apply(clean)
df.head()

Unnamed: 0,Session,Year,Country,Speech,Speech_clean
0,25,1970,IND,"40.\t Mr. President, I offer you our congratul...","Mr President, I offer you our congratulations..."
1,26,1971,IND,"38.\tMr. President, on behalf of the people of...","Mr President, on behalf of the people of India..."
2,27,1972,IND,"Mr. President, I offer you on behalf of India ...","Mr President, I offer you on behalf of India o..."
3,28,1973,IND,"﻿122.\tMr. President, I bring to you and to al...","﻿Mr President, I bring to you and to all our c..."
4,29,1974,IND,"Mr. President, I have already had occasion to ...","Mr President, I have already had occasion to c..."


In [10]:
def sentences(text):
    # split sentences and questions
    text = re.split('[.?]', text)
    clean_sent = []
    for sent in text:
        clean_sent.append(sent)
    return clean_sent

df['sent'] = df['Speech_clean'].apply(sentences)

In [11]:
df2 = pd.DataFrame(columns=['Sent','Year','Len'])

row_list = []

for i in range(len(df)):
    for sent in df.loc[i,'sent']:
    
        wordcount = len(sent.split())
        year = df.loc[i,'Year']

        dict1 = {'Year':year,'Sent':sent,'Len':wordcount}
        row_list.append(dict1)
    
df2 = pd.DataFrame(row_list)
df2.head()

Unnamed: 0,Year,Sent,Len
0,1970,"Mr President, I offer you our congratulations...",21
1,1970,"You represent Norway, a country which can tak...",17
2,1970,Your personal qualifications and your family ...,13
3,1970,I should also like to express our appreciatio...,19
4,1970,I would also repeat our admiration for U Than...,18


# Information Extraction using SpaCy

## Rule on Noun-Verb-Noun Phrases

When you look at a sentence, it generally contains a subject (noun), action (verb), and an object (noun). The rest of the words are just there to give us additional information about the entities. Therefore, we can leverage this basic structure to extract the main bits of information from the sentence

In [12]:
def rule1(text):
    doc = nlp(text)
    sent = []
    
    for token in doc:
        if (token.pos_=='VERB'):
            phrase = ''
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    phrase += sub_tok.text
                    phrase += ' '+token.lemma_ 
                    for sub_tok in token.rights:
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):       
                            phrase += ' '+sub_tok.text
                            sent.append(phrase)
            
    return sent

In [13]:
for sent in df2['Sent'].sample(10, random_state=19).values:
    rule = rule1(sent)
    if rule != []:
        print(sent, '->', rule)
        print('\n\n')

 Industrialized countries with planned economies, which do not formally belong to the international monetary system but participate in the global activities of commerce and technological exchange, also face problems of production and renovation -> ['countries face problems']



 The entry of these two countries into the United Nations has taken this Organization one step closer to its goal of universality -> ['entry take Organization']



 The two United Nations Development Decades, one of the 1960s and the other of the 1970s, and a series of protracted negotiations, have proved sterile exercises, belying the hopes that had been raised that inequity between nations need not be an inexorable law and that, for reasons as much economic as ethical, the rich should assist the poor -> ['Decades prove exercises']





## Rule on Adjective Noun Structure

In the previous rule that, the information did not feel complete. This is because many nouns have an adjective or a word with a compound dependency that augments the meaning of a noun. Extracting these along with the noun will give us better information about the subject and the object.

In [14]:
def rule2(text):
    doc = nlp(text)
    sent = []
    
    for token in doc:
        phrase = ''
        if (token.pos_ == 'NOUN') and (token.dep_ in ['dobj','pobj','nsubj','nsubjpass']):
            for subtoken in token.children:
                if (subtoken.pos_ == 'ADJ') or (subtoken.dep_ == 'compound'):
                    phrase += subtoken.text + ' '
            if len(phrase)!=0:
                phrase += token.text 
        if  len(phrase)!=0:
            sent.append(phrase)
    return sent

In [15]:
for sent in df2['Sent'].sample(10, random_state=19).values:
    rule = rule2(sent)
    if rule != []:
        print(sent, '->', rule)
        print('\n\n')

 There is renewed awareness of the continued relevance of his message of non violence and tolerance -> ['non violence']



 Industrialized countries with planned economies, which do not formally belong to the international monetary system but participate in the global activities of commerce and technological exchange, also face problems of production and renovation -> ['Industrialized countries', 'international monetary system', 'global activities']



 There can be no durable peace in West Asia without a just and comprehensive settlement, based on the realization by the Palestinian people of their inalienable right to self determination and the recognition of the rights of all States in the region, including Palestine and Israel, to live in peace and security within internationally recognized borders -> ['just settlement', 'Palestinian people', 'inalienable right', 'self determination']



 The bulk of the global military expenditure of $1 trillion a year is accounted for by a handful

## Combining both rules

In [16]:
def rule2_mod(text,index):
    doc = nlp(text)
    phrase = ''
    
    for token in doc:
        if token.i == index:
            for subtoken in token.children:
                if (subtoken.pos_ == 'ADJ'):
                    phrase += ' '+subtoken.text
            break
    
    return phrase

def rule1_mod(text):
    doc = nlp(text)
    sent = []
    
    for token in doc:
        if (token.pos_=='VERB'):
            phrase =''
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    adj = rule2_mod(text,sub_tok.i)
                    phrase += adj + ' ' + sub_tok.text
                    phrase += ' '+token.lemma_ 
                    for sub_tok in token.rights:
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            adj = rule2_mod(text,sub_tok.i)
                            phrase += adj+' '+sub_tok.text
                            sent.append(phrase)
            
    return sent

In [17]:
for sent in df2['Sent'].sample(10, random_state=19).values:
    rule = rule1_mod(sent)
    if rule != []:
        print(sent, '->', rule)
        print('\n\n')

 Industrialized countries with planned economies, which do not formally belong to the international monetary system but participate in the global activities of commerce and technological exchange, also face problems of production and renovation -> [' countries face problems']



 The entry of these two countries into the United Nations has taken this Organization one step closer to its goal of universality -> [' entry take Organization']



 The two United Nations Development Decades, one of the 1960s and the other of the 1970s, and a series of protracted negotiations, have proved sterile exercises, belying the hopes that had been raised that inequity between nations need not be an inexorable law and that, for reasons as much economic as ethical, the rich should assist the poor -> [' Decades prove sterile exercises']





## Rule on Prepositions

Prepositions tell us where or when something is in a relationship with something else. For example, The people of India believe in the principles of the United Nations. Clearly extracting phrases including prepositions will give us a lot of information from the sentence.

In [18]:
def rule3(text):
    doc = nlp(text)
    sent = []
    
    for token in doc:
        if token.pos_=='ADP':
            phrase = ''
            if token.head.pos_=='NOUN':
                phrase += token.head.text
                phrase += ' '+token.text
                for right_tok in token.rights:
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        phrase += ' '+right_tok.text
                
                if len(phrase)>2:
                    sent.append(phrase)
                
    return sent

In [19]:
for sent in df2['Sent'].sample(10, random_state=19).values:
    rule = rule3(sent)
    if rule != []:
        print(sent, '->', rule)
        print('\n\n')

 There is renewed awareness of the continued relevance of his message of non violence and tolerance -> ['awareness of relevance', 'relevance of message', 'message of violence']



 Industrialized countries with planned economies, which do not formally belong to the international monetary system but participate in the global activities of commerce and technological exchange, also face problems of production and renovation -> ['countries with economies', 'activities of exchange', 'problems of production']



 The entry of these two countries into the United Nations has taken this Organization one step closer to its goal of universality -> ['entry of countries', 'entry into Nations', 'goal of universality']



 There can be no durable peace in West Asia without a just and comprehensive settlement, based on the realization by the Palestinian people of their inalienable right to self determination and the recognition of the rights of all States in the region, including Palestine and Israel,

## Modified Rule

In [20]:
def rule0(text, index):
    doc = nlp(text)  
    token = doc[index]
    entity = ''
    
    for sub_tok in token.children:
        if (sub_tok.dep_ in ['compound','amod']):
            entity += sub_tok.text+' '
    
    entity += token.text
    return entity

def rule3_mod(text):
    doc = nlp(text)
    sent = []
    
    for token in doc:
        if token.pos_=='ADP':
            phrase = ''
            if token.head.pos_=='NOUN':
                append = rule0(text, token.head.i)
                if len(append)!=0:
                    phrase += append
                else:  
                    phrase += token.head.text
                phrase += ' '+token.text

                for right_tok in token.rights:
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        right_phrase = ''
                        append = rule0(text, right_tok.i)
                        if len(append)!=0:
                            right_phrase += ' '+append
                        else:
                            right_phrase += ' '+right_tok.text 
                        phrase += right_phrase
                
                if len(phrase)>2:
                    sent.append(phrase)
                
    return sent

In [21]:
for sent in df2['Sent'].sample(10, random_state=19).values:
    rule = rule3_mod(sent)
    if rule != []:
        print(sent, '->', rule)
        print('\n\n')

 There is renewed awareness of the continued relevance of his message of non violence and tolerance -> ['renewed awareness of continued relevance', 'continued relevance of message', 'message of non violence']



 Industrialized countries with planned economies, which do not formally belong to the international monetary system but participate in the global activities of commerce and technological exchange, also face problems of production and renovation -> ['Industrialized countries with planned economies', 'global activities of exchange', 'problems of production']



 The entry of these two countries into the United Nations has taken this Organization one step closer to its goal of universality -> ['entry of countries', 'entry into United Nations', 'goal of universality']



 There can be no durable peace in West Asia without a just and comprehensive settlement, based on the realization by the Palestinian people of their inalienable right to self determination and the recognition of th