## This tutorial show rule based information extraction using Hearst Patterns (created by Marti Hearst)

### Terms used - 
- Triples - Representation of a couple of entities and relation between them. (Usain Bolt, won, Gold Medal)

### Approaches to perform information extraction automatically - 

- Rule-Based Approach - A set of grammatical rules are defined by using which we extract information from text.
- Supervised - A model is trained using labeled data. Which later is used to extract entitie and relation between them.
- Unsupervised - When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text


In [1]:
import warnings
warnings.warn('ignore warning')

  


In [2]:
import re
import string
import nltk
import spacy
import pandas as pd
import numpy as np
import math

from spacy.matcher import Matcher
from spacy.tokens import span
from spacy import displacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [40]:
def create_document(text):
    doc = nlp(text)
    return doc

In [39]:
def show_token_depedency_tags_and_pos_tags(doc):
    for token in doc:
        print(token.text,'-',token.dep_,'-',token.pos_)

In [44]:
def apply_pattern_in_document(doc,pattern):
    matcher = Matcher(nlp.vocab)
    matcher.add('pattern_1',None,pattern)
    matches = matcher(doc)
    span = doc[matches[0][1]:matches[0][2]]
    print(span.text)

### Pattern: X such as Y

In [45]:
text = 'GDP in developing countries such as Vietnam will continue growing at a high rate.'

In [46]:
doc = create_document(text)
show_token_depedency_tags_and_pos_tags(doc)

GDP - nsubj - NOUN
in - prep - ADP
developing - amod - VERB
countries - pobj - NOUN
such - amod - ADJ
as - prep - ADP
Vietnam - pobj - PROPN
will - aux - VERB
continue - ROOT - VERB
growing - xcomp - VERB
at - prep - ADP
a - det - DET
high - amod - ADJ
rate - pobj - NOUN
. - punct - PUNCT


In [47]:
# ‘OP’: ‘?’ means that the modifier is optional 
pattern = [{'DEP':'amod', 'OP':'?'},
           {'POS':'NOUN'},
           {'LOWER':'such'},
           {'LOWER':'as'},
           {'POS':'PROPN'}]

In [48]:
apply_pattern_in_document(doc,pattern)

developing countries such as Vietnam


### Pattern: X and/or Y

In [53]:
text = "Here is how you can keep your car or other vehicles clean."

In [54]:
doc = create_document(text)
show_token_depedency_tags_and_pos_tags(doc)

Here - advmod - ADV
is - ROOT - VERB
how - advmod - ADV
you - nsubj - PRON
can - aux - VERB
keep - ccomp - VERB
your - poss - DET
car - dobj - NOUN
or - cc - CCONJ
other - amod - ADJ
vehicles - conj - NOUN
clean - oprd - ADJ
. - punct - PUNCT


In [55]:
pattern = [{'DEP':'amod', 'OP':"?"}, 
           {'POS':'NOUN'}, 
           {'LOWER': 'and', 'OP':"?"}, 
           {'LOWER': 'or', 'OP':"?"}, 
           {'LOWER': 'other'}, 
           {'POS': 'NOUN'}] 

In [56]:
apply_pattern_in_document(doc,pattern)

car or other vehicles


### Pattern: X, including Y

In [57]:
text = "Eight people, including two children, were injured in the explosion"

In [58]:
doc = create_document(text)
show_token_depedency_tags_and_pos_tags(doc)

Eight - nummod - NUM
people - nsubjpass - NOUN
, - punct - PUNCT
including - prep - VERB
two - nummod - NUM
children - pobj - NOUN
, - punct - PUNCT
were - auxpass - VERB
injured - ROOT - VERB
in - prep - ADP
the - det - DET
explosion - pobj - NOUN


In [59]:
pattern = [{'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}, 
           {'IS_PUNCT': True}, 
           {'LOWER': 'including'}, 
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}] 

In [60]:
apply_pattern_in_document(doc,pattern)

Eight people, including two children


### Pattern: X, especially Y

In [61]:
text = "A healthy eating pattern includes fruits, especially whole fruits."

In [63]:
doc = create_document(text)
show_token_depedency_tags_and_pos_tags(doc)

A - det - DET
healthy - amod - ADJ
eating - compound - NOUN
pattern - nsubj - NOUN
includes - ROOT - VERB
fruits - dobj - NOUN
, - punct - PUNCT
especially - advmod - ADV
whole - amod - ADJ
fruits - appos - NOUN
. - punct - PUNCT


In [64]:
pattern = [{'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}, 
           {'IS_PUNCT':True}, 
           {'LOWER': 'especially'}, 
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}]

In [65]:
apply_pattern_in_document(doc,pattern)

fruits, especially whole fruits


## Subtree matching for relation extraction

In [76]:
def create_dependency_tree(doc):
    displacy.render(doc, style='dep',jupyter=True)

In [82]:
def get_entities(doc):
    is_passive = False

    for i,token in enumerate(doc):
        if token.dep_.find('subjpass')==True:
            is_passive = True
            break

    x = ''
    y = ''

    if is_passive:                            # for passive sentence
        for i,token in enumerate(doc):
            if token.dep_.find('subjpass')==True:
                y = token.text
            if token.dep_.endswith('obj')==True:
                x = token.text
    else:                                    # for active sentence
        for i,token in enumerate(doc):
            if token.dep_.endswith("subj") == True:
                x = token.text
            if token.dep_.endswith("obj") == True:
                y = token.text
    return x,y

In [83]:
text = "Tableau was recently acquired by Salesforce." 
doc = create_document(text)
create_dependency_tree(doc)
get_entities(doc)

('Salesforce', 'Tableau')

In [84]:
text = "Careem, a ride-hailing major in the middle east, was acquired by Uber."
doc = create_document(text)
create_dependency_tree(doc)
get_entities(doc)

('Uber', 'Careem')

In [85]:
text = "Salesforce recently acquired Tableau."
doc = create_document(text)
create_dependency_tree(doc)
get_entities(doc)

('Salesforce', 'Tableau')