# spaCy Hearst Patterns
---

In this experiment we test the utility of Hearst Patterns for detecting the ingroup and outgroup of a text.

For this experiment spaCy matcher is used with code adapted from: https://github.com/mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

Hypernym relations are semantic relationships between two concepts: C1 is a hypernym of C2 means that C1 categorizes C2 (e.g. “instrument” is a hypernym of “Piano”). For this research, the phrase, "America has enemies, such as Al Qaeda and the Taliban" would return the following '[('Al Qaeda', 'enemy'), ('the Taliban', 'enemy')]'. In this example, the categorising term 'enemy' is a hypernym of both 'Al Qaeda' and the 'Taliban'; conversely 'al Qaeda' and 'the Tabliban' are hyponyms of 'enemy'. Using this technique, hypernym terms could be classified as ingroup or outgroup and named entities identified as hyponym terms could be identified as either group.

## Setup the spaCy Pipeline

In [1]:
%%time

import spacy

nlp = spacy.load("en_core_web_md")

for component in nlp.pipe_names:
    if component not in ['tagger', "parser", "ner"]:
        nlp.remove_pipe(component)

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)

Wall time: 26.4 s


## Create Dataset of Political Speeches from George Bush, Osama bin Laden and Martin Luther King

In [2]:
%%time

import os
import importlib
import cndobjects
importlib.reload(cndobjects)


dirpath = r'C:\\Users\\Steve\\OneDrive - University of Southampton\\CNDPipeline\\dataset'

orators = cndobjects.Dataset(dirpath)

orators.summarise()
#orators["bush"].summarise()

Wall time: 1.24 s


Unnamed: 0_level_0,Name,Text Count,Word Count,File Size
Ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hitler,Adolf Hitler,0.0,0.0,56.0
bush,George Bush,14.0,143936.0,56.0
tolstoy,Leo Tolstoy,0.0,0.0,56.0
king,Martin Luther King,5.0,122815.0,56.0
laden,Osama bin Laden,5.0,77440.0,56.0
Totals,,24.0,344191.0,280.0


In [133]:
%%time
from IPython.display import clear_output
import importlib
import pandas as pd
import hpspacy, hpregex
from tqdm import tqdm
importlib.reload(hpspacy)
importlib.reload(hpregex)


class hp_Analysis(object):
    
    """
    class function for quantitatively analysing the Hearst Pattern Methods
    
    iterates over each Text() object within each Orator() of the Dataset() and returns
    the number of hyponym matches for each technique.
    
    tested techniques are:
    - regex patterns
    - spacy patterns (bronze): uses stopword list for noun chunking
    - spacy patterns (silver): uses modifier dependency for noun_chunking
    - spacy patterns (gold): uses stopword list compiled from pattern rule names
    
    input: Dataset Object
    output: dataframe in the following format:
    
            | orator          | ... |
    regex   | hyponyms count  | ... |
    bronze  | hyponyms count  | ... |
    silver  | hyponyms count  | ... |
    gold    | hyponyms count  | ... |
    """
    
    def __init__(self, data = None):
        
        if isinstance(data, cndobjects.Dataset):
            self.orators = data
            self.results = {"regex" : [], "bronze" : [], "silver" : [], "gold" : []}
            self.orator_list = None # list of Orator() objects containin Texts()
            self.hyponym_list = None # dict object containing hyponyms for each Orator()
            self.quant = None
            
            self.summarise()
        
        else:
            print("input not of type Dataset")

    def summarise(self):
        
        self.orator_list = [orator.name.title() for orator in self.orators if len(orator) > 0] 
        self.hyponym_list = {orator.ref : {method : [] for method in self.results} for orator in self.orators if len(orator) > 0}
        
        # initialise regex Hearst Pattern detection
        hr = hpregex.HearstPatterns(nlp, extended=True, merge = False)
        
        #iterate over each orator in the dataset
        for orator in self.orators:

            # dict for storing the results
            results_count = {key: 0 for key in self.results.keys()}

            # check for Orator() objects containing no speeches
            if len(orator) > 0:
                print(f'analysing: {orator.name}')

                for text in tqdm(orator):

                    # get results for regex analysis
                    try:
                        hyps = hr.find_hyponyms(str(text))
                        
                        self.hyponym_list[orator.ref]["regex"].extend(hyps)
                        
                        results_count["regex"] += len(hyps)
                    except:
                        print("not analysed (regex): ", text.reference)

                    # get results for each of the spaCy techniques
                    for grade in list(self.results.keys())[1:]:

                        try:
                            h = hpspacy.hearst_patterns(nlp, extended = True, predicatematch = grade)
                            hyps = h.find_hyponyms(str(text))
                            self.hyponym_list[orator.ref][grade].extend(hyps)
                            results_count[grade] += len(hyps)
                        except:
                            print(f"not analysed {grade}: {orator.ref}")           

                for key, value in results_count.items():
                    self.results[key].append(value)
                    
                clear_output(wait=True)
                
        self.quant = pd.DataFrame(self.results, index = self.orator_list)
        

output = hp_Analysis(orators)

Wall time: 1min 57s


In [135]:
display(output.hyponym_list)

{'bush': {'regex': [('who', 'citizen'),
   ('who', 'police officer'),
   ('an exceptional man', 'passenger'),
   ('alQaeda', 'loosely affiliate terrorist organization'),
   ('woman', 'civilian'),
   ('child', 'civilian'),
   ('theEgyptianIslamicJihad', 'country'),
   ('theIslamicMovementofUzbekistan', 'country'),
   ('Afghanistan', 'place'),
   ('american citizen', 'all foreign national'),
   ('Egypt', 'muslim country'),
   ('SaudiArabia', 'muslim country'),
   ('Jordan', 'muslim country'),
   ('the will', 'every value'),
   ('theUnitedStates', 'a hostile regime'),
   ('terrorism', 'a threat'),
   ('Afghanistan', 'a terrorist base'),
   ('Canada', 'close friend'),
   ('Australia', 'close friend'),
   ('Germany', 'close friend'),
   ('France', 'close friend'),
   ('force', 'the operation'),
   ('american citizen', 'all foreign national'),
   ('none', 'demand'),
   ('cave', 'entrenched hide place'),
   ('goodness', 'evil'),
   ('courage', 'value'),
   ('honor', 'value'),
   ('people', 'g

In [182]:
%%time
import importlib
import pandas as pd
import hpspacy, hpregex

importlib.reload(hpspacy)

orator_list = [o[0] for o in TEXTS]


orator_list = [grade for grade in orator_list]



for orator in orator_list:
    results = []
    
    for grade in grades:
        h = hpspacy.hearst_patterns(nlp, extended = True, predicatematch = grade)
        result = len(h.find_hyponyms(str(orators[orator])))
        results.append(result)
    
    df[orator] = results

display(df)

Unnamed: 0,bush,king,laden
bronze,49,11,42
silver,41,8,26
gold,49,12,44


Wall time: 1min 43s


In [175]:
docs = ["There are works by such authors as Herrick, Goldsmith, and Shakespeare.",
        "There were bruises, lacerations, or other injuries were not prevalent.",
        "common law countries, including Canada, Australia, and England enjoy toast.",
        "Many countries, especially France, England and Spain also enjoy toast.",
        "There are such benefits as postharvest losses reduction, food increase and soil fertility improvement."
       ]

for doc in docs:
    print(h.find_hyponyms(doc))
    print('----------')


[('Herrick', 'author', 'such'), ('Goldsmith', 'author', 'such'), ('Shakespeare', 'author', 'such')]
----------
[('laceration', 'injury', 'other'), ('bruise', 'injury', 'other')]
----------
[('Canada', 'common law countrie', 'include'), ('Australia', 'common law countrie', 'include'), ('England', 'common law countrie', 'include')]
----------
[('France', 'many countrie', 'especially'), ('England', 'many countrie', 'especially'), ('Spain', 'many countrie', 'especially')]
----------
[('postharvest losses reduction', 'benefit', 'such'), ('food increase', 'benefit', 'such'), ('soil fertility improvement', 'benefit', 'such')]
----------


## Initial Test of Hearst Pattern Detection Object

First sentence contains a 'first' relationship' where hypernym preceeds hyponym.

Second sentence contains both a 'first' and 'last' relationship.

In [176]:
docs = [
    "We are hunting for terrorist groups, particularly the Taliban and al Qaeda",
    "We are hunting for the IRA, ISIS, al Qaeda and some other terrorist groups, especially the Taliban, Web Scientists and particularly Southampton University"
]

def show_hyps(o):
    
   
    for i, text in enumerate(o):
        print(i, "#####")
        print(h.find_hyponyms(text))
    
        print('-----')

show_hyps(docs)

0 #####
[('the Taliban', 'terrorist group', 'particularly'), ('al Qaeda', 'terrorist group', 'particularly')]
-----
1 #####
[('al Qaeda', 'terrorist group', 'some_other'), ('the IRA', 'terrorist group', 'some_other'), ('ISIS', 'terrorist group', 'some_other'), ('the Taliban', 'terrorist group', 'especially'), ('web scientist', 'terrorist group', 'especially'), ('Southampton University', 'terrorist group', 'especially')]
-----


## Test With a Larger Number of sentences

In [8]:
%%time

# create a list of docs
docs = [
    "Forty-four percent of patients with uveitis had one or more identifiable signs or symptoms, such as red eye, ocular pain, visual acuity, or photophobia, in order of decreasing frequency.",
    "Other close friends, including Canada, Australia, Germany and France, have pledged forces as the operation unfolds.",
    "The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.",
    "Terrorist groups like al Qaeda depend upon the aid or indifference of governments.",
    "This new law that I sign today will allow surveillance of all communications used by terrorists, including e-mails, the Internet, and cell phones.",
    "From this day forward, any nation that continues to harbor or support terrorism will be regarded by the United States as a hostile regime.",
    "We are looking out for the Taliban, al Qaeda and other terrorist groups",
    "We are looking out for al Qaeda and other terrorist groups, especially the Taliban and the muppets"
]

for doc in docs:
    print(h.find_hyponyms(doc))

[('red eye', 'symptom', 'such_as'), ('ocular pain', 'symptom', 'such_as'), ('visual acuity', 'symptom', 'such_as'), ('photophobia', 'symptom', 'such_as')]
[('Canada', 'close friend', 'include'), ('Australia', 'close friend', 'include'), ('Germany', 'close friend', 'include'), ('France', 'close friend', 'include')]
[('al Qaeda', 'loosely affiliated terrorist organization', 'know_as')]
[('al Qaeda', 'terrorist group', 'like')]
[('e-mail', 'terrorist', 'include'), ('the internet', 'terrorist', 'include'), ('cell phone', 'terrorist', 'include')]
[]
[('al Qaeda', 'terrorist group', 'other'), ('the Taliban', 'terrorist group', 'other')]
[('al Qaeda', 'terrorist group', 'other'), ('the Taliban', 'terrorist group', 'especially'), ('the muppet', 'terrorist group', 'especially')]
Wall time: 220 ms


## Test with a Full Speech

In [12]:
import os
import json

dirpath = os.getcwd()
file = "first_docs.json"

with open(os.path.join(dirpath, file), "r") as f:
    last_docs = json.load(f)

for doc in last_docs:
    hyponyms = h.find_hyponyms(doc[1])
    #if len(hyponyms[1:]) != 3:
    print(doc[1])
    print(doc[0], '=>', hyponyms)
    print('----------')

we are looking for terrorist groups, such as the Taliban, al Qeada and Southampton University
such_as => [('the Taliban', 'terrorist group', 'such_as'), ('al Qeada', 'terrorist group', 'such_as'), ('Southampton University', 'terrorist group', 'such_as')]
----------
we are looking for terrorist groups, known as the Taliban, al Qeada and Southampton University
known_as => [('the Taliban', 'terrorist group', 'know_as'), ('al Qeada', 'terrorist group', 'know_as'), ('Southampton University', 'terrorist group', 'know_as')]
----------
we are looking for terrorist groups, including the Taliban, al Qeada and Southampton University
including => [('the Taliban', 'terrorist group', 'include'), ('al Qeada', 'terrorist group', 'include'), ('Southampton University', 'terrorist group', 'include')]
----------
we are looking for terrorist groups, especially the Taliban, al Qeada and Southampton University
especially => [('the Taliban', 'terrorist group', 'especially'), ('al Qeada', 'terrorist group', 'e

In [13]:
%%time
dirpath = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\Leo Tolstoy"
file = "warandpeace_testdata.json"

with open(os.path.join(dirpath, file), "r") as f:
    docs = json.load(f)
    
for doc in docs:
    hyponyms = h.find_hyponyms(doc[2])
    #if len(hyponyms[1:]) != 3:
    print(doc[2])
    print(doc[1])
    print(doc[0], '=>', hyponyms)
    print('----------')

The younger ones occupied themselves as before, some playing cards (there was plenty of money, though there was no food), some with more innocent games, such as quoits and skittles
True
such_as => [('quoits', 'more innocent game', 'such_as'), ('skittle', 'more innocent game', 'such_as')]
----------
The trench itself was the room, in which the lucky ones, such as the squadron commander, had a board, lying on piles at the end opposite the entrance, to serve as a table.
True
such_as => [('the squadron commander', 'the lucky one', 'such_as')]
----------
Through the hard century-old bark, even where there were no twigs, leaves had sprouted such as one could hardly believe the old veteran could have produced.
False
such_as => []
----------
Religion alone can explain to us what without its help man cannot comprehend: why, for what cause, kind and noble beings able to find happiness in life—not merely harming no one but necessary to the happiness of others—are called away to God, while cruel, 