# spaCy Hearst Patterns
---

In this experiment we test the utility of Hearst Patterns for detecting the ingroup and outgroup of a text.

For this experiment spaCy matcher is used with code adapted from: https://github.com/mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

Hypernym relations are semantic relationships between two concepts: C1 is a hypernym of C2 means that C1 categorizes C2 (e.g. “instrument” is a hypernym of “Piano”). For this research, the phrase, "America has enemies, such as Al Qaeda and the Taliban" would return the following '[('Al Qaeda', 'enemy'), ('the Taliban', 'enemy')]'. In this example, the categorising term 'enemy' is a hypernym of both 'Al Qaeda' and the 'Taliban'; conversely 'al Qaeda' and 'the Tabliban' are hyponyms of 'enemy'. Using this technique, hypernym terms could be classified as ingroup or outgroup and named entities identified as hyponym terms could be identified as either group.

# Notes for next time

re-add removal of "DET" stop words

add "DET" to patterns

add is_hypernym and has_hyponyms to token

add is_hyponym and has hypernym to token

review test data for ingroups and outgroups with displacy

begin to look at how to create dependency rules, probably only possible with Dependency Matcher

## Setup the spaCy Pipeline

In [1]:
%%time
import importlib
import pipeline
importlib.reload(pipeline)

cnd = pipeline.CND(merge = True)

print(cnd.nlp.meta['name'])
print([pipe for pipe in cnd.nlp.pipe_names])

core_web_md
['tagger', 'parser', 'ner', 'Named Entity Matcher', 'merge_entities', 'Concept Matcher', 'merge_custom_chunks', 'hearst pattern matcher']
Wall time: 19.6 s


## Create Dataset of Political Speeches from George Bush, Osama bin Laden and Martin Luther King

In [2]:
%%time
import os
import importlib
import cndobjects
importlib.reload(cndobjects)

dirpath = r'C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset'

orators = cndobjects.Dataset(cnd, dirpath)

parsing:  bush (2001-09-11) 911 Address to the Nation
parsing:  bush (2001-09-14) Remarks at the National Day of Prayer & Remembrance Service
parsing:  bush (2001-09-15) First Radio Address following 911
parsing:  bush (2001-09-17) Address at Islamic Center of Washington, D.C.
parsing:  bush (2001-09-20) Address to Joint Session of Congress Following 911 Attacks
parsing:  bush (2001-10-07) Operation Enduring Freedom in Afghanistan Address to the Nation
parsing:  bush (2001-10-11) 911 Pentagon Remembrance Address
parsing:  bush (2001-10-11) Prime Time News Conference on War on Terror
parsing:  bush (2001-10-11) Prime Time News Conference Q&A
parsing:  bush (2001-10-26) Address on Signing the USA Patriot Act of 2001
parsing:  bush (2001-11-10) First Address to the United Nations General Assembly
parsing:  bush (2001-12-11) Address to Citadel Cadets
parsing:  bush (2001-12-11) The World Will Always Remember 911
parsing:  bush (2002-01-29) First (Official) Presidential State of the Union A

In [18]:
doc = orators["bush"][4].doc

n = 30

for ent in doc._.custom_chunks:
    
    if ent._.ATTRIBUTE == "outgroup":# or ent._.ATTRIBUTE == "ingroup":
        print(str(ent.sent).strip())
        print(str(ent).ljust(n), '=>', ent.root.ent_type_, '=>', ent._.ATTRIBUTE, '=>', [t.tag_ for t in ent])
        print()
        
#         #if ent.root._.has_hypernym._.ATTRIBUTE == "outgroup" or 
#             print(str(ent.sent).strip())
#             print(str(ent).ljust(n), '=>', ent.root._.has_hypernym._.ATTRIBUTE.ljust(n), '=>', str(ent.root._.has_hypernym).ljust(n), '=>')
#             print('---')
#     #         print(str(ent).ljust(n), '=>', str(ent.root._.is_hyponym).ljust(n), '=>', ent._.ATTRIBUTE.ljust(n), '=>', ent._.span_type.ljust(n))

We have seen it in the courage of passengers, who rushed terrorists to save others on the ground -- passengers like an exceptional man named Todd Beamer.
terrorists                     =>  => outgroup => ['NNS']

Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done.
enemies                        =>  => outgroup => ['NNS']

Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done.
enemies                        =>  => outgroup => ['NNS']

On September the 11th, enemies of freedom committed an act of war against our country.
enemies of freedom             =>  => outgroup => ['NNS']

The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.
collection of loosely affiliated terrorist organizations =>  => outgroup => ['NN']

They are some of the murderers indicted for bombing American embassies in Tanzania and Kenya, and responsible for bombing the

In [3]:
%%time
from IPython.display import clear_output
import importlib
import pandas as pd
import hpspacy, hpregex
from tqdm import tqdm
importlib.reload(hpspacy)
importlib.reload(hpregex)


class hp_Analysis(object):
    
    """
    class function for analysing each Hearst Pattern Method of Text() objects from each
    Orator() in the Dataset() object.
    
    iterates over each Text() object within each Orator() of the Dataset() and returns
    the number of hyponym matches for each method.
    
    tested methods are:
    - regex patterns
    - spacy patterns (bronze): uses stopword list for noun chunking
    - spacy patterns (silver): uses modifier dependency for noun_chunking
    - spacy patterns (gold): uses stopword list compiled from pattern rule names
    
    input: Dataset Object
    outputs: 
    - quant: dataframe quantifying results, in the following format:
    
            | orator          | ... |
    regex   | hyponyms count  | ... |
    bronze  | hyponyms count  | ... |
    silver  | hyponyms count  | ... |
    gold    | hyponyms count  | ... |
    
    - hyponym_list: dict object of hyponym results for each orator and each method in the 
    following format:
    
    {"orator": {"method" : [hyponyms]}}
    """
    
    def __init__(self, dataset = None, orator_list = None, hyponym_list = None):
        
        if dataset is not None:
            self.quant = None # dataframe output quantifying hyponym results
        else:
            dataset = None
        
        self.methods = {"regex" : [], "spaCy" : []}
        
        #self.methods = {"regex" : [], "bronze" : [], "silver" : [], "gold" : []}
        
        # initialise the object when Dataset() is passed
        
        self.dataset = dataset # the dataset

        if orator_list is None:
            self.orator_list = [orator for orator in self.dataset if len(orator) > 0] # list of Orator() names containing Texts()
        else:
            orator_list = None

        if hyponym_list is None:
            self.hyponym_list = {orator.ref : {method : [] for method in self.methods} for orator in self.orator_list} # dict object containing hyponyms for each Orator()
        else:
            orator_list = None

        self.summarise(self.orator_list)

    def summarise(self, iterable):
        
        # initialise regex Hearst Pattern detection
        hr = hpregex.HearstPatterns(cnd.nlp, extended=True, merge = False)
        hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
        
        #iterate over each orator in the dataset
        for orator in iterable:
            
            if orator.ref == "hitler":
                continue

            # temp dict for storing the results
            results_count = {key: 0 for key in self.methods.keys()}

            for text in tqdm(orator, desc = orator.ref):

                # get results for regex method
                try:
                    hyps = hr.find_hyponyms(str(text))

                    self.hyponym_list[orator.ref]["regex"].extend(hyps)

                    results_count["regex"] += len(hyps)
                except:
                    pass
#                     print("not analysed (regex):", text.reference)
                    
                try:
                    hyps = hs.find_hyponyms(str(text))
                    
                    self.hyponym_list[orator.ref]["spaCy"].extend(hyps)
                    
                    results_count["spaCy"] += len(hyps)
                    
                except:
                    print("not analysed(spacy):", text.reference)

                # get results for each of the spaCy methods
                
                
#                 for grade in list(self.methods.keys())[1:]:

#                     try:
#                         h = hpspacy.hearst_patterns(nlp, extended = True, predicatematch = grade)
#                         hyps = h.find_hyponyms(str(text))
#                         self.hyponym_list[orator.ref][grade].extend(hyps)
#                         results_count[grade] += len(hyps)
#                     except:
#                         print(f"not analysed {grade}: {orator.ref}")           

            for key, value in results_count.items():
                self.methods[key].append(value)
                
#         display(pd.DataFrame(not_analysed, columns = ["not analysed"]))
        self.quant = pd.DataFrame(self.methods, index = [orator.name.title() for orator in iterable])
        
output = None
output = hp_Analysis(dataset = {k : v for k, v in orators.items() if k.ref != "hitler"})

bush: 100%|██████████| 14/14 [00:14<00:00,  1.06s/it]
king: 100%|██████████| 5/5 [00:11<00:00,  2.23s/it]
laden: 100%|██████████| 6/6 [00:10<00:00,  1.79s/it]

Wall time: 36.8 s





In [4]:
display(output.quant)

Unnamed: 0,regex,spaCy
George Bush,39,92
Martin Luther King,22,53
Osama Bin Laden,29,71


In [4]:
import importlib
import hpspacy
importlib.reload(hpspacy)

hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
for pair in hs.find_hyponyms(orators["bush"][4].doc):
    print(str(pair[-1].sent).strip())
    print(pair)
    print(pair[0].root.is_hyponym, '=>', pair[0].root._.has_hypernyms)
    print("-"*5)

Tonight we are a country awakened to danger and called to defend freedom.
(a country, 'be_a', we)


AttributeError: 'spacy.tokens.token.Token' object has no attribute 'is_hyponym'

## Create Hearst Pattern Detection Object

In [29]:
import importlib
import hpspacy
importlib.reload(hpspacy)

docs = ["There are works by such authors as Herrick, Goldsmith, and Shakespeare.",
        "There were bruises, lacerations, or other injuries were not prevalent.",
        "common law countries, including Canada, Australia, and England enjoy toast.",
        "Many countries, especially France, England and Spain also enjoy toast.",
        "There are such benefits as postharvest losses reduction, food increase and soil fertility improvement."
       ]
hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)

for doc in docs:
    print(hs.find_hyponyms(doc))
    print('----------')


[]
----------
[]
----------
[(common law countries, 'include', Canada), (common law countries, 'include', Australia), (common law countries, 'include', England)]
----------
[(Many countries, 'especially', France), (Many countries, 'especially', England), (Many countries, 'especially', Spain)]
----------
[]
----------


## Initial Test of Hearst Pattern Detection Object

First sentence contains a 'first' relationship' where hypernym preceeds hyponym.

Second sentence contains both a 'first' and 'last' relationship.

In [7]:
import importlib
import hpspacy
importlib.reload(hpspacy)

docs = [
    "We are hunting for terrorist groups, particularly the Taliban and al Qaeda",
    "We are hunting for the IRA, ISIS, al Qaeda and some other terrorist groups, especially the Taliban, Web Scientists and particularly Southampton University"
]

# hr = hpregex.HearstPatterns(cnd.nlp, extended=True, merge = False)
hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
def show_hyps(o):
    
   
    for i, text in enumerate(o):
        print(i, "#####")
        print(hs.find_hyponyms(text))
    
        print('-----')

show_hyps(docs)

0 #####
[(Taliban, terrorist groups, 'particularly'), (al Qaeda, terrorist groups, 'particularly')]
-----
1 #####
[(Taliban, other terrorist groups, 'especially'), (Web Scientists, other terrorist groups, 'especially'), (Southampton University, other terrorist groups, 'especially')]
-----


## Test With a Larger Number of sentences

In [8]:
%%time
import importlib
import hpspacy
importlib.reload(hpspacy)

# create a list of docs
docs = [
    "Forty-four percent of patients with uveitis had one or more identifiable signs or symptoms, such as red eye, ocular pain, visual acuity, or photophobia, in order of decreasing frequency.",
    "Other close friends, including Canada, Australia, Germany and France, have pledged forces as the operation unfolds.",
    "The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.",
    "Terrorist groups like al Qaeda depend upon the aid or indifference of governments.",
    "This new law that I sign today will allow surveillance of all communications used by terrorists, including e-mails, the Internet, and cell phones.",
    "From this day forward, any nation that continues to harbor or support terrorism will be regarded by the United States as a hostile regime.",
    "We are looking out for the Taliban, al Qaeda and other terrorist groups",
    "We are looking out for al Qaeda and other terrorist groups, especially the Taliban and the muppets"
]

for doc in docs:
    print(hs.find_hyponyms(doc))

[(red eye, symptoms, 'such_as'), (ocular pain, symptoms, 'such_as'), (visual acuity, symptoms, 'such_as'), (photophobia, symptoms, 'such_as')]
[(Canada, Other close friends, 'include'), (Australia, Other close friends, 'include'), (Germany, Other close friends, 'include'), (France, Other close friends, 'include')]
[(al Qaeda, collection of loosely affiliated terrorist organizations, 'know_as')]
[(al Qaeda, Terrorist groups, 'like')]
[(e-mails, terrorists, 'include'), (Internet, terrorists, 'include'), (cell phones, terrorists, 'include')]
[]
[]
[(Taliban, other terrorist groups, 'especially'), (muppets, other terrorist groups, 'especially')]
Wall time: 156 ms


## Test with a Full Speech

In [9]:
import os
import json
import importlib
import hpspacy
importlib.reload(hpspacy)
from spacy import displacy


dirpath = os.getcwd()
file = "first_docs.json"
hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)

with open(os.path.join(dirpath, file), "r") as f:
    last_docs = json.load(f)

total = 0
for text in last_docs:
    doc = cnd(text[2])
    hyponyms = hs.find_hyponyms(doc)
    print(doc)
    print([t for t in doc])
    print(text[0], '=>', hyponyms)
    if not hyponyms: total += 1
    print('----------')
    
print(total, '/', len(last_docs))

we are looking for terrorist groups, such as the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, such, as, the, Taliban, ,, al Qeada, and, Southampton University]
such_as => [(Taliban, terrorist groups, 'such_as'), (al Qeada, terrorist groups, 'such_as'), (Southampton University, terrorist groups, 'such_as')]
----------
we are looking for terrorist groups, known as the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, known, as, the, Taliban, ,, al Qeada, and, Southampton University]
known_as => [(Taliban, terrorist groups, 'know_as'), (al Qeada, terrorist groups, 'know_as'), (Southampton University, terrorist groups, 'know_as')]
----------
we are looking for terrorist groups, including the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, including, the, Taliban, ,, al Qeada, and, Southampton University]
including => [(Taliban, terrorist groups, 'include'), (al Qea

In [10]:
%%time
dirpath = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\Leo Tolstoy"
file = "warandpeace_testdata.json"

with open(os.path.join(dirpath, file), "r") as f:
    docs = json.load(f)

hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
for doc in docs:
    hyponyms = hs.find_hyponyms(doc[2])
    #if len(hyponyms[1:]) != 3:
    print(doc[2])
    print([t for t in cnd(doc[2])])
    print(doc[1])
    print(doc[0], '=>', hyponyms)
    print('----------')

The younger ones occupied themselves as before, some playing cards (there was plenty of money, though there was no food), some with more innocent games, such as quoits and skittles
[The, younger ones, occupied, themselves, as, before, ,, some, playing, cards, (, there, was, plenty of money, ,, though, there, was, no, food, ), ,, some, with, more, innocent games, ,, such, as, quoits, and, skittles]
True
such_as => [(quoits, innocent games, 'such_as'), (skittles, innocent games, 'such_as')]
----------
The trench itself was the room, in which the lucky ones, such as the squadron commander, had a board, lying on piles at the end opposite the entrance, to serve as a table.
[The, trench, itself, was, the, room, ,, in, which, the, lucky ones, ,, such, as, the, squadron commander, ,, had, a, board, ,, lying, on, piles, at, the, end opposite the entrance, ,, to, serve, as, a, table, .]
True
such_as => [(squadron commander, lucky ones, 'such_as')]
----------
Through the hard century-old bark, ev