# spaCy Hearst Patterns
---

In this experiment we test the utility of Hearst Patterns for detecting the ingroup and outgroup of a text.

For this experiment spaCy matcher is used with code adapted from: https://github.com/mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

Hypernym relations are semantic relationships between two concepts: C1 is a hypernym of C2 means that C1 categorizes C2 (e.g. “instrument” is a hypernym of “Piano”). For this research, the phrase, "America has enemies, such as Al Qaeda and the Taliban" would return the following '[('Al Qaeda', 'enemy'), ('the Taliban', 'enemy')]'. In this example, the categorising term 'enemy' is a hypernym of both 'Al Qaeda' and the 'Taliban'; conversely 'al Qaeda' and 'the Tabliban' are hyponyms of 'enemy'. Using this technique, hypernym terms could be classified as ingroup or outgroup and named entities identified as hyponym terms could be identified as either group.

## Instantiate the Pipeline

In [1]:
%%time
import importlib
from cndlib import pipeline
# importlib.reload(cndlib.pipeline)

cnd = pipeline.CND("medium")
print(cnd.nlp.meta['name'])
print([pipe for pipe in cnd.nlp.pipe_names])

core_web_md
['tagger', 'parser', 'ner', 'Named Entity Matcher', 'merge_entities', 'Concept Matcher', 'merge_custom_chunks', 'hearst pattern matcher']
Wall time: 9.42 s


## Instantiate the Dataset

In [2]:
%%time
import os
import importlib
from cndlib import cndobjects
# importlib.reload(cndobjects)

dirpath = r'C:\Users\spa1e17\OneDrive - University of Southampton\hostile-narrative-analysis\dataset'

orators = cndobjects.Dataset(cnd, dirpath)

parsing:  hitler (2020-06-30) Mein Kampf
parsing:  bush (2001-09-11) 911 Address to the Nation
parsing:  bush (2001-09-14) Remarks at the National Day of Prayer & Remembrance Service
parsing:  bush (2001-09-15) First Radio Address following 911
parsing:  bush (2001-09-17) Address at Islamic Center of Washington, D.C.
parsing:  bush (2001-09-20) Address to Joint Session of Congress Following 911 Attacks
parsing:  bush (2001-10-07) Operation Enduring Freedom in Afghanistan Address to the Nation
parsing:  bush (2001-10-11) 911 Pentagon Remembrance Address
parsing:  bush (2001-10-11) Prime Time News Conference on War on Terror
parsing:  bush (2001-10-11) Prime Time News Conference Q&A
parsing:  bush (2001-10-26) Address on Signing the USA Patriot Act of 2001
parsing:  bush (2001-11-10) First Address to the United Nations General Assembly
parsing:  bush (2001-12-11) Address to Citadel Cadets
parsing:  bush (2001-12-11) The World Will Always Remember 911
parsing:  bush (2002-01-29) First (Of

## Test 1 - Comparing regex with spaCy results

Having developed the spaCy pipeline component, this first test assesses improvement compared to the regex method.

For this test the hp_Analysis function has been developed which iterates over the dataset, performs each of the detection methods on the text for each orator and records the result.

In [3]:
%%time
import importlib
from cndlib import hpspacy, hpregex, hpanalysis
# importlib.reload(hpspacy)
# importlib.reload(hpregex)
# importlib.reload(hpanalysis)
                 
methods = {"regex" : hpregex.HearstPatterns(cnd.nlp, extended=True, merge = False).find_hyponyms,
           "spaCy" : lambda t: cnd.nlp(t)._.pairs}

analysis = hpanalysis.hp_Analysis(methods=methods, iterable=orators)

bush: 100%|██████████| 14/14 [00:22<00:00,  1.61s/it]
king: 100%|██████████| 5/5 [00:20<00:00,  4.09s/it]
laden: 100%|██████████| 6/6 [00:17<00:00,  2.89s/it]

Wall time: 1min





from this test we can see the level of improvement for the number of patterns detected over the dataset.

In [4]:
display(analysis.results)

Unnamed: 0,Adolf Hitler,George Bush,Martin Luther King,Osama Bin Laden
detected regex patterns,0,40,26,46
detected spaCy patterns,0,75,42,65
failed analysis (regex),0,3,0,1
failed analysis (spaCy),0,0,0,0
improvement,0,0,0,0


## Examples

Here are some examples from George Bush's declaration

In [5]:
import importlib
import hpspacy
importlib.reload(hpspacy)

hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
for pair in orators["bush"][4].doc._.pairs:
    print(str(pair[-1].sent).strip())
    print(pair)
    print("-"*5)

Tonight we are a country awakened to danger and called to defend freedom.
('be_a', a country, we)
-----
Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done.
('whether', resolution, we)
-----
The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.
('know_as', a collection of loosely affiliated terrorist organizations, al Qaeda)
-----
The terrorists' directive commands them to kill Christians and Jews, to kill all Americans, and make no distinctions among military and civilians, including women and children.
('include', civilians, women)
-----
The terrorists' directive commands them to kill Christians and Jews, to kill all Americans, and make no distinctions among military and civilians, including women and children.
('include', civilians, children)
-----
This group and its leader -- a person named Usama bin Laden -- are linked to many other organizations in different countries, i

## Create Hearst Pattern Detection Object

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [14]:
import pandas as pd
import itertools
import importlib
import hpspacy
importlib.reload(hpspacy)


texts = ["There are works by such authors as Herrick, Goldsmith, and Shakespeare.", # such_NOUN_as
        "There are such benefits as postharvest losses reduction, food increase and soil fertility improvement.", # such_NOUN_as
        "There were bruises, lacerations, or other injuries were not prevalent.",
        "common law countries, including Canada, Australia, and England enjoy toast.", #noun, including noun
        "Many countries, especially France, England and Spain also enjoy toast.", #noun, especially noun
       ]

hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
for text in texts:
    chunks = [[token for token in nlp(text)], [token for token in cnd(text)]]
    
    display(pd.DataFrame(_ for _ in itertools.zip_longest(*chunks)).T)
    
    custom_pairs = cnd(text)._.pairs
    print(custom_pairs)
    inbuilt_pairs = hs(nlp(text))._.pairs
    
    if custom_pairs: display(pd.DataFrame(custom_pairs, columns = ["custom chunk predicate", "hypernym", "hyponym"]))
    
    if inbuilt_pairs: display(pd.DataFrame(inbuilt_pairs, columns = ["in-built chunk predicate", "hypernym", "hyponym"]))
    print('----------')


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,There,are,works,by,such,authors,as,Herrick,",",Goldsmith,",",and,Shakespeare,.
1,There,are,works by such authors,as,Herrick,",",Goldsmith,",",and,Shakespeare,.,,,


[]


Unnamed: 0,in-built chunk predicate,hypernym,hyponym
0,such_NOUN_as,"(such, authors)",(Herrick)
1,such_NOUN_as,"(such, authors)",(Goldsmith)
2,such_NOUN_as,"(such, authors)",(Shakespeare)


----------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,There,are,such,benefits,as,postharvest,losses,reduction,",",food,increase,and,soil,fertility,improvement,.
1,There,are,such,benefits,as,postharvest losses reduction,",",food increase,and,soil fertility improvement,.,,,,,


[('such_NOUN_as', such benefits, postharvest losses reduction), ('such_NOUN_as', such benefits, food increase), ('such_NOUN_as', such benefits, soil fertility improvement)]


Unnamed: 0,custom chunk predicate,hypernym,hyponym
0,such_NOUN_as,"(such, benefits)",(postharvest losses reduction)
1,such_NOUN_as,"(such, benefits)",(food increase)
2,such_NOUN_as,"(such, benefits)",(soil fertility improvement)


----------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,There,were,bruises,",",lacerations,",",or,other,injuries,were,not,prevalent,.
1,There,were,bruises,",",lacerations,",",or,other,injuries,were,not,prevalent,.


[('other', other injuries, lacerations), ('other', other injuries, bruises)]


Unnamed: 0,custom chunk predicate,hypernym,hyponym
0,other,"(other, injuries)",(lacerations)
1,other,"(other, injuries)",(bruises)


Unnamed: 0,in-built chunk predicate,hypernym,hyponym
0,other,"(other, injuries)",(lacerations)
1,other,"(other, injuries)",(bruises)


----------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,common,law,countries,",",including,Canada,",",Australia,",",and,England,enjoy,toast,.
1,common law countries,",",including,Canada,",",Australia,",",and,England,enjoy,toast,.,,


[('include', common law countries, Canada), ('include', common law countries, Australia), ('include', common law countries, England)]


Unnamed: 0,custom chunk predicate,hypernym,hyponym
0,include,(common law countries),(Canada)
1,include,(common law countries),(Australia)
2,include,(common law countries),(England)


Unnamed: 0,in-built chunk predicate,hypernym,hyponym
0,include,"(common, law, countries)",(Canada)
1,include,"(common, law, countries)",(Australia)
2,include,"(common, law, countries)",(England)


----------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Many,countries,",",especially,France,",",England,and,Spain,also,enjoy,toast,.
1,Many,countries,",",especially,France,",",England,and,Spain,also,enjoy,toast,.


[('especially', Many countries, especially France), ('especially', Many countries, England), ('especially', Many countries, Spain)]


Unnamed: 0,custom chunk predicate,hypernym,hyponym
0,especially,"(Many, countries)","(especially, France)"
1,especially,"(Many, countries)",(England)
2,especially,"(Many, countries)",(Spain)


Unnamed: 0,in-built chunk predicate,hypernym,hyponym
0,especially,"(Many, countries)","(especially, France)"
1,especially,"(Many, countries)",(England)
2,especially,"(Many, countries)",(Spain)


----------


## Initial Test of Hearst Pattern Detection Object

First sentence contains a 'first' relationship' where hypernym preceeds hyponym.

Second sentence contains both a 'first' and 'last' relationship.

In [15]:
import importlib
import hpspacy
importlib.reload(hpspacy)

texts = [
    "We are hunting for terrorist groups, particularly the Taliban and al Qaeda",
    "We are hunting for the IRA, ISIS, al Qaeda and some other terrorist groups, especially the Taliban, Web Scientists and particularly Southampton University"
]

hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)
def show_hyps(texts):
    
   
    for i, text in enumerate(texts):
        print(i, "#####")
        print("custom chunks =>", [token for token in cnd(text)])
        print(cnd(text)._.pairs)
        print('-----')

show_hyps(texts)

0 #####
custom chunks => [We, are, hunting, for, terrorist groups, ,, particularly, the, Taliban, and, al Qaeda]
[]
-----
1 #####
custom chunks => [We, are, hunting, for, the, IRA, ,, ISIS, ,, al Qaeda, and, some, other, terrorist groups, ,, especially, the, Taliban, ,, Web Scientists, and, particularly, Southampton University]
[('some_other', some other terrorist groups, al Qaeda), ('some_other', some other terrorist groups, the IRA), ('some_other', some other terrorist groups, ISIS), ('some_other', some other terrorist groups, especially the Taliban), ('some_other', some other terrorist groups, Web Scientists), ('some_other', some other terrorist groups, particularly Southampton University)]
-----


## Test With a Larger Number of sentences

In [16]:
%%time
import importlib
import hpspacy
importlib.reload(hpspacy)

# create a list of docs
texts = [
    "Forty-four percent of patients with uveitis had one or more identifiable signs or symptoms, such as red eye, ocular pain, visual acuity, or photophobia, in order of decreasing frequency.",
    "Other close friends, including Canada, Australia, Germany and France, have pledged forces as the operation unfolds.",
    "The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.",
    "Terrorist groups like al Qaeda depend upon the aid or indifference of governments.",
    "This new law that I sign today will allow surveillance of all communications used by terrorists, including e-mails, the Internet, and cell phones.",
    "From this day forward, any nation that continues to harbor or support terrorism will be regarded by the United States as a hostile regime.",
    "We are looking out for the Taliban, al Qaeda and other terrorist groups",
    "We are looking out for al Qaeda and other terrorist groups, especially the Taliban and the muppets"
]

def show_hyps(texts):
    
   
    for i, text in enumerate(texts):
        print(i, "#####")
        print("custom chunks =>", [token for token in cnd(text)])
        print(cnd(text)._.pairs)
        print('-----')

show_hyps(texts)

0 #####
custom chunks => [Forty-four percent, of, patients with uveitis, had, one, or, more, identifiable signs, or, symptoms, ,, such, as, red eye, ,, ocular pain, ,, visual acuity, ,, or, photophobia, ,, in, order of decreasing frequency, .]
[('such_as', symptoms, red eye), ('such_as', symptoms, ocular pain), ('such_as', symptoms, visual acuity), ('such_as', symptoms, photophobia)]
-----
1 #####
custom chunks => [Other, close friends, ,, including, Canada, ,, Australia, ,, Germany, and, France, ,, have, pledged, forces, as, the, operation, unfolds, .]
[('include', Other close friends, Canada), ('include', Other close friends, Australia), ('include', Other close friends, Germany), ('include', Other close friends, France)]
-----
2 #####
custom chunks => [The, evidence, we, have, gathered, all, points, to, a, collection of loosely affiliated terrorist organizations, known, as, al Qaeda, .]
[('know_as', a collection of loosely affiliated terrorist organizations, al Qaeda)]
-----
3 #####


## Test with a Full Speech

In [17]:
import os
import json
import importlib
import hpspacy
importlib.reload(hpspacy)
from spacy import displacy


dirpath = os.getcwd()
file = "first_docs.json"
hs = hpspacy.HearstPatterns(cnd.nlp, extended = True)

with open(os.path.join(dirpath, file), "r") as f:
    last_docs = json.load(f)

total = 0
for text in last_docs:
    doc = cnd(text[2])
    print(doc)
    print([t for t in doc])
    print(text[0], '=>', doc._.pairs)
    if not doc._.pairs: total += 1
    print('----------')
    
print(total, '/', len(last_docs))

we are looking for terrorist groups, such as the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, such, as, the, Taliban, ,, al Qeada, and, Southampton University]
such_as => []
----------
we are looking for terrorist groups, known as the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, known, as, the, Taliban, ,, al Qeada, and, Southampton University]
known_as => []
----------
we are looking for terrorist groups, including the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, including, the, Taliban, ,, al Qeada, and, Southampton University]
including => []
----------
we are looking for terrorist groups, especially the Taliban, al Qeada and Southampton University
[we, are, looking, for, terrorist groups, ,, especially, the, Taliban, ,, al Qeada, and, Southampton University]
especially => []
----------
we are looking for terrorist groups, like the Taliban, al Qeada 