# spaCy Hearst Patterns
---

In this experiment we test the utility of Hearst Patterns for detecting the ingroup and outgroup of a text.

For this experiment spaCy matcher is used with code adapted from: https://github.com/mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

Hypernym relations are semantic relationships between two concepts: C1 is a hypernym of C2 means that C1 categorizes C2 (e.g. “instrument” is a hypernym of “Piano”). For this research, the phrase, "America has enemies, such as Al Qaeda and the Taliban" would return the following '[('Al Qaeda', 'enemy'), ('the Taliban', 'enemy')]'. In this example, the categorising term 'enemy' is a hypernym of both 'Al Qaeda' and the 'Taliban'; conversely 'al Qaeda' and 'the Tabliban' are hyponyms of 'enemy'. Using this technique, hypernym terms could be classified as ingroup or outgroup and named entities identified as hyponym terms could be identified as either group.

## Setup the spaCy Pipeline

In [2]:
%%time

import spacy

nlp = spacy.load("en_core_web_md")

for component in nlp.pipe_names:
    if component not in ['tagger', "parser", "ner"]:
        nlp.remove_pipe(component)

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)

Wall time: 47.4 s


## Create Dataset of Political Speeches from George Bush, Osama bin Laden and Martin Luther King

In [5]:
import cndobjects

dirpath  = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\speeches"

orators = cndobjects.Dataset(dirpath)

for orator in orators:
    print(f'object {orator.ref} called {orator.name} has {len(orator)} speeches')

object bush called George Bush has 14 speeches
object king called Martin Luther King has 5 speeches
object laden called Osama bin Laden has 7 speeches


## Create Hearst Pattern Detection Object

In [118]:
%%time

# Hearst patterns take the form of (NP <predicate> (NP (and | or)?)+)

class hearst_patterns(object):
    
    """ Hearst Patterns is a class object used to detects hypernym relations to hyponyms in a text
    
    input: raw text
    returns: list of dict object with each entry all the hypernym-hyponym pairs of a text
    entry format: ["predicate" : [(hyponym, hypernym), (hyponym, hypernym), ..]]
    
    """
    
    import spacy
    
    def __init__(self, nlp, extended=False, predicatematch = "basic"):
        
       
#     Included in each entry is the original regex pattern now adapted as a spaCy matcher pattern.
#     Many of these patterns are in the same format, next iteration of code should include an
#     automatic pattern generator for patterns.
            
#     These patterns need checking and cleaning up for testing.
            
#     Format for the dict entry of each pattern
#     {
#      "label" : predicate, 
#      "pattern" : spaCy pattern, 
#      "posn" : first/last depending on whether the hypernym appears before its hyponym
#     }
      
        # make the patterns easier to read
        # as lexical understanding develops, consider adding attributes to dstinguish between hypernyms and hyponyms
        self.nlp = nlp
        
        options = ["bronze", "silver", "gold"]
        if predicatematch not in options:
            entry = ""
            while entry not in ["1", "2", "3"]: 
                entry = input(f"1. {options[0]}, 2. {options[1]}, 3. {options[2]}")
            self.predicatematch = options[int(entry) -1]
        else:
            self.predicatematch = predicatematch
        
        hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
        hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
        punct = {"IS_PUNCT": True, "OP": "?"}

        self.patterns = [

        {"label" : "such_as", "pattern" : [
#                 '(NP_\\w+ (, )?such as (NP_\\w+ ?(, )?(and |or )?)+)',
#                 'first'
             hypernym, punct, {"LEMMA": "such"}, {"LEMMA": "as"}, hyponym
        ], "posn" : "first"},

        {"label" : "know_as", "pattern" : [
#                 '(NP_\\w+ (, )?know as (NP_\\w+ ?(, )?(and |or )?)+)', # added for this experiment
#                 'first'
             hypernym, punct, {"LEMMA": "know"}, {"LEMMA": "as"}, hyponym
        ], "posn" : "first"},

        {"label" : "such", "pattern" : [
#                 '(such NP_\\w+ (, )?as (NP_\\w+ ?(, )?(and |or )?)+)',
#                 'first'
             {"LEMMA": "such"}, hypernym, punct, {"LEMMA": "as"}, hyponym
        ], "posn" : "first"},

        {"label" : "include", "pattern" : [
#                 '(NP_\\w+ (, )?include (NP_\\w+ ?(, )?(and |or )?)+)',
#                 'first'
             hypernym, punct, {"LEMMA" : "include"}, hyponym
        ], "posn" : "first"},

        {"label" : "especially", "pattern" : [ ## problem - especially is merged as a modifier in to a noun phrase
#                 '(NP_\\w+ (, )?especially (NP_\\w+ ?(, )?(and |or )?)+)',
#                 'first'
             hypernym, punct, {"LEMMA" : "especially"}, hyponym
        ], "posn" : "first"},

        {"label" : "other", "pattern" : [
#             problem: the noun_chunk, 'others' clashes with this rule to create a zero length chunk when predicate removed
#                 '((NP_\\w+ ?(, )?)+(and |or )?other NP_\\w+)',
#                 'last'
             hyponym, punct, {"LEMMA" : {"IN" : ["and", "or"]}}, {"LEMMA" : "other"}, hypernym
#             There were bruises, lacerations, or other injuries were not prevalent."
        ], "posn" : "last"},

        ]

        if extended:
            self.patterns.extend([

            {"label" : "which_may_include", "pattern" : [
#                     '(NP_\\w+ (, )?which may include (NP_\\w+ '
#                     '?(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "which"}, {"LEMMA" : "may"}, {"LEMMA" : "include"}, hyponym
            ], "posn" : "first"},

            {"label" : "which_be_similar_to", "pattern" : [
#                     '(NP_\\w+ (, )?which be similar to (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "which"}, {"LEMMA" : "be"}, {"LEMMA" : "similar"}, {"LEMMA" : "to"}, hyponym
            ], "posn" : "first"},

            {"label" : "example_of_this_be", "pattern" : [
#                     '(NP_\\w+ (, )?example of this be (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "example"}, {"LEMMA" : "of"}, {"LEMMA" : "this"}, {"LEMMA" : "be"}, hyponym
            ], "posn" : "first"},

            {"label" : ",type", "pattern" : [
#                     '(NP_\\w+ (, )?type (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "type"}, punct, hyponym
            ], "posn" : "first"},

            {"label" : "mainly", "pattern" : [
#                     '(NP_\\w+ (, )?mainly (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "mainly"}, hyponym
            ], "posn" : "first"},

            {"label" : "mostly", "pattern" : [
#                     '(NP_\\w+ (, )?mostly (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "mostly"}, hyponym
            ], "posn" : "first"},

            {"label" : "notably", "pattern" : [
#                     '(NP_\\w+ (, )?notably (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "notably"}, hyponym
            ], "posn" : "first"},

            {"label" : "particularly", "pattern" : [
#                     '(NP_\\w+ (, )?particularly (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "particularly"}, hyponym
            ], "posn" : "first"},

            {"label" : "principally", "pattern" : [
#                     '(NP_\\w+ (, )?principally (NP_\\w+ ? (, )?(and |or )?)+)', - fuses in a noun phrase
#                     'first'
                hypernym, punct, {"LEMMA" : "principally"}, hyponym
            ], "posn" : "first"},

            {"label" : "in_particular", "pattern" : [
#                     '(NP_\\w+ (, )?in particular (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "in"}, {"LEMMA" : "particular"}, hyponym
            ], "posn" : "first"},

            {"label" : "except", "pattern" : [
#                     '(NP_\\w+ (, )?except (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "except"}, hyponym
            ], "posn" : "first"},

            {"label" : "other_than", "pattern" : [
#                     '(NP_\\w+ (, )?other than (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "other"}, {"LEMMA" : "than"}, hyponym
            ], "posn" : "first"},

            {"label" : "eg", "pattern" : [
#                     '(NP_\\w+ (, )?e.g. (, )?(NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : {"IN" : ["e.g.", "eg"]}}, hyponym 
            ], "posn" : "first"},

#                 {"label" : "eg-ie", "pattern" : [ 
# #                     '(NP_\\w+ \\( (e.g.|i.e.) (, )?(NP_\\w+ ? (, )?(and |or )?)+' - need to understand this pattern better
# #                     '(\\. )?\\))',
# #                     'first'
#                     hypernym, punct, {"LEMMA" : {IN : ["e.g.", "i.e.", "eg", "ie"]}}, {"LEMMA" : "than"}, hyponym
#                 ]},

            {"label" : "ie", "pattern" : [
#                     '(NP_\\w+ (, )?i.e. (, )?(NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : {"IN" : ["i.e.", "ie"]}}, hyponym 
            ], "posn" : "first"},

            {"label" : "for_example", "pattern" : [
#                     '(NP_\\w+ (, )?for example (, )?'
#                     '(NP_\\w+ ?(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "for"}, {"LEMMA" : "example"}, punct, hyponym
            ], "posn" : "first"},

            {"label" : "example_of_be", "pattern" : [
#                     'example of (NP_\\w+ (, )?be (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                {"LEMMA" : "example"}, {"LEMMA" : "of"}, hypernym, punct, {"LEMMA" : "be"}, hyponym
            ], "posn" : "first"},

            {"label" : "like", "pattern" : [
#                     '(NP_\\w+ (, )?like (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "like"}, hyponym,
            ], "posn" : "first"},

            # repeat of such_as pattern in primary patterns???
#                     'such (NP_\\w+ (, )?as (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'

                {"label" : "whether", "pattern" : [
#                     '(NP_\\w+ (, )?whether (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "whether"}, hyponym
            ], "posn" : "first"},

            {"label" : "compare_to", "pattern" : [
#                     '(NP_\\w+ (, )?compare to (NP_\\w+ ? (, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "compare"}, {"LEMMA" : "to"}, hyponym 
            ], "posn" : "first"},

            {"label" : "among_-PRON-", "pattern" : [
#                     '(NP_\\w+ (, )?among -PRON- (NP_\\w+ ? '
#                     '(, )?(and |or )?)+)',
#                     'first'
                hypernym, punct, {"LEMMA" : "among"}, {"LEMMA" : "-PRON-"}, hyponym
            ], "posn" : "first"},

            {"label" : "for_instance", "pattern" : [
#                     '(NP_\\w+ (, )? (NP_\\w+ ? (, )?(and |or )?)+ '
#                     'for instance)',
#                     'first'
                hypernym, punct, hyponym, {"LEMMA" : "for"}, {"LEMMA" : "instance"}
            ], "posn" : "first"},

            {"label" : "and-or_any_other", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?any other NP_\\w+)',
#                     'last'
                hyponym, punct, {"DEP": "cc"}, {"LEMMA" : "any"}, {"LEMMA" : "other"}, hypernym,
            ], "posn" : "last"},

            {"label" : "some_other", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?some other NP_\\w+)',
#                     'last'
                hyponym, punct, {"DEP": "cc", "OP" : "?"}, {"LEMMA" : "some"}, {"LEMMA" : "other"}, hypernym,
            ], "posn" : "last"},

            {"label" : "be_a", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?be a NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "be"}, {"LEMMA" : "a"}, hypernym,
            ], "posn" : "last"},

            {"label" : "like_other", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?like other NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "like"}, {"LEMMA" : "other"}, hypernym,
            ], "posn" : "last"},

             {"label" : "one_of_the", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?one of the NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "one"}, {"LEMMA" : "of"}, {"LEMMA" : "the"}, hypernym,
            ], "posn" : "last"},

            {"label" : "one_of_these", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?one of these NP_\\w+)',
#                     'last'
            hyponym, punct, {"LEMMA" : "one"}, {"LEMMA" : "of"}, {"LEMMA" : "these"}, hypernym,
            ], "posn" : "last"},

            {"label" : "one_of_those", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?one of those NP_\\w+)',
#                     'last'
            hyponym, punct, {"DEP": "cc", "OP" : "?"}, {"LEMMA" : "one"}, {"LEMMA" : "of"}, {"LEMMA" : "those"}, hypernym,
            ], "posn" : "last"},

            {"label" : "be_example_of", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?be example of NP_\\w+)', added optional "an" to spaCy pattern for singular vs. plural
#                     'last'
                hyponym, punct, {"LEMMA" : "be"}, {"LEMMA" : "an", "OP" : "?"}, {"LEMMA" : "example"}, {"LEMMA" : "of"}, hypernym
            ], "posn" : "last"},

            {"label" : "which_be_call", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?which be call NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "which"}, {"LEMMA" : "be"}, {"LEMMA" : "call"}, hypernym
            ], "posn" : "last"},
#               
            {"label" : "which_be_name", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?which be name NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "which"}, {"LEMMA" : "be"}, {"LEMMA" : "name"}, hypernym
            ], "posn" : "last"},

            {"label" : "a_kind_of", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and|or)? a kind of NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "a"}, {"LEMMA" : "kind"}, {"LEMMA" : "of"}, hypernym
            ], "posn" : "last"},

#                     '((NP_\\w+ ?(, )?)+(and|or)? kind of NP_\\w+)', - combined with above
#                     'last'

            {"label" : "form_of", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and|or)? form of NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "a", "OP" : "?"}, {"LEMMA" : "form"}, {"LEMMA" : "of"}, hypernym
            ], "posn" : "last"},

            {"label" : "which_look_like", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?which look like NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "which"}, {"LEMMA" : "look"}, {"LEMMA" : "like"}, hyponym
            ], "posn" : "last"},

            {"label" : "which_sound_like", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )?which sound like NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "which"}, {"LEMMA" : "sound"}, {"LEMMA" : "like"}, hypernym
            ], "posn" : "last"},

            {"label" : "type", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and |or )? NP_\\w+ type)',
#                     'last'
                hyponym, punct, {"LEMMA" : "type"}, hypernym
            ], "posn" : "last"},

            {"label" : "compare_with", "pattern" : [
#                     '(compare (NP_\\w+ ?(, )?)+(and |or )?with NP_\\w+)',
#                     'last'
                {"LEMMA" : "compare"}, hyponym, punct, {"LEMMA" : "with"}, hypernym
            ], "posn" : "last"},

#             {"label" : "as", "pattern" : [
# #                     '((NP_\\w+ ?(, )?)+(and |or )?as NP_\\w+)',
# #                     'last'
#                 hyponym, punct, {"LEMMA" : "as"}, hypernym
#             ], "posn" : "last"},

            {"label" : "sort_of", "pattern" : [
#                     '((NP_\\w+ ?(, )?)+(and|or)? sort of NP_\\w+)',
#                     'last'
                hyponym, punct, {"LEMMA" : "sort"}, {"LEMMA" : "of"}, hypernym
            ], "posn" : "last"},

        ]),        

        ## initiate matcher
        from spacy.matcher import Matcher
        self.matcher = Matcher(self.nlp.vocab, validate = True)
        
        # added "some" to original list
        self.predicate_list = [
            'able', 'available', 'brief', 'certain',
            'different', 'due', 'enough', 'especially', 'few', 'fifth',
            'former', 'his', 'howbeit', 'immediate', 'important', 'inc',
            'its', 'last', 'latter', 'least', 'less', 'likely', 'little',
            'many', 'ml', 'more', 'most', 'much', 'my', 'necessary',
            'new', 'next', 'non', 'old', 'other', 'our', 'ours', 'own',
            'particular', 'past', 'possible', 'present', 'proud', 'recent',
            'same', 'several', 'significant', 'similar', 'some', 'such', 'sup', 'sure'
        ]

        self.predicates = []
        self.first = []
        self.last = []

        # add patterns to matcher
        for pattern in self.patterns:
            self.matcher.add(pattern["label"], None, pattern["pattern"])

            # gather list of predicate terms for the noun_chunk deconfliction
            self.predicates.append(pattern["label"].split('_'))

            # gather list of predicates where the hypernym appears first
            if pattern["posn"] == "first":
                self.first.append(pattern["label"])

            # gather list of predicates where the hypernym appears last
            if pattern["posn"] == "last":
                self.last.append(pattern["label"])
                
    def isPredicateMatch_bronze(self, chunk, predicates):
        
        """
        Bronze option to remove predicate phrases from noun_chunks using a predefined list of modifiers

        input: the chunk to be checked, list of predicate phrases
        returns: the chnunk with predicate phrases removed.

        """
        counter = 0
        while chunk[counter].lemma_ in predicates:
                counter += 1
                
        #remove empty spans, eg the noun_chunk 'others' becomes a zero length span
        if len(chunk[count:]) == 0:
            count = 0
                
        return chunk[counter:]
    
    def isPredicateMatch_silver(self, chunk, predicates):
        
        """
        Silver option to remove predicate phrases from noun_chunks using dependency parse labels

        input: the chunk to be checked, list of predicate phrases
        returns: the chnunk with predicate phrases removed.

        """
        counter = 0
        while chunk[counter].pos_ in ["DET", "ADJ", "ADV"]:
            counter += 1
                
        #remove empty spans, eg the noun_chunk 'others' becomes a zero length span
        if len(chunk[counter:]) == 0:
            count = 0
                
        return chunk[counter:]

    def isPredicateMatch_gold(self, chunk, predicates):
        
        """
        Gold option to remove predicate phrases from noun_chunks using pattern labels.

        input: the chunk to be checked, list of predicate phrases
        returns: the chnunk with predicate phrases removed.

        """

        def match(empty, count, chunk, predicates):#
            # empty: check whether predicates list is empty
            # count < len(predicates[0]): checks whether the count has reached the final token of the predicate
            # chunk[count].lemma_ == predicates[0][count]: check whether chunk token is equal to the predicate token

            
            while not empty and count < len(predicates[0]) and chunk[count].lemma_ == predicates[0][count]:
                count += 1
                
            #remove empty spans, eg the noun_chunk 'others' becomes a zero length span
            if len(chunk[count:]) == 0:
                count = 0

            return empty, count
    
        def isMatch(chunk, predicates):

            empty, counter = match(predicates == [], 0, chunk, predicates)
            if empty or counter == len(predicates[0]):
                #print(chunk, "becomes: ", chunk[counter:])
                return chunk[counter:]
            else:
                return isMatch(chunk, predicates[1:])

        return isMatch(chunk, predicates)
    
    
    def find_hyponyms(self, text):
        
        """
        this is the main function of the class object
        
        follows logic of:
        1. checks whether text has been parsed
        2. pre-processing for noun_chunks
        3. generate matches
        4. create list of dict object containing match results
        """
        
        if type(text) is spacy.tokens.doc.Doc:
            doc = text
        else:
            doc = self.nlp(text) # initiate doc 
            
        
        ## Pre-processing
        # there are some predicate terms, such as "particularly", "especially" and "some other" which are
        # merged with the noun phrase. Such terms are part of the pattern and become part of the
        # merged noun-chunk, consequently, they are not detected in by the matcher.
        # This pre-processing, therefore, walks through the noun_chunks of a doc object to remove those
        # predicate terms from each noun_chunk and merges the result.
        
        with doc.retokenize() as retokenizer:

            for chunk in doc.noun_chunks:

                attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

                if self.predicatematch == "bronze":
                    retokenizer.merge(self.isPredicateMatch_bronze(chunk, self.predicate_list), attrs = attrs)
                elif self.predicatematch == "silver":
                    retokenizer.merge(self.isPredicateMatch_silver(chunk, self.predicate_list), attrs = attrs)
                elif self.predicatematch == "gold":
                    retokenizer.merge(self.isPredicateMatch_gold(chunk, self.predicates), attrs = attrs)
    
        ## Main Body
        #Find matches in doc
        matches = self.matcher(doc)
        
        pairs = [] # set up dictionary containing pairs
        
        # If none are found then return None
        if not matches:
            return pairs

        for match_id, start, end in matches:
            predicate = self.nlp.vocab.strings[match_id]
            
            # if the predicate is in the list where the hypernym is last, else hypernym is first
            if predicate in self.last: 
                hypernym = doc[end - 1]
                hyponym = doc[start]
            else:
                # an inelegent way to deal with the "such_NOUN_as pattern" since the first token is not the hypernym
                if doc[start].lemma_ == "such":
                    start += 1
                hypernym = doc[start]
                hyponym = doc[end - 1]

            # create a list of dictionary objects with the format:
            # {
            # "predicate" : " predicate term based from pattern name,
            # "pairs" : [(hypernym, hyponym)] + [hyponym conjuncts (tokens linked by and | or)]
            # "sent" : sentence in which the pairs originate
            # }
            
#             pairs.append(dict({"predicate" : predicate, 
#                                "pairs" : [(hypernym, hyponym)] + [(hypernym, token) for token in hyponym.conjuncts if token != hypernym],
#                                "sent" : (hyponym.sent.text).strip()}))

            pairs.append((hyponym.lemma_, hypernym.lemma_, predicate))  
            for token in hyponym.conjuncts:   
                if token != hypernym and token != None:
                    pairs.append((token.lemma_, hypernym.lemma_, predicate))

        return pairs
    
h = hearst_patterns(nlp, extended = True, predicatematch = "gold")
print(h.predicatematch)
result = h.find_hyponyms(orators["bush"][3].text)
print(len(result))
print(result)
#print(h.predicates)

gold
12
[('an exceptional man', 'passenger', 'like'), ('al Qaeda', 'loosely affiliated terrorist organization', 'know_as'), ('woman', 'civilian', 'include'), ('child', 'civilian', 'include'), ('the Egyptian Islamic Jihad', 'different countrie', 'include'), ('the Islamic Movement of Uzbekistan', 'different countrie', 'include'), ('Afghanistan', 'place', 'like'), ('american citizen', 'all foreign national', 'include'), ('Egypt', 'many muslim countrie', 'such_as'), ('Saudi Arabia', 'many muslim countrie', 'such_as'), ('Jordan', 'many muslim countrie', 'such_as'), ('the will', 'every value', 'except')]
Wall time: 1.16 s


In [119]:
docs = ["There are works by such authors as Herrick, Goldsmith, and Shakespeare.",
        "There were bruises, lacerations, or other injuries were not prevalent.",
        "common law countries, including Canada, Australia, and England enjoy toast.",
        "Many countries, especially France, England and Spain also enjoy toast.",
        "There are such benefits as postharvest losses reduction, food increase and soil fertility improvement."
       ]

for doc in docs:
    print(h.find_hyponyms(doc))
    print('----------')


[('Herrick', 'author', 'such'), ('Goldsmith', 'author', 'such'), ('Shakespeare', 'author', 'such')]
----------
[('laceration', 'injury', 'other'), ('bruise', 'injury', 'other')]
----------
[('Canada', 'common law countrie', 'include'), ('Australia', 'common law countrie', 'include'), ('England', 'common law countrie', 'include')]
----------
[('France', 'many countrie', 'especially'), ('England', 'many countrie', 'especially'), ('Spain', 'many countrie', 'especially')]
----------
[('postharvest losses reduction', 'benefit', 'such'), ('food increase', 'benefit', 'such'), ('soil fertility improvement', 'benefit', 'such')]
----------


In [82]:
dox = nlp("common law countries")
for t in dox.noun_chunks:
    print(t.lemma_)

common law country


## Initial Test of Hearst Pattern Detection Object

First sentence contains a 'first' relationship' where hypernym preceeds hyponym.

Second sentence contains both a 'first' and 'last' relationship.

In [117]:
docs = [
    "We are hunting for terrorist groups, particularly the Taliban and al Qaeda",
    "We are hunting for the IRA, ISIS, al Qaeda and some other terrorist groups, especially the Taliban, Web Scientists and particularly Southampton University"
]

def show_hyps(o):
    
   
    for i, text in enumerate(o):
        print(i, "#####")
        print(h.find_hyponyms(text))
    
        print('-----')

show_hyps(docs)

0 #####
[]
-----
1 #####
[]
-----


## Test With a Larger Number of sentences

In [110]:
%%time

# create a list of docs
docs = [
    "Forty-four percent of patients with uveitis had one or more identifiable signs or symptoms, such as red eye, ocular pain, visual acuity, or photophobia, in order of decreasing frequency.",
    "Other close friends, including Canada, Australia, Germany and France, have pledged forces as the operation unfolds.",
    "The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda.",
    "Terrorist groups like al Qaeda depend upon the aid or indifference of governments.",
    "This new law that I sign today will allow surveillance of all communications used by terrorists, including e-mails, the Internet, and cell phones.",
    "From this day forward, any nation that continues to harbor or support terrorism will be regarded by the United States as a hostile regime.",
    "We are looking out for the Taliban, al Qaeda and other terrorist groups",
    "We are looking out for al Qaeda and other terrorist groups, especially the Taliban and the muppets"
]

for doc in docs:
    print(h.find_hyponyms(doc))

[('red eye', 'symptom', 'such_as'), ('ocular pain', 'symptom', 'such_as'), ('visual acuity', 'symptom', 'such_as'), ('photophobia', 'symptom', 'such_as')]
[('Canada', 'close friend', 'include'), ('Australia', 'close friend', 'include'), ('Germany', 'close friend', 'include'), ('France', 'close friend', 'include')]
[('al Qaeda', 'loosely affiliated terrorist organization', 'know_as')]
[('al Qaeda', 'terrorist group', 'like')]
[('e-mail', 'terrorist', 'include'), ('the internet', 'terrorist', 'include'), ('cell phone', 'terrorist', 'include')]
[]
[('al Qaeda', 'terrorist group', 'other'), ('the Taliban', 'terrorist group', 'other')]
[('al Qaeda', 'terrorist group', 'other'), ('the Taliban', 'terrorist group', 'especially'), ('the muppet', 'terrorist group', 'especially')]
Wall time: 165 ms


## Test with a Full Speech

In [26]:
for i, text in enumerate(orators["king"]):
    print(i, ': ', text.title)
    
    hypernyms = h.find_hyponyms(text.text)

    if hypernyms:
        for hypernym in hypernyms:
            print(hypernym["sent"])
            print(hypernym["predicate"], '=>', hypernym["pairs"])
            print()

0 :   - Ive Been to the Mountaintop.txt


ZeroDivisionError: division by zero