# Hypernym Relationship Extraction in Class

In this example, we will use NLTK and Hearst Pattern for hypernym relationship extraction. 
- Firstly, install python environment
- Install NLTK: pip install nltk
- Download data distribution for NLTK. Install using NLTK downloader: ``nltk.download()``. If cannot download using ``nltk.download()``, try download manually from https://github.com/nltk/nltk_data/tree/gh-pages![image.png](attachment:image.png) or https://pan.baidu.com/s/1wONWpaa86_wnsIksKda8eQ (code:tfon )
- Unzip the downloaded file to the following folder: ``nltk.data.find(".")``
- Unzip each zip file in the ten folders: *chunkers, corpora, grammers, help, misc, models, sentiment, stemmers, taggers, tokenizers*

## Hyponym Extraction using Hearst Pattern
Hyponym extraction follows the following 4 steps:
- Noun phrase chunking or named eneity chunking. You can use any np chunking/named entity technique.
- Chunked sentences prepare. Traverse the chunked result, if the label is ``NP``, then merge all the words in this chunk and add a prefix ``NP_`` (for subsequence process).
- Chunking refinement. If two or more NPs next to each other should be merged into a single NP. Eg., *"NP_foo NP_bar blah blah"* becomes *"NP_foo_bar blah blah"*
- Find the hypernym and hyponym pairs based on the refined prepared chunked sentence.

In [4]:
import nltk
import re
from nltk import pos_tag, word_tokenize, Tree, ne_chunk
from nltk.stem import WordNetLemmatizer 

Regular expression practice: In this example, we show one regex pattern example for Hearst pattern: ``NP such as {NP,}* {(or | and)} NP`` (https://docs.python.org/3/library/re.html)

In [9]:
# case 1
# "I like to listen to music from musical genres,such as blues,rock and jazz."
regex = r"(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)"
test_str = "NP_1 such as NP_2 , NP_3 and NP_4 "
matches = re.search(regex, test_str)
if matches:
    # Match.group([group1, ...]) Returns one or more subgroups of the match. 
    # If there is a single argument, the result is a single string;
    # if there are multiple arguments, the result is a tuple with one item per argument. 
    # Without arguments, group1 defaults to zero (the whole match is returned).
    print(matches.group(0))

NP_1 such as NP_2 , NP_3 and NP_4 


In [10]:
# case 2
# "... works by such authors as Herrick, Goldsmith, and Shakespeare."
regex = r"(such NP_\w+ as (NP_\w+ ?(, )?(and |or )?)+)"
test_str = "such NP_1 as NP_2 and NP_3 "
matches = re.search(regex, test_str)
if matches:
    print(matches.group(0))

such NP_1 as NP_2 and NP_3 


In [11]:
# case 3
# "... bistros, coffee shops, and other cheap eating places."
regex = r"((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)"
test_str = "NP_1, NP_2 and other NP_3 "
matches = re.search(regex, test_str)
if matches:
    print(matches.group(0))

NP_1, NP_2 and other NP_3


In [5]:
# case 4
# "...all common law countries, including Canada and England."
regex = r"(NP_\w+ ?(, )?including (NP_\w+ ?(, )?(and |or )?)+)"
test_str = "NP_1 including NP_2, NP_3, and NP_4 "
matches = re.search(regex, test_str)
if matches:
    print(matches.group(0))

NP_1 including NP_2, NP_3, and NP_4 


In [6]:
# case 5
# "...most European countries, especially France, England, and Spain."
regex = r"(NP_\w+ ?(, )?especially (NP_\w+ ?(, )?(and |or )?)+)"
test_str = "NP_1 especially NP_2, NP_3, and NP_4 "
matches = re.search(regex, test_str)
if matches:
    print(matches.group(0))

NP_1 especially NP_2, NP_3, and NP_4 


### Step1: Chunking Sentence
- Note the result is not the chunked np, instead is the chunk tree structure

In [7]:
def np_chunking(sentence):
    grammer = "NP: {<JJ>*<NN.*>+}\n {<NN.*>+}"  # chunker finds any number of adjectives (JJ) and then followed by  a nouns (NN)
    cp = nltk.RegexpParser(grammer)
    result = cp.parse(pos_tag(word_tokenize(sentence))) 
    #result = ne_chunk(pos_tag(word_tokenize(sentence)))
    return result

result_chunks = np_chunking("""I like to listen to music from musical genres,such as blues,rock and jazz.""")
result_chunks.draw()
print(result_chunks)

(S
  I/PRP
  like/VBP
  to/TO
  listen/VB
  to/TO
  (NP music/NN)
  from/IN
  (NP musical/JJ genres/NNS)
  ,/,
  such/JJ
  as/IN
  (NP blues/NNS)
  ,/,
  (NP rock/NN)
  and/CC
  (NP jazz/NN)
  ./.)


### Step2: Prepare the chunked result for subsequent Hearst pattern matching
- Traverse the chunked result, if the label is ``NP``, then merge all the words in this chunk and add a prefix ``NP_``
- All the tokens are separated with a white space (``" "``) 
- Remember to lemmatize words, using ``WordNetLemmatizer`` (``from nltk.stem import WordNetLemmatizer``)

In [108]:
# prepare the chunked sentence by merging words and add prefix NP_
# this function cannot handle case 2 and 3 correctly, need to be improved
def prepare_chunks(chunks):
    terms = []
    for chunk in chunks:
        if type(chunk) == Tree:
#             for token, pos in chunk.leaves():
#                 tokens.append(token)
            # todo: need to handle such and other cases
            # NP_such_NEx to such NP_NEx
            # NP_other_NEx to other NP_NEx
            for a in chunk:
                if a[0] == 'other':
                    terms.append('other')
                if a[0] == 'such':
                    terms.append('such')
            np = "NP_"+"_".join([WordNetLemmatizer().lemmatize(a[0]) for a in chunk if a[0]!='other'and a[0]!='such'])
            terms.append(np)
        else:
            terms.append(chunk[0]) 
    return ' '.join(terms)   # use space to join every term, all the commas will be separated

In [117]:
raw_text = "Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

NP_Agar is a NP_substance prepared from a NP_mixture of NP_red_algae , such as NP_Gelidium , for NP_laboratory or NP_industrial_use .


In [109]:
# case 1
raw_text = "I like to listen to music from musical genres,such as blues,rock and jazz."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

I like to listen to NP_music from NP_musical_genre , such as NP_blue , NP_rock and NP_jazz .


In [110]:
# case 2
# have a problem, such is an JJ, it is in the NP
raw_text = "... works by such authors as Herrick, Goldsmith, and Shakespeare."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

... NP_work by such NP_author as NP_Herrick , NP_Goldsmith , and NP_Shakespeare .


In [111]:
# case 3
# have a problem, other is an JJ, it is in the NP
raw_text = "... bistros, coffee shops, and other cheap eating places."
chunk_res = np_chunking(raw_text)

print(prepare_chunks(chunk_res))

... NP_bistro , NP_coffee_shop , and other NP_cheap_eating_place .


In [112]:
# case 4
raw_text = "...all common law countries, including Canada and England."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

... all NP_common_law_country , including NP_Canada and NP_England .


In [113]:
# case 5
raw_text = "...most European countries, especially France, England, and Spain."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

... most NP_European_country , especially NP_France , NP_England , and NP_Spain .


### Step3: Refinement chunking
If two or more NPs next to each other should be merged into a single NP. e.g., ``NP_foo NP_bar blah blah`` becomes ``NP_foo_bar blah blah``

In [14]:
def merge_NP(prepared_chunks):
    sentence = re.sub(r"(NP_\w+ NP_\w+)+",lambda m: m.expand(r'\1').replace(" NP_", "_"),prepared_chunks)
    return sentence

In [15]:
merge_NP("NP_foo NP_bar blah blah")

'NP_foo_bar blah blah'

In [16]:
merge_NP(prepare_chunks(chunk_res))

'... most NP_European_country , especially NP_France , NP_England , and NP_Spain .'

### Step4: Find the hypernym(上位词) and hyponyms(下位词) on processed chunked results
- Define Hearst patterns. Besides the regex, we also need to specify whether the hypernym is in the first part or the second part in the pattern.
  - For example:
  - in the pattern ``NP1 such as NP2 AND NP3``, the hypernym is the first part of the pattern; 
  - in the pattern ``NP1 , NP2 and other NP3``, the hypernym is the last part of the pattern. 
- After regex matching, find all the NPs and extract the hypernym and hyponym pairs based on the ``first`` or ``last`` attribute.
- Clean the NPs by removing the prefix ``NP_`` and ``_``

In [115]:
# Given by the prepared text, return the hypernym-hyponym pairs
def hyponym_extract(prepared_text, hearst_patterns):
    pairs = []
    for (pattern,parser) in hearst_patterns:
        matches = re.search(pattern, prepared_text)
        if matches:
            match_str = matches.group(0)
            
            #find all NP_xx and save to a list
            nps = [a for a in match_str.split() if a.startswith("NP_")]
            
            if parser == "first":
                hypernym = nps[0]
                hyponyms = nps[1:]
            else:
                hypernym = nps[-1]
                hyponyms = nps[:-1]
            for hypo in hyponyms:
                pairs.append((hypo,hypernym))
    return pairs



In [116]:
hearst_patterns = [("(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                                  ("((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)","last")]
                   
print(hyponym_extract(prepare_chunks(np_chunking("I like to listen to music from musical genres,such as blues,rock and jazz.")),hearst_patterns))
print(hyponym_extract(prepare_chunks(np_chunking("He likes to play basketball,football and other sports.")),hearst_patterns))
# 理想结果


[('NP_blue', 'NP_musical_genre'), ('NP_rock', 'NP_musical_genre'), ('NP_jazz', 'NP_musical_genre')]
[('NP_basketball', 'NP_sport'), ('NP_football', 'NP_sport')]
