# Custom Noun Chunking
-----

There is a problem whereby spaCy's inbuilt noun_chunks is too course grained for the chunking required for detecting the ingroups and outgroups.

For the purposes of the methodology, a more fine grained noun chunking algorithm is required.

There are several examples in the test ingroup and outgroup sentences named entities are chunked with other nouns when they would preferable be kept separate.

There are also several examples where a noun chunk contains more than one noun of a custom attribute, therefore, the chunk needs to be resolved to a single instance

This notebook adapts spaCy's noun_chunk source code and adapt for the specific purpose of this pipeline.

Source code at these links:

    Noun Chunker Code
    
    https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

    Class extensions

    https://github.com/explosion/spaCy/blob/9ce059dd067ecc3f097d04023e3cfa0d70d35bb8/spacy/tokens/doc.pyx

    https://github.com/explosion/spaCy/blob/f49e2810e6ea5c8b848df5b0f393c27ee31bb7f4/spacy/tokens/span.pyx


In [1]:
%%time
import pandas as pd
import nltk
import spacy
nlp = spacy.load("en_core_web_md")
print(f'spaCy version: {spacy.__version__}')

spaCy version: 3.4.3
CPU times: user 6.43 s, sys: 1.31 s, total: 7.74 s
Wall time: 14.3 s


In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

stop_words_list = list(stopwords.words('english'))

vectorizer = CountVectorizer(stop_words=stop_words_list)


bow_sentences = """
Al Qaeda is to terror what the mafia is to crime.
Deliver to United States authorities all the leaders of al Qaeda who hide in your land.
But today, for al Qaeda and the Taliban, there is no shelter.
On my orders, the United States military has begun strikes against Al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan
"""


print('Original Sentence:')
for i, sentence in enumerate(sent_tokenize(bow_sentences)):
    print(f'Sentence {i + 1}: {sentence.strip()}')
print()

print('Tokenised Sentence:')
for i, sentence in enumerate(sent_tokenize(bow_sentences)):
    print(f'Sentence {i + 1}: {word_tokenize(sentence)}')
print()

print('Sentences Without Stopwords')
print(f'Stopword examples {stop_words_list[0:10]}')
for i, sentence in enumerate(sent_tokenize(bow_sentences)):
    print(f'Sentence {i + 1}: {[word for word in word_tokenize(sentence) if word not in stop_words_list]}')
print()

print('BOW representation:')
count_matrix = vectorizer.fit_transform([bow_sentences])
BOW_values = count_matrix.toarray()
BOW_keys = vectorizer.get_feature_names()

BOW = {BOW_keys[i]: BOW_values[0][i] for i in range(len(BOW_keys))}
print(BOW)
print()


Original Sentence:
Sentence 1: Al Qaeda is to terror what the mafia is to crime.
Sentence 2: Deliver to United States authorities all the leaders of al Qaeda who hide in your land.
Sentence 3: But today, for al Qaeda and the Taliban, there is no shelter.
Sentence 4: On my orders, the United States military has begun strikes against Al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan

Tokenised Sentence:
Sentence 1: ['Al', 'Qaeda', 'is', 'to', 'terror', 'what', 'the', 'mafia', 'is', 'to', 'crime', '.']
Sentence 2: ['Deliver', 'to', 'United', 'States', 'authorities', 'all', 'the', 'leaders', 'of', 'al', 'Qaeda', 'who', 'hide', 'in', 'your', 'land', '.']
Sentence 3: ['But', 'today', ',', 'for', 'al', 'Qaeda', 'and', 'the', 'Taliban', ',', 'there', 'is', 'no', 'shelter', '.']
Sentence 4: ['On', 'my', 'orders', ',', 'the', 'United', 'States', 'military', 'has', 'begun', 'strikes', 'against', 'Al', 'Qaeda', 'terrorist', 'training', 'camps', 'and'



In [3]:
if "merge_noun_chunks" in nlp.pipe_names: nlp.remove_pipe("merge_noun_chunks")
nlp.add_pipe("merge_noun_chunks")

for i, sentence in enumerate(nlp(bow_sentences).sents):
    print(f'Sentence {i}: {[token.text.strip() for token in sentence]}')
    print()
    

nlp.remove_pipe("merge_noun_chunks")
print()

Sentence 0: ['Al Qaeda', 'is', 'to', 'terror', 'what', 'the mafia', 'is', 'to', 'crime', '.', '']

Sentence 1: ['Deliver', 'to', 'United States authorities', 'all the leaders', 'of', 'al Qaeda', 'who', 'hide', 'in', 'your land', '.', '']

Sentence 2: ['But', 'today', ',', 'for', 'al Qaeda', 'and', 'the Taliban', ',', 'there', 'is', 'no shelter', '.', '']

Sentence 3: ['On', 'my orders', ',', 'the United States military', 'has', 'begun', 'strikes', 'against', 'Al Qaeda terrorist training camps', 'and', 'military installations', 'of', 'the Taliban regime', 'in', 'Afghanistan', '']




In [4]:
sentences = [
    "Our war on terror begins with al Qaeda, but it does not end there.",
    "These same terrorists are searching for weapons of mass destruction, the tools to turn their hatred into holocaust.",
    "On September the 11th, enemies of freedom committed an act of war against our country",
    "The face of terror is not the true faith of Islam",
    "The United States of America is a friend to the Afghan people, and we are the friends of almost a billion worldwide who practice the Islamic faith.",
    "The image of that dreadful massacre in Qana, Lebanon, is still vivid in one's mind, and so are the massacres in Tajikistan, Burma, Kashmir, Assam, the Philippines, Fatani, Ogaden, Somalia, Eritrea, Chechnya, and Bosnia-Herzegovina where hair-raising and revolting massacres were committed before the eyes of the entire world clearly in accordance with a conspiracy by the United States and its allies who banned arms for the oppressed there under the cover of the unfair United Nations",
]

In [5]:
import pandas as pd
pd.set_option('display.max_rows', None)
from spacy import displacy
from spacy.symbols import NOUN, PROPN, PRON, ADP

doc = nlp(sentences[5])

def get_children(token):
    child = next(token.children, False)
    if child and child.head.pos == ADP and child.head.head.pos in (NOUN, PROPN, PRON):
        return f'{child} ({child.dep_})'
    return ""

def get_head(token):
#     if next(token.rights, False) and token.pos in (NOUN, PROPN, PRON) and next(token.rights).pos == ADP:
#         return f'{token.head} ({token.head.dep_})'
#     if token.pos == ADP:
#         return f'{token.head} ({token.head.dep_})'
    return f'{token.head} ({token.head.dep_}) ({spacy.explain(token.head.dep_)})'

data = {
    "word": [t for t in doc],
    "POS Tag": [f"{t.pos_} ({spacy.explain(t.pos_)})" for t in doc],
    "Dependency Label": [f"{t.dep_} ({spacy.explain(t.dep_)})" for t in doc],
    "head": [get_head(t) for t in doc],
    "Gen Child": [get_children(t) for t in doc],
    "Lefts" : [list(t.lefts) for t in doc],
    "Left Edge": [t.left_edge for t in doc],
    "Right Edge": [t.right_edge for t in doc],
    "Rights": [list(t.rights) for t in doc],
    "Ancestors" : [list(t.ancestors) for t in doc],
    "Children" : [list(t.children) for t in doc],
    "Subtree" : [list(t.subtree) for t in doc]
    }
display(pd.DataFrame(data))
spacy.displacy.render(doc, style="dep")



Unnamed: 0,word,POS Tag,Dependency Label,head,Gen Child,Lefts,Left Edge,Right Edge,Rights,Ancestors,Children,Subtree
0,The,DET (determiner),det (determiner),image (nsubj) (nominal subject),,[],The,The,[],"[image, is]",[],[The]
1,image,NOUN (noun),nsubj (nominal subject),is (ROOT) (root),,[The],The,Lebanon,[of],[is],"[The, of]","[The, image, of, that, dreadful, massacre, in,..."
2,of,ADP (adposition),prep (prepositional modifier),image (nsubj) (nominal subject),massacre (pobj),[],of,Lebanon,[massacre],"[image, is]",[massacre],"[of, that, dreadful, massacre, in, Qana, ,, Le..."
3,that,DET (determiner),det (determiner),massacre (pobj) (object of preposition),,[],that,that,[],"[massacre, of, image, is]",[],[that]
4,dreadful,ADJ (adjective),amod (adjectival modifier),massacre (pobj) (object of preposition),,[],dreadful,dreadful,[],"[massacre, of, image, is]",[],[dreadful]
5,massacre,NOUN (noun),pobj (object of preposition),of (prep) (prepositional modifier),,"[that, dreadful]",that,Lebanon,[in],"[of, image, is]","[that, dreadful, in]","[that, dreadful, massacre, in, Qana, ,, Lebanon]"
6,in,ADP (adposition),prep (prepositional modifier),massacre (pobj) (object of preposition),Qana (pobj),[],in,Lebanon,[Qana],"[massacre, of, image, is]",[Qana],"[in, Qana, ,, Lebanon]"
7,Qana,PROPN (proper noun),pobj (object of preposition),in (prep) (prepositional modifier),,[],Qana,Lebanon,"[,, Lebanon]","[in, massacre, of, image, is]","[,, Lebanon]","[Qana, ,, Lebanon]"
8,",",PUNCT (punctuation),punct (punctuation),Qana (pobj) (object of preposition),,[],",",",",[],"[Qana, in, massacre, of, image, is]",[],"[,]"
9,Lebanon,PROPN (proper noun),appos (appositional modifier),Qana (pobj) (object of preposition),,[],Lebanon,Lebanon,[],"[Qana, in, massacre, of, image, is]",[],[Lebanon]


In [6]:
# s = nlp(sentences[0])
s = nlp("""Al Qaeda is to terror what the mafia is to crime.""")

data = {
    "word": [t for t in s],
    "POS Tag": [f"{t.pos_} ({spacy.explain(t.pos_)})" for t in s],
    "Dependency Label": [f"{t} ({spacy.explain(t.dep_)})" for t in s]
    }
display(pd.DataFrame(data))

Unnamed: 0,word,POS Tag,Dependency Label
0,Al,PROPN (proper noun),Al (compound)
1,Qaeda,PROPN (proper noun),Qaeda (nominal subject)
2,is,AUX (auxiliary),is (root)
3,to,PART (particle),to (auxiliary)
4,terror,VERB (verb),terror (open clausal complement)
5,what,PRON (pronoun),what (attribute)
6,the,DET (determiner),the (determiner)
7,mafia,NOUN (noun),mafia (nominal subject)
8,is,AUX (auxiliary),is (clausal complement)
9,to,ADP (adposition),to (prepositional modifier)


In [7]:
data = {}
for i, sentence in enumerate(sentences):
    data[f"Sentence {i+1} spaCy Noun Chunks"] = [chunk.text.strip() for chunk in nlp(sentence).noun_chunks]
         
display(pd.DataFrame.from_dict(data, orient = "index").fillna(value=""))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
Sentence 1 spaCy Noun Chunks,Our war,terror,al Qaeda,it,,,,,,,...,,,,,,,,,,
Sentence 2 spaCy Noun Chunks,These same terrorists,weapons,mass destruction,their hatred,holocaust,,,,,,...,,,,,,,,,,
Sentence 3 spaCy Noun Chunks,September,enemies,freedom,an act,war,our country,,,,,...,,,,,,,,,,
Sentence 4 spaCy Noun Chunks,The face,terror,the true faith,Islam,,,,,,,...,,,,,,,,,,
Sentence 5 spaCy Noun Chunks,The United States,America,a friend,the Afghan people,we,the friends,who,the Islamic faith,,,...,,,,,,,,,,
Sentence 6 spaCy Noun Chunks,The image,that dreadful massacre,Qana,Lebanon,one's mind,the massacres,Tajikistan,Burma,Kashmir,Assam,...,the eyes,the entire world,accordance,a conspiracy,the United States,its allies,who,arms,the cover,the unfair United Nations


In [8]:
import nltk
import re
import pprint
from nltk import Tree

patterns = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

patterns = r"""
    NP: {<.*>*}             # start by chunking everything
        }<[\.VI].*>+{       # strip any verbs, prepositions or periods
        <.*>}{<DT>          # separate on determiners
    PP: {<IN><NP>}          # PP = preposition + noun phrase
    VP: {<VB.*><NP|PP>*}    # VP = verb words + NPs and PPs
    """

NPChunker = nltk.RegexpParser(patterns)

def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences


def parsed_text_to_NP(sentences):
    nps = []
    for sent in prepare_text(sentences):
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps

data = {}
for i, sentence in enumerate(sentences):
    data[f"Sentence {i+1} NLTK Noun Chunks"] = [chunk.strip() for chunk in parsed_text_to_NP(sentence)]
         
display(pd.DataFrame.from_dict(data, orient = "index").fillna(value=""))
    
    

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
Sentence 1 NLTK Noun Chunks,Our war,terror,"al Qaeda , but it",not,there,,,,,,,,,,,,,
Sentence 2 NLTK Noun Chunks,These same terrorists,weapons,"mass destruction ,",the tools to,their,holocaust,,,,,,,,,,,,
Sentence 3 NLTK Noun Chunks,September,"the 11th , enemies",freedom,an act,war,our country,,,,,,,,,,,,
Sentence 4 NLTK Noun Chunks,The face,terror,not,the true faith,Islam,,,,,,,,,,,,,
Sentence 5 NLTK Noun Chunks,The United States,America,a friend to,"the Afghan people , and we",the friends,almost,a billion worldwide who practice,the Islamic faith,,,,,,,,,,
Sentence 6 NLTK Noun Chunks,The image,that dreadful massacre,"Qana , Lebanon ,",still,"one 's mind , and so",the massacres,"Tajikistan , Burma , Kashmir , Assam ,","the Philippines , Fatani , Ogaden , Somalia , ...",the eyes,the entire world clearly,accordance,a conspiracy,the United States and its allies who,arms,the,there,the cover,the unfair United Nations


In [10]:
from typing import Union, Iterator, Tuple

from spacy.symbols import NOUN, PROPN, PRON, ADJ, ADV, ADP, VERB
from spacy.errors import Errors
from spacy.tokens import Doc, Span, Token

                              
def get_noun_phrase(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    
    def is_preposition_subject(word):
        
        if next(word.rights, False) and word.pos in (NOUN, PROPN, PRON) and next(word.rights).pos == ADP:
            return True
        return False
    
    def is_nested_preposition(prep):
        
        # works only for the first ADP in a nested preposition
        
        preposition_object = next(prep.children, False)
        if not preposition_object:
            return False
        return is_preposition_subject(preposition_object)
    
    def get_preposition_head(word):
        
        head = word
        
        # get to the head of a conjunction
        while head.dep_ == "conj" and head.head.i < head.i:
            head = head.head
                    
        # get to the head of a preposition
        while not is_preposition_subject(head) and head.head.i < head.i:
            head = head.head
            
        # get to the head of a nested preposition
        if is_preposition_subject(head.head.head):
            head = head.head.head

        return head
        
    def get_left_edge(word):
        
        if word.left_edge.pos == PRON:
            return word.i
        return word.left_edge.i
    
    def get_right_edge(word):
            
        token = next(word.rights, False)
        if token and token.pos in (ADJ, ADV):
            return token.i
        return word.i 
        
    def is_conjunction_object(word):
        return word.dep_ == "conj"
    
    labels = [
        "oprd",
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",
        "attr",
        "ROOT",
        
        # added
        "dobj",
        "dep",    
    ]
    
    doc = doclike.doc  # Ensure works on both Doc and Span.
    
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    prep_dep = doc.vocab.strings.add("prep")
    poss_dep = doc.vocab.strings.add("poss")
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
    
    if not doc.has_annotation("DEP"):
        raise ValueError(Errors.E029)
    
    prev_end = -1
    
    for i, word in enumerate(doclike):

        if word.pos not in (NOUN, PROPN, PRON, ADP):
#             print(f'"{word}" not in NOUN, PROPN, PRON, ADP ({word.i} right edge {prev_end})')
            continue
            
        if word.left_edge.i <= prev_end:
#             print(f'"{word}" {word.i} left edge {word.left_edge.i} < {prev_end}')
            continue
                    
        if is_preposition_subject(word):
#             print(f'"{word}" is preposition_subject ({word.i} right edge {prev_end})')
            continue
            
        if word.dep == prep_dep and is_nested_preposition(word):
#             print(f'"{word}" is a nested preposition ({word.i} right edge {prev_end})')
            continue
    
        if word.dep == poss_dep:
            prev_end == word.i
            
#             print(f'"{word}" is a possessional modifier ({word.i} right edge {prev_end})')

            yield word.i, word.i + 1, np_label
    
        elif word.dep == prep_dep and is_preposition_subject(word.head):
            
            preposition_object = next(word.children)
            preposition_head = get_preposition_head(word.head)
                
            prev_end = get_right_edge(preposition_object)
            left_edge = get_left_edge(preposition_head)
            
#             print(f'"{word}: {Span(word.doc, left_edge, prev_end + 1)}" is a preposition ({word.i} right edge {prev_end})')
            
            yield left_edge, prev_end + 1, np_label
    
        elif word.dep in np_deps:
            
            prev_end = get_right_edge(word)
            left_edge = get_left_edge(word)
            
#             print(f'"{Span(word.doc, left_edge, prev_end + 1)}" is a noun phrase ({word.i} right edge {prev_end})')
            
            yield left_edge, prev_end + 1, np_label
            
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                
                prev_end = get_right_edge(word)
                left_edge = get_left_edge(word)
                
#                 print(f'"{Span(word.doc, left_edge, prev_end + 1)}" is a conjunction with head: {get_preposition_head(word)} ({word.i} right edge {prev_end})')
                
                yield left_edge, prev_end + 1, np_label
    
def chunker(doc):
    for start, end, label in get_noun_phrase(doc):
        chunk = Span(doc, start, end, label=label)
        yield chunk
        
data = {}
# for i, sentence in enumerate(sentences):
#     data[f"Sentence {i+1} Noun Chunks"] = [chunk.text.strip() for chunk in chunker(nlp(sentence))]
    
data[f"Sentence {i+1} Noun Chunks"] = [chunk.text.strip() for chunk in chunker(nlp(sentences[-1]))]

# print('-----')
# for sentence in sentences:
#     print(sentence)
display(pd.DataFrame.from_dict(data, orient = "index").fillna(value=""))
        
        

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
Sentence 6 Noun Chunks,The image of that dreadful massacre in Qana,Lebanon,one,mind,the massacres in Tajikistan,Burma,Kashmir,Assam,the Philippines,Fatani,...,Chechnya,Bosnia-Herzegovina,hair-raising and revolting massacres,the eyes of the entire world,accordance with a conspiracy by the United States,its,allies,who,arms,the cover of the unfair United Nations


In [32]:
texts = {
    "2": "The evidence we have gathered all points to a collection of loosely affiliated terrorist organisations known as al Qaeda.",
    "4": "They are some of the murderers indicted for bombing American embassies in Tanzania and Kenya, and responsible for bombing the USS Cole",
    "5": "The enemy of America is not our many Muslim friends; it is not our many Arab friends",
}





index = [f'spaCy noun chunks', f'Custom Noun Chunks']

for key, text in texts.items():
    df = []
    doc = nlp(text)
    df.append([chunk for chunk in doc.noun_chunks])
    df.append([chunk for chunk in chunker(doc)])
    
    print(f'Sentence {key}: {text}')
    pd.options.display.max_colwidth
    df = pd.DataFrame(df, index = index).astype(str).style.hide(axis="columns")
    display(df)
    print()
    print()
    
               
               

Sentence 2: The evidence we have gathered all points to a collection of loosely affiliated terrorist organisations known as al Qaeda.


0,1,2,3,4,5,6
spaCy noun chunks,The evidence,we,all points,a collection,loosely affiliated terrorist organisations,al Qaeda
Custom Noun Chunks,The evidence,we,all points,a collection of loosely affiliated terrorist organisations,al Qaeda,




Sentence 4: They are some of the murderers indicted for bombing American embassies in Tanzania and Kenya, and responsible for bombing the USS Cole


0,1,2,3,4,5,6,7
spaCy noun chunks,They,some,the murderers,American embassies,Tanzania,Kenya,the USS Cole
Custom Noun Chunks,They,some of the murderers,American embassies in Tanzania,Kenya,the USS Cole,,




Sentence 5: The enemy of America is not our many Muslim friends; it is not our many Arab friends


0,1,2,3,4,5,6
spaCy noun chunks,The enemy,America,our many Muslim friends,it,our many Arab friends,
Custom Noun Chunks,The enemy of America,our,friends,it,our,friends






In [34]:
sentence = """
The image of that dreadful massacre in Qana, Lebanon, is still vivid in one's mind, and so are the massacres in Tajikistan, Burma, Kashmir, Assam, the Philippines, Fatani, Ogaden, Somalia, Eritrea, Chechnya, and Bosnia-Herzegovina where hair-raising and revolting massacres were committed before the eyes of the entire world clearly in accordance with a conspiracy by the United States and its allies who banned arms for the oppressed there under the cover of the unfair United Nations

"""

sentence = "They are some of the murderers indicted for bombing American embassies in Tanzania and Kenya, and responsible for bombing the USS Cole."

data = {}   
data[f"Sentence {i+1} Noun Chunks"] = [chunk.text.strip() for chunk in chunker(nlp(sentence))]

def is_preposition_subject(word):
    if next(word.rights, False) and word.pos in (NOUN, PROPN, PRON) and next(word.rights).pos == ADP:
        return True
    return False

def is_nested_preposition(prep):
    preposition_object = next(prep.children, False)
    if not preposition_object:
        return False
    return is_preposition_subject(preposition_object)

def is_preposition_object(word):
    if word.head.dep_ == "prep" and word.head.head.pos in (NOUN, PROPN, PRON):
        if not is_nested_preposition(word):
            return True
    return False

def normalise(token):
    return token.text.strip()

def get_conjuncts(sentence):
    for token in sentence:
        if is_preposition_object(token) and token.conjuncts:
            for conjunct in [token] + list(token.conjuncts):
                if next(conjunct.lefts, False):
                    yield ' '.join([normalise(token.head.head), normalise(token.head),
                                    ' '.join([normalise(left) for left in list(conjunct.lefts)]), 
                                    normalise(conjunct)])
                else:
                    yield ' '.join([normalise(token.head.head), normalise(token.head),
                                    normalise(conjunct)])


doc = nlp(sentence)
human_chunks = [chunk for chunk in list(get_conjuncts(doc))]

spacy_chunks = [str(chunk) for chunk in list(doc.noun_chunks)[5:17]]
spacy_chunks.extend([str(chunk) for chunk in list(doc.noun_chunks)[21:24]])

df1 = pd.DataFrame({"Human Parse": human_chunks})
df2 = pd.DataFrame({"spaCy Parse": spacy_chunks})
new = pd.concat([df1, df2], axis=1)


display(new.fillna(value=""))


Unnamed: 0,Human Parse,spaCy Parse
0,embassies in Tanzania,Kenya
1,embassies in Kenya,the USS Cole


# Results

the results of developing the custom chunker are as follows:

In [None]:
success_rate = lambda part, whole:round(100 * (float(part) / float(whole)), 0)
original = success_rate(orig_success, total_chunks)
custom = success_rate(custom_success, total_chunks)
print(f'in-built success rate: {original}%')
print(f'custom success rate: {custom}%')
print(f'a {custom - original}% improvement in using the new chunker')

## The existing spacy code for noun chunks is below

The crux of this code is in this section, the purpose of this note book is to determine how this code blob for spaCy's noun chunker should be modified to create more fine-grained noun chunks with the correct named concepts:

`

    if word.pos not in (NOUN, PROPN, PRON):
                continue
            # Prevent nested chunks from being produced
            if word.left_edge.i <= prev_end:
                continue
            if word.dep in np_deps:
                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
            elif word.dep == conj:
                head = word.head
                while head.dep == conj and head.head.i < head.i:
                    head = head.head
                # If the head is an NP, and we're coordinated to it, we're an NP
                if head.dep in np_deps:
                    prev_end = word.i
                    yield word.left_edge.i, word.i + 1, np_label
                    
`

`np_deps` is a list of dependency labels denoting a noun token

`prev_end` is a index to ensure subsequent chunks do not overlap with existing chunks

`word.left_edge.i` creates a chunk from the root token and all other tokens in its leftwards facing dependency tree.

Where `word.left_edge.i` is too course grained, the custom chunk will expand the number of tests to become a more fine grained chunker. For example: 

There are rightward facing noun chunks also of interest, for example: 
- "weapons of mass destruction": with weapon as the root, the chunk is rightward facing.

There are noun chunks containin multiple tokens of interest that need to be resolved to a single annotation, for example:
- "the occupying American enemy": needs to be resolved to a merged noun chunk annotated as an outgroup
- "the alliance of Jews, Christians, and their agents": with alliance as the root, this is a rightwards facing group noun chunk

Additional functionality for custom attributes will have to be added and there is the need to remove predicate terms for the hearst pattern detection algorithm.

## Test data

The ingroup and outgroup and outgroup files for each orator comprise sentences of interest each of which contain feature phrases relevant to the methodology.

In [None]:
## create a dict object of all the ingroup/outgroup sentences
import os
import cndutils as ut
path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"

sent_dict = dict()
jsonl_files = [f for f in os.listdir(path) if os.path.splitext(f)[1] == ".jsonl" and "group" in f]
for file in jsonl_files:
    data_list = ut.load_jsonl(os.path.join(path, file))
    for entry in data_list:
        for value in entry.values():
            sent_dict[len(sent_dict)] = value
            
print(jsonl_files)

## Iterate through Test Data for Sentences of Interest

These sentences will be used to tune the existing spaCy noun chunker for the purposes of this methodology.

This code block selects some of these sentences to create test data for the new chunker.

In [None]:
import importlib
import cndutils
importlib.reload(cndutils)


path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"
ss = cndutils.sent_select(path = path, file = "test_sents")
output = ss(cnd.nlp, sent_dict)

## Instantiate Pipeline

In [None]:
%%time
import importlib
import pipeline
importlib.reload(pipeline)
cnd = pipeline.CND(extended = False)

print(cnd.nlp.meta['name'])
print([pipe for pipe in cnd.nlp.pipe_names])

## Developing the new code for the custom noun chunker

The custom noun chunker in this code block is developed while iterating through the data.

In [None]:
from spacy.tokens import Doc, Span, Token
from spacy import displacy
from spacy import explain
import pandas as pd
import importlib
import visuals
importlib.reload(visuals)
from pipeline import ConceptMatcher

cust_stopwords = [
        'able', 'available', 'brief', 'certain',
        'different', 'due', 'enough', 'especially', 'few', 'fifth',
        'former', 'his', 'howbeit', 'immediate', 'important', 'inc',
        'its', 'last', 'latter', 'least', 'less', 'likely', 
        'little', 'mainly', 'many', 'ml', 'more', 'most', 'mostly', 'much', 
        'my', 'necessary', 'new', 'next', 'non', 'notably', 'old', 'other', 
        'our', 'ours', 'own', 'particular', 'particularly', 'principally',
        'past', 'possible', 'present', 'proud', 'recent', 'same', 'several', 
        'significant', 'similar', 'some', 'such', 'sup', 'sure', 'these', 'those'
    ]

def custom_chunk_iterator(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    
    This is a modification of the spaCy's noun chunker.
    
    Instead of using the <.left_edge.i> property to capture the span, this chunker uses <<subtree>>.
    
    Signifying custom chunks, the Span objects are labeled with "CC"
    
    source code: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py
    """
    
    labels = [
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",
        "attr",
        "ROOT",
        "conj"
    ]
    
    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    pobj = doc.vocab.strings.add("pobj")
    relcl = doc.vocab.strings.add("relcl")
    acl = doc.vocab.strings.add("acl")
    prep = doc.vocab.strings.add("prep")
    ADP = doc.vocab.strings.add("ADP")
    advmod = doc.vocab.strings.add("advmod")
    cc_label = doc.vocab.strings.add("CC")
    
    def ADP_head(word):
        
        """
        function to check whether a word is the head of an adpositional phrase
        if there is a nested adpositional phrase, returns false
        """
        
        if word.n_rights > 0:
            adp_i = list(word.rights)[0].i
            if doc[adp_i].pos == ADP and doc[adp_i].text not in ["to", "in"] and doc[adp_i].n_rights > 0:
                pobj_i = list(doc[adp_i].children)[0].i

                if doc[pobj_i].dep == pobj and doc[pobj_i].n_rights == 0:
                    return True

                if doc[pobj_i].pos_ == conj:
                    return False

                if doc[pobj_i].dep == pobj and doc[pobj_i].n_rights > 0:
                    if list(doc[pobj_i].rights)[0].pos != ADP:
                        return True
                
#                 :
#                     return True
#                 if doc[pobj_i].conjuncts or doc[pobj_i].pos_ not in ["NOUN", "PROPN", "PRON"]:
#                     return True
            
        return False
        
    def get_right_edge(word):
        
        """
        function to get the immediate right edge of a adpositional phrase
        """
        
        adp_i = None
        adp_i = list(word.rights)[0].i
        if doc[adp_i].n_rights > 0:
            pobj_i = list(doc[adp_i].rights)[-1].i
            if doc[pobj_i].pos_ not in ["NOUN", "PROPN", "PRON"]:
                return doc[pobj_i].right_edge.i
            
            return list(doc[adp_i].children)[-1].i
        
    def get_gold(gold_start):

        """
        function to get the start and end indicies of a noun_chunk
        """
        gold_end = None
        
        if doc[gold_start].pos_ == "DET":
            gold_start += 1

        if word.conjuncts and word.dep_ != "conj":
            gold_end = word.i + 1 # word is a conjunction head therefore return head index
        
        elif word.dep_ == "conj" and list(word.rights) and "conj" in [t.dep for t in word.rights]:
            gold_end = word.i + 1 # word is a sub-conjunction head therefore return sub-conjunction head index
        
        elif word.dep_ in ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT", "conj"]:
            gold_end = word.right_edge.i + 1 # capture the full noun chunk
            for index in range(gold_start, gold_end):
                if doc[index].text.lower() in [",", "--", ":", "with", "which"] or doc[index].pos_ in ["SCONJ", "CCONJ"] or doc[index].dep_ in ["cc"]: # split noun chunks comprising lists
                    gold_end = index
                    break
        
        return gold_start, gold_end
    
    prev_end = -1
    
    for word in doclike:
        
        if word.pos_ not in ["NOUN", "PROPN", "PRON"]:
            continue
       
        if word.left_edge.i <= prev_end:
            continue
            
        # if the token is an apositional head
        elif ADP_head(word):
            
            right_edge = word.right_edge.i
            
            if word.n_rights > 0:
                right_edge = get_right_edge(word)
                    
            elif word.n_rights > 0 and word.conjuncts:
                right_edge = get_right_edge(word)
                
            prev_end = right_edge 
            yield word.left_edge.i, right_edge + 1, cc_label
            
        # for when the word is not an apositional head    
        elif word.dep in np_deps:
            prev_end = word.i                    
            yield word.left_edge.i, word.i + 1, cc_label
                
Doc.set_extension("custom_chunk_iterator", getter = custom_chunk_iterator, force = True)
                
get = ConceptMatcher()
    
def is_modifier(token):

    """
    function to determine whether a token modifies a span
    """

    tag_modifiers = ["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS"]
    dep_modifiers = ["amod", "poss", "pobj", "npadvmod", "appos", "compound"]

    if token.tag_ in tag_modifiers and token.dep_ in dep_modifiers:
        return True
    return False

def get_span_modifier(self, span):

    """
    Getter function to for any modifying tokens of the root.
    """       

    word = span.root

    # when the token is a conjunct head, need to isolate only terms to its left
    # if the word has conjuncts but does not have a `conj` dependency it is the head of the main conjunction.
    if word.conjuncts and word.dep_ != "conj":
        for token in word.lefts:
#             print(f"from ({word}) testing ({token}) from root.lefts ({list(word.lefts)})")
            if self.is_modifier(token) and token.i != word.i:
                return token
    else:
        # when the token is not a conjunct head, can iterate over terms to the left and right
        for token in span:
#             print(f"from ({word}) testing ({token}) from root.subtree ({list(word.subtree)})")
            if self.is_modifier(token) and token.i != word.i:
                return token

def get_span_type(self, span):
        
        """
        getter function to define the span entity type for any named entities modifying the root token
        
        iterates through left facing tokens to the root to identify any modifier terms
        returns: ent_type_ of any modifier named entities
        else returns the span root ent_type_
        """
        
        #iterate through the span and return any named concepts other than those related to the root.

        for token in span:
            if self.is_modifier(token) and token.ent_type_:
                return token.ent_type_
            
        return span.root.ent_type_

def custom_chunks(doc):
    
    """
    Yields base customised noun-phrase `Span` objects from the custom chunk 
    iterator, if the document has been syntactically parsed. 
    Different to spaCy's inbuilt noun_chunks which uses the <.left_edge.i> property to capture the span, 
    this chunker uses the <.subtree> property.
    
    YIELDS (Span): Base customised chunk `Span` objects
    """
    
    if not doc.is_parsed:
            raise ValueError(Errors.E029)
        # Accumulate the result before beginning to iterate over it. This
        # prevents the tokenisation from being changed out from under us
        # during the iteration. The tricky thing here is that Span accepts
        # its tokenisation changing, so it's okay once we have the Span
        # objects. See Issue #375
    spans = []
    
    get = pipeline.ConceptMatcher()
    
    Token.set_extension("span_type", default = "", force = True)
    
    Span.set_extension("CONCEPT", default = "", force = True)
    Span.set_extension("ATTRIBUTE", getter = get.get_attribute, force = True)
    Span.set_extension("IDEOLOGY", getter = get.get_ideology, force = True)

    Span.set_extension("span_type", default = "", force = True)
    Span.set_extension("get_span_type", getter = get.get_span_type, force = True)
    Span.set_extension("get_span_CONCEPT", getter = get.get_span_concept, force = True)
    
    
    for start, end, label in doc._.custom_chunk_iterator:
        
        # remove stopword tokens from left of the span
        for index in range(start, end):
            if doc[index].pos_ in ["PROPN", "NOUN", "PRON"]:
                break
            if doc[index].lower_ in cust_stopwords or doc[index].dep_ == "poss" or doc[index].is_stop:
                start += 1

        span = Span(doc, start, end, label=label)

        if span.root._.CONCEPT:
            span._.CONCEPT = span.root._.CONCEPT
        else:
            span._.CONCEPT = span._.get_span_CONCEPT

        if span.root._.span_type:
            span._.span_type = span.root._.span_type
        else:
            span._.span_type = span._.get_span_type
        
        spans.append(span)
                  
    for span in spans:
        yield span

Doc.set_extension("custom_chunks", getter = custom_chunks, force = True)


######################
# testing of the custom functions
#####################

text = "we are the USA and our enemy is the Taliban Regime who are a terrorist organisation"
# separate noun chunks
text = "In this trial, we have been reminded and the world has seen that our fellow Americans are generous and kind, resourceful and brave." #
# # # # right facing chunks - how to attach freedom to defender
# text = "They have attacked America because we are freedom's home and defender, and the commitment of our Fathers is now the calling of our time."
# # # # how to parse conjunctions)
# text = "Both Americans and Muslim friends and citizens, tax-paying citizens, and Muslims in nations were just appalled and could not believe what -- what we saw on our TV screens."
# # # # removal of stopwords
# text = "The enemy of America is not our many Muslim friends; it is not our many Arab friends." 
# # # # what to do about chunks joined by punctuation
# text = "We are joined in this operation by our staunch friend, Great Britain." 
# # # # long right facing chunk - missing PROPN between <billion> and <worldwide>. Extend the noun chunk to be "a billion worldwide who practice the Islamic faith."
# text = "The United States of America is a friend to the Afghan people, and we are the friends of almost a billion worldwide who practice the Islamic faith."
# # # # resolve a nested prep>pobj patterns - problem sentence - resolving these nested prep>pobj patterns contradicts others
text = "I would like to report to the American people on the state of our war against terror, and then I'll be happy to take questions from the White House press corps."
# # # # clipping a noun chunk with who
# text = "At the same time, we are showing the compassion of America by delivering food and medicine to the Afghan people who are, themselves, the victims of a repressive regime."
# # # # how to parse out <diligent and determined work> to link it with the other propn in the sentence. FBI is not detected. FBI is not part of the correct list.
# text = "We may never know what horrors our country was spared by the diligent and determined work of our police forces, the FBI, ATF agents, federal marshals, Custom officers, Secret Service, intelligence professionals and local law enforcement officials, under the most trying conditions."
# # # # currently ADP phrase currently ends at "orgnizations", if al Qaeda traceable to outgroup can revise to word.right_edge.i
# text = "The evidence we have gathered all points to a collection of loosely affiliated terrorist organizations known as al Qaeda."
# # # # "nations" is not connected to "every continent on the earth"
# text = "Our staunch friends, Great Britain, our neighbors Canada and Mexico, our NATO allies, our allies in Asia, Russia and nations from every continent on the Earth have offered help of one kind or another -- from military assistance to intelligence information, to crack down on terrorists' financial networks."
# # # # split a list of chunks
# text = "A terrorist underworld -- including groups like Hamas, Hezbollah, Islamic Jihad, Jaish-i-Mohammed -- operates in remote jungles and deserts, and hides in the centers of large cities."
# # # # "weapon" is not dependency linked to "mass destruction"
# text = "North Korea is a regime arming with missiles and weapons of mass destruction, while starving its citizens."
# # # # split a conjunct from an adpositional phrase. "people" is not dependency linked to "big and small". "People of all walks of life" conflicts with "state of our war on terror"
# text = "We want to study the ways which could be used to rectify matters and restore rights to their owners as people have been subjected to grave danger and harm to their religion and their lives, people of all walks of life, civilians, military, security men, employees, merchants, people big and small, school and university students, and unemployed university graduates, in fact hundreds of thousands who constitute a broad sector of the society."
# # # # capturing adpositional conjuncts
# text = "O protectors of monotheism and guardians of the faith; O successors of those who spread the light of guidance in the world; O grandsons of Sa'd Bin-Abi-Waqqas, al-Muthanna Bin-Harithah al-Shibani, al-Qa'qa' Bin-'Amr al-Tamimi, and the companions who fought alongside them: You rushed to join the Army and the Guard merely to join the jihad for the cause of God in order to spread the word of God and to defend Islam and the land of the two holy mosques against invaders and occupiers, which is the highest degree of belief in religion."
# # # # how to link this phrase to be "sons of Islam and daughters of Islam"
# text = "Sons and daughters of Islam!"
# # # # not detecting "group of vanguard Muslims"
# text = "God has blessed a group of vanguard Muslims, the forefront of Islam, to destroy America."
# # # connect "alliance of" to "Jews" and "Chistians"
# text = "You are not unaware of the injustice, repression, and aggression that have befallen Muslims through the alliance of Jews, Christians, and their agents, so much so that Muslims' blood has become the cheapest blood and their money and wealth are plundered by the enemies."
# # # mark "Jewish-crusade alliance" as an outgroup 
# text = "And so, this Jewish-crusade alliance killed and detained the symbols of the truthful ulema and upholders of the call—and God is above everyone."
# # # link "regime" to "injustice" and "illegitimate actions"
# text = "They feel that God is tormenting them because they kept quiet about the regime's injustice and illegitimate actions, especially its failure to have recourse to the Shari'ah, its confiscation of people's legitimate rights, the opening of the land of the two holy mosques to the American occupiers, and the arbitrary jailing of the true ulema, heirs of the Prophets."
# # # link "nation's enemies" to "the American crusader forces". long noun phrases to link to "aspects of our plight"
# text = "Its failure to protect the country, opening it to the nation's enemies, the American crusader forces who have become the main cause of all aspects of our plight, especially the economic aspect as a result of the unjustified heavy expenditure on them and as a result of the policies they impose on the country, and particularly the oil policy determining the quantities of oil to be produced and setting the prices which suit their own economic interests ignoring the country's economic interests, and also as a result of the exorbitant arms deals imposed on the regime, to the point that people are wondering what good, then, is the regime?"
# # missing noun for "those who fomented internal sedition in their country". remove "those" from stopword list. add "such" to stop words list
# text = "That was the only door left open to the public for ending injustice and upholding right and justice, and in whose interests do Prince Sultan and Prince Nayif plunge the country and the people into an internal war that would destroy everything, enlisting the aid and advice of those who fomented internal sedition in their country and using the people's police force to put down the reform movement there and pit members of the public one against the other—leaving the main enemy in the region, namely the Jewish-American alliance, safe and secure, having found such traitors to implement its policies aimed at exhausting the nation's human and financial resources internally."
# # split civilians and military. dependency identifies sentence as a conjunction, but it is not.
# text = "But, thank God, the vast majority of the people, civilians and military, are aware of that sinister plan and will not allow themselves to be an instrument for strikes against one another in implementation of the policy of the main enemy, namely the Israeli-American alliance, through the Saudi regime, its agent in the country."
# # split a chunk span by with
# text = "Partition of the country of the two holy mosques with Israel taking the northern part of the land of the two holy mosques is considered to be an urgent demand of the Jewish-crusade alliance, because the existence of a state of such size and with such resources under sound Islamic rule, which, God willing, is coming, would be a threat to the Jewish entity in Palestine, for the land of the two holy mosques would be a symbol for the unity of the Islamic world because of the presence of the holy Ka'bah, the qiblah of all Muslims."
# # split a long noun chunk of sub clauses. not picking up Islamic world's ulema
# text = "He lied to the ulema who sanctioned the Americans' entry and he lied to the Islamic world's ulema and leaders at the [World Muslim] League's conference in holy Mecca in the wake of the Islamic world's condemnation of the crusader forces' entry into the country of the two holy mosques on the pretext of defending it."
# # not picking up US Defence Secretary
# text = "A few days ago news agencies carried a statement by the occupier-crusader, the US defense secretary, in which he said that he has learned one lesson from the Riyadh and al-Khubar blasts, namely not to retreat in front of the terrorist cowards."
# # link "intentional killing of innocent" to "women" and "children" 
# text = "And that day, it was confirmed to me that oppression and the intentional killing of innocent women and children is a deliberate American policy."

cnd.nlp.vocab["O"].is_stop = True
doc = cnd(text)
# display(visuals.chunk_custom_attrs(list(doc._.custom_chunks), json = True))

# word = doc[35]
# print(word)
# if word.n_rights > 0:
#     adp_i = list(word.rights)[0].i
#     if doc[adp_i].dep_ == "prep" and doc[adp_i].n_rights > 0:
#         pobj_i = list(doc[adp_i].children)[0].i
#         print("nearly true")
#         if doc[pobj_i].dep_ == "pobj" and doc[pobj_i].n_rights == 0:
#             print("True")
#         if doc[pobj_i].dep_ == "pobj" and doc[pobj_i].n_rights > 0:
#             if list(doc[pobj_i].rights)[0].pos_ != "ADP":
# #                 print("very true")
# else:
#     print("try again")
    
# options = {"compact": True}
# displacy.render(doc, style = "dep", options=options)
# print(doc.ents)
# print(list(doc.noun_chunks))
# visuals.chunk_custom_attrs(list(doc._.custom_chunks), json = True)

print("<<< token custom_chunks >>>")
# display(visuals.sent_custom_chunks(doc))
print(doc)

print("<<< custom chunk attributes >>>")
display(visuals.chunk_custom_attrs(list(doc._.custom_chunks)).T)

print("<<< original chunk attributes >>>")
display(visuals.chunk_custom_attrs(list(doc.noun_chunks)).T)

token_index = 16
if isinstance(token_index, int) and token_index < len(doc):
    print("<<< selected token dependency tree attributes >>>")
    display(visuals.token_deps(doc[token_index]))

## Iterating over the data

Iterate over each sentence and review each noun chunk to determine the desired noun chunk, and develop notes to determine what modifications to the noun chunk doc extension is required.

While iterating over the data a gold standard dataset is created to test the new chunker.

In [None]:
import os
import importlib
import json
import jsonlines

from IPython.display import clear_output

import pandas as pd

from spacy import displacy

import pipeline
importlib.reload(pipeline)
import cndutils as ut
import visuals
importlib.reload(visuals)

def get_gold(doc, span):
    
    gold_start = None
    gold_end = None
    
    word = span.root

    gold_start = word.left_edge.i
    if doc[gold_start].pos_ == "DET":
        gold_start += 1

    if word.conjuncts and word.dep_ != "conj":
        gold_end = word.i + 1 # word is a conjunction head
    elif word.dep_ == "conj" and list(word.rights) and "conj" in [t.dep for t in word.rights]:
        gold_end = word.i + 1 # word is a sub-conjunction head
    elif word.dep_ in ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT", "conj"]:
        gold_end = word.right_edge.i + 1 # capture the full noun chunk
        for index in range(gold_start, gold_end):
            if doc[index].text in [",", "--"] or doc[index].pos_ in ["SCONJ", "CCONJ"] or doc[index].dep_ in ["cc"]: # split noun chunks comprising lists
                gold_end = index
                break
        
    return gold_start, gold_end

#################################
# Initialise
#################################

path = os.getcwd()
test_jsonl = ""
cust_jsonl = ""
index_str = "index.json"
test_filepath = os.path.join(path, test_jsonl)
cust_filepath = os.path.join(path, cust_jsonl)
index_filepath = os.path.join(path, index_str)

with jsonlines.open(test_filepath) as f:
    test_chunks = list(f.iter())
    
try:  
    with jsonlines.open(cust_filepath) as f:
        cust_chunk_list = list(f.iter())
    if len(test_chunks) == 0:
        cust_chunk_list = list()    

except:
    cust_chunk_list = list()

try:
    with open(index_filepath, "r") as index_json:
        index = json.load(index_json)
        
except:
    index = 0
    
lookup = pipeline.ConceptMatcher(cnd.nlp)
    
#################################
# main body
#################################
    
while index <= len(test_chunks):
    
    line = test_chunks[index]
            
    with open(index_filepath, "wb") as f:
        f.write(json.dumps(index).encode("utf-8"))
    
    clear_output(wait=True)
  
    #parse document
    doc = cnd(line[str(index)])
    
    # add the original and indexed noun_chunks to the line
    line["orig_chunks"] = visuals.chunk_custom_attrs(list(doc.noun_chunks), json = True)
    line["gold_chunks"] = visuals.chunk_custom_attrs(list(doc._.custom_chunks), json = True)
    
    for chunk in line["gold_chunks"]:
        if not chunk["CONCEPT"] and chunk["span_type"] == "GPE":
            chunk["CONCEPT"] = "TERRITORY"
            chunk["ATTRIBUTE"] = get.get_attribute(chunk["CONCEPT"])
            chunk["IDEOLOGY"] = get.get_ideology(chunk["CONCEPT"])
        
        if not chunk["CONCEPT"] and chunk["span_type"] == "LOC":
            chunk["CONCEPT"] = "PLACE"
            chunk["ATTRIBUTE"] = get.get_attribute(chunk["CONCEPT"])
            chunk["IDEOLOGY"] = get.get_ideology(chunk["CONCEPT"])
        
        if not chunk["CONCEPT"] and chunk["text"].lower() in ["americans"]:
            chunk["CONCEPT"] = "SOCGROUP"
            chunk["ATTRIBUTE"] = get.get_attribute(chunk["CONCEPT"])
            chunk["IDEOLOGY"] = get.get_ideology(chunk["CONCEPT"])
            
        if not chunk["CONCEPT"] and chunk["text"].lower() in ["muslims"]:
            chunk["CONCEPT"] = "RELGROUP"
            chunk["ATTRIBUTE"] = get.get_attribute(chunk["CONCEPT"])
            chunk["IDEOLOGY"] = get.get_ideology(chunk["CONCEPT"])
            
        if not chunk["CONCEPT"] and chunk["span_type"] in ["DEITY"]:
            chunk["CONCEPT"] = "RELFIGURE"
            chunk["ATTRIBUTE"] = get.get_attribute(chunk["CONCEPT"])
            chunk["IDEOLOGY"] = get.get_ideology(chunk["CONCEPT"])
    
    while True:

        clear_output(wait = True)
        
        # display dependency parse
        options = {"compact": True}
        displacy.render(doc, style = "dep", options=options)

        # display sentence attributes
        display(visuals.sent_frame(doc, extend = True))

        # display both original and gold chunk attributes
        print("<<< original chunks >>>")
        display(pd.DataFrame(line["orig_chunks"]).T)
        
        # print sentence text
        print(f"{index} / {len(test_chunks)}")
        print(doc.text)
        
        print("<<< custom chunks >>>")
        display(pd.DataFrame(line["gold_chunks"]).T)

        check = ""
        check = input("satisfied? (y/q)").lower()
        if check == "y":
            
            cust_chunk_list.append(line)

            #write jsonl object to disk
            with jsonlines.open(os.path.join(path, cust_filepath), 'w') as writer:
                writer.write_all(cust_chunk_list)
        
            index += 1
            break
        
        elif check == "q":
            raise SystemExit("Stop right there!")
            
        line["gold_chunks"].clear()
        
        for chunk in doc.noun_chunks:
            
            gold_span = Span(doc, chunk.start, chunk.end, label = "CC")
            if gold_span.root._.CONCEPT:
                gold_span._.CONCEPT = gold_span.root._.CONCEPT
            else:
                gold_span._.CONCEPT = gold_span._.modifier._.CONCEPT
            
            while True:
                display(visuals.chunk_custom_attrs([gold_span]))
                gold_start, gold_end = get_gold(doc, gold_span)
                print("gold_span:", doc[gold_start : gold_end].text)
                print()

                notes = ""
                check = input("satisfied? (y/q)").lower()

                if check == "y":
                    line["gold_chunks"].append(*visuals.chunk_custom_attrs([gold_span], json = True))
                    line["gold_chunks"][-1]["notes"] = notes
                    break
                elif check == "q":
                    raise SystemExit("Stop right there!")

                # get new custom span
                entry = "s"

                while entry in ["el", "er", "dl", "dr", "s", "q"]:
                    text = doc[gold_start : gold_end].text
                    # el = expand left (subtract 1 from new_start)
                    # er = expand right (add 1 to new_end)
                    # dl = decrease left (add 1 to new_start)
                    # dr = descrease right (subtract 1 from new_end)
                    # sk = skip chunk
                    entry = input(f'new chunk text <{text}> (el) (er) (dl), (dr), (q), (sk)')
                    if len(entry) == 0:
                        break
                    elif entry == "el":
                        gold_start -= 1
                    elif entry == "er":
                        gold_end += 1
                    elif entry == "dl":
                        gold_start += 1
                    elif entry == "dr":
                        gold_end -= 1
                    elif entry == "q":
                        raise SystemExit("Stop right there")
                    elif entry == "sk":
                        break
                
                if entry == "sk":
                    break
                        
                gold_span = Span(doc, gold_start, gold_end, label = "CC")

                # get new span_type
                cust_span_type = get.get_span_type(gold_span)
                gold_span._.span_type = input(f'new span_type [{cust_span_type}]').lower()
                if len(gold_span._.span_type) == 0:
                    gold_span._.span_type = cust_span_type

                #get modifier
                cust_modifier = get.get_span_modifier(gold_span)
                gold_span._.modifier = input(f'new modifier [{cust_modifier}]').lower()
                if gold_span._.modifier:
                    gold_span._.modifier = cust_modifier

                # get new concept    
                if gold_span._.CONCEPT:
                    cust_concept = gold_span._.CONCEPT
                else:                    
                    cust_concept = gold_span._.modifier._.CONCEPT
                    
                if gold_span._.span_type == "GPE":
                    cust_concept = "TERRITORY"
                if gold_span._.span_type == "NORP":
                    cust_concept = "SOCIALGROUP"
                gold_span._.CONCEPT = input(f'concept [{gold_span.root}: {cust_concept}]:').upper()
                if len(gold_span._.CONCEPT) == 0:
                    gold_span._.CONCEPT = cust_concept

                # get new notes
                notes = input("notes:")
                if len(notes) > 0 and notes[-1] != ".":
                    notes += "."

#### Annotation Notes

Phrases do not conform to a Hearst Pattern
- "The United States respects the people of Afghanistan."
- "America has no truer friend than Great Britain."
- "The enemy of America is not our many Muslim friends; it is not our many Arab friends."

Current chunk is "friend", would prefer "friend to the Afghan People"
not picking up because of `doc[adp_i].text not in ["to", "in"]` clause in ADP head detection
- "The United States of America is a friend to the Afghan people, and we are the friends of almost a billion worldwide who practice the Islamic faith."

Notable sentences
- the noun phrase "nations from every continent on the Earth" is split between two dependency trees
-- "Our staunch friends, Great Britain, our neighbors Canada and Mexico, our NATO allies, our allies in Asia, Russia and nations from every continent on the Earth have offered help of one kind or another -- from military assistance to intelligence information, to crack down on terrorists' financial networks."
-- "America and Afghanistan are now allies against terror."
-- "A terrorist underworld -- including groups like Hamas, Hezbollah, Islamic Jihad, Jaish-i-Mohammed -- operates in remote jungles and deserts, and hides in the centers of large cities."

Neither method are picking up 
- "my fellow americans" : "And in this great conflict, my fellow Americans, we will see freedom's victory."
- "Usama bin Laden" : "This group and its leader -- a person named  -- are linked to many other organizations in different countries, including the Egyptian Islamic Jihad and the Islamic Movement of Uzbekistan."

Linked through appos dependency
- "We are joined in this operation by our staunch friend, Great Britain."

"Close" is annotated as an ADJ and not VERB.
- "More than two weeks ago, I gave Taliban leaders a series of clear and specific demands: Close terrorist training camps; hand over leaders of the Al Qaeda network; and return all foreign nationals, including American citizens, unjustly detained in your country."

How to classify the term "nuclear" to capture its severity without over-egging

"weapons of mass desctruction" is split across dependency tree
- "North Korea is a regime arming with missiles and weapons of mass destruction, while starving its citizens."

"Ulema" is recorded as an "ADJ" when it should be a "NOUN"
- "In the light of the reality we are going through and the blessed, sweeping awakening in the world at large and in the Islamic world in particular, I meet with you today after a long absence dictated by the unjust crusade campaign led by the United States against the ulema and advocates of Islam to prevent them from instigating the Islamic nation against its enemies, as did their predecessors, may God have mercy on their souls, such as Ibn-Taymiyah and al-'Izz Ibn-'Abd-al-Salam."

"Military" is recorded as an "ADJ" when it should be a "NOUN"
- "We want to study the ways which could be used to rectify matters and restore rights to their owners as people have been subjected to grave danger and harm to their religion and their lives, people of all walks of life, civilians, military, security men, employees, merchants, people big and small, school and university students, and unemployed university graduates, in fact hundreds of thousands who constitute a broad sector of the society."

not picking up "infighting"
- "The Muslims are reminded that they should avoid infighting between sons of the Muslim nation because that will have dire consequences, the most important being:"

should be "presence of the crusader" and "presence of the American military forces"
- "Destruction of the oil industries, because the presence of the crusader and American military forces in the Islamic Gulf states, on land, in the air, and at sea, represents the greatest danger and harm and the greatest threat to the largest oil reserves in the world."

should be "killing of innocent women" and "killing of innocent children"
- "And that day, it was confirmed to me that oppression and the intentional killing of innocent women and children is a deliberate American policy."

### Test the results

Compute the success rate of the new noun chunker against the gold data and compare to in-built noun chunker

In [None]:
import os
import json
import jsonlines
import pandas as pd
import visuals
from tqdm import tqdm

from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

path = os.getcwd()
cust_jsonl = "cust_chunks.jsonl"
cust_filepath = os.path.join(path, cust_jsonl)

with jsonlines.open(cust_filepath) as f:
    gold_chunk_list = list(f.iter())
    
custom_success = 0
orig_success = 0
failure = 0
total_chunks = 0
results = []

for i, chunk_list in tqdm(enumerate(gold_chunk_list), total = len(gold_chunk_list)):

    gold_list = [gold_chunk["text"] for gold_chunk in chunk_list["gold_chunks"]]
    total_chunks += len(gold_list)

    doc = cnd(chunk_list[str(i)]) 
    cust_chunks = visuals.chunk_custom_attrs(list(doc._.custom_chunks), json = True)
    orig_chunks = visuals.chunk_custom_attrs(list(doc.noun_chunks), json = True)

    for orig_chunk in orig_chunks:
        if orig_chunk["text"] in gold_list:
            orig_success += 1
    
    error_list = []

    for cust_chunk in cust_chunks:
        if cust_chunk["text"] in gold_list:
            custom_success +=1
            gold_list.remove(cust_chunk["text"])
        else:
            error_list.append(cust_chunk["text"])

    if gold_list or error_list:
        results.append((chunk_list[str(i)], gold_list, error_list))
        
print()
success_rate = lambda part, whole:round(100 * (float(part) / float(whole)), 0)
original = success_rate(orig_success, total_chunks)
custom = success_rate(custom_success, total_chunks)
print(f'in-built success rate: {original}%')
print(f'custom success rate: {custom}%')
print(f'a {custom - original}% improvement')
print()
        
for sentence, gold_list, error_list in results:
    print(sentence)
    display_side_by_side(pd.DataFrame(gold_list, columns = ["Gold Chunks"]), pd.DataFrame(error_list, columns = ["Custom Chunks"]))
    print()