# Custom Noun Chunking
-----

There is a problem whereby spaCy's inbuilt noun_chunks is too course grained for the chunking required for detecting the ingroups and outgroups.

For the purposes of the methodology, a more fine grained noun chunking algorithm is required.

There are several examples in the test ingroup and outgroup sentences named entities are chunked with other nouns when they would preferable be kept separate.

There are also several examples where a noun chunk contains more than one noun of a custom attribute, therefore, the chunk needs to be resolved to a single instance

This notebook adapt spaCy's noun_chunk source code and adapt for the specific purpose of this pipeline.

Source code at these links:

    Noun Chunker Code
    
    https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

    Class extensions

    https://github.com/explosion/spaCy/blob/9ce059dd067ecc3f097d04023e3cfa0d70d35bb8/spacy/tokens/doc.pyx

    https://github.com/explosion/spaCy/blob/f49e2810e6ea5c8b848df5b0f393c27ee31bb7f4/spacy/tokens/span.pyx


## The existing spacy code for noun chunks is below

In [None]:
def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    
    source code: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py
    """
    labels = [
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",
        "attr",
        "ROOT",
    ]
    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
    
    prev_end = -1
    
    for i, word in enumerate(doclike):
        
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label

def merge_named_concepts(doc):

    """Merge named concepts into a single token.
    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged named concepts.

    Adapts the spacy merge noun chunks function
    code: https://github.com/explosion/spaCy/blob/master/spacy/pipeline/functions.py
    """
    if not doc.is_parsed:
        return doc
    with doc.retokenize() as retokenizer:
        for span in doc._.named_concepts:
            attrs = {
                    "tag": span.root.tag, 
                    "dep": span.root.dep
                    }

            retokenizer.merge(span, attrs=attrs)
    
    return doc

The crux of this code is in this section, the purpose of this note book is to determine how this section should be modified to create more fine-grained noun chunks with the correct named concepts:

`
if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
`

`np_deps` is a list of dependency labels denoting a noun token

`prev_end` is a index to ensure subsequent chunks do not overlap with existing chunks

`word.left_edge.i` creates a chunk from the root token and all other tokens in its leftwards facing dependency tree.

Where `word.left_edge.i` is too course grained, the custom chunk will expand the number of tests to become a more fine grained chunker. For example: 

There are rightward facing noun chunks also of interest, for example: 
- "weapons of mass destruction": with weapon as the root, the chunk is rightward facing.

There are noun chunks containin multiple tokens of interest that need to be resolved to a single annotation, for example:
- "the occupying American enemy": needs to be resolved to a merged noun chunk annotated as an outgroup
- "the alliance of Jews, Christians, and their agents": with alliance as the root, this is a rightwards facing group noun chunk

Additional functionality for custom attributes will have to be added and there is the need to remove predicate terms for the hearst pattern detection algorithm.

## Test data

In [5]:
## create a dict object of all the ingroup/outgroup sentences
import os
import cndutils as ut
path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"

sent_dict = dict()
jsonl_files = [f for f in os.listdir(path) if os.path.splitext(f)[1] == ".jsonl" and "group" in f]
for file in jsonl_files:
    data_list = ut.load_jsonl(os.path.join(path, file))
    for entry in data_list:
        for value in entry.values():
            sent_dict[len(sent_dict)] = value
            
print(jsonl_files)

Loaded 66 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\bush_ingroup_sents.jsonl
Loaded 37 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\bush_outgroup_sents.jsonl
Loaded 41 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\laden_ingroup_sents.jsonl
Loaded 113 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\laden_outgroup_sents.jsonl
['bush_ingroup_sents.jsonl', 'bush_outgroup_sents.jsonl', 'laden_ingroup_sents.jsonl', 'laden_outgroup_sents.jsonl']


## Instantiate Pipeline

In [1]:
%%time

import spacy
import pipeline
cnd = pipeline.CND()

# merge_nps = cnd.nlp.create_pipe("merge_noun_chunks")
# cnd.nlp.add_pipe(merge_nps)

print([pipe for pipe in cnd.nlp.pipe_names])

['tagger', 'parser', 'ner', 'Named Entity Matcher', 'merge_entities', 'Concept Matcher']
Wall time: 29.8 s


## Iterate through Test Data for Sentences of Interest

These sentences will be used to tune the existing spaCy noun chunker for the purposes of this methodology

In [None]:
import importlib
import cndutils
importlib.reload(cndutils)


path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"
ss = cndutils.sent_select(path = path, file = "test_sents")
output = ss(cnd.nlp, sent_dict)

        
    
# remove dets
    

## Review the noun chunks of each sentence of interest to ascertain the required changes

iterate over each sentence and review each noun chunk to determine the desired noun chunk, and develop notes to determine what modeifications to the noun chunk doc extension is required.

In [2]:
from spacy.tokens import Doc

def custom_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    
    source code: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py
    """
    labels = [
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",
        "attr",
        "ROOT",
    ]
    
    notes = ""
    
    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
    
    prev_end = -1
    
    for i, word in enumerate(doclike):
        
        if word.pos_ not in ["NOUN", "PROPN", "PRON"]:
            continue
       
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        
        if word.dep in np_deps:
            prev_end = word.i
            
            # test whether noun is left or right facing            
            # for named entities followed by a noun, merge attrs in merge custom chunks
            # append compounds            
            # remove det
            
            include  = ["compound"]
            exclude = ["det", "poss"]
            left_edge_index = word.left_edge.i
            
            n = 0
            while left_edge_index < i + n: # and doc[word.i + n].dep_ in exclude:
                if doc[word.i + n - 1].dep_ in exclude:
                    break
                else:
                    n -= 1

            yield doc[word.i + n : word.i + 1]
            
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                prev_end = word.i
                yield doc[word.left_edge.i : word.i + 1]

Doc.set_extension("custom_chunks", getter = custom_chunks, force = True)

text = "In this trial, we have been reminded and the world has seen that our fellow Americans are generous and kind, resourceful and brave."
doc = cnd(text)

print(list(doc._.custom_chunks))
# the shithouse prime minister, a lying prick, I, he]
print(doc[16])
for t in doc[16].lefts:
    if t.dep_ in ["amod"]:
        print(t, '=>', t._.CONCEPT)
        
lookup = pipeline.ConceptMatcher(cnd.nlp)
lookup.get_concept("they")

[trial, we, world, fellow Americans]
Americans
fellow => AFFILIATE


''

In [3]:
import os
import json
import jsonlines
from itertools import zip_longest

from IPython.display import clear_output

import pandas as pd

from spacy import displacy

import pipeline
import cndutils as ut
import visuals as viz


#################################
# chunk attrs
#################################

def chunk_attrs(token):
    
    attrs = ["text", "root", "i", "concept", "attribute", "ideology"]
    
    root = ""
    root = token.root.lemma_.lower() 
    if root == "-pron-":
        root = token.text.lower()

    return dict(zip(attrs, [str(token).lower(), root.lower(), str(token.root.i), \
                               str(token._.CONCEPT), str(token._.ATTRIBUTE), str(token._.IDEOLOGY)]))

################################
# get notes
################################
   
def get_notes(index):
   
    notes = ''
    include  = ["compound"]
    exclude = ["det", "poss"]
    left_edge_index = doc[index].left_edge.i

    n = 0
    while left_edge_index <= index + n: # and doc[word.i + n].dep_ in exclude:
        if doc[index + n].dep_ in exclude:
            notes = f"exclude {doc[index + n].dep_}. "
        index -= 1

    return notes

path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"
test_jsonl = "test_chunks.jsonl"
cust_jsonl = "cust_chunks.jsonl"
index_str = "index.json"
test_filepath = os.path.join(path, test_jsonl)
cust_filepath = os.path.join(path, cust_jsonl)
index_filepath = os.path.join(path, index_str)

with jsonlines.open(test_filepath) as f:
    test_chunks = list(f.iter())
    
try:  
    with jsonlines.open(cust_filepath) as f:
        cust_chunk_list = list(f.iter())
    if len(test_chunks) == 0:
        cust_chunk_list = list()    

except:
    cust_chunk_list = list()

try:
    with open(index_filepath, "r") as index_json:
        index = json.load(index_json)
        
except:
    index = 0
    
lookup = pipeline.ConceptMatcher(cnd.nlp)
    
#################################
# main body
#################################
    
while index < len(test_chunks):
    
    quit = False
    
    line = test_chunks[index]
            
    with open(index_filepath, "wb") as f:
        f.write(json.dumps(index).encode("utf-8"))
    
    clear_output(wait=True)
  
    #parse document
    doc = cnd(line[str(index)])
    
    # add the original and indexed noun_chunks to the line
        
    line["orig_chunks"] = {k : chunk_attrs(v) for k, v in enumerate(doc.noun_chunks)}
    line["new_chunks"] = {}
    
    new_chunk_dict = dict()
    stop = False

    for key, orig, new in zip_longest(range(len(list(doc.noun_chunks))), doc.noun_chunks, doc._.custom_chunks):

        satisfied = False
        while satisfied == False:

            clear_output(wait = True)
            
            displacy.render(doc, style = "dep")

            display(viz.sent_frame(doc, compact = False))

            attrs = ["text", "root", "i", "concept", "attribute", "ideology"]
            
            print(doc.text)

            display(pd.DataFrame(line["orig_chunks"]))

            new_text = ""
            new_notes = ""
            new_root = ""
            new_i = ""
            new_concept = ""
            new_attribute = ""
            new_ideology = ""

            #get new string
            cust_text = new.text
            new_text = input(f"new chunk text [{cust_text}] (q): ").lower()
            if new_text == "q":
                raise SystemExit("Stop right there!")
            if len(new_text) == 0:
                new_text = cust_text                

            #get new root
            cust_root = new.root.lemma_.lower()
            if cust_root.lower() == "-pron-":
                cust_root = new.text
            new_root = input(f"new root text [{cust_root}]: ").lower()
            if len(new_root) == 0:
                new_root = cust_root

            #get new i
            cust_i = new.root.i
            new_i = input(f"idx [{cust_i}]: ").lower()
            if len(new_i) == 0:
                new_i = cust_i

           # get new concept
            concept_lookup = lookup.get_concept(new_root)
            
###########################
#do this first, get the attributes of an immediate modifier token
#             ## get the custom attributes of any amod token
#             print("lefts: ", list(doc[new_i].lefts))
#             for t in doc[new_i].lefts:
#                 if t.dep_ in ["amod"]:
#                     concept_lookup = t._.CONCEPT
#             print(new_root)
            
            new_concept = input(f"concept [{concept_lookup}]:").upper()
            if len(new_concept) == 0:
                new_concept = concept_lookup

            # get new attribute
            attribute_lookup = lookup.get_attribute(new_concept.lower())
            new_attribute = input(f"attribute [{attribute_lookup}]: ").lower()
            if len(new_attribute) == 0:
                new_attribute = attribute_lookup

            # get new ideology
            ideology_lookup = lookup.get_ideology(new_concept.lower())
            new_ideology = input(f"ideology [{ideology_lookup}]: ").lower()
            if len(new_ideology) == 0:
                new_ideology = ideology_lookup
                
            new_notes = get_notes(new.root.i)
            note = input(f"notes [{new_notes}]")
            new_notes += note
            if len(note) > 0 and note[-1] != ".":
                new_notes += "."

            new_chunk_dict[key] = {"text" : new_text, "root" : new_root, "idx" : new_i, \
                                   "concept" : new_concept, "attribute" : new_attribute, "ideology" : new_ideology, "notes" : new_notes}

            display(pd.DataFrame([new_chunk_dict[key]]))

            check = ""
            while check not in ["y", "n", "q"]:
                check = input("satisfied").lower()
                if check == "y":
                    satisfied = True
                if check == "q":
                    raise SystemExit("Stop right there!")
        
        # append the original and new chunk dicts to jsonl object in json readable format   

    line.update({"new_chunks" : ut.doubleQuoteDict(new_chunk_dict)})
    cust_chunk_list.append(line)
    
    #write jsonl object to disk
    with jsonlines.open(os.path.join(path, cust_filepath), 'w') as writer:
        writer.write_all(cust_chunk_list)
        
    index += 1
                

{'1': "They have attacked America because we are freedom's home and defender, and the commitment of our Fathers is now the calling of our time.", 'orig_chunks': {0: {'text': 'they', 'root': 'they', 'i': '0', 'concept': '', 'attribute': '', 'ideology': ''}, 1: {'text': 'america', 'root': 'america', 'i': '3', 'concept': '', 'attribute': '', 'ideology': ''}, 2: {'text': 'we', 'root': 'we', 'i': '5', 'concept': '', 'attribute': '', 'ideology': ''}, 3: {'text': "freedom's home", 'root': 'home', 'i': '9', 'concept': '', 'attribute': '', 'ideology': ''}, 4: {'text': 'defender', 'root': 'defender', 'i': '11', 'concept': 'ARMEDGROUP', 'attribute': 'identity', 'ideology': 'military'}, 5: {'text': 'the commitment', 'root': 'commitment', 'i': '15', 'concept': '', 'attribute': '', 'ideology': ''}, 6: {'text': 'our fathers', 'root': 'father', 'i': '18', 'concept': '', 'attribute': '', 'ideology': ''}, 7: {'text': 'the calling', 'root': 'calling', 'i': '22', 'concept': '', 'attribute': '', 'ideology'

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
text,They,have,attacked,America,because,we,are,freedom,'s,home,and,defender,",",and,the,commitment,of,our,Fathers,is,now,the,calling,of,our,time,.
lemma,-PRON-,have,attack,America,because,-PRON-,be,freedom,'s,home,and,defender,",",and,the,commitment,of,-PRON-,father,be,now,the,calling,of,-PRON-,time,.
ent_type,,,,GPE,,,,,,,,,,,,,,,,,,,,,,,
pos,PRON,AUX,VERB,PROPN,SCONJ,PRON,AUX,NOUN,PART,NOUN,CCONJ,NOUN,PUNCT,CCONJ,DET,NOUN,ADP,DET,NOUN,AUX,ADV,DET,NOUN,ADP,DET,NOUN,PUNCT
tag,PRP,VBP,VBN,NNP,IN,PRP,VBP,NN,POS,NN,CC,NN,",",CC,DT,NN,IN,PRP$,NNS,VBZ,RB,DT,NN,IN,PRP$,NN,.
dep,nsubj,aux,ROOT,dobj,mark,nsubj,advcl,poss,case,attr,cc,conj,punct,cc,det,nsubj,prep,poss,pobj,conj,advmod,det,attr,prep,poss,pobj,punct
concept,,,MILACTION,,,,,BENEVOLANCE,,PLACE,,ARMEDGROUP,,,,,,,FAMILY,,,,,,,,
attribute,,,trade,,,,,trade,,entity,,identity,,,,,,,ingroup,,,,,,,,
ideology,,,military,,,,,social,,social,,military,,,,,,,social,,,,,,,,


They have attacked America because we are freedom's home and defender, and the commitment of our Fathers is now the calling of our time.


Unnamed: 0,0,1,2,3,4,5,6,7,8
text,they,america,we,freedom's home,defender,the commitment,our fathers,the calling,our time
root,they,america,we,home,defender,commitment,father,calling,time
i,0,3,5,9,11,15,18,22,25
concept,,,,,ARMEDGROUP,,,,
attribute,,,,,identity,,,,
ideology,,,,,military,,,,


new chunk text [we] (q):  
notes [] 
new root text [we]:  
idx [5]:  


lefts:  []
we


concept [SELF]: 
attribute [ingroup]:  
ideology [social]:  


Unnamed: 0,text,root,idx,concept,attribute,ideology,notes
0,we,we,5,SELF,ingroup,social,


satisfied q


SystemExit: Stop right there!

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
