# Custom Noun Chunking
-----

There is a problem whereby spaCy's inbuilt noun_chunks is too course grained for the chunking required for detecting the ingroups and outgroups.

For the purposes of the methodology, a more fine grained noun chunking algorithm is required.

There are several examples in the test ingroup and outgroup sentences named entities are chunked with other nouns when they would preferable be kept separate.

There are also several examples where a noun chunk contains more than one noun of a custom attribute, therefore, the chunk needs to be resolved to a single instance

This notebook adapt spaCy's noun_chunk source code and adapt for the specific purpose of this pipeline.

Source code at these links:

    Noun Chunker Code
    
    https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

    Class extensions

    https://github.com/explosion/spaCy/blob/9ce059dd067ecc3f097d04023e3cfa0d70d35bb8/spacy/tokens/doc.pyx

    https://github.com/explosion/spaCy/blob/f49e2810e6ea5c8b848df5b0f393c27ee31bb7f4/spacy/tokens/span.pyx


## Test data

the following sentences will be used for this notebook

In [3]:
## create a dict object of all the ingroup/outgroup sentences
import os
import cndutils as ut
path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"

sent_dict = dict()
jsonl_files = [f for f in os.listdir(path) if os.path.splitext(f)[1] == ".jsonl" and "group" in os.path.splitext(f)[0]]
for file in jsonl_files:
    data_list = ut.load_jsonl(os.path.join(path, file))
    for entry in data_list:
        for value in entry.values():
            sent_dict[len(sent_dict)] = value

Loaded 49 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\binladen_ingroup_sents.jsonl
Loaded 101 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\binladen_outgroup_sents.jsonl
Loaded 66 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\bush_ingroup_sents.jsonl
Loaded 37 records from C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\bush_outgroup_sents.jsonl


## Instantiate Pipeline

In [4]:
%%time

import spacy
import pipeline
cnd = pipeline.CND()

merge_nps = cnd.nlp.create_pipe("merge_noun_chunks")
cnd.nlp.add_pipe(merge_nps)

Wall time: 22.1 s


## Iterate through Test Data for sentences of Interest

These sentences will be used to tune the existing spaCy noun chunker for the purposes of this methodology

In [None]:
import importlib
import cndutils
importlib.reload(cndutils)


path = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset"
ss = cndutils.sent_select(path = path, file = "test_sents")
output = ss(cnd.nlp, sent_dict)
print(output)
        
    
# remove dets
    

29 / 253


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27
text,When,those,have,stood,in,defense,of,their weak children,",",their,brothers,and,sisters,in,Palestine,and,other Muslim nations,",",the whole world,went,into,an uproar,",",the infidels,followed,by,the hypocrites,.
ent_type,,,,,,,,,,,,,,,GPE,,,,LOC,,,,,NORP,,,,
concept,,,,,,MILENTITY,,,,,FAMILY,,FAMILY,,,,,,,,,,,,,,,
attribute,,,,,,entity,,,,,ingroup,,ingroup,,,,,,,,,,,,,,,
ideology,,,,,,military,,,,,social,,social,,,,,,,,,,,,,,,


## This code is adapted from the exising noun chunker in spaCy.

The crux of this code is in this section:

`
if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
`

`np_deps` is a list of dependency labels denoting a noun token

`prev_end` is a index to ensure subsequent chunks do not overlap with existing chunks

`word.left_edge.i` creates a chunk from the root token and all other tokens in its leftwards facing dependency tree.

Where `word.left_edge.i` is too course grained, the custom chunk will expand the number of tests to become a more fine grained chunker.

Additionally, functionality for custom attributes will have to be added.

For the 

In [None]:
def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    
    source code: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py
    """
    labels = [
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",
        "attr",
        "ROOT",
    ]
    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
    
    prev_end = -1
    
    for i, word in enumerate(doclike):
        
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label

def merge_named_concepts(doc):

    """Merge named concepts into a single token.
    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged named concepts.

    Adapts the spacy merge noun chunks function
    code: https://github.com/explosion/spaCy/blob/master/spacy/pipeline/functions.py
    """
    if not doc.is_parsed:
        return doc
    with doc.retokenize() as retokenizer:
        for span in doc._.named_concepts:
            attrs = {
                    "tag": span.root.tag, 
                    "dep": span.root.dep
                    }

            retokenizer.merge(span, attrs=attrs)
    
    return doc