## Second Assignment - Working with Named Entities

- **Student**: Farina Matteo  
- **Mat. Number**: 221252  
- **email**: [matteo.farina-1@studenti.unitn.it](mailto:matteo.farina-1@studenti.unitn.it)

### Content  
This notebook contains the code for the second assignment of the course. Additionally, detailed descriptions are provided right before each code cell, so that this notebook can play the role of the report too.  
  
The structure of the notebook consists of several sections, briefly listed below.
- [**Requirements**](#Requirements): section for external libraries and packages.
- [**Task 1**](#Task-1) Evaluate spaCy NER on CoNLL 2003 data.
    - [**Task 1.1**](#Task-1.1) report token-level performance (per class and total) and accuracy
    - [**Task 1.2**](#Task-1.2) report chunk-level metrics (precision, recall and f1-score) of recognizing all the named entities in a chunk per class and total
- [**Task 2**](#Task-2) Grouping of Entities and analysis in terms of most frequent NER combinations.
- [**Task 3**](#Task-3) Extending Entity Spans with compounds

### Requirements  

In here, python dependencies and actions to be performed in order to have your system up and running for this notebook are illustrated. Please note that this notebook has been tested under the **Python 3.9** interpreter inside an Anaconda virtual environment. External libraries have been installed with **pip**, version 21.0.1.  

The main library that is used throughout the notebook is `spaCy`. To install it, run:  
- `pip install spacy`  

Other than that, in this notebook spaCy is coupled with its primary english language pipeline (small one):  
- `python -m spacy download en_core_web_sm`  

For evaluation purposes, `scikit-learn` is used. To install it, run:  
- `pip install scikit-learn`  
  
Finally, `pandas` is required to render beatiful tables to visualize our data:  
- `pip install pandas`  

#### Shortcut  
If you prefer to avoid individual install commands, a requirements file has been included in the repo.
Note that spaCy's english language must be manually installed anyway. You can install everything by running the code cell below.

In [56]:
!pip install -r requirements.txt
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 257 kB/s eta 0:00:01


Collecting en-core-web-sm==3.0.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In the next two code cells, we are importing every library we need as well as setting up any external dependency and loading the conll2003 dataset.

In [57]:
# importing the necessary python libraries
import os
import sys
import spacy
import pandas as pd

In [58]:
# adding to importables'path also the conll.py script, useful later on, and loading the conll2003 ds
sys.path.insert(0, os.path.abspath('src/'))
from conll import read_corpus_conll, evaluate
conll2003 = read_corpus_conll('data/conll2003/test.txt')

### Task 1  
In the first task students were required to evaluate the spaCy Named-Entity-Recognizer on the conll2003 dataset, with both **token-level** and **chunk-level** metrics.  
  
To this aim, two different strategies are shown in order to deal with the pre-defined tokens and labels of the conll2003 dataset:  
1. the first implementation shows how pre-defined tokens can be **forced** inside a spaCy Doc representation;  
2. the second implementation, on the other hand, **aligns** the original tokens and the ones automatically computed by spaCy. This allows for spaCy to use its own tokenization and possibly lead to better performances. Note, though, that also labels have been aligned accordingly.  
  
Afterwards, both the implementations are evaluated and the relative performances are displayed.

- ### Task 1.1  
  
In this section, we will focus on **token-level** metrics and how data have to be preprocessed according to the described strategies to suit our evaluation method. In the code cell below, some utility functions are defined.

In [59]:
# utility fns for the first task
def get_token(conll_tuple: tuple):
    """Returns the token text of a tuple from the conll2003 ds"""
    decoded = conll_tuple[0].strip().split()
    return decoded[0]

def get_iob(conll_tuple: tuple):
    """Returns the iob tag of a tuple from the conll2003 ds"""
    decoded = conll_tuple[0].strip().split()
    return decoded[-1]

def decode_ref(conll_tuple: tuple):
    """Decodes a tuple from the conll2003 dataset. Returns values as (token, iob)"""
    return (get_token(conll_tuple), get_iob(conll_tuple))

### First method: using pre-defined tokens  
In the below code cell, a simple whitespace tokenizer is defined. This will be then passed to a spaCy pipeline, overwriting its default tokenizer. With this tweak we will then be able to ensure spaCy uses the exact same tokens as defined in the conll2003 dataset.

In [60]:
# defining a custom whitespace tokenizer
class WhitespaceTokenizer(object):
    """
    Class for a basic whitespace tokenizer. When an instance is called, it splits a sentence string
    into whitespace-separated tokens and embeds them into a Doc object to be returned.
    """
    def __init__(self, vocab):
        super(WhitespaceTokenizer, self).__init__()
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words); spaces[-1] = False  # last word of a sentence doesn't need trailing space
        return spacy.tokens.doc.Doc(self.vocab, words=words, spaces=spaces)
    
# initializing the default pipeline and substituting the tokenizer.
# this will allow tokens provided by spacy and the original tokens from the conll2003 dataset to match.
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

After the pipeline has been initialized and its tokenizer has been overwritten, we can start collecting our references and our hypotheses, where:  
- `refs` are composed of tuples of `(token, iob)` and represent each token together with its real IOB tag;  
- `hyps` are composed of tuples `(token, iob)` where each token is paired with its IOB tag predicted by spaCy.  
  
To conclude the discussion about enforced tokenization, it is needed to say spaCy docs are initialized with strings made up of the different tokens of a sentence from conll2003, **whitespace-separated**. It follows naturally, then, that our `WhitespaceTokenizer` will simply split these sentence into the original tokens. Note that this could've been achieved in other ways, such as extracting the NER from a spaCy pipeline beforehand (with `nlp.get_pipe("ner")`) in order to apply it to a Doc object manually initialized with the tokens retrieved from the current conll2003 sentence. 
  
After each doc has been processed, and so the Named-Entity-Recognizer has been applied, the assigned labels need to be converted. In fact, the default IOB tags inside the conll2003 dataset also contain information concerning the categorization of entities, which spaCy stores inside the ad-hoc attribute `ent_type_` of each token. Also naming conventions are different, in spaCy's *PERSON* label maps to conll2003's *PER* one. Additionally, any category other than `[ORG, LOC, PERSON]` is mapped to `MISC`.  

In [61]:
# build the refs and the hyps (tuples of token + iob tag wrt to NER task)
refs, hyps = [], []

# start looping through the sents of the dataset
for sent in conll2003:
    # get the tokens and the sentence
    tokens = [get_token(tpl) for tpl in sent]
    sent_str = " ".join(tokens)
    
    # check this is not a DOCSTART line
    if sent_str.startswith("-DOCSTART-"): continue
    
    # if everything's fine, let's run the pipeline with our custom whitespace tokenizer.
    # we are feeding the pipeline with whitespace-separated tokens, so this will turn out
    # in using the tokens pre-defined by the conll2003 dataset.
    doc = nlp(sent_str)
        
    # we must now convert spaCy NE labels to match the conll2003 notation for eval.
    iobs_pred = []
    for t in doc:
        if t.ent_iob_ != 'O':
            if t.ent_type_ in ['ORG', 'LOC']:
                iobs_pred.append((t.text, f"{t.ent_iob_}-{t.ent_type_}"))
            elif t.ent_type_ == 'PERSON':
                iobs_pred.append((t.text, f"{t.ent_iob_}-PER"))
            else:
                iobs_pred.append((t.text, f"{t.ent_iob_}-MISC"))
        else:
            iobs_pred.append((t.text, 'O'))

    # then append data for evaluation
    refs.append([decode_ref(tpl) for tpl in sent])
    hyps.append(iobs_pred)

When data has been preprocessed, **token-level** evaluation is performed with the help of scikit-learn's `classification_report`.  

The function is simply fed with two lists containing the reference values and the hypotheses, equally indexed (i.e. `y_true[i]` and `y_pred[i]` both refer to the i-th token). 

In [62]:
# when refs and hyps are ready, let's use sklearn to compute token-level metrics
from sklearn.metrics import classification_report
y_true = []
y_pred = []
for i in range(len(refs)):
    for j in range(len(refs[i])):
        y_true.append(refs[i][j][-1])  # tuple is composed of (token, ref_tag) for further evaluation fn
        y_pred.append(hyps[i][j][-1])  # tuple is composed of (token, pred_tag) for further evaluation fn

# finally, compute and print the report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

       B-LOC       0.52      0.02      0.03      1668
      B-MISC       0.08      0.58      0.14       702
       B-ORG       0.50      0.30      0.38      1661
       B-PER       0.79      0.61      0.69      1617
       I-LOC       0.41      0.08      0.13       257
      I-MISC       0.05      0.41      0.08       216
       I-ORG       0.42      0.52      0.46       835
       I-PER       0.82      0.76      0.78      1156
           O       0.94      0.86      0.90     38323

    accuracy                           0.78     46435
   macro avg       0.50      0.46      0.40     46435
weighted avg       0.87      0.78      0.81     46435



### Second Method: token alignment  

As mentioned in the intro of section one, the evaluation task has been addressed in order to look at both sides of the coin. In the code cells below, the implemented pipeline is the exact same one just described above. A significant difference, indeed, is represented by how tokens and labels are managed.

Each sentence from spaCy is now tokenized by spaCy the default way, without enforcing any kind of tokenization. Afterwards, the obtained tokens and the source ones are **aligned** thanks to spaCy's `Alignment` class, allowing us to understand how the (possibly) different list of tokens map to each other. This serves as a mandatory step for the further label alignment: based on token alignment, also labels are preprocessed.  
  
Note, though, that a step is crucial here: when subsequent spaCy tokens map to one single input token, if this is labeled as a BOS we must pay attention. If we do not consider this situation, we will end up having sequences of B tags which, in reality, should represent only one entity. 

In [63]:
from spacy.training import Alignment
nlp = spacy.load("en_core_web_sm")

# build the refs and the hyps (tuples of token + iob tag wrt to NER task)
refs_aligned, hyps_aligned = [], []

# start looping through dataset sents
for sent in conll2003:
    # get the src tokens, the src iob tags and the src sentence
    tokens = [get_token(tpl) for tpl in sent]
    iobs = [get_iob(tpl) for tpl in sent]
    sent_str = " ".join(tokens)
    
    # check this is not a DOCSTART line
    if sent_str.startswith("-DOCSTART-"): continue
    
    # compute default spaCy tokens and get their inner text
    doc = nlp(sent_str)
    spacy_tokens = [t.text for t in doc]
    
    # align conll2003 tokens and spacy tokens
    align = Alignment.from_strings(tokens, spacy_tokens)

    # then, align the original labels to match the current tokens.
    # If we encounter the same match twice (or more) in a row, 
    # we must convert its label if it is a BOS. Otherwise, we will
    # end up having as groundtruth labels multiple contiguous B tags which
    # do not have any meaning.
    labels = []
    last_match = None
    for match in align.y2x.dataXd:
        label = iobs[match]
        if label.startswith("B-") and last_match == match:
            base, category = label.split("-")
            label = f"I-{category}"
        last_match = match
        labels.append(label)

    # we must now convert spaCy NE labels to match the conll2003 notation for eval.
    iobs_pred = []
    for t in doc:
        if t.ent_iob_ != 'O':
            if t.ent_type_ in ['ORG', 'LOC']:
                iobs_pred.append((t.text, f"{t.ent_iob_}-{t.ent_type_}"))
            elif t.ent_type_ == 'PERSON':
                iobs_pred.append((t.text, f"{t.ent_iob_}-PER"))
            else:
                iobs_pred.append((t.text, f"{t.ent_iob_}-MISC"))
        else:
            iobs_pred.append((t.text, 'O'))

    # then append data for evaluation
    refs_aligned.append([(token, y_true) for (token, y_true) in zip(spacy_tokens, labels)])
    hyps_aligned.append(iobs_pred)

In [64]:
# then, let's see the performance differences (if significant)
# between the enforce tokenization and the aligned one.
from sklearn.metrics import classification_report
y_true = []
y_pred = []
for i in range(len(refs_aligned)):
    for j in range(len(refs_aligned[i])):
        y_true.append(refs_aligned[i][j][-1])
        y_pred.append(hyps_aligned[i][j][-1])

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

       B-LOC       0.48      0.02      0.03      1668
      B-MISC       0.08      0.60      0.14       702
       B-ORG       0.52      0.31      0.38      1661
       B-PER       0.80      0.63      0.70      1617
       I-LOC       0.33      0.07      0.12       276
      I-MISC       0.04      0.31      0.07       322
       I-ORG       0.41      0.52      0.46       876
       I-PER       0.82      0.78      0.80      1210
           O       0.95      0.85      0.90     40913

    accuracy                           0.78     49245
   macro avg       0.49      0.45      0.40     49245
weighted avg       0.88      0.78      0.81     49245



- ### Task 1.2  

The second part of the first task of the assignment consisted in performing **chunk-level** evaluation, retrieving metrics such as precision, recall and f1-score. This has been implemented through the provided `conll.py`'s `evaluate` function, which suits perfectly. As data have already been prepared, the code cells below only display evaluation metrics for both the enforced tokenization output and the aligned one. 

In [65]:
# to get class-level scores, let's make use of conll.py
results = evaluate(refs, hyps)
pd.DataFrame().from_dict(results, orient='index').round(decimals=3)        # default WhiteSpace

Unnamed: 0,p,r,f,s
LOC,0.481,0.016,0.03,1668
PER,0.761,0.59,0.665,1617
MISC,0.078,0.567,0.137,702
ORG,0.448,0.272,0.339,1661
total,0.246,0.324,0.28,5648


In [66]:
results_aligned = evaluate(refs_aligned, hyps_aligned)
pd.DataFrame().from_dict(results_aligned, orient='index').round(decimals=3)  # aligned Tokenization

Unnamed: 0,p,r,f,s
LOC,0.45,0.016,0.031,1668
PER,0.774,0.609,0.681,1617
MISC,0.074,0.561,0.131,702
ORG,0.464,0.276,0.346,1661
total,0.244,0.33,0.28,5648


### Task 2

For the second task, students were asked to perform grouping of entities as well as to analyze the most frequent NER combinations. As previously, some task-specific utility functions are defined in the first place.

In [67]:
def encode_key(lst: list):
    """Transforms a list in a string format, useful to compute keys for frequency list generation."""
    return "[{}]".format(", ".join(sorted(lst)))

def nbest(d, n):
    """Returns first :param n: items of the frequency list :param d: (more frequent ones)."""
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

def nworst(d, n):
    """Returns last :param n: items of the frequency list :param d: (less frequent ones)."""
    return dict(sorted(d.items(), key=lambda item: item[1])[:n])

Below the core functions of the task are defined:  

`fn: group_sent` iterates through the `noun_chunks` of the entities of a sentence in order to understand which entity types are coupled. Then, also entities that do not appear together with other entities are taken into account, as they will single-handedly compose a group. 

When building groups, entity labels are **sorted**. This turns out to be significant when analyzing correlations and group frequencies. In this way, entities that appear with order `['A','B']` and `['B','A']` are both considered as the same entity group. For example, instances of groups `['PERSON', 'ORG']` and `['ORG', 'PERSON']` will account for `['ORG', 'PERSON']` at the end.  
**This is merely a design choice**.  
In my opinion, this is more robust than considering `['ORG', 'PERSON']` and `['PERSON', 'ORG']` as different "events" to be tracked, for instances, and enables us to capture linguistic expressions such as *Apple's Steve Jobs* and *Steve Jobs of Apple* which refer to the same real world entities.

`fn: fl` builds a frequency list of groups and is meant to be fed with a list containing outputs of `fn: group_sent`. By iterating through the input list, it builds a Python dictionary where the keys are the representations of entity groups and the values are the counts. 

In [68]:
def group_sent(doc: spacy.tokens.doc.Doc):
    """
    Process a spacy Doc to get Named-Entity groups that appear together.
    At first, groups are built based on which entities belong to the same noun chunks.
    Afterwards, also entities that are not coupled with any other entity in a noun chunk are 
    taken into account as a mocked group with 1 NE.  
    :return: list of lists, where each inner list is made up of entity labels that appear together.
    """
    stored_ents = []
    ent_groups = []

    # add all groups that appear in noun chunks, taking care of avoiding repetitions
    for nc in doc.noun_chunks:
        to_store = []
        for ent in nc.ents:
            if ent not in stored_ents:
                to_store.append(ent.label_)
                stored_ents.append(ent)
        if len(to_store) > 0:
            ent_groups.append(sorted(to_store))          
    
    # take also into account entities that do not appear in noun chunks
    # and count them as groups composed by only 1 NE
    for ent in doc.ents:
        if ent not in stored_ents:
            ent_groups.append([ent.label_])
    
    return ent_groups

def fl(groups):
    """
    Build the frequency list of groups. Each key is made up of a string representation
    of the group and occurrences and the respective values would represent occurences
    of that particular Named-Entities combination.
    
    :param groups: list-of-lists-of-lists, where: 
    - Outermost list indexes sentences;  
    (groups[i] will be a list-of-lists, groups of sent. i)
    
    - Intermediate lists index groups of a sentence; 
    (groups[i][j] will be a list, labels of group j of sent. i)
    
    - Innermost lists index entity labels of a group of a sentence. 
    (groups[i][j][k] will be label k of group j of sent. i)
    
    :return fdict: frequency list.
    """
    fdict = {}
    # scan each sentence
    for sent_groups in groups:
        # scan each group within a sentence
        for group in sent_groups:  # ATTENTION: 'group' is a list and can be empty
            if len(group) == 0: continue
            key = encode_key(group)
            # create the dict with default counts if needed
            if key not in fdict.keys():
                fdict[key] = 0
            # update the counts for the given group
            fdict[key] += 1
    return fdict

Groups are then built for the whole conll2003 dataset, and the frequency list is computed. The 10 most frequent and least frequent entity groups are displayed.

In [69]:
nlp = spacy.load("en_core_web_sm")

# initialize the list where groups per sentence will be put
groups = []

# process the whole ds
for sent in conll2003:
    # same tasks performed also in previous cells
    tokens = [get_token(tpl) for tpl in sent]
    sent_str = " ".join(tokens)
    if sent_str.startswith("-DOCSTART-"): continue
    
    # group entities in the current sentence and store groups
    doc = nlp(sent_str)
    sent_groups = group_sent(doc)
    groups.append(sent_groups)

# Analyze the groups in terms of most frequent combinations (i.e. NER types that go together)
frequency_list = fl(groups)

print("\n10 most frequent NER combinations")
out = nbest(frequency_list, n=10)
[print(k,v) for k,v in out.items()]

print("\n10 least frequent NER combinations")
out = nworst(frequency_list, n=10)
[print(k,v) for k,v in out.items()]


10 most frequent NER combinations
[CARDINAL] 1624
[GPE] 1255
[PERSON] 1074
[DATE] 997
[ORG] 873
[NORP] 293
[MONEY] 147
[ORDINAL] 111
[TIME] 83
[PERCENT] 81

10 least frequent NER combinations
[DATE, ORDINAL] 1
[GPE, ORDINAL, ORG] 1
[ORG, QUANTITY] 1
[CARDINAL, GPE, GPE] 1
[DATE, LOC] 1
[DATE, FAC] 1
[CARDINAL, CARDINAL, NORP] 1
[LOC, ORDINAL] 1
[MONEY, ORG] 1
[LOC, NORP] 1


[None, None, None, None, None, None, None, None, None, None]

### Task 3  

For task three, the goal was to perform a common post-processing task, that is (trying to) fix segmentation errors based on the `compound` dependency relation. This is achieved by means of `extend`, a **recursive function**. The behaviour of the function is rather simple:  
- if an entity span has no tokens with the compound dependency relation, no extension is performed;  
- otherwise, extension is performed by including the head of compounds inside the current span as long as brand new compounds are introduced in the span (i.e. recursion keeps going until the set difference between the new compounds and the previous compounds is zero). This allows to "climb" the dependency tree up to all reachable compounds from the original span.  

In the code cell below, what's needed to extend entities is defined and the implementation is then tested against the first 20 sentences of the conll2003 dataset.

In [72]:
def compounds(ent: spacy.tokens.span.Span):
    """Get all tokens with the compound dependency relation inside :param ent:."""
    comps = []
    for t in ent:
        if t.dep_=="compound":
            comps.append(t)
    return comps

def extend_entity_span(ent: spacy.tokens.span.Span):
    """
    Extend the span of an entity based on its compounds. Heads of compound (i.e. nouns that
    are qualified by some token in the entity) will be added to the entity, extending it.
    :param ent: input entity to be extended
    :return:    extended span of the entity (spaCy span)
    """
    ent_ext = []
    # checking noun compounds for each token
    for t in ent:
        # appending the current token to the extended entity
        ent_ext.append(t)
        # if the token is a compound wrt sth outside of the entity, let's add the head too
        if t.dep_ == "compound" and t.head not in ent and t.head not in ent_ext:
            ent_ext.append(t.head)
            
    # ensure tokens are in the original sentence order and return
    ent_ext = sorted(ent_ext, key=lambda x:x.i)
    ent_ext = ent.doc[ent_ext[0].i:ent_ext[-1].i+1]  # as a Span object
    return ent_ext

def extend(ent: spacy.tokens.span.Span, prev_comps=set()):
    """
    Recursively extend the span of an entity based on the 'compound' dependency relation.
    Recursion stops when no new compounds are found.
    """
    # perform set difference between the current compounds and the compounds of the 
    # previous recursive call.
    current_comps = set(compounds(ent))
    diff = current_comps - prev_comps
    
    # base case, if no new compounds have been found, return.
    if len(diff) == 0:
        return ent
    
    # otherwise, perform a first-level extension of the span and proceed with recursion.
    ent = extend_entity_span(ent)
    return extend(ent, current_comps)

# test out the implementation on a slice of the conll2003 dataset
for sent in conll2003[:20]:
    tokens = [get_token(tpl) for tpl in sent]
    sent_str = " ".join(tokens)
    if " ".join(tokens).startswith("-DOCSTART-"): continue
    doc = nlp(sent_str)
    original = [ent for ent in doc.ents]
    extended = [extend(ent) for ent in doc.ents]
    print("'{}'".format(sent_str))
    print("Original entities: ", original)
    print("Extended entities: ", extended)
    print("\n\n")

'SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .'
Original entities:  [CHINA]
Extended entities:  [CHINA]



'Nadim Ladki'
Original entities:  [Nadim Ladki]
Extended entities:  [Nadim Ladki]



'AL-AIN , United Arab Emirates 1996-12-06'
Original entities:  [AL-AIN, United Arab Emirates, 1996-12-06]
Extended entities:  [AL-AIN, United Arab Emirates, 1996-12-06]



'Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .'
Original entities:  [Japan, their Asian Cup, Syria, Group C, Friday]
Extended entities:  [Japan, their Asian Cup title, Syria, Group C championship match, Friday]



'But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .'
Original entities:  [China, second, 2, Uzbekistan]
Extended entities:  [China, second, 2, Uzbekistan]



'China controlled most of the match and saw several chances missed until the 78th minute 

As we can see, extension does its job.

For instance, in sentence: "*China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net .*" initially `Uzbek` and `Igor Shkvyrin` were considered different entities, while performing extension allowed us to merge them into a new one.

Indeed, there is still something we can do. It's true that entities have been extended in this way, but we haven't **pruned** useless entities accordingly. Looking at the same sentence, we may notice that if the named entity `Uzbek striker Igor Shkvyrin` appears, it doesn't make sense to also have `Igor Shkvyrin` as a separate, different NE. So, we may think of getting rid of it.

For this reason, the following (and last) code cell implements a filtering function, named **filter_ents**. Given a list of entities, it checks whether some entity span is fully included into some other one and can, thus, be safely pruned. Span inclusion is checked leveraging the `i` attribute of spaCy tokens.

Then, the implementation will be tested on the same slice of conll2003. As you will see, redundant entities will disappear.

In [None]:
def filter_ents(ents: list):
    """
    Process all extended entities of a sentence (:param ents:). If an entity span is found
    to be a sub-span of another entity, this implies that two original named-entity where in reality
    linked by a compound dependency, so the extension has merged them. Thus, the sub-span can be removed
    as it is redundant and less informative.
    
    :param ents: list of extended entities to be checked;
    :return:     list of extended entities, where fully included sub-spans of other entities
                 in :param ents: have been pruned. 
    """
    extended_filtered = []
    for ext_ent in ents:
        # perform type check for each item of the provided list
        assert type(ext_ent) == spacy.tokens.span.Span, \
        "(fn: filter_ents) each item of argument 'ents' must be a Span obj."
        
        keep = True
        start, stop = ext_ent[0].i, ext_ent[-1].i
        for ext_match in ents:
            match_start, match_stop = ext_match[0].i, ext_match[-1].i       
            # now, check if the span intersection is not empty
            # in order, the following types of inclusion are checked:
            # - strict inclusion;
            # - non-strict left-inclusion;
            # - non-strict right-inclusion.
            if (match_start < start and stop < match_stop) or \
            (match_start <= start and stop < match_stop) or \
            (match_start < start and stop <= match_stop): 
                keep = False
                break
        
        if keep: extended_filtered.append(ext_ent)
    return extended_filtered

# test out the implementation on a slice of the conll2003 dataset
for sent in conll2003[:20]:
    tokens = [get_token(tpl) for tpl in sent]
    sent_str = " ".join(tokens)
    if " ".join(tokens).startswith("-DOCSTART-"): continue
    doc = nlp(sent_str)
    original = [ent for ent in doc.ents]
    extended = [extend(ent) for ent in doc.ents]
    filtered = filter_ents(extended)
    print("'{}'".format(sent_str))
    print("Original entities: ", original)
    print("Extended entities: ", extended)
    print("Filtered entities: ", filtered)
    print("\n\n")