<a href="https://colab.research.google.com/github/MatteoZanella/NLU-assignement-2/blob/main/NLU_assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLU assignment n.2

Update SpaCy to version 3 and download the dataset

In [1]:
%%capture
!pip install --upgrade spacy
!python -m spacy download en_core_web_sm
!wget -nc https://raw.githubusercontent.com/esrel/NLU.Lab.2021/master/src/conll.py
!wget -nc https://github.com/esrel/NLU.Lab.2021/raw/master/src/conll2003.zip
!unzip -n conll2003.zip -d conll2003

Load the dataset and Spacy

In [2]:
# Imports
import random
import conll
import spacy
import pandas as pd
from spacy.tokens import Token
from spacy.training import Alignment
from sklearn.metrics import classification_report


nlp = spacy.load("en_core_web_sm")

gt = conll.read_corpus_conll("./conll2003/train.txt", fs=" ")
gt.extend(conll.read_corpus_conll("./conll2003/test.txt", fs=" "))
gt.extend(conll.read_corpus_conll("./conll2003/dev.txt", fs=" "))

# Removing reference lines
gt = [tag_sent for tag_sent in gt if tag_sent[0][0] != '-DOCSTART-']

# Limit the dataset, for a faster analysis in the following code
gt = random.sample(gt, 8000)

Creation of custom extentions to save the dataset information directly in the SpaCy tokens

In [3]:
Token.set_extension("ent_ref", default='')

Translation function: By scanning the entire Conll dataset, you can see that the only Entities present are:
`'LOC', 'ORG', 'PER', 'MISC'`

SpaCy Entities, more detailed, should be translated according to their meaning

In [4]:
def to_ref_entity(token):
  ent_iob = token.ent_iob_
  ent_type = token.ent_type_
  if ent_type == 'ORG' : # Organizations
    ent_type = 'ORG'
  elif ent_type == 'PERSON':  # Persons
    ent_type = 'PER'
  elif ent_type == 'GPE' or ent_type == 'FAC' or ent_type == 'LOC':  # Localities
    ent_type = 'LOC'
  else:
    ent_type = 'MISC'
  
  if ent_iob == 'O':
    return ent_iob 
  else:
    return f"{ent_iob}-{ent_type}"

From the ground truth dataset, extract the corpus with all plain text sentences

In [5]:
corpus = [" ".join([tup[0] for tup in gt_sentence]) for gt_sentence in gt]

## Task 1: SpaCy NER evaluation
Evaluate spaCy NER on CoNLL 2003 data (provided)


The spacy tokenization is different from the one provided in the dataset.
I checked `alignment.x2y.lengths` and verified that spacy tokens needs to be merged at most, never to be splitted.

In [6]:
docs = []
for gt_sentence in gt:
  # List of ground truth tokens (token, POS, chunk, entity)
  gt_tokens = [tup[0] for tup in gt_sentence]
  # Create Doc object and extract tokens
  doc = nlp(" ".join(gt_tokens))
  doc_tokens = [t.text for t in doc]
  
  # Get the alignment: .y2x.lengths has the merge informations
  # .x2y.lengths is all ones with the tokenization considered
  alignment = Alignment.from_strings(doc_tokens, gt_tokens)
  # Merge together tokens to reflect ground truth tokenization
  with doc.retokenize() as retokenizer:
    doc_idx = 0
    for length in alignment.y2x.lengths:
      if length > 1:
        retokenizer.merge(doc[doc_idx:doc_idx+length])
      doc_idx += length

  # Add the information about chunk division and entity
  for token, ref in zip(doc, gt_sentence):
    token._.ent_ref = ref[3]
  docs.append(doc)

### Part 1.1: Token-level performance
Report token-level performance (per class and total)
  - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)
  - to get per-class and total token-level performances you use scikit-learn's classification report, like we did in the lab on evaluation (you don't need to compute accuracy per-class, such thing does not exist)


In [7]:
def token_entities(docs):
  """Extract token-level predicted and reference Named Entities, as requested by classification_report"""
  token_NE_ref = []
  token_NE_pred = []

  for doc in docs:
    for token in doc:
      token_NE_ref.append(token._.ent_ref)
      token_NE_pred.append(to_ref_entity(token))
  return token_NE_ref, token_NE_pred

In [8]:
# Print the results
NE_ref, NE_pred = token_entities(docs)
print(classification_report(NE_ref, NE_pred))
print('='*80)
# Optional Confusion Matrix
y_actu = pd.Series(NE_ref, name='Actual')
y_pred = pd.Series(NE_pred, name='Predicted')
pd_tbl = pd.crosstab(y_pred, y_actu)
pd_tbl.round(decimals=3)

              precision    recall  f1-score   support

       B-LOC       0.78      0.70      0.74      4047
      B-MISC       0.12      0.58      0.20      1899
       B-ORG       0.48      0.31      0.38      3561
       B-PER       0.80      0.66      0.73      3895
       I-LOC       0.56      0.58      0.57       643
      I-MISC       0.04      0.28      0.08       637
       I-ORG       0.48      0.55      0.51      2037
       I-PER       0.83      0.81      0.82      2680
           O       0.95      0.87      0.91     97193

    accuracy                           0.82    116592
   macro avg       0.56      0.59      0.55    116592
weighted avg       0.90      0.82      0.85    116592



Actual,B-LOC,B-MISC,B-ORG,B-PER,I-LOC,I-MISC,I-ORG,I-PER,O
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
B-LOC,2839,59,442,117,7,7,27,9,119
B-MISC,86,1103,74,46,0,34,22,4,7900
B-ORG,194,176,1113,315,1,6,19,3,475
B-PER,95,36,328,2584,3,8,32,48,81
I-LOC,80,16,8,3,376,20,98,22,53
I-MISC,2,58,17,23,11,177,13,24,3636
I-ORG,46,61,252,42,87,145,1114,123,461
I-PER,9,4,10,34,30,22,211,2167,128
O,696,386,1317,731,128,218,501,280,84340


### Part 1.2: Chunk-level performance
Report CoNLL chunk-level performance (per class and total):
  - Precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total.
  - To get chunk-level NER performance, you simply need to use conll.py's evaluate, that computes segmentation and labeling performance.



In [9]:
def chunk_entities(docs):
  """Transform tokens into (text, iob), as requested by conll.evaluate()"""
  chunk_NE_ref = []
  chunk_NE_pred = []
  
  for doc in docs:
    chunk_NE_pred.append([(t.text, to_ref_entity(t)) for t in doc])
    chunk_NE_ref.append([(t.text, t._.ent_ref) for t in doc])
  
  return chunk_NE_ref, chunk_NE_pred

In [10]:
# Print the results
NE_ref, NE_pred = chunk_entities(docs)
results = conll.evaluate(NE_ref, NE_pred)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.784,0.647,0.709,3895
ORG,0.428,0.277,0.336,3561
LOC,0.774,0.693,0.731,4047
MISC,0.112,0.548,0.186,1899
total,0.399,0.549,0.462,13402


## Task 2: Grouping of Entities
Write a function to group recognized named entities using noun_chunks method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks).

In [11]:
def grouped_entities(sentence):
  doc = nlp(sentence)
  n_chunks = list(doc.noun_chunks)
  entities = []
  
  curr_chunk = 0
  curr_token = 0
  chunk_ents = set()
  for token in doc:
    if curr_chunk < len(n_chunks) and token == n_chunks[curr_chunk][curr_token]:  # Token is next in the noun chunk
      if token.ent_type_ != '': # Middle or final token
        chunk_ents.add(token.ent_type_)
      curr_token += 1
      if token == n_chunks[curr_chunk][-1]:  # Last token of the current chunk
        if len(chunk_ents) > 0:
          entities.append(sorted(chunk_ents))  # Sorted, so it's a list without notion of token ordering
        curr_chunk += 1  # Look for next token
        curr_token = 0  # At the first position
        chunk_ents = set()  # Reset the set of chunk's tokens
    elif token.ent_type_ != '':
      entities.append([token.ent_type_])
  return entities

In [12]:
# Testing the function
grouped_entities("Apple's Steve Jobs died in 2011 in Palo Alto, California.")

[['ORG', 'PERSON'], ['DATE'], ['GPE'], ['GPE']]

### Part 2.1: Frequency analysis
Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

In [13]:
from collections import Counter

frequencies = Counter()
for sentence in corpus: 
  entities = grouped_entities(sentence)
  for group in entities:
    combination = '-'.join(group)
    frequencies[combination] += 1

In [14]:
for combination, counter in frequencies.most_common():
  print(f"{combination}: {counter}")

DATE: 4714
CARDINAL: 3883
GPE: 3168
PERSON: 2954
ORG: 2419
NORP: 859
MONEY: 627
ORDINAL: 404
TIME: 397
QUANTITY: 246
PERCENT: 226
LOC: 130
EVENT: 113
NORP-PERSON: 107
CARDINAL-PERSON: 106
GPE-PERSON: 92
ORG-PERSON: 71
WORK_OF_ART: 70
PRODUCT: 67
FAC: 65
CARDINAL-ORG: 55
LAW: 51
CARDINAL-NORP: 43
GPE-ORG: 41
CARDINAL-GPE: 33
DATE-ORG: 31
GPE-ORDINAL: 25
NORP-ORDINAL: 21
DATE-GPE: 20
DATE-PERSON: 19
LANGUAGE: 18
NORP-ORG: 17
CARDINAL-DATE: 17
GPE-NORP: 17
DATE-EVENT: 17
DATE-TIME: 16
ORDINAL-PERSON: 13
DATE-NORP: 12
GPE-PRODUCT: 11
ORDINAL-ORG: 10
CARDINAL-ORDINAL: 8
ORG-PRODUCT: 8
LANGUAGE-ORDINAL: 7
ORDINAL-QUANTITY: 6
DATE-ORDINAL: 6
CARDINAL-PRODUCT: 5
EVENT-GPE: 4
GPE-LOC: 4
CARDINAL-GPE-PERSON: 3
DATE-PERCENT: 3
FAC-GPE: 3
GPE-ORDINAL-PERSON: 3
MONEY-ORG-PRODUCT: 3
GPE-NORP-ORG: 3
GPE-ORG-PERSON: 2
LOC-NORP: 2
CARDINAL-QUANTITY: 2
CARDINAL-PERCENT: 2
DATE-PRODUCT: 2
NORP-ORG-PERSON: 2
DATE-WORK_OF_ART: 2
CARDINAL-MONEY-ORG: 2
PERSON-WORK_OF_ART: 2
PERSON-TIME: 2
DATE-GPE-PERSON: 2


## Task 3
One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

You have to be careful when extending entities with the coumpound, because you could overwrite other entities.

In [15]:
from spacy.tokens import Span

def expand_entities(doc):
  entities = []
  for ents_i, ent in enumerate(doc.ents):
    ent_start = ent.start
    ent_end = ent.end
    # List of all the children of the entity span tokens
    subtree = list(ent.root.subtree) 
    search_start = subtree[0].i
    search_end = subtree[-1].i + 1
    # The search should be limited by previous and next entities
    if ents_i > 0 and doc.ents[ents_i - 1].end > search_start:
      search_start = doc.ents[ents_i - 1].end
    if ents_i < (len(doc.ents) - 1) and doc.ents[ents_i + 1].start < search_end:
      search_end = doc.ents[ents_i + 1].start
    # Extend the head
    token = doc[search_start]
    while token.i < ent_start:
      compound_root = token
      while compound_root.dep_ == 'compound':
        compound_root = compound_root.head
      if ent_start <= compound_root.i < ent_end:
        ent_start = token.i
      token = token.nbor()
    # Extend the tail
    token = doc[search_end - 1]
    while token.i >= ent_end:
      compound_root = token
      while compound_root.dep_ == 'compound':
        compound_root = compound_root.head
      if ent_start <= compound_root.i < ent_end:
        ent_end = token.i + 1
      token = token.nbor(-1)
    # Add the expanded entity to the list
    entity = Span(doc, ent_start, ent_end, label=ent.label_)
    entities.append(entity)
  # Set the extended entities
  doc.set_ents(entities)

We can directy apply the postprocessing to the docs object.

In [16]:
# Application of the post-processing step
for doc in docs:
  expand_entities(doc)

Results are worse. For instance, "Shimon Peres" is extended to "minister Shimon Peres" since minister has a compound relationship with Shimon, and that's clearly not the correct identification of the PERSON named entity

In [17]:
# Evaluation of the results
NE_ref, NE_pred = token_entities(docs)
print(classification_report(NE_ref, NE_pred))
print('='*80)

NE_ref, NE_pred = chunk_entities(docs)
results = conll.evaluate(NE_ref, NE_pred)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

              precision    recall  f1-score   support

       B-LOC       0.77      0.69      0.73      4047
      B-MISC       0.12      0.58      0.20      1899
       B-ORG       0.47      0.31      0.37      3561
       B-PER       0.69      0.57      0.62      3895
       I-LOC       0.49      0.58      0.54       643
      I-MISC       0.04      0.28      0.08       637
       I-ORG       0.46      0.55      0.50      2037
       I-PER       0.67      0.82      0.74      2680
           O       0.95      0.86      0.90     97193

    accuracy                           0.81    116592
   macro avg       0.52      0.58      0.52    116592
weighted avg       0.89      0.81      0.84    116592



Unnamed: 0,p,r,f,s
PER,0.666,0.55,0.603,3895
ORG,0.417,0.27,0.328,3561
LOC,0.763,0.683,0.721,4047
MISC,0.112,0.545,0.185,1899
total,0.375,0.515,0.434,13402
