<a href="https://colab.research.google.com/github/Bentley97/NLU_Second_Assignment/blob/main/SecondAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#** Second Assignment**
Student:
- Name: Luca
- Surname: Bentivoglio
- Student number: 221246




## **Requirements**
To run the notebook are necessary some files that you can find in the 'src' directory.
To import it we can simply clone the repository into the notebook and unzip the conll2003 archive, as follows.

In [60]:
%%bash
git clone https://github.com/Bentley97/NLU_Second_Assignment.git

mkdir NLU_Second_Assignment/src/conll2003

unzip -q NLU_Second_Assignment/src/conll2003.zip -d NLU_Second_Assignment/src/conll2003

Cloning into 'NLU_Second_Assignment'...


##1. Evaluate spaCy NER on CoNLL 2003 data
This point is divided in two performance evaluations: the first at token-level, while the second at chunk-level.


To compute performace we take the following steps:
- firstly read the test.txt file containing basically the corpus and respective ground truth labels;
- load the english pipeline;
- cicling over sentences:
  - extract words to produce a sentence
  - extract the references to compose the list of ground truth to use in evaluation
  - pass the sentence to the NER
  - post-process the result to reassemble tokens because the parser sometimes splits composed words that need to stay together
  - build a list of tuples (text, label) for the sentence, convert its label from spacy format to CoNLL format and append it to the list of hypoteses to evaluate
- then convert list of lists of tuples into a list of labels to feed the scikit-learn evaluation function and print results token-level per class (for each combination of IOB-label and tag-label) and in total;
- in the end evaluate also performace at chunk-level per class and total.

---

To make the code more readable I split main process in function:

`reassemble_tokens(doc)`

This is a post-process function that merge tokens splitted by the parser on the basis of the whitespace attribute.

Input:
- doc ==> spacy Doc element

Output:
- doc ==> spacy Doc element

---
`convert_labels_into_conll(doc, convert_dict)`

This function simply converts labels according to the mappings of the dict given in input.

Input:
- doc ==> spacy Doc element 
- convert_dict ==> dict (of labels)

Output:
- list of tuples: (text, label) where label is composed by IOB+tag
---

`convert_in_ordered_list_of_label(l)`

This function converts a list of lists of tuples into a list of string(label) with the same order.

Input:
- l: list of lists of tuples

Output:
- list of strings

---

`build_references(sentence)`

This function is used to build the list of ground truth returning a list of tuples (text, label) from a sentence given in input in the conll's file format.

Input:
- sentence: list of strings

Output:
- list of tuples



In [152]:
### POST-PROCESS reassamble tokens
def reassemble_tokens(doc):
  i = 0
  j = -1
  doc_length = len(doc)
  while i != doc_length:
    if doc[i].whitespace_ == "" and doc[i] != doc[-1]:
      if j == -1:
        j = i
    elif j != -1:
      with doc.retokenize() as retokenizer:
        retokenizer.merge(doc[j:i+1])
      doc_length -= i-j
      i = j
      j = -1

    i += 1

  return doc

### convert labels from the spacy format to the conll format
def convert_labels_into_conll(doc, convert_dict):
  temp_hyp = []
  for token in doc:
    if token.ent_type_ == "":
      temp_hyp.append((token.text, token.ent_iob_))
    else:
      temp_hyp.append((token.text, token.ent_iob_+"-"+convert_dict[token.ent_type_]))
    
  return temp_hyp

### convert a list of lists of tuples into a list of string(label) with the same order
def convert_in_ordered_list_of_label(l):
  return [tup[1] for sent in l for tup in sent ]
  
### builds a list of tuples from a text sentence
def build_references(sentence):
  return [(e0,e3) for elem in sent for e0,e1,e2,e3 in [elem[0].split(" ")]]



In [146]:
import os
import sys
sys.path.insert(0, os.path.abspath('NLU_Second_Assignment/src'))

from conll import read_corpus_conll
from conll import evaluate

import spacy
from spacy.tokens import Doc
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import pandas as pd

trn_url = "NLU_Second_Assignment/src/conll2003/dev.txt"
trn_url = "NLU_Second_Assignment/src/conll2003/train.txt"
tst_url = "NLU_Second_Assignment/src/conll2003/test.txt"


raw_corpus = read_corpus_conll(tst_url) # reading the file

# remove -DOCSTART- lines
for r in raw_corpus:
  if r[0][0].split(" ")[0] == "-DOCSTART-":
    raw_corpus.remove(r)

### loading the english pipeline
nlp = spacy.load("en_core_web_sm")

hyps = []
refs = []

### cicle over all sentences in the corpus
for sent in raw_corpus:
  sentence = " ".join([elem[0].split(" ")[0] for elem in sent])
  
  ### building list of references for a sentence and append it to the list of references of the whole
  refs.append(build_references(sentence))
  
  ### call to the NER of spacy
  doc = nlp(sentence)

  ### POST-PROCESS reassamble tokens
  doc = reassemble_tokens(doc)

  ### build the list of tuple (text, label) for a sentence converting labels in conll format and appending to list of hypoteses
  convert_dict = {
      "PERSON": "PER",
      "ORG": "ORG",
      "LOC": "LOC",
      "GPE": "LOC",
      "FAC": "LOC",
      "CARDINAL": "MISC",
      "DATE": "MISC",
      "EVENT": "MISC",
      "LANGUAGE": "MISC",
      "LAW": "MISC",
      "MONEY": "MISC",
      "NORP": "MISC",
      "ORDINAL": "MISC",
      "PERCENT": "MISC",
      "PRODUCT": "MISC",
      "QUANTITY": "MISC",
      "TIME": "MISC",
      "WORK_OF_ART": "MISC"
  }
  hyps.append(convert_labels_into_conll(doc, convert_dict))
  
 
### adapt hypoteses and references to sklear input format
hyps_for_sklearn = convert_in_ordered_list_of_label(hyps)
refs_for_sklearn = convert_in_ordered_list_of_label(refs)

### extract labels present 
labels = sorted(list(set(refs_for_sklearn)))


### total accuracy is labeld as accuracy
print("PERFORMANCE token-level:")
print(classification_report(refs_for_sklearn, hyps_for_sklearn, labels=labels, digits=3))

print("Total accuracy: ",accuracy_score(refs_for_sklearn,hyps_for_sklearn))


results = evaluate(refs, hyps)

print("")
print("")
print("PERFORMANCE chunk-level:")
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

PERFORMANCE token-level:
              precision    recall  f1-score   support

       B-LOC      0.786     0.726     0.755      1668
      B-MISC      0.091     0.581     0.157       702
       B-ORG      0.523     0.337     0.410      1661
       B-PER      0.780     0.623     0.693      1617
       I-LOC      0.537     0.591     0.563       257
      I-MISC      0.055     0.426     0.097       216
       I-ORG      0.458     0.550     0.499       835
       I-PER      0.736     0.776     0.755      1156
           O      0.950     0.839     0.891     38323

    accuracy                          0.796     46435
   macro avg      0.546     0.606     0.536     46435
weighted avg      0.889     0.796     0.835     46435

Total accuracy:  0.7956929040594379


PERFORMANCE chunk-level:


Unnamed: 0,p,r,f,s
ORG,0.464,0.299,0.363,1661
MISC,0.087,0.558,0.151,702
PER,0.74,0.592,0.658,1617
LOC,0.777,0.718,0.746,1668
total,0.363,0.539,0.433,5648


# **2. Grouping of entities**

In [63]:
def grouping_entities(doc):
  retlist = []
  ent_chunked = []
  
  for ent in doc.ents:
    in_chunk = False
    if ent[0].idx not in ent_chunked:
      for chunk in doc.noun_chunks:
        if len(chunk.ents) != 0:
          if chunk.ents[0].start_char == ent[0].idx:
            in_chunk = True
            temp_result = []
            for ce in chunk.ents:
              temp_result.append(ce.label_)
              ent_chunked.append(ce[0].idx)
            break
      if in_chunk == False:
        retlist.append([ent.label_])
      else:
        retlist.append(temp_result)
  
  return retlist
  

In [64]:
import spacy

test_sentence = "Apple's Steve Jobs died in 2011 in Palo Alto , California . Autonomous cars shift insurance liability toward manufacturers in 1996"

nlp = spacy.load("en_core_web_sm")
doc = nlp(test_sentence)
groups_of_entities = grouping_entities(doc)
print("Test grouping function")
print(groups_of_entities)


Test grouping function
[['ORG', 'PERSON'], ['DATE'], ['GPE'], ['GPE'], ['DATE']]


In [65]:
### non tengo conto dell'ordine preché il significato del chunk è diverso (anche se non so che statistica vuole fare il prof)
def counting(groups):
  dict_group = defaultdict(int)

  for g in groups:
    key = ", ".join([s for s in g])
    dict_group[key] = dict_group[key] + 1

  return dict_group


In [66]:
import spacy 
from collections import defaultdict

nlp = spacy.load("en_core_web_sm")

groups = []
vu = 0
for sent in raw_corpus:
  sentence = " ".join([elem[0].split(" ")[0] for elem in sent])
  
  doc = nlp(sentence)
  
  groups.extend(grouping_entities(doc))

counts = counting(groups)
sort_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)

print("NE groups frequencies:")
for comb in sort_counts:
  print(comb)

NE groups frequencies:
('CARDINAL', 2116)
('GPE', 1346)
('DATE', 1140)
('PERSON', 1105)
('ORG', 955)
('NORP', 308)
('MONEY', 151)
('ORDINAL', 117)
('TIME', 92)
('PERCENT', 86)
('QUANTITY', 82)
('EVENT', 58)
('LOC', 57)
('NORP, PERSON', 47)
('CARDINAL, PERSON', 45)
('GPE, PERSON', 26)
('PRODUCT', 26)
('ORG, PERSON', 25)
('FAC', 21)
('CARDINAL, NORP', 16)
('CARDINAL, ORG', 13)
('WORK_OF_ART', 12)
('GPE, ORG', 11)
('GPE, GPE', 11)
('CARDINAL, GPE', 11)
('PERSON, PERSON', 10)
('DATE, EVENT', 9)
('ORG, ORG', 9)
('LANGUAGE', 8)
('LAW', 8)
('NORP, ORG', 7)
('PERSON, GPE', 6)
('DATE, ORG', 6)
('GPE, CARDINAL', 5)
('DATE, TIME', 5)
('DATE, NORP', 5)
('ORG, GPE', 4)
('CARDINAL, CARDINAL', 4)
('NORP, NORP', 4)
('GPE, NORP', 4)
('ORG, DATE', 4)
('CARDINAL, DATE', 3)
('ORDINAL, PERSON', 3)
('GPE, ORDINAL', 3)
('NORP, GPE', 3)
('ORDINAL, CARDINAL', 2)
('DATE, PRODUCT', 2)
('NORP, DATE', 2)
('ORG, NORP', 2)
('MONEY, ORG', 2)
('PERSON, ORG', 2)
('ORG, LOC', 2)
('DATE, PERSON', 2)
('ORDINAL, EVENT', 2)

In [190]:
convert_dict = {
      "PERSON": "PER",
      "ORG": "ORG",
      "LOC": "LOC",
      "GPE": "LOC",
      "FAC": "LOC",
      "CARDINAL": "MISC",
      "DATE": "MISC",
      "EVENT": "MISC",
      "LANGUAGE": "MISC",
      "LAW": "MISC",
      "MONEY": "MISC",
      "NORP": "MISC",
      "ORDINAL": "MISC",
      "PERCENT": "MISC",
      "PRODUCT": "MISC",
      "QUANTITY": "MISC",
      "TIME": "MISC",
      "WORK_OF_ART": "MISC"
  }

def extend_NE_to_compounds(doc):
  ### build hypoteses to update converting labels in conll format

  #h_update = convert_labels_into_conll(doc, convert_dict)

  ### build hypoteses to update
  h_update = []
  for token in doc:
    if token.ent_type_ == "":
      tup = (token.text, token.ent_iob_)
    else:
      tup = (token.text, token.ent_iob_+"-"+token.ent_type_)

    h_update.append(tup)

  #print("PRIMA")
  #print(h_update)
  #print("")
  #print("")

  #print(doc[2].text, doc[2].dep_, doc[2].ent_type_, "!"+doc[2].whitespace_+"!")
  ### extract compounds
  compounds = []
  head_type = ""
  comp = []
  for token in doc:
    if token.dep_ == "compound":
      headT = token.head
      comp.append(token)
    else:
      if comp:
        if token == headT: #to avoid compounds with dep different from 'compound' in the between 
          comp.append(headT)
        
        # check validity
        #j = 0
        #for n in range(len(comp)+j):
          #if comp[n+1].i != comp[n].i+1:
            #if doc[comp[n].i+1].dep_ == "compound":
              #comp.insert(n+1,doc[comp[n].i+1])
              #j += 1
            #else:
              #comp = comp[:n]
        

        compounds.append(comp)
      comp = []
      headT = ""

  #print("COMPOUNDS: ",compounds)

  ### update hypotheses
  for ent in doc.ents:
    for c in compounds:
      if ent[0].i in [c_elem.i for c_elem in c]:
        
        tok = c[0]
        head_type = ""
        if tok.ent_type_ != "":
            head_type = tok.ent_type_
        while tok.dep_ == "compound":
          if tok.head.ent_type_ != "":
            head_type = tok.head.ent_type_
          tok = tok.head
        #print(c[0], head_type)


        for i in range(c[0].i, (c[-1].i)+1):
          if i == c[0].i:
            #print("1 ",c," ",[e.i for e in c]," | ",i, c[0].i)
            if head_type != "":
              h_update[i] = (c[i-c[0].i].text, "B-"+head_type)
            else:
              h_update[i] = (c[i-c[0].i].text, "B-"+ent.label_)
          else:
            #print("2 ",c," ",[e.i for e in c]," | ",i, c[0].i)
            if head_type != "":
              h_update[i] = (c[i-c[0].i].text, "I-"+head_type)
            else:
              h_update[i] = (c[i-c[0].i].text, "I-"+ent.label_)
  
  #print("DOPO")
  #print(h_update)
  #print("")
  #print("")


  return convert_labels(h_update)


def convert_labels(le):
  converted = []

  for tu in le:
    if tu[1] != "O":
      b = tu[1].split("-")[0] + "-" + convert_dict[tu[1].split("-")[1]]
      converted.append((tu[0], b))
    else:
      converted.append(tu)
  
  return converted


In [120]:
##### NON SERVE PIù


import spacy
tests = "RUGBY UNION - CUTTITTA BACK FOR ITALY AFTER A YEAR ."
test_sent = "He said a proposal last month by EU Farm Commissioner Franz Fischler to ban sheep brains ."
test_sent2 = "Apple's Steve Jobs died in 2011 in Palo Alto, California ."

nlp = spacy.load("en_core_web_sm")
doc = nlp(tests)

print(extend_NE_to_compounds(doc))


[('RUGBY', 'B-ORG'), ('UNION', 'I-ORG'), ('-', 'I-ORG'), ('CUTTITTA', 'I-ORG'), ('BACK', 'I-ORG'), ('FOR', 'O'), ('ITALY', 'O'), ('AFTER', 'O'), ('A', 'B-MISC'), ('YEAR', 'I-MISC'), ('.', 'O')]


In [191]:
import spacy

nlp = spacy.load("en_core_web_sm")

hyps_extended = []
refs_extended = []

for sent in raw_corpus:
  sentence = " ".join([elem[0].split(" ")[0] for elem in sent])
  refs_extended.append(build_references(sentence))

  doc = nlp(sentence)

  # reassemble tokens according to whitespace_
  doc = reassemble_tokens(doc)

  # extend entity spans to compound spans
  hyps_extended.append(extend_NE_to_compounds(doc))



### adapt hypoteses and references to sklear input format
hyps_for_sklearn_extended = convert_in_ordered_list_of_label(hyps_extended)
refs_for_sklearn_extended = convert_in_ordered_list_of_label(refs_extended)

### extract labels present 
labels = sorted(list(set(refs_for_sklearn_extended)))


### total accuracy is labeld as accuracy
print("PERFORMANCE token-level:")
print(classification_report(refs_for_sklearn_extended, hyps_for_sklearn_extended, labels=labels, digits=3))

print("Total accuracy: ",accuracy_score(refs_for_sklearn_extended,hyps_for_sklearn_extended))


results = evaluate(refs_extended, hyps_extended)

print("")
print("")
print("PERFORMANCE chunk-level:")
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)


PERFORMANCE token-level:
              precision    recall  f1-score   support

       B-LOC      0.769     0.700     0.733      1668
      B-MISC      0.090     0.566     0.155       702
       B-ORG      0.505     0.316     0.389      1661
       B-PER      0.640     0.505     0.565      1617
       I-LOC      0.311     0.599     0.410       257
      I-MISC      0.050     0.454     0.090       216
       I-ORG      0.367     0.547     0.440       835
       I-PER      0.560     0.780     0.652      1156
           O      0.952     0.817     0.879     38323

    accuracy                          0.771     46435
   macro avg      0.472     0.587     0.479     46435
weighted avg      0.878     0.771     0.815     46435

Total accuracy:  0.7712070636373425


PERFORMANCE chunk-level:


Unnamed: 0,p,r,f,s
ORG,0.358,0.224,0.276,1661
MISC,0.075,0.476,0.13,702
PER,0.588,0.465,0.52,1617
LOC,0.711,0.647,0.677,1668
total,0.307,0.449,0.365,5648


In [154]:
print(results)
print(refs[0])
print(hyps_extended[0])
print(labels)

{'DATE': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'PRODUCT': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'TIME': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'MONEY': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'NORP': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'ORDINAL': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'PERSON': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'PER': {'p': 0.44537815126050423, 'r': 0.09833024118738404, 'f': 0.16109422492401215, 's': 1617}, 'LANGUAGE': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'GPE': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'FAC': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'LAW': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'ORG': {'p': 0.35803657362848895, 'r': 0.22396146899458158, 'f': 0.27555555555555555, 's': 1661}, 'CARDINAL': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'QUANTITY': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'EVENT': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'PERCENT': {'p': 0.0, 'r': 0, 'f': 0, 's': 0}, 'MISC': {'p': 0.07671699951290795, 'r': 0.44871794871794873, 'f': 0.13103161397670549, 's': 702}, 'LOC': {'p': 0.81444991

In [155]:
for f in hyps_extended:
  print(f)


results = evaluate(refs, hyps_extended)

print("")
print("")
print("PERFORMANCE chunk-level:")
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

[('SOCCER', 'O'), ('-', 'O'), ('JAPAN', 'O'), ('GET', 'B-ORG'), ('LUCKY', 'I-ORG'), ('WIN', 'I-ORG'), (',', 'O'), ('CHINA', 'B-LOC'), ('IN', 'O'), ('SURPRISE', 'O'), ('DEFEAT', 'O'), ('.', 'O')]
[('Nadim', 'O'), ('Ladki', 'O')]
[('AL-AIN', 'B-ORG'), (',', 'O'), ('United', 'B-GPE'), ('Arab', 'I-GPE'), ('Emirates', 'I-GPE'), ('1996-12-06', 'B-MISC')]
[('Japan', 'B-LOC'), ('began', 'O'), ('the', 'O'), ('defence', 'O'), ('of', 'O'), ('their', 'O'), ('Asian', 'B-EVENT'), ('Cup', 'I-EVENT'), ('title', 'I-EVENT'), ('with', 'O'), ('a', 'O'), ('lucky', 'O'), ('2-1', 'B-MISC'), ('win', 'O'), ('against', 'O'), ('Syria', 'B-LOC'), ('in', 'O'), ('a', 'O'), ('Group', 'B-ORG'), ('C', 'I-ORG'), ('championship', 'I-ORG'), ('match', 'I-ORG'), ('on', 'O'), ('Friday', 'B-MISC'), ('.', 'O')]
[('But', 'O'), ('China', 'B-LOC'), ('saw', 'O'), ('their', 'O'), ('luck', 'O'), ('desert', 'O'), ('them', 'O'), ('in', 'O'), ('the', 'O'), ('second', 'B-MISC'), ('match', 'O'), ('of', 'O'), ('the', 'O'), ('group', 'O')

Unnamed: 0,p,r,f,s
DATE,0.0,0.0,0.0,0
PRODUCT,0.0,0.0,0.0,0
TIME,0.0,0.0,0.0,0
MONEY,0.0,0.0,0.0,0
NORP,0.0,0.0,0.0,0
ORDINAL,0.0,0.0,0.0,0
PERSON,0.0,0.0,0.0,0
PER,0.445,0.098,0.161,1617
LANGUAGE,0.0,0.0,0.0,0
GPE,0.0,0.0,0.0,0
