<a href="https://colab.research.google.com/github/FrancescoMinchio/NLU_Assignement_2/blob/main/Second_Assignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SECOND_ASSIGNEMENT

*   Student name: Francesco Minchio
*   Student contact: francesco.minchio@studenti.unitn.it
*   Student referal: 225269

### **Basic imports**

* en_core_web_sm with spacy.load (text processing)
* panda (to be assessed)
* Conll.py module
* Read_corpus_conl function to load our dataset: this function in summary serves to store each sentence in a list of tuples, each containing a single line of text, with their respective tokens and labels. For this reason a list of phrases and labels are created. For each obtained line a string is created to store token text within the sentence being observed and a list is created to store pairs for each tuple.



In [1]:
import spacy
from spacy.tokens import Doc
from sklearn.metrics import classification_report
import pandas as pd

import re

nlp = spacy.load("en_core_web_sm")
nlp_standard = spacy.load("en_core_web_sm")

def read_corpus_conll(corpus_file, fs="\t"):
    featn = None  # number of features for consistency check
    sents = []  # list to hold words list sequences
    words = []  # list to hold feature tuples

    for line in open(corpus_file):
        line = line.strip()
        if len(line.strip()) > 0:
            feats = tuple(line.strip().split(fs))
            if not featn:
                featn = len(feats)
            elif featn != len(feats) and len(feats) != 0:
                raise ValueError("Unexpected number of columns {} ({})".format(len(feats), featn))

            words.append(feats)
        else:
            if len(words) > 0:
                sents.append(words)
                words = []
    return sents

### **TASK 0:**

* Evaluate spaCy NER on CoNLL 2003 data (provided).

Features:

* WhitespaceTokenizer: we are able to extract the tokens from string of words or sentences without whitespaces, new line and tabs and produce a return of tokens from a string;
* Spacy vocab.

Next process: mapping to make spacy tags in conll tags:

In [2]:
class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        return Doc(self.vocab, words=words)


nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

spacy_to_conll_map = {
    "PERSON": "PER",
    "NORP": "MISC",
    "FACILITY": "ORG",
    "FAC": "MISC",
    "ORG": "ORG",
    "GPE": "LOC",
    "LOC": "LOC",
    "PRODUCT": "MISC",
    "EVENT": "MISC",
    "WORK_OF_ART": "MISC",
    "LAW": "MISC",
    "LANGUAGE": "MISC",
    "DATE": "MISC",
    "TIME": "MISC",
    "PERCENT": "MISC",
    "MONEY": "MISC",
    "QUANTITY": "MISC",
    "ORDINAL": "MISC",
    "CARDINAL": "MISC",
    "PER": "PER",
    "MISC": "MISC",
    "EVT": "MISC",
    "PROD": "MISC",
    "DRV": "MISC",
    "GPE_LOC": "LOC",
    "GPE_ORG": "ORG",
    "": ""
}

### **TASK 0.1:**

* Report token-level performance (per class and total): accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)
* Report CoNLL chunk-level performance (per class and total): precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

Function: token_level_performance which contains the algorithm for this operation.
By using read_corpus_conll, the word dataset contained in conll is read, while conll_data is a list of lists divided into two sections: externally there is the list of sentences ( sentence n), internally there is a list corresponding to a line of the dataset.
In this way, by applying the algorithm, sentences are created and inserted in a string where each token is separated by " ".




In [3]:
def token_level_performance(conll_data):   #read the list of lists
    total_tokens = 0
    correctly_classified = 0
    for sentence in conll_data:
        token_array = []
        part_of_speech_tag_array = []
        chunck_tag_array = []
        named_entity_tag_array = []
        for element in sentence:
            token = element[0].split()[0]
            part_of_speech_tag = element[0].split()[1]
            chunck_tag = element[0].split()[2]
            named_entity_tag = element[0].split()[3]
            token_array.append(token)
            part_of_speech_tag_array.append(part_of_speech_tag)
            chunck_tag_array.append(chunck_tag)
            named_entity_tag_array.append(named_entity_tag)
        doc = nlp(" ".join(token_array))
        token_index = 0
        for token in doc:
            total_tokens += 1
            ent_type_converted_to_conll = token.ent_iob_
            if(spacy_to_conll_map[token.ent_type_] != ""):
                ent_type_converted_to_conll += "-" + \
                    spacy_to_conll_map[token.ent_type_]
            if(ent_type_converted_to_conll == named_entity_tag_array[token_index]):
                correctly_classified += 1
            token_index += 1
    print(total_tokens, correctly_classified)
    return correctly_classified / total_tokens

### **TASK 0.2:**

* Report CoNLL chunk-level performance (per class and total);
* Precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total.

A token array is provided by Conll through which the Chunks are rebuilt. The Chunk is recognized directly by the algorithm and checks if the label is correct.

In [4]:
def chunk_level_performance(conll_data):
    effective_class_counts = {
        "MISC": 0,
        "ORG": 0,
        "PER": 0,
        "LOC": 0,
    }
    recognized_class_counts = {
        "MISC": 0,
        "ORG": 0,
        "PER": 0,
        "LOC": 0,
    }
    counter = 0
    chunk_counter = 0
    recognized_chunk_counter = 0
    for sentence in conll_data:
        token_array = []
        part_of_speech_tag_array = []
        chunck_tag_array = []
        named_entity_tag_array = []
        for element in sentence:
            token = element[0].split()[0]
            part_of_speech_tag = element[0].split()[1]
            chunck_tag = element[0].split()[2]
            named_entity_tag = element[0].split()[3]
            token_array.append(token)
            part_of_speech_tag_array.append(part_of_speech_tag)
            chunck_tag_array.append(chunck_tag)
            named_entity_tag_array.append(named_entity_tag)
        doc = nlp(" ".join(token_array))
        actual_chunks, actual_chunk_label_array, non_empty_chunks, current_effective_class_counts = get_chunks(
            token_array, named_entity_tag_array)
        effective_class_counts["MISC"] += current_effective_class_counts["MISC"]
        effective_class_counts["ORG"] += current_effective_class_counts["ORG"]
        effective_class_counts["PER"] += current_effective_class_counts["PER"]
        effective_class_counts["LOC"] += current_effective_class_counts["LOC"]
        chunk_counter += non_empty_chunks
        for ent in doc.ents:
            if ent.text in actual_chunks:
                recognized_chunk_counter += 1
                actual_chunk_index = actual_chunks.index(ent.text)
                if spacy_to_conll_map[ent.label_] == actual_chunk_label_array[actual_chunk_index]:
                    recognized_class_counts[actual_chunk_label_array[actual_chunk_index]] += 1
    return effective_class_counts, recognized_class_counts, chunk_counter, recognized_chunk_counter


def get_chunks(token_array, named_entity_tag_array):
    effective_class_counts = {
        "MISC": 0,
        "ORG": 0,
        "PER": 0,
        "LOC": 0,
    }
    chunk_array = []
    chunk_label_array = []
    current_chunk = ""
    last_iob = ""
    total_chunks = 0
    for token_index, token in enumerate(token_array):
        if named_entity_tag_array[token_index][0] == "B":
            effective_class_counts[named_entity_tag_array[token_index][2:]] += 1
            total_chunks += 1
            chunk_label_array.append(named_entity_tag_array[token_index][2:])
            if current_chunk != "":
                chunk_array.append(current_chunk)
            current_chunk = token
        if named_entity_tag_array[token_index][0] == "I":
            current_chunk += " " + token
        last_iob = named_entity_tag_array[token_index][0]
    chunk_array.append(current_chunk)
    return chunk_array, chunk_label_array, total_chunks, effective_class_counts,

### **TASK 1:**

* Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

We have alist of sentences as string (input) and a list of lists where the internal lists have as elements the label of the entities belonging to the same chunk noun (output).

OPERATION PRINCIPLE: 2 STEPS

* [doc.noun_chunks]. Creation of a list with label check on all tokens, if this first loop doesn't match labels or if it has already been detected, the function jumps to the next token. At the end of the process each token is added to the Chunk dictionary as a key and its value is a list containing as first element the list of the entity belonging to the Token Chunk and as second element the piece containing the key token.

* [doc.ents]. If the first element of the entity B is one of the dictionary keys and the chunk to which it belongs has not yet been added to the list of the grouped entity, the list of all the entity labels of its noun piece is added to the group_ent list and its chunk is added to the chunk_done list to not be repeated when you find the other tokens belonging to that chunk.



In [5]:
def grouping_entities(text):
    chunks=dict()                       #creating the dictionary containing sentences
    chunk_done=[]                       #list to archive blocks of entities already added to the final output
    group_ent=[]                        #list to store the future output
    for sentence in text:
        doc=nlp_spacy(sentence)         #get a document
        for chunk in doc.noun_chunks:
            l=[]
            for c in chunk:
                if c.ent_type_!="" and c.ent_iob_=='B':
                    l.append(c.ent_type_)
            for ch in chunk:
                chunks[ch]=[l, chunk]
        for ent in doc.ents:
            if ent[0] in chunks.keys() and chunks[ent[0]][1] not in chunk_done:
                group_ent.append(chunks[ent[0]][0])
                chunk_done.append(chunks[ent[0]][1])
            elif ent[0] not in chunks.keys():
                group_ent.append([ent[0].ent_type_])
    return group_ent                                   #list with the grouped entities


### **TASK 2:**

* One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

In [6]:
def expand_entity_with_compound(sentence):
    doc = nlp_standard(sentence)
    ents = doc.ents
    idx_to_tokenindex_map = {}
    token_ent_pair_array= []
    token_to_change = []
    for token_index, token in enumerate(doc):
        idx_to_tokenindex_map[token.idx] = token_index
        if token.dep_ != "compound":
            is_first = True
            is_first_child = True
            for child in token.children:
                if child.dep_ == "compound" and child.idx < token.idx:
                    is_first = False
                    if is_first_child:
                        token_ent_pair_array[idx_to_tokenindex_map[child.idx]] = (child.text, "B-" + token.ent_type_)
                    else:
                        token_ent_pair_array[idx_to_tokenindex_map[child.idx]] = (child.text, "I-" + token.ent_type_)
                    is_first_child = False
            if is_first:
                ent_iob_ = "O"
                if token.ent_iob_ != "O":
                    ent_iob_ = token.ent_iob_ + "-"
                token_ent_pair_array.append((token.text, ent_iob_ + token.ent_type_))
            else:
                token_ent_pair_array.append((token.text, "I-" + token.ent_type_))
        else:
            if token.head.idx < token.idx:
                head_ent_type = token_ent_pair_array[idx_to_tokenindex_map[token.head.idx]][1][2:]
                token_ent_pair_array.append((token.text, "I-" + head_ent_type))
            else:
                token_ent_pair_array.append(())
    
    return token_ent_pair_array

### **Results:**

In [None]:
print()
print("Task 0.1 results:")
conll_data = read_corpus_conll("./conll2003/test.txt")
print("Token classification:", token_level_performance(conll_data))

print()
print("Task 0.2 results:")
effective_class_counts, recognized_class_counts, chunk_counter, recognized_chunk_counter = chunk_level_performance(conll_data)
print("total chunk rateo:", (recognized_chunk_counter/chunk_counter))
print("class ORG rateo:", recognized_class_counts["ORG"] / effective_class_counts["ORG"])
print("class LOC rateo:", recognized_class_counts["LOC"] / effective_class_counts["LOC"])


test_sentence = "Apple's Steve Jobs died in 2011 in Palo Alto, California."
print()
print("Task 1 results:")
print(grouping_entities(test_sentence))

print()
print("Task 2 Results:")
print(expand_entity_with_compound(test_sentence))