# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

In [None]:
#whole dataset: all dataset (anche validation) - 1.1, 1.4 consider also the validation. 
# no lowercase for name entity

from nltk.corpus.reader import ConllCorpusReader
from collections import Counter

root = './data'
corpus = ConllCorpusReader(root, '.txt', ('words', 'pos','tree', 'chunk'))
whole_dataset = []

test_words = [w.lower() for w in corpus.words('test.txt')]
train_words = [w.lower() for w in corpus.words('train.txt')]
valid_words = [w.lower() for w in corpus.words('valid.txt')]

whole_dataset = test_words + train_words + valid_words
test_plus_train = test_words + train_words

test_sents = corpus.sents('test.txt')
train_sents = corpus.sents('train.txt')

whole_dataset_freq_list = Counter(whole_dataset)
test_plus_train_freq_list = Counter(test_plus_train)
train_freq_list = Counter(train_words)
test_freq_list = Counter(test_words)

most_twenty_whole = [ w[0] for w in (sorted(whole_dataset_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_whole)

most_twenty_train = [ w[0] for w in (sorted(train_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_train)

most_twenty_test = [ w[0] for w in (sorted(test_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_test)


#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

In [None]:
OOV = set(train_words) - set(test_words)
overlap = (set(train_words)).intersection(set(test_words))

#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [None]:
! pip install tabulate
from tabulate import tabulate

number_sents = len(test_sents + train_sents)
# 1
print(len(test_words + train_words)/number_sents)

#2
most_frequent_words = sorted(test_plus_train_freq_list.items(), key=lambda item: item[1] ,reverse= True)
print(tabulate(most_frequent_words[:50], headers=['Token', 'Frequence'], tablefmt='orgtbl'))

#3
# Do we have to consider also the first empty line?
print(number_sents)

#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

In [None]:
# frequency of each Named Entities: NN, ...
# list of all words

def nbest(d):
    return sorted(d.items(), key=lambda item: item[1], reverse=True)

def create_dictionary_NE_frequency(file, second_file = None):
    words = corpus.iob_words(file)

    if second_file is not None:
        words_second = corpus.iob_words(second_file)
        words = words + words_second
    
    names = []
    word = []

    for entity in words:
        print(entity)
        if "-" in entity[2]:
            prefix, suffix = entity[2].split("-")
            if prefix == "B":
                names.append(' '.join(word))
                word.clear()
                word.append(entity[0])
            if prefix == "I":
                word.append(entity[0])

    return Counter(names)
    

print(tabulate(nbest(create_dictionary_NE_frequency("test.txt"))[:5], headers=['Test token', 'Frequence'], tablefmt='orgtbl'))
print()
print(tabulate(nbest(create_dictionary_NE_frequency("train.txt"))[:5], headers=['Train token', 'Frequence'], tablefmt='orgtbl'))
print()
print(tabulate(nbest(create_dictionary_NE_frequency("train.txt", "test.txt"))[:5], headers=['Test and Train token', 'Frequence'], tablefmt='orgtbl'))



### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [None]:
# lab 6 - 1.1
# Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. 
# You can think of noun chunks as a noun plus the words describing the noun

sent = "I saw the man with a telescope"

def get_subjects(doc):
    for t in doc:
        dep = t.dep_
        if dep == 'nsubj': print("Nominal subject: '{}'".format(t))
        elif dep == 'dobj': print("Direct object: '{}'".format(t))
        elif dep == 'iobj': print("Indirect object: '{}'".format(t))

def get_noun_chunks(doc):
    for chunk in doc.noun_chunks:
        print("Noun chunk: '{}' \n Head Noun: '{}'".format(chunk.text, chunk.root.text))
        print()

get_subjects(nlp(sent))
print()
get_noun_chunks(nlp(sent))
print()


#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

In [None]:
# type of tree as a spacy doc
# subtree from spacy: <generator> object
# in sentence there are chunks --> dependency of that chunk
# the segment is a chunk 

from nltk import Tree
import en_core_web_sm
spacy_nlp = en_core_web_sm.load()

#thanks to: https://stackoverflow.com/questions/36610179/how-to-get-the-dependency-tree-with-spacy
def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_

def output_subtree(sentence, segment):
    words = segment.split()
    doc = spacy_nlp(sentence)
    not_leaf = []
    for sent in doc.sents:
        tree = to_nltk_tree(sent.root)
        for sub in tree.subtrees():
            if sub.label() in words:
                not_leaf.append(sub.label())
                sub.pretty_print() 
    
    leafs = set(words) - set(not_leaf)
    for f in leafs:
        print("leaf: {}".format(f))


sent = "I saw the man with a telescope"
output_subtree(sent, "a telescope")

#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [None]:
#print the path

def path_to_root(doc, segment):
    words = segment.split()

    root = [token for token in doc if token.head == token][0]
    for descendant in root.subtree:
        if descendant.text in words:
            print("word: {}, path to root: {}".format(descendant, [ancestor.text for ancestor in descendant.ancestors]))

sent = "I saw the man with a telescope"

doc = nlp(sent)
path_to_root(doc, "I saw the man")

### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

In [None]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('./data/test.txt'))

from conll import evaluate, read_corpus_conll


In [None]:
# use the accuracy as a metric
# length of sequences are the same
# use the IOB
# look at conll.py on piaza, we have to use it
# entity can consist in one or more token
# when we evaluate a token consider it in the whole

# token level: classification level
# entity level: conll.py

#spacy has different taglist. Having a mapping (name to person)
# think a model and compare with groundtrough
# make sure that tokenizer doesnt tokenize the sentence

# LAB 7

# first point
def get_categories(refs):
    categories = []
    for sent in refs:
        for iob in sent:
            if "-" in iob:
                iob = iob[2:]
                categories.append(iob)
    return categories

refs_test = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("test.txt")]
refs_train = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("train.txt")]
refs_valid = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("valid.txt")]

print(set(get_categories(refs_test) + get_categories(refs_train) + get_categories(refs_valid)))

In [None]:

named_entity = list(filter(lambda entity: entity != 'O', [entity for entity in corpus.iob_words("test.txt")]))


In [None]:

match = {
    "PERSON": "PER",
    "GPE": "LOC",
    "ORG": "ORG",
}

def check_name(iob, name):
    if iob == 'O' or name not in match:
        return 'O'
    else:
        return "-".join([iob, match.get(name)])

# parsing test set
res = []
for sent in new_corpus:
    doc = nlp(" ".join([t[0] for t in sent]))
    out = []
    for t in doc:
        out.append((t.text, check_name(t.ent_iob_, t.ent_type_)))
    res.append(out)

res1 = evaluate(new_corpus, res)
pd_tbl = pd.DataFrame().from_dict(res1, orient='index')
pd_tbl.round(decimals=3)

### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

In [None]:
# google it

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.