# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

In [137]:
from nltk.corpus.reader import ConllCorpusReader
from collections import Counter

root = './data'
corpus = ConllCorpusReader(root, '.txt', ('words', 'pos','tree', 'chunk'))
whole_dataset = []

test_words = [w.lower() for w in corpus.words('test.txt')]
train_words = [w.lower() for w in corpus.words('train.txt')]
valid_words = [w.lower() for w in corpus.words('valid.txt')]

whole_dataset = test_words + train_words + valid_words
test_plus_train = test_words + train_words

test_sents = corpus.sents('test.txt')
train_sents = corpus.sents('train.txt')

whole_dataset_freq_list = Counter(whole_dataset)
test_plus_train_freq_list = Counter(test_plus_train)
train_freq_list = Counter(train_words)
test_freq_list = Counter(test_words)

most_twenty_whole = [ w[0] for w in (sorted(whole_dataset_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_whole)

most_twenty_train = [ w[0] for w in (sorted(train_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_train)

most_twenty_test = [ w[0] for w in (sorted(test_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_test)


['the', ',', '.', 'of', 'in', 'to', 'a', '(', ')', 'and', '"', 'on', 'said', "'s", 'for', '-', '1', 'at', 'was', '2']
['the', '.', ',', 'of', 'in', 'to', 'a', 'and', '(', ')', '"', 'on', 'said', "'s", 'for', '1', '-', 'at', 'was', '2']
['the', ',', '.', 'to', 'of', 'in', '(', ')', 'a', 'and', 'on', '"', 'said', "'s", '-', 'for', 'at', 'was', '4', 'with']


#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

In [138]:
OOV = set(train_words) - set(test_words)
overlap = (set(train_words)).intersection(set(test_words))

#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [139]:
from tabulate import tabulate

number_sents = len(test_sents + train_sents)

# 1
print('Average sentence length: {}'.format(len(test_words + train_words)/number_sents))
#2
most_frequent_tokens = sorted(test_plus_train_freq_list.items(), key=lambda item: item[1] ,reverse= True)
print()
print(tabulate(most_frequent_tokens[:5], headers=['Most frequent 5 tokens', 'Frequence'], tablefmt='orgtbl'))

#3
print(number_sents)

Average sentence length: 13.392748112045417

| Most frequent 5 tokens   |   Frequence |
|--------------------------+-------------|
| the                      |       10155 |
| .                        |        9000 |
| ,                        |        8927 |
| of                       |        4604 |
| in                       |        4382 |
18671


#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

In [140]:
# frequency of each Named Entities: NN, ...
# list of all words

def nbest(d):
    return sorted(d.items(), key=lambda item: item[1], reverse=True)

def create_dictionary_NE_frequency(file, second_file = None):
    words = corpus.iob_words(file)

    if second_file is not None:
        words_second = corpus.iob_words(second_file)
        words = words + words_second
    
    names = []
    word = []

    for entity in words:
        if "-" in entity[2]:
            prefix, suffix = entity[2].split("-")
            if prefix == "B":
                names.append(' '.join(word))
                word.clear()
                word.append(entity[0])
            if prefix == "I":
                word.append(entity[0])

    return Counter(names)
    

print(tabulate(nbest(create_dictionary_NE_frequency("test.txt"))[:5], headers=['Test token', 'Frequence'], tablefmt='orgtbl'))
print()
print(tabulate(nbest(create_dictionary_NE_frequency("train.txt"))[:5], headers=['Train token', 'Frequence'], tablefmt='orgtbl'))
print()
print(tabulate(nbest(create_dictionary_NE_frequency("train.txt", "test.txt"))[:5], headers=['Test and Train token', 'Frequence'], tablefmt='orgtbl'))

| Test token   |   Frequence |
|--------------+-------------|
| Germany      |          49 |
| U.S.         |          45 |
| Australia    |          45 |
| Japan        |          41 |
| Italy        |          41 |

| Train token   |   Frequence |
|---------------+-------------|
| U.S.          |         303 |
| Germany       |         141 |
| Britain       |         133 |
| Australia     |         130 |
| England       |         123 |

| Test and Train token   |   Frequence |
|------------------------+-------------|
| U.S.                   |         348 |
| Germany                |         190 |
| Australia              |         175 |
| France                 |         162 |
| England                |         144 |


### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


In [141]:
import spacy
nlp = spacy.load('en_core_web_sm')
sent = "I saw the man with a telescope"
segment = "with a telescope"
doc = nlp(sent)

#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [142]:
def get_subjects(doc):
    for t in doc:
        dep = t.dep_
        if dep == 'nsubj': print("Nominal subject: '{}'".format(t))
        elif dep == 'dobj': print("Direct object: '{}'".format(t))
        elif dep == 'iobj': print("Indirect object: '{}'".format(t))

def get_noun_chunks(doc):
    for chunk in doc.noun_chunks:
        print("\n Noun chunk: '{}' \n Head Noun: '{}'".format(chunk.text, chunk.root.text))

get_subjects(doc)
get_noun_chunks(doc)

Nominal subject: 'I'
Direct object: 'man'

 Noun chunk: 'I' 
 Head Noun: 'I'

 Noun chunk: 'the man' 
 Head Noun: 'man'

 Noun chunk: 'a telescope' 
 Head Noun: 'telescope'


#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

In [143]:
from nltk import Tree
import en_core_web_sm
spacy_nlp = en_core_web_sm.load()


def check_leaf(node):
    if not node.n_lefts and not node.n_rights:
        return True

# gives the index of the segment in a sentece
def find_indexes(sentence_words, segment_words):
    
    results=[]
    segment_words_len=len(segment_words)

    # enumerate return two values: index and value at that index
    for index_segment in (index for index, value in enumerate(sentence_words) if value==segment_words[0]):
        if sentence_words[index_segment:index_segment+segment_words_len]==segment_words:
            results.append((index_segment,index_segment+segment_words_len-1))

    return results

# thanks to: https://stackoverflow.com/questions/36610179/how-to-get-the-dependency-tree-with-spacy
def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_

# given a sentence and a segment, print the subtree of the segment
def output_subtree(sentence, segment):
    sentence_words = sentence.split(' ')
    segment_words = segment.split(' ')
    result = find_indexes(sentence_words, segment_words)
    if len(result) < 1: raise ValueError('Segment not found') 

    right, left = result[0][0], result[0][1]

    doc = spacy_nlp(sentence)

    span = doc[doc[right].left_edge.i : doc[left].right_edge.i+1]
    tree = to_nltk_tree(span.root)
    if len(span) <2 and check_leaf(span): return 'The segment is a leaf'
    return tree

try:
    res = output_subtree(sent, segment)
    if isinstance(res, str):
        print(res)
    else:
        res.pretty_print()
            
except ValueError as ve:
    print(ve)


   with  
    |     
telescope
    |     
    a    



#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [144]:
#print the path

def path_to_root(doc, segment):
    words = segment.split()

    root = [token for token in doc if token.head == token][0]
    for descendant in root.subtree:
        if descendant.text in words:
            print("word: {}, path to root: {}".format(descendant, [ancestor.text for ancestor in descendant.ancestors]))

path_to_root(doc, segment)

word: with, path to root: ['man', 'saw']
word: a, path to root: ['telescope', 'with', 'man', 'saw']
word: telescope, path to root: ['with', 'man', 'saw']


### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

In [145]:
# to import conll
import os
import sys
import pandas as pd
from conll import evaluate, read_corpus_conll
from sklearn.metrics import classification_report
from spacy.tokenizer import Tokenizer
sys.path.insert(0, os.path.abspath('./src/'))

match = {
    "PERSON": "PER",
    "GPE": "LOC",
    "ORG": "ORG",
}

In [146]:
def get_categories(refs):
    categories = []
    for sent in refs:
        for iob in sent:
            if "-" in iob:
                iob = iob[2:]
                categories.append(iob)
    return categories

refs_test = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("test.txt")]
refs_train = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("train.txt")]
refs_valid = [[iob for text, pos, iob in sent] for sent in  corpus.iob_sents("valid.txt")]

print(set(get_categories(refs_test) + get_categories(refs_train) + get_categories(refs_valid)))

{'LOC', 'MISC', 'ORG', 'PER'}


In [147]:
def check_name(iob, name):
    if iob == 'O':
        return 'O'
    elif name not in match: return "-".join([iob, 'MISC'])
    else:
        return "-".join([iob, match.get(name)])

spacy_map_token = []
named_entity = list([(entity[0], entity[2]) for entity in corpus.iob_words("test.txt")])
sents = [sent for sent in corpus.sents("test.txt")]

# create blank vocab, thanks to: https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/
nlp.tokenizer = Tokenizer(nlp.vocab)
for sent in sents:
    doc = nlp(' '.join(sent))
    for s in doc:
        spacy_map_token.append((s.text, check_name(s.ent_iob_, s.ent_type_)))

print(classification_report([en[1] for en in named_entity], [en[1] for en in spacy_map_token]))

              precision    recall  f1-score   support

       B-LOC       0.78      0.70      0.74      1668
      B-MISC       0.10      0.58      0.17       702
       B-ORG       0.50      0.28      0.36      1661
       B-PER       0.77      0.57      0.66      1617
       I-LOC       0.66      0.53      0.58       257
      I-MISC       0.07      0.49      0.12       216
       I-ORG       0.46      0.48      0.47       835
       I-PER       0.78      0.71      0.74      1156
           O       0.94      0.86      0.90     38323

    accuracy                           0.81     46435
   macro avg       0.56      0.58      0.53     46435
weighted avg       0.88      0.81      0.84     46435



In [148]:
new_corpus  = read_corpus_conll('./data/test.txt', fs = ' ')

# parsing test set
res = []
for sent in new_corpus:
    doc = nlp(" ".join([t[0] for t in sent]))
    out = []
    for t in doc:
        out.append((t.text, check_name(t.ent_iob_, t.ent_type_)))
    res.append(out)


result = evaluate(new_corpus, res)
pd_tb = pd.DataFrame().from_dict(result, orient='index')
pd_tb.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.776,0.692,0.732,1668
ORG,0.354,0.25,0.293,1661
MISC,0.099,0.56,0.168,702
PER,0.737,0.543,0.625,1617
total,0.363,0.503,0.421,5648


### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

In [149]:
# google it

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.