# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

In [1]:
from nltk.corpus.reader import ConllCorpusReader
from collections import Counter

root = './data'
corpus = ConllCorpusReader(root, '.txt', ('words', 'pos', 'tree'))
whole_dataset = []

test_words = [w.lower() for w in corpus.words('test.txt')]
train_words = [w.lower() for w in corpus.words('train.txt')]
valid_words = [w.lower() for w in corpus.words('valid.txt')]

whole_dataset = test_words + train_words + valid_words
test_plus_train = test_words + train_words

test_sents = corpus.sents('test.txt')
train_sents = corpus.sents('train.txt')
valid_sents = corpus.sents('valid.txt')

whole_dataset_freq_list = Counter(whole_dataset)
test_plus_train_freq_list = Counter(test_plus_train)
train_freq_list = Counter(train_words)
test_freq_list = Counter(test_words)

most_twenty_whole = [ w[0] for w in (sorted(whole_dataset_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_whole)

most_twenty_train = [ w[0] for w in (sorted(train_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_train)

most_twenty_test = [ w[0] for w in (sorted(test_freq_list.items(), key=lambda item: item[1] ,reverse= True)[:20]) ]
print(most_twenty_test)


['the', ',', '.', 'of', 'in', 'to', 'a', '(', ')', 'and', '"', 'on', 'said', "'s", 'for', '-', '1', 'at', 'was', '2']
['the', '.', ',', 'of', 'in', 'to', 'a', 'and', '(', ')', '"', 'on', 'said', "'s", 'for', '1', '-', 'at', 'was', '2']
['the', ',', '.', 'to', 'of', 'in', '(', ')', 'a', 'and', 'on', '"', 'said', "'s", '-', 'for', 'at', 'was', '4', 'with']


#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

In [2]:
print(set(train_words) - set(test_words))
print(set(train_words).intersection(set (test_words)))



#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [3]:
! pip install tabulate
from tabulate import tabulate

number_sents = len(test_sents + train_sents)
# 1
print(len(test_words + train_words)/number_sents)

#2
most_frequent_words = sorted(test_plus_train_freq_list.items(), key=lambda item: item[1] ,reverse= True)
print(tabulate(most_frequent_words[:50], headers=['Token', 'Frequence'], tablefmt='orgtbl'))

#3
# Do we have to consider also the first empty line?
print(number_sents)

13.392748112045417
| Token   |   Frequence |
|---------+-------------|
| the     |       10155 |
| .       |        9000 |
| ,       |        8927 |
| of      |        4604 |
| in      |        4382 |
| to      |        4229 |
| a       |        3857 |
| (       |        3547 |
| )       |        3545 |
| and     |        3470 |
| "       |        2599 |
| on      |        2559 |
| said    |        2248 |
| 's      |        1913 |
| for     |        1751 |
| 1       |        1567 |
| -       |        1530 |
| at      |        1397 |
| was     |        1319 |
| 2       |        1134 |
| 3       |        1088 |
| 0       |        1072 |
| with    |        1052 |
| that    |         972 |
| he      |         967 |
| from    |         947 |
| it      |         911 |
| by      |         896 |
| :       |         875 |
| is      |         836 |
| 4       |         782 |
| as      |         753 |
| had     |         700 |
| his     |         682 |
| not     |         665 |
| but     |        

#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

In [4]:
# frequency of each Named Entities: NN, ...
# 
corpus = ConllCorpusReader('./data', '.txt', ('words', 'pos','tree', 'chunk'))

def nbest(d):
    return sorted(d.items(), key=lambda item: item[1], reverse=True)

def print_most_frequent(first_file, second_file, corpus):
    named_entity = list(filter(lambda entity: entity != 'O', [entity [2] for entity in corpus.iob_words(first_file)]))
    if second_file is not None:
        named_entity_second = list(filter(lambda entity: entity != 'O', [entity [2] for entity in corpus.iob_words(second_file)]))
        named_entity = named_entity + named_entity_second
    named_entity_counter = Counter(named_entity)
    print(tabulate(nbest(named_entity_counter), headers=['Token', 'Frequence'], tablefmt='orgtbl'))

print_most_frequent('train.txt', 'test.txt', corpus)
print_most_frequent('train.txt', None, corpus)
print_most_frequent('test.txt', None, corpus)

| Token   |   Frequence |
|---------+-------------|
| B-LOC   |        8808 |
| B-PER   |        8217 |
| B-ORG   |        7982 |
| I-PER   |        5684 |
| I-ORG   |        4539 |
| B-MISC  |        4140 |
| I-LOC   |        1414 |
| I-MISC  |        1371 |
| Token   |   Frequence |
|---------+-------------|
| B-LOC   |        7140 |
| B-PER   |        6600 |
| B-ORG   |        6321 |
| I-PER   |        4528 |
| I-ORG   |        3704 |
| B-MISC  |        3438 |
| I-LOC   |        1157 |
| I-MISC  |        1155 |
| Token   |   Frequence |
|---------+-------------|
| B-LOC   |        1668 |
| B-ORG   |        1661 |
| B-PER   |        1617 |
| I-PER   |        1156 |
| I-ORG   |         835 |
| B-MISC  |         702 |
| I-LOC   |         257 |
| I-MISC  |         216 |


### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = ' '.join(test_words)
doc = nlp(text)

#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [6]:
# lab 6 - 1.1
# Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. 
# You can think of noun chunks as a noun plus the words describing the noun

sent = "I saw the man with a telescope"

def get_subjects(doc):
    for t in doc:
        dep = t.dep_.lower()
        if dep == 'nsubj': print("Nominal subject: '{}'".format(t))
        elif dep == 'dobj': print("Direct object: '{}'".format(t))
        elif dep == 'iobj': print("Indirect object: '{}'".format(t))

def get_noun_chunks(doc):
    for chunk in doc.noun_chunks:
        print("Noun chunk: '{}' \n Head Noun: '{}'".format(chunk.text, chunk.root.text))
        print()

get_subjects(nlp(sent))
print()
get_noun_chunks(nlp(sent))
print()


Nominal subject: 'I'
Direct object: 'man'

Noun chunk: 'I' 
 Head Noun: 'I'

Noun chunk: 'the man' 
 Head Noun: 'man'

Noun chunk: 'a telescope' 
 Head Noun: 'telescope'




#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

In [95]:
def output_subtree(doc, segment):
    words = segment.split()
    root = [token for token in doc if token.head == token][0]

    for descendant in root.subtree:
        if descendant.text in words:
            print("node: {}, subtree: {}".format(descendant.text, [descendant.text for descendant in descendant.subtree]))


sent = "I saw the man with a telescope"

doc = nlp(sent)
output_subtree(doc, sent)
   

node: I, subtree: ['I']
node: saw, subtree: ['I', 'saw', 'the', 'man', 'with', 'a', 'telescope']
node: the, subtree: ['the']
node: man, subtree: ['the', 'man', 'with', 'a', 'telescope']
node: with, subtree: ['with', 'a', 'telescope']
node: a, subtree: ['a']
node: telescope, subtree: ['a', 'telescope']


#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [91]:
def path_to_root(doc, segment):
    words = segment.split()

    root = [token for token in doc if token.head == token][0]
    for descendant in root.subtree:
        if descendant.text in words:
            print("word: {}, path to root: {}".format(descendant, [ancestor.text for ancestor in descendant.ancestors]))

sent = "I saw the man with a telescope"

doc = nlp(sent)
path_to_root(doc, "I saw the man")

word: I, path to root: ['saw']
word: saw, path to root: []
word: the, path to root: ['man', 'saw']
word: man, path to root: ['saw']


### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.