# Natural Language Understanding
## Assignment 2: named entity recognition and dependency parsing
The following requests have been fulfilled:

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
    - report CoNLL chunk-level performance (per class and total);
1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 
2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

## Rationale
Python code has been written in order to exploit the vast functionality of `spaCy` functions. Detailed comments on the logic behind each function can be found on the `assignment2.py` file itself; here we will present a higher level explanation of the code.
The code has the main purpose of exploring and working with named entity recognition and dependency parsing through `spaCy`.
A large number of functions has been designed in order to explore the various possibilities of working with a corpus which tokenizes differently with respect to `spacy`.

### Function `extract_tokens(corpus)`
The function reconstructs the sentences of the input corpus as plain strings, without performing any filtering or manipulation of the sentences themselves. Therefore, in this case, also all the `-DOCSTART-` tokens have been kept.
The function is a simple list comprehension iterating over the sentences of the corpus and joining the encountered tokens.

In [11]:
## Output (partial):
## [['-DOCSTART-'], ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']]

### Function `extract_tokens_clean(corpus)`
The function reconstructs the sentences of the input corpus as plain strings, but this time filtering for not needed sentences (e.g. by removing the `-DOCSTART-` tokens).
The function is a simple list comprehension iterating over the previously created object through `extract_tokens(corpus)` and filtering out unnecessary sentences.

In [12]:
## Output (partial):
## [['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.'], ['Nadim', 'Ladki']]

### Function `extract_tags(corpus)`
The function extracts tokens and relative IOB tags from the input coNLL corpus. It does not filter nor perform manipulation.
It is a simple list comprehension iterating over all sentences of the corpus and storing in tuples the tokens and IOB tags.

### Function `extract_tags_clean(corpus)`
The function extracts tokens and relative IOB tags from the input coNLL corpus, but this time performing filtering of the `-DOCSTART-` tokens.
It is a simple list comprehension iterating over all sentences of the previously generated object through `extract_tags(corpus)` and filtering out unnecessary sentences.

### Function `clean_sents(corpus)`
The function reconstructs the sentences in string type of a corpus in coNLL format by also accounting for the correct distribution of whitespaces and punctuation.

In [13]:
## Output (partial):
## [['SOCCER - JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT.'], ['Nadim Ladki']]

### Function `spacy_on_cleansents_text(corpus)`

In [14]:
## Output (partial):
## [['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.'], ['Nadim', 'Ladki']]

### Function `spacy_on_cleansents_token(corpus)`

In [15]:
## Output (partial):
## [[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT, .], [Nadim, Ladki]]

### Function `get_whitespaces(corpus)`
The function retrieves the `token.whitespace_` attribute of each token and stores it into a list for later use.
It is a simple list comprehension iterating over the sentences obtained through the `spacy_on_cleansents_token(corpus)` function.

In [16]:
## Output (partial):
## [[True, True, True, True, True, False, True, True, True, True, False, False], [True, False]]

### Function `get_spacy_tags(corpus)`
The function reconstructs the entities of the tokens of the input corpus without further manipulation.
It is a list comprehension iterating over the sentences of the object produced by the function `clean_sents(corpus)`.

In [17]:
## Output (partial):
## [[('SOCCER', 'O-'), ('-', 'O-'), ('JAPAN', 'O-'), ('GET', 'O-'), ('LUCKY', 'O-'), ('WIN', 'O-'), (',', 'O-'), ('CHINA', 'B-LOC'), ('IN', 'O-'), ('SURPRISE', 'O-'), ('DEFEAT', 'O-'), ('.', 'O-')], [('Nadim', 'B-ORG'), ('Ladki', 'I-ORG')]]

### Function `get_spacy_tags_clean(corpus)`
The function reconstructs the entities of the tokens of the input corpus applying some manipulations.
It is a list comprehension iterating over the sentences obtained through the function `get_spacy_tags(corpus)`.

In [18]:
## Output (partial):
## [[('SOCCER', 'O'), ('-', 'O'), ('JAPAN', 'O'), ('GET', 'O'), ('LUCKY', 'O'), ('WIN', 'O'), (',', 'O'), ('CHINA', 'B-LOC'), ('IN', 'O'), ('SURPRISE', 'O'), ('DEFEAT', 'O'), ('.', 'O')], [('Nadim', 'B-ORG'), ('Ladki', 'I-ORG')]]

### Function `get_spacy_alignment(corpus)`
The function returns the tokenization alignment as given by spacy. It is useful to check which sentences are differently tokenized between the coNLL corpus and the spacy corpus.

### Function `collapse_tokens(corpus)`
The function reconstructs the tokenization done by spacy on the coNLL corpus in order to align the two. It does so by exploiting the whitespace information of the tokens and by manipulating the entity tags with respect to the ones given by spacy.

### Function `collapse_tokens_alternative(corpus)`
The function reconstructs the tokenization done by spacy on the coNLL corpus, approaching the problem from a different perspective. It has been empirically noticed that by using this function over the previous one, the accuracy scores decrease substantially; it is therefore not used to the ends of the assignment, but it has been explored as a possibility.

### Function `get_stats(corpus)`
The function computes the chunk-level performance of the tokenization on the coNLL corpus by using the `conll.evaluate(ref, hyp)` function provided.

### Function `accuracy_token_level(corpus)`
The function computes the overall token-level accuracy of the tokenization on the coNLL corpus by using the `classification_report(tot, pos)` function of sklearn.

### Function `get_chunks_ent(corpus)`
The function performs grouping of noun chunk entities and then calculates the overall number of chunks which display the seen combinations of entities.

### Function `span_compound(doc)`
The function fixes segmentation errors (if present) by expanding the span of the compound token, through manipulation of its IOB tag. We analyze the cases in which the token is located in the proximity of the beginning or the end of an entity, in order to extend the span according to the `compound` dependency relation. 