Skip to content

Report 2

Jhonny Hueller edited this page Apr 27, 2021 · 2 revisions

Code description:

This page is dedicated to explaining the logic behind the code for the second assignment. This describes the main functions found in the assignment_module_2.py.

Helper class:

WhitespaceTokenizer

This class is used as an alternative tokenizer for spacy. The reason is present is because it has been proven necessary for the alignment with the tokenization of the conll 2003 dataset.

Main functions:

extract_data

This function extract data from the conll 2003 dataset, returning the corresponding docs objects of spacy and two list of data of equal structure for easy access to both list at once.

Input: string (file path of the text file with conll format)

Output: list[Docs], list[list[tuple(string, string, string, string)]], list[list[tuple(string, string, string, string)]]

The function read the conll corpus using the read_corpus_conll function of the conll script. It builds a more easily accessible variable. The variable is a list of list of tuple in which each tuple is defined as (text, pos_tag, chunk_tag, iob_tag) for one token. The tuples are grouped together by sentence into a list, and the resulting list is inserted into the main list, which contains all the dataset.

The dataset is cleaned and feed to a spacy language object, updated with the new tokenizer, to retrieve the docs objects. A new dataset variable, with the same structure of the one used for conll data, is created with spacy data. Not before properly adapt those data to the conll format.

Then the 3 variables, the list of docs, the conll dataset and the modified spacy dataset are returned.

evaluate_lists

This function provide for general statistics regarding 2 lists, the first of estimate data and the second of true data, measuring the accuracy of the spacy classified.

Input: list[tuple(string, string)] (estimates), list[tuple(string, string)] (ground truths), dict (optional dictionary with conversions from spacy to conll), dict (optional dictionary with conversion from conll to spacy)

Output: float (total accuracy), int (number of accurate prediction), int (total number of prediction), dict (per tag agguracies)

This function takes 2 list, which must be of equal length and made by tuples with at least two items, and using the second list as ground truth calculate the amount of time the first list, treated as random estimates, equals the ground truth. If provided uses dictionaries in order to make the necessary conversions without affecting the original lists.

group_entities

This function return the list of valid entity type of a specific doc. Used to calculate the frequency of each permutation of entity tags.

Input: doc

Output: list[list[str]]

It takes the noun chunks from the spacy doc object, and for each chunk extract the list of entity type. Then it returns a list of all the list found.

extend_noun_compound

This function returns a list of tuples with corrected noun dependency composition, used to complete the third task.

Input: list[Docs]

Output: list[list[(string,string)]], a list of list of tuple defined as (token text, corrected iob tag)

For each doc object, created a dicttionary to store the tag for each token and initializes it with 'O'. Then takes each entity present in the doc, and starting from the first element of each entity it expand its tag for all its token storing them into the dictionary. Then return a list created from the dictionary, in the correct order.

Support Functions:

build_simple_data_list

Builds a simple list[list[tuple]] from the conll and spacy dataset made by extract_data with just text and a single tag.

build_tuple_data_list

Undo the grouping by sentence of the dataset from extract_data, preserving the tuples.

build_sentence_string

Creates a string from a list of tuples of conll dataset. Used to feed spacy.

Clone this wiki locally