Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Conll-U to fully support and cover EWT dataset #194

Merged
merged 15 commits into from
Jun 3, 2021

Conversation

ZachEichen
Copy link
Collaborator

Addresses issue #191 and adds support for importing CoNLL-U data-format files, especially those in EWT, Global Dependencies, and conll_2009 formats, as well as Ontonotes.

Created a separate method from the conll_to_df as a new entry-point for these dataformats, which supports similar options to other available packages supporting .conllu files (such as Spacy). Refactored common code within the io/conll module to separate methods.

@ZachEichen ZachEichen requested a review from frreiss June 1, 2021 14:06
Copy link
Member

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Some minor changes requested inline.

Would you mind also running the modified files through black before checking in?

self._sentence_id = None
self._paragraph_id = None
self._doc_id = None
self.conll_09_format = predicate_args
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class field name should start with underscore.

def has_conll_u_metadata(self):
return (self._sentence_id is not None) or (self._paragraph_id is not None) or (self._doc_id is not None)

def set_conll_u_metadata(self, doc_id: str = None, paragraph_id: str = None, sent_id: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind augmenting these fields with a configurable set of key-value pairs? The CoNLL-U spec seems to think the comments before each sentence are supposed to hold arbitrary named attributes; see https://universaldependencies.org/format.html#sentence-boundaries-and-comments


def add_line_ewt(self, line_num: int, line_elems: List[str]):
"""
:param line_num: Location in file, for error reporting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note here to tell how this is different from add_line?

if len(line_elems) < 2 + len(self._column_names):
if len(line_elems) > 2 + self._num_standard_cols:
line_elems.extend(['_' for i in range(2 + len(self._column_names) - len(line_elems))])
print(f"Unexpected number of elements {len(line_elems)} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this print statement is supposed to be here.

-> List[List[_SentenceData]]:
"""

Parses EWT file format to python objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function just for EWT, or is it for anything that meets the CoNLL-U standard? If the former, the function ought to be called _parse_ewt_file; if the latter, this docstring needs to be updated.

@@ -40,6 +40,7 @@
# Special token that CoNLL-2003 format uses to delineate the documents in
# the collection.
_CONLL_DOC_SEPARATOR = "-DOCSTART-"
_EWT_DOC_SEPERATOR = "# newdoc id"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this separator is actually officially part of the CoNLL-U standard and isn't just for EWT; see https://universaldependencies.org/format.html#paragraph-and-document-boundaries

elif line_elems[0] == "# sent_id":
sentence_id = line_elems[1]
current_sentence.set_conll_u_metadata(sent_id=sentence_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an additional branch to this if statement that gathers up any other metadata key-value pairs that the dataset sees fit to attach to the sentence (see https://universaldependencies.org/format.html#sentence-boundaries-and-comments).

It would be nice if there was also a way to have this function map a user-configurable set of additional metadata values to additional columns of the returned DataFrame.

:param merge_subtokens: dictates how to handle tokens that are smaller than one word. By default, we keep
the subtokens as two seperate entities, but if this is set to true, the subtokens will be merged into a
single entity, of the same length as the token, and their attributes will be concatenated
:param merge_subtoken_seperator: If merge subtokens is selected, concatenate the attributes with this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seperator ==> separator

@frreiss frreiss merged commit 3fb107c into CODAIT:master Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants