New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Update Conll-U to fully support and cover EWT dataset #194

Merged

frreiss merged 15 commits into CODAIT:master from ZachEichen:master

Jun 3, 2021

Collaborator

ZachEichen commented Jun 1, 2021

Addresses issue #191 and adds support for importing CoNLL-U data-format files, especially those in EWT, Global Dependencies, and conll_2009 formats, as well as Ontonotes.

Created a separate method from the conll_to_df as a new entry-point for these dataformats, which supports similar options to other available packages supporting .conllu files (such as Spacy). Refactored common code within the io/conll module to separate methods.

ZachEichen added 5 commits

May 26, 2021 10:57


          initial version of conll_u support. still adding metadata for documen…

90901f4

…ts, paragraphs, and sentences


          added support for certian metadata components

6c079b4


          added support for merging subtokens

5a73ba8


          furthered support for alternate versions of conllu files

1cf7223


          added support for predicate encodings

0070fd8

ZachEichen requested a review from frreiss

June 1, 2021 14:06

ZachEichen added 5 commits

June 1, 2021 10:48


          removed bug where first doc didn't have a doc_id metadata

83c1899


          added logic to re-point 'head' field at the new indexes of heads with…

dae57c5

…in sentence


          updated unit tests accordingly

d968892


          added more intense regression tests for conllu files


          added missing file from last push

197468b

frreiss approved these changes

View reviewed changes

Member

frreiss left a comment

Looking good. Some minor changes requested inline.

Would you mind also running the modified files through black before checking in?

text_extensions_for_pandas/io/conll.py Outdated

+                      self._sentence_id = None
+                      self._paragraph_id = None
+                      self._doc_id = None
+                      self.conll_09_format = predicate_args

Member

frreiss Jun 2, 2021

This class field name should start with underscore.

text_extensions_for_pandas/io/conll.py Outdated

+                  def has_conll_u_metadata(self):
+                      return (self._sentence_id is not None) or (self._paragraph_id is not None) or (self._doc_id is not None)
+                  def set_conll_u_metadata(self, doc_id: str = None, paragraph_id: str = None, sent_id: str = None):

Member

frreiss Jun 2, 2021

Would you mind augmenting these fields with a configurable set of key-value pairs? The CoNLL-U spec seems to think the comments before each sentence are supposed to hold arbitrary named attributes; see https://universaldependencies.org/format.html#sentence-boundaries-and-comments

text_extensions_for_pandas/io/conll.py

+                  def add_line_ewt(self, line_num: int, line_elems: List[str]):
+                      """
+                      :param line_num: Location in file, for error reporting

Member

frreiss Jun 2, 2021

Can you add a note here to tell how this is different from add_line?

text_extensions_for_pandas/io/conll.py Outdated

+                      if len(line_elems) < 2 + len(self._column_names):
+                          if len(line_elems) > 2 + self._num_standard_cols:
+                              line_elems.extend(['_' for i in range(2 + len(self._column_names) - len(line_elems))])
+                              print(f"Unexpected number of elements {len(line_elems)} "

Member

frreiss Jun 2, 2021

It doesn't look like this print statement is supposed to be here.

text_extensions_for_pandas/io/conll.py Outdated

+                      -> List[List[_SentenceData]]:
+                  """
+                  Parses EWT file format to python objects

Member

frreiss Jun 2, 2021

Is this function just for EWT, or is it for anything that meets the CoNLL-U standard? If the former, the function ought to be called _parse_ewt_file; if the latter, this docstring needs to be updated.

text_extensions_for_pandas/io/conll.py

@@ @@ -40,6 +40,7 @@ @@
               # Special token that CoNLL-2003 format uses to delineate the documents in
               # the collection.
               _CONLL_DOC_SEPARATOR = "-DOCSTART-"
+              _EWT_DOC_SEPERATOR = "# newdoc id"

Member

frreiss Jun 2, 2021

I think this separator is actually officially part of the CoNLL-U standard and isn't just for EWT; see https://universaldependencies.org/format.html#paragraph-and-document-boundaries

text_extensions_for_pandas/io/conll.py

+                          elif line_elems[0] == "# sent_id":
+                              sentence_id = line_elems[1]
+                              current_sentence.set_conll_u_metadata(sent_id=sentence_id)

Member

frreiss Jun 2, 2021

There should be an additional branch to this if statement that gathers up any other metadata key-value pairs that the dataset sees fit to attach to the sentence (see https://universaldependencies.org/format.html#sentence-boundaries-and-comments).

It would be nice if there was also a way to have this function map a user-configurable set of additional metadata values to additional columns of the returned DataFrame.

text_extensions_for_pandas/io/conll.py Outdated

+                  :param merge_subtokens: dictates how to handle tokens that are smaller than one word. By default, we keep
+                   the subtokens as two seperate entities, but if this is set to true, the subtokens will be merged into a
+                   single entity, of the same length as the token, and their attributes will be concatenated
+                  :param merge_subtoken_seperator: If merge subtokens is selected, concatenate the attributes with this

Member

frreiss Jun 2, 2021

seperator ==> separator

ZachEichen added 5 commits

June 2, 2021 10:25


          addressed comments in pull request. ran through formatter

92f8ea6


          added support for user configurable metadata tags

de13bb2


          fixed docstrings

3e2f6fb


          formatted

ffb5162


          cleaned up files in the case where no instances of metadata exist

0f90151

frreiss merged commit 3fb107c into CODAIT:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet