Standardize sentence-level comments that are used in multiple treebanks #273

dan-zeman · 2016-03-22T20:12:32Z

Originally reported by @jeanm in #272 (comment)

I noticed UD_French has a comment above each annotated sentence with the unnanotated text. For example:

# sentid: fr-ud-dev_00001
# sentence-text: Aviator, un film sur la vie de Hughes.
1   Aviator _   PROPN   _   _   0   root    _   _
2   ,   _   PUNCT   _   _   1   punct   _   _
3   un  _   DET _   _   4   det _   _
4   film    _   NOUN    _   _   1   appos   _   _
5   sur _   ADP _   _   7   case    _   _
6   la  _   DET _   _   7   det _   _
7   vie _   NOUN    _   _   4   nmod    _   _
8   de  _   ADP _   _   9   case    _   _
9   Hughes  _   PROPN   _   _   7   nmod    _   _
10  .   _   PUNCT   _   _   1   punct   _   _

Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional # sentence-text: <unannotated text> line before each annotated sentence.

The text was updated successfully, but these errors were encountered:

manning · 2016-04-03T18:13:03Z

This seems to me a very good idea. I like the idea of making comments a key-value pair and to attempt standardization of the keys across corpora. (Maybe "sentence-text" could be shorter though, like just "text"?)

This seems like a good issue to put someone else in charge of! I'm nominating Ryan. :)

jnivre · 2016-06-28T10:21:55Z

I support this, but I think we should standardize sentence-ids as well, and perhaps make them globally unique by having a treebank-specific prefix as in the French example. This can be useful if people want to add additional (standoff) annotation to UD treebanks, and also for alignment information in parallel treebanks.

martinpopel · 2016-06-28T13:22:33Z

Thanks for reviving this issue. What should be decided here:

format of attribute-value pairs in comments. My suggestion: # attribute value. Alternatives: # attribute = value, # attribute: value.
name of the sentence plain text attribute. My suggestion: sentence. Alternatives: sentence-text, text, ...
name of the id attribute. My suggestion: sent_id. Alternatives: tree_id, sentid, ID,...

I have several related proposals which I plan to discuss in separate issues soon, especially the format of the id values (#321).

amir-zeldes · 2016-06-30T15:09:07Z

Thanks for suggesting these, I think it makes a lot of sense. I would be against #attribute value since people put all sorts of things in comments, and a simple prose comment could then be confused with key-value pairs.

Between these options I'd vote for a=b, because it's least likely to occur by chance (colons are more prose like). In fact, I would almost prefer to have a text delimiter for the value, such as:

# attribute="some value with spaces in it"

This could be useful just to make it very clear if we have white space, or multiple attributes on one line, or who knows what.

I'd also like to mention another specific attribute that could be useful, which is the speaker, if known. It could look like this: (apologies for the non-UD example)

#speaker="Mario J. Lucero"
1       Heaven  _       NNP     NNP     _       2       nn      _       _
2       Sent    _       NNP     NNP     _       3       nn      _       _
3       Gaming  _       NNP     NNP     _       6       nsubj   _       _
4       is      _       VBZ     VBZ     _       6       cop     _       _
5       basically       _       RB      RB      _       6       advmod  _       _
6       me      _       PRP     PRP     _       0       root    _       _
7       and     _       CC      CC      _       6       cc      _       _
8       Isabel  _       NNP     NNP     _       6       conj    _       _
9       ,       _       ,       ,       _       0       punct   _       _
10      I       _       PRP     PRP     _       14      nsubj   _       _
11      'm      _       VBP     VBP     _       14      cop     _       _
12      Mario   _       NNP     NNP     _       14      nn      _       _
13      J.      _       NNP     NNP     _       14      nn      _       _
14      Lucero  _       NNP     NNP     _       6       parataxis       _       _
15      .       _       .       .       _       0       punct   _       _

This is helpful for anaphora resolution using dependency trees as input.

dan-zeman · 2016-06-30T15:48:12Z

Text delimiters tend to complicate parsing of the line, and so do multiple attributes on one line. I prefer to keep it simple. The key, then perhaps the equals sign, and the rest of the line is the value, leading and trailing whitespace removed.

We only need to standardize those key-value pairs that we foresee to be of general interest and used in multiple treebanks (or, in the cases like the sent_id, obligatorily in all treebanks). These may receive special treatment by UD tools, and also some attention by the format validator. The rest may or may not be / look like key-value pairs, but from the UD point of view they do not differ from ordinary prose comments.

martinpopel · 2016-06-30T16:27:49Z

@amir-zeldes thanks for your suggestions.

Double-quote delimiters would mean a need to define escaping # sentence "I say \"Hello!\"". I agree with @dan-zeman: let's keep it simple at the cost of forbidding values with leading/trailing whitespace and with newlines. The whole UD is about simplification anyway.
Personally, I would also prefer to add the equal sign (with optional space on both sides) between the key and value. Just most of the v1.3 treebanks use # sent_id 123 because it was mentioned as an example in CoNLL-U specification.
Ad simple prose comment confused with key-value pairs: the solution I had in mind was that there will be a list of standardized attribute names (so far sent_id and sentence). So UD tools can extract those from the comments during loading and serialize it there during saving, but other key-value-like lines will be kept within the comments. An alternative solution is to allow any attribute names ([a-z_]+) and let the UD tools to extract from the comments all lines matching ^# ([a-z_]+) *= *(.*)$. However, this is rather a question of the design and API of the UD tools. But if we want to go this way in future, it would be wise to standardize the equal sign.

amir-zeldes · 2016-06-30T16:47:21Z

Agreed, that all makes sense. I think insisting on the = sign is a good idea, even if some of the existing resources with # sent_id 123 will have to be updated (also easy to fix automatically).

So that means that space-containing values are generally allowed, and the value is always stripped? Is there any situation where we might want a leading/trailing space kept?

My original thinking was for sentence text, something like a space past the final period, if we want to preserve white space behavior from a resource that we have other, non-dependency annotations for. On the other hand, the tokenization itself is non-whitespace-preserving, so that's probably a moot point.

fginter · 2016-07-01T10:29:34Z

👍 from me on this one. I would myself prefer tree_id over sentence and would like to keep a notion of document_id as well. This is quite important for search tools which need to return a context of a hit (plus minus a handful of sentences) - ie they need to know the document boundaries. url is another which could be listed. I think ^# ([a-z_]+) *= *(.*)$ would be a good common ground.

martinpopel · 2016-07-06T18:26:16Z

OK, I change my suggestion to # attribute = value (i.e. ^# ([a-z_]+) *= *(.*)$) as @amir-zeldes and @fginter support this.

I would myself prefer tree_id over sentence

Note that the sentence is meant to store the plain text string of the sentence, so it is not a substitute for sent_id/tree_id. It should be possible to reconstruct (up to redundant whitespace) the sentence string from the forms and SpaceAfter=No (if multi-word tokens are used properly), but there are still use cases for this attribute.

Or did you mean "I would myself prefer tree_id over sent_id"?
I also slightly prefer the tree_id name because sometimes I have more trees (alternative annotations) for the same sentence: see #321. However, sent_id was already mentioned in docs, so I wanted to stay backward compatible. Now, if by adding the equal sign we loose the compatibility with UD v1.3 anyway, I am voting for tree_id.

and would like to keep a notion of document_id as well.

Good idea. My original idea was to keep doc_id (optionally) encoded within sent_id. I also plan to use sent_id#node_id for storing coreference and word alignment. This results in fully node IDs such as f000001-s2/en#12, which are bit too long (this is a real example taken from CzEng). If we standardize doc_id (and expecting coreference across document boundaries is forbidden), I can use just s2/en#12 when referring to that node. So I vote for this as well. Perhaps, we should discuss it with others within #321.

foxik · 2016-08-26T08:55:41Z

We are discussing how the original sentence can be encoded in sentence-level comments in #332. The latest suggestion from this issue, ^# ([a-z_]+) *= *(.*)$, cannot represent spaces at the beginning of the sentence. Therefore I suggested ^# ([a-z_0-9]+)=(.*)$ with C-like escapes of the value (only \n, \r and \\). I also suggested we use text instead of sentence for the attribute storing the text of the sentence (because sentence suggests also other meanings like sentence id, while text does not).

The discussion will probably continue in #332, but if consensus is reached, I will post it here as well.

dan-zeman · 2016-08-26T09:09:29Z

I would prefer to keep the spaces around = for readability. Also, I don't like trailing whitespace to be semantically significant - occasionally CoNLL-U files are edited manually, and then you have to double-check whether your editor preserves trailing whitespace (e.g. I have set my editor to remove it). But since I also opposed using quotation marks around attribute values, I guess I am for escaping any whitespace before the first / after the last token. Unfortunately, that means that almost every sentence ends with \s.

I am fine with naming it text.

fginter · 2016-08-26T09:51:48Z

I personally do not see the need to represent leading and trailing whitespace on sentences (with or without escaping). We should make the format user-friendly and dealing with escaped and double-escaped characters is such a hassle.

This of course all boils down to the question of whether CoNLL-U is meant to faithfully represent a text corpus with its annotation. I personally do not see that as the task of this format. Anyway, that's just me - feel free to disagree. :)

martinpopel · 2016-08-26T10:06:30Z

OK, we can name the attribute text instead of sentence (we'll have to change also the API in Python, Perl and Java, but that's still possible).

My original idea was to ignore spaces before/after sentences and multiple spaces between words in CoNLL-U. Afterall, CoNLL-U and UD is all about simplification.
Even if we decide to encode such spaces (e.g. in order to train segmenters on noisy data where spaces between sentences may be missing), I agree with @dan-zeman that

CoNLL-U should not allow trailing whitespace
It is not user-friendly if almost all sentences (i.e. text) end with \s

foxik · 2016-08-26T19:20:41Z

Oh -- if you are already using sentence in UDApi alread, I think it is not worth the rename. It was just a minor thing. So back to sentence?

martinpopel · 2016-08-26T20:41:02Z

Udapi is not in a stable version yet, so it is no problem to change sentence to text if there is an agreement that it is a better name.

spyysalo · 2016-09-08T10:30:32Z

👍 for naming this text and avoiding delimiters (such as double-quotes), 👎 for having most sentences end with \s. I agree with @fginter and @martinpopel to ignore spaces between sentences.

spyysalo · 2016-12-01T00:58:15Z

Resolved in v2 as # sent_id = (mandatory) and # text = (optional) (http://universaldependencies.org/format.html)

dan-zeman added standard needed CoNLL-U labels Mar 22, 2016

dan-zeman added this to the universal v2 milestone Mar 22, 2016

dan-zeman mentioned this issue Mar 22, 2016

unannotated text in UD languages? #272

Closed

manning assigned ryanmcd Apr 3, 2016

martinpopel mentioned this issue Jul 6, 2016

sent_id format and parallel treebanks #321

Closed

foxik mentioned this issue Aug 26, 2016

Allow representing all space characters of the original text in the CoNLL-U format. #332

Closed

spyysalo closed this as completed Dec 1, 2016

martinpopel mentioned this issue Feb 17, 2022

Recommended format for alignments? #846

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize sentence-level comments that are used in multiple treebanks #273

Standardize sentence-level comments that are used in multiple treebanks #273

dan-zeman commented Mar 22, 2016

manning commented Apr 3, 2016

jnivre commented Jun 28, 2016

martinpopel commented Jun 28, 2016 •

edited

amir-zeldes commented Jun 30, 2016

dan-zeman commented Jun 30, 2016

martinpopel commented Jun 30, 2016

amir-zeldes commented Jun 30, 2016

fginter commented Jul 1, 2016

martinpopel commented Jul 6, 2016

foxik commented Aug 26, 2016

dan-zeman commented Aug 26, 2016

fginter commented Aug 26, 2016

martinpopel commented Aug 26, 2016

foxik commented Aug 26, 2016

martinpopel commented Aug 26, 2016

spyysalo commented Sep 8, 2016

spyysalo commented Dec 1, 2016

Standardize sentence-level comments that are used in multiple treebanks #273

Standardize sentence-level comments that are used in multiple treebanks #273

Comments

dan-zeman commented Mar 22, 2016

manning commented Apr 3, 2016

jnivre commented Jun 28, 2016

martinpopel commented Jun 28, 2016 • edited

amir-zeldes commented Jun 30, 2016

dan-zeman commented Jun 30, 2016

martinpopel commented Jun 30, 2016

amir-zeldes commented Jun 30, 2016

fginter commented Jul 1, 2016

martinpopel commented Jul 6, 2016

foxik commented Aug 26, 2016

dan-zeman commented Aug 26, 2016

fginter commented Aug 26, 2016

martinpopel commented Aug 26, 2016

foxik commented Aug 26, 2016

martinpopel commented Aug 26, 2016

spyysalo commented Sep 8, 2016

spyysalo commented Dec 1, 2016

martinpopel commented Jun 28, 2016 •

edited