Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize sentence-level comments that are used in multiple treebanks #273

Closed
dan-zeman opened this issue Mar 22, 2016 · 17 comments
Closed

Comments

@dan-zeman
Copy link
Member

Originally reported by @jeanm in #272 (comment)

I noticed UD_French has a comment above each annotated sentence with the unnanotated text. For example:

# sentid: fr-ud-dev_00001
# sentence-text: Aviator, un film sur la vie de Hughes.
1   Aviator _   PROPN   _   _   0   root    _   _
2   ,   _   PUNCT   _   _   1   punct   _   _
3   un  _   DET _   _   4   det _   _
4   film    _   NOUN    _   _   1   appos   _   _
5   sur _   ADP _   _   7   case    _   _
6   la  _   DET _   _   7   det _   _
7   vie _   NOUN    _   _   4   nmod    _   _
8   de  _   ADP _   _   9   case    _   _
9   Hughes  _   PROPN   _   _   7   nmod    _   _
10  .   _   PUNCT   _   _   1   punct   _   _

Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional # sentence-text: <unannotated text> line before each annotated sentence.

@manning
Copy link
Contributor

manning commented Apr 3, 2016

This seems to me a very good idea. I like the idea of making comments a key-value pair and to attempt standardization of the keys across corpora. (Maybe "sentence-text" could be shorter though, like just "text"?)

This seems like a good issue to put someone else in charge of! I'm nominating Ryan. :)

@jnivre
Copy link
Contributor

jnivre commented Jun 28, 2016

I support this, but I think we should standardize sentence-ids as well, and perhaps make them globally unique by having a treebank-specific prefix as in the French example. This can be useful if people want to add additional (standoff) annotation to UD treebanks, and also for alignment information in parallel treebanks.

@martinpopel
Copy link
Member

martinpopel commented Jun 28, 2016

Thanks for reviving this issue. What should be decided here:

  • format of attribute-value pairs in comments. My suggestion: # attribute value. Alternatives: # attribute = value, # attribute: value.
  • name of the sentence plain text attribute. My suggestion: sentence. Alternatives: sentence-text, text, ...
  • name of the id attribute. My suggestion: sent_id. Alternatives: tree_id, sentid, ID,...

I have several related proposals which I plan to discuss in separate issues soon, especially the format of the id values (#321).

@amir-zeldes
Copy link
Contributor

Thanks for suggesting these, I think it makes a lot of sense. I would be against #attribute value since people put all sorts of things in comments, and a simple prose comment could then be confused with key-value pairs.

Between these options I'd vote for a=b, because it's least likely to occur by chance (colons are more prose like). In fact, I would almost prefer to have a text delimiter for the value, such as:

# attribute="some value with spaces in it"

This could be useful just to make it very clear if we have white space, or multiple attributes on one line, or who knows what.

I'd also like to mention another specific attribute that could be useful, which is the speaker, if known. It could look like this: (apologies for the non-UD example)

#speaker="Mario J. Lucero"
1       Heaven  _       NNP     NNP     _       2       nn      _       _
2       Sent    _       NNP     NNP     _       3       nn      _       _
3       Gaming  _       NNP     NNP     _       6       nsubj   _       _
4       is      _       VBZ     VBZ     _       6       cop     _       _
5       basically       _       RB      RB      _       6       advmod  _       _
6       me      _       PRP     PRP     _       0       root    _       _
7       and     _       CC      CC      _       6       cc      _       _
8       Isabel  _       NNP     NNP     _       6       conj    _       _
9       ,       _       ,       ,       _       0       punct   _       _
10      I       _       PRP     PRP     _       14      nsubj   _       _
11      'm      _       VBP     VBP     _       14      cop     _       _
12      Mario   _       NNP     NNP     _       14      nn      _       _
13      J.      _       NNP     NNP     _       14      nn      _       _
14      Lucero  _       NNP     NNP     _       6       parataxis       _       _
15      .       _       .       .       _       0       punct   _       _

This is helpful for anaphora resolution using dependency trees as input.

@dan-zeman
Copy link
Member Author

Text delimiters tend to complicate parsing of the line, and so do multiple attributes on one line. I prefer to keep it simple. The key, then perhaps the equals sign, and the rest of the line is the value, leading and trailing whitespace removed.

We only need to standardize those key-value pairs that we foresee to be of general interest and used in multiple treebanks (or, in the cases like the sent_id, obligatorily in all treebanks). These may receive special treatment by UD tools, and also some attention by the format validator. The rest may or may not be / look like key-value pairs, but from the UD point of view they do not differ from ordinary prose comments.

@martinpopel
Copy link
Member

@amir-zeldes thanks for your suggestions.

  • Double-quote delimiters would mean a need to define escaping # sentence "I say \"Hello!\"". I agree with @dan-zeman: let's keep it simple at the cost of forbidding values with leading/trailing whitespace and with newlines. The whole UD is about simplification anyway.
  • Personally, I would also prefer to add the equal sign (with optional space on both sides) between the key and value. Just most of the v1.3 treebanks use # sent_id 123 because it was mentioned as an example in CoNLL-U specification.
  • Ad simple prose comment confused with key-value pairs: the solution I had in mind was that there will be a list of standardized attribute names (so far sent_id and sentence). So UD tools can extract those from the comments during loading and serialize it there during saving, but other key-value-like lines will be kept within the comments. An alternative solution is to allow any attribute names ([a-z_]+) and let the UD tools to extract from the comments all lines matching ^# ([a-z_]+) *= *(.*)$. However, this is rather a question of the design and API of the UD tools. But if we want to go this way in future, it would be wise to standardize the equal sign.

@amir-zeldes
Copy link
Contributor

Agreed, that all makes sense. I think insisting on the = sign is a good idea, even if some of the existing resources with # sent_id 123 will have to be updated (also easy to fix automatically).

So that means that space-containing values are generally allowed, and the value is always stripped? Is there any situation where we might want a leading/trailing space kept?

My original thinking was for sentence text, something like a space past the final period, if we want to preserve white space behavior from a resource that we have other, non-dependency annotations for. On the other hand, the tokenization itself is non-whitespace-preserving, so that's probably a moot point.

@fginter
Copy link
Member

fginter commented Jul 1, 2016

👍 from me on this one. I would myself prefer tree_id over sentence and would like to keep a notion of document_id as well. This is quite important for search tools which need to return a context of a hit (plus minus a handful of sentences) - ie they need to know the document boundaries. url is another which could be listed. I think ^# ([a-z_]+) *= *(.*)$ would be a good common ground.

@martinpopel
Copy link
Member

OK, I change my suggestion to # attribute = value (i.e. ^# ([a-z_]+) *= *(.*)$) as @amir-zeldes and @fginter support this.

I would myself prefer tree_id over sentence

Note that the sentence is meant to store the plain text string of the sentence, so it is not a substitute for sent_id/tree_id. It should be possible to reconstruct (up to redundant whitespace) the sentence string from the forms and SpaceAfter=No (if multi-word tokens are used properly), but there are still use cases for this attribute.

Or did you mean "I would myself prefer tree_id over sent_id"?
I also slightly prefer the tree_id name because sometimes I have more trees (alternative annotations) for the same sentence: see #321. However, sent_id was already mentioned in docs, so I wanted to stay backward compatible. Now, if by adding the equal sign we loose the compatibility with UD v1.3 anyway, I am voting for tree_id.

and would like to keep a notion of document_id as well.

Good idea. My original idea was to keep doc_id (optionally) encoded within sent_id. I also plan to use sent_id#node_id for storing coreference and word alignment. This results in fully node IDs such as f000001-s2/en#12, which are bit too long (this is a real example taken from CzEng). If we standardize doc_id (and expecting coreference across document boundaries is forbidden), I can use just s2/en#12 when referring to that node. So I vote for this as well. Perhaps, we should discuss it with others within #321.

@foxik
Copy link
Member

foxik commented Aug 26, 2016

We are discussing how the original sentence can be encoded in sentence-level comments in #332. The latest suggestion from this issue, ^# ([a-z_]+) *= *(.*)$, cannot represent spaces at the beginning of the sentence. Therefore I suggested ^# ([a-z_0-9]+)=(.*)$ with C-like escapes of the value (only \n, \r and \\). I also suggested we use text instead of sentence for the attribute storing the text of the sentence (because sentence suggests also other meanings like sentence id, while text does not).

The discussion will probably continue in #332, but if consensus is reached, I will post it here as well.

@dan-zeman
Copy link
Member Author

I would prefer to keep the spaces around = for readability. Also, I don't like trailing whitespace to be semantically significant - occasionally CoNLL-U files are edited manually, and then you have to double-check whether your editor preserves trailing whitespace (e.g. I have set my editor to remove it). But since I also opposed using quotation marks around attribute values, I guess I am for escaping any whitespace before the first / after the last token. Unfortunately, that means that almost every sentence ends with \s.

I am fine with naming it text.

@fginter
Copy link
Member

fginter commented Aug 26, 2016

I personally do not see the need to represent leading and trailing whitespace on sentences (with or without escaping). We should make the format user-friendly and dealing with escaped and double-escaped characters is such a hassle.

This of course all boils down to the question of whether CoNLL-U is meant to faithfully represent a text corpus with its annotation. I personally do not see that as the task of this format. Anyway, that's just me - feel free to disagree. :)

@martinpopel
Copy link
Member

OK, we can name the attribute text instead of sentence (we'll have to change also the API in Python, Perl and Java, but that's still possible).

My original idea was to ignore spaces before/after sentences and multiple spaces between words in CoNLL-U. Afterall, CoNLL-U and UD is all about simplification.
Even if we decide to encode such spaces (e.g. in order to train segmenters on noisy data where spaces between sentences may be missing), I agree with @dan-zeman that

  • CoNLL-U should not allow trailing whitespace
  • It is not user-friendly if almost all sentences (i.e. text) end with \s

@foxik
Copy link
Member

foxik commented Aug 26, 2016

Oh -- if you are already using sentence in UDApi alread, I think it is not worth the rename. It was just a minor thing. So back to sentence?

@martinpopel
Copy link
Member

Udapi is not in a stable version yet, so it is no problem to change sentence to text if there is an agreement that it is a better name.

@spyysalo
Copy link
Member

spyysalo commented Sep 8, 2016

👍 for naming this text and avoiding delimiters (such as double-quotes), 👎 for having most sentences end with \s. I agree with @fginter and @martinpopel to ignore spaces between sentences.

@spyysalo
Copy link
Member

spyysalo commented Dec 1, 2016

Resolved in v2 as # sent_id = (mandatory) and # text = (optional) (http://universaldependencies.org/format.html)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants