New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize sentence-level comments that are used in multiple treebanks #273
Comments
This seems to me a very good idea. I like the idea of making comments a key-value pair and to attempt standardization of the keys across corpora. (Maybe "sentence-text" could be shorter though, like just "text"?) This seems like a good issue to put someone else in charge of! I'm nominating Ryan. :) |
I support this, but I think we should standardize sentence-ids as well, and perhaps make them globally unique by having a treebank-specific prefix as in the French example. This can be useful if people want to add additional (standoff) annotation to UD treebanks, and also for alignment information in parallel treebanks. |
Thanks for reviving this issue. What should be decided here:
I have several related proposals which I plan to discuss in separate issues soon, especially the format of the id values (#321). |
Thanks for suggesting these, I think it makes a lot of sense. I would be against Between these options I'd vote for a=b, because it's least likely to occur by chance (colons are more prose like). In fact, I would almost prefer to have a text delimiter for the value, such as:
This could be useful just to make it very clear if we have white space, or multiple attributes on one line, or who knows what. I'd also like to mention another specific attribute that could be useful, which is the speaker, if known. It could look like this: (apologies for the non-UD example)
This is helpful for anaphora resolution using dependency trees as input. |
Text delimiters tend to complicate parsing of the line, and so do multiple attributes on one line. I prefer to keep it simple. The key, then perhaps the equals sign, and the rest of the line is the value, leading and trailing whitespace removed. We only need to standardize those key-value pairs that we foresee to be of general interest and used in multiple treebanks (or, in the cases like the sent_id, obligatorily in all treebanks). These may receive special treatment by UD tools, and also some attention by the format validator. The rest may or may not be / look like key-value pairs, but from the UD point of view they do not differ from ordinary prose comments. |
@amir-zeldes thanks for your suggestions.
|
Agreed, that all makes sense. I think insisting on the = sign is a good idea, even if some of the existing resources with So that means that space-containing values are generally allowed, and the value is always stripped? Is there any situation where we might want a leading/trailing space kept? My original thinking was for sentence text, something like a space past the final period, if we want to preserve white space behavior from a resource that we have other, non-dependency annotations for. On the other hand, the tokenization itself is non-whitespace-preserving, so that's probably a moot point. |
👍 from me on this one. I would myself prefer |
OK, I change my suggestion to
Note that the Or did you mean "I would myself prefer
Good idea. My original idea was to keep doc_id (optionally) encoded within |
We are discussing how the original sentence can be encoded in sentence-level comments in #332. The latest suggestion from this issue, The discussion will probably continue in #332, but if consensus is reached, I will post it here as well. |
I would prefer to keep the spaces around I am fine with naming it |
I personally do not see the need to represent leading and trailing whitespace on sentences (with or without escaping). We should make the format user-friendly and dealing with escaped and double-escaped characters is such a hassle. This of course all boils down to the question of whether CoNLL-U is meant to faithfully represent a text corpus with its annotation. I personally do not see that as the task of this format. Anyway, that's just me - feel free to disagree. :) |
OK, we can name the attribute My original idea was to ignore spaces before/after sentences and multiple spaces between words in CoNLL-U. Afterall, CoNLL-U and UD is all about simplification.
|
Oh -- if you are already using |
Udapi is not in a stable version yet, so it is no problem to change |
👍 for naming this |
Resolved in v2 as |
Originally reported by @jeanm in #272 (comment)
I noticed
UD_French
has a comment above each annotated sentence with the unnanotated text. For example:Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional
# sentence-text: <unannotated text>
line before each annotated sentence.The text was updated successfully, but these errors were encountered: