cg-proc -w mangles case #24

TinoDidriksen · 2018-12-21T10:53:57Z

See https://groups.google.com/forum/#!topic/constraint-grammar/pLpCsu-eUY4

I think this goes to @unhammer

unhammer · 2018-12-21T20:50:45Z

I think this one will be hard to deal with correctly without abandoning the "lemma-casing→form-casing" mechanism (where casing is {lower,title,upper}).

That is, in Apertium we currently completely drop source forms while encoding the casing of the source form on the source lemma, so Je/prpers<prn> becomes Prpers<prn>. But casing is ambiguous with single-letter forms (is U title or upper?), so U/prpers becomes PRPERS when it might as well have been Prpers. Translating U into you then might turn it into YOU instead of You.

Some ideas:

Keep source form throughout the pipeline, letting e.g. transfer rules decide what to do. This would be nice for other reasons as well, but would require changes to many Apertium modules, and you'd still have to deal with casing-ambiguity at some point.
ADD (@casinghint) in CG where you have access to both the form and lemma casing (by regexp matching) and part of speech, and transfer upper-/lowercase based on the CG tag you added. I think this would be the easiest solution for now, and it lets you do things like disambiguate U being upper vs title based on if the following word is allcaps or not.

unhammer · 2018-12-21T20:51:27Z

@MarcRiera ^

MarcRiera · 2018-12-23T18:13:39Z

Thanks for the ideas! After experimenting a bit with CG, I have managed to overcome the limitation for both languages, with two solutions that are pair-independent (everything is corrected in the language's CG after-section):

For Romanian "A" and "O", a new reading with the correct case is added if the word appears at BOS, and the original incorrect reading is removed. For non-BOS occurrences, the uppercase reading is kept.
For English "I", the previous solution does not seem to work for non-BOS cases, because even if a lowercase reading can be inserted, the -w flag changes it to uppercase again. My first workaround was to change the surface form to "prpers" by inserting a new cohort and removing the old one, but this could negatively affect the tagger (which also has access to surface forms). So the solution for BOS occurrences has been, as @unhammer mentioned, to add an extra tag that is recognised during transfer in the English-Catalan pair.

The wordform «Håndball-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now titlecases first subreading instead of uppercasing: /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ The wordform «HÅNDBALL-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now uppercases second subreading too: /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$ Unfortunately doesn't do anything for GrammarSoft#24 TODO: Could avoid some copying of UnicodeString's

The wordform «Håndball-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now titlecases first subreading instead of uppercasing: /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ The wordform «HÅNDBALL-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now uppercases second subreading too: /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$ We require at least two alphabetic uppercase characters before calling it uppercase (and no lowercase). This doesn't solve #24 but at least realises ^I/prpers<prn>$ as ^I/Prpers<prn>$ instead of ^I/PRPERS<prn>$, which is less shouty in the cases where people haven't yet made the CG rules for BOS vs non-BOS. Code now also uses pointer offsets to avoid unnecessary copying. git-svn-id: svn+ssh://beta.visl.sdu.dk/usr/local/svn/repos/visl/tools/vislcg3/trunk@13496 cb2587b3-b7ff-0310-8c81-a2a651690ada

TinoDidriksen added the bug label Dec 21, 2018

TinoDidriksen removed the bug label Dec 23, 2018

unhammer mentioned this issue Apr 27, 2019

cg-proc -w: Only uppercase if whole form upper, uppercase all baseforms #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cg-proc -w mangles case #24

cg-proc -w mangles case #24

TinoDidriksen commented Dec 21, 2018

unhammer commented Dec 21, 2018

unhammer commented Dec 21, 2018

MarcRiera commented Dec 23, 2018

cg-proc -w mangles case #24

cg-proc -w mangles case #24

Comments

TinoDidriksen commented Dec 21, 2018

unhammer commented Dec 21, 2018

unhammer commented Dec 21, 2018

MarcRiera commented Dec 23, 2018