Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cg-proc -w mangles case #24

Open
TinoDidriksen opened this issue Dec 21, 2018 · 3 comments
Open

cg-proc -w mangles case #24

TinoDidriksen opened this issue Dec 21, 2018 · 3 comments

Comments

@TinoDidriksen
Copy link
Member

See https://groups.google.com/forum/#!topic/constraint-grammar/pLpCsu-eUY4

I think this goes to @unhammer

@unhammer
Copy link
Collaborator

I think this one will be hard to deal with correctly without abandoning the "lemma-casing→form-casing" mechanism (where casing is {lower,title,upper}).

That is, in Apertium we currently completely drop source forms while encoding the casing of the source form on the source lemma, so Je/prpers<prn> becomes Prpers<prn>. But casing is ambiguous with single-letter forms (is U title or upper?), so U/prpers becomes PRPERS when it might as well have been Prpers. Translating U into you then might turn it into YOU instead of You.

Some ideas:

  1. Keep source form throughout the pipeline, letting e.g. transfer rules decide what to do. This would be nice for other reasons as well, but would require changes to many Apertium modules, and you'd still have to deal with casing-ambiguity at some point.
  2. ADD (@casinghint) in CG where you have access to both the form and lemma casing (by regexp matching) and part of speech, and transfer upper-/lowercase based on the CG tag you added. I think this would be the easiest solution for now, and it lets you do things like disambiguate U being upper vs title based on if the following word is allcaps or not.

@unhammer
Copy link
Collaborator

@MarcRiera ^

@MarcRiera
Copy link

Thanks for the ideas! After experimenting a bit with CG, I have managed to overcome the limitation for both languages, with two solutions that are pair-independent (everything is corrected in the language's CG after-section):

  • For Romanian "A" and "O", a new reading with the correct case is added if the word appears at BOS, and the original incorrect reading is removed. For non-BOS occurrences, the uppercase reading is kept.
  • For English "I", the previous solution does not seem to work for non-BOS cases, because even if a lowercase reading can be inserted, the -w flag changes it to uppercase again. My first workaround was to change the surface form to "prpers" by inserting a new cohort and removing the old one, but this could negatively affect the tagger (which also has access to surface forms). So the solution for BOS occurrences has been, as @unhammer mentioned, to add an extra tag that is recognised during transfer in the English-Catalan pair.

@TinoDidriksen TinoDidriksen removed the bug label Dec 23, 2018
unhammer added a commit to unhammer/cg3 that referenced this issue Apr 26, 2019
The wordform «Håndball-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now titlecases first subreading instead of uppercasing:

 /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

The wordform «HÅNDBALL-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now uppercases second subreading too:

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$

Unfortunately doesn't do anything for
GrammarSoft#24

TODO: Could avoid some copying of UnicodeString's
TinoDidriksen pushed a commit that referenced this issue Apr 30, 2019
The wordform «Håndball-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now titlecases first subreading instead of uppercasing:

 /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$


The wordform «HÅNDBALL-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now uppercases second subreading too:

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$


We require at least two alphabetic uppercase characters before calling
it uppercase (and no lowercase).

This doesn't solve #24 but
at least realises ^I/prpers<prn>$ as ^I/Prpers<prn>$ instead of
^I/PRPERS<prn>$, which is less shouty in the cases where people
haven't yet made the CG rules for BOS vs non-BOS.

Code now also uses pointer offsets to avoid unnecessary copying.



git-svn-id: svn+ssh://beta.visl.sdu.dk/usr/local/svn/repos/visl/tools/vislcg3/trunk@13496 cb2587b3-b7ff-0310-8c81-a2a651690ada
TinoDidriksen pushed a commit that referenced this issue Aug 11, 2021
The wordform «Håndball-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now titlecases first subreading instead of uppercasing:

 /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$


The wordform «HÅNDBALL-VM» used to give

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$

now uppercases second subreading too:

 /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$


We require at least two alphabetic uppercase characters before calling
it uppercase (and no lowercase).

This doesn't solve #24 but
at least realises ^I/prpers<prn>$ as ^I/Prpers<prn>$ instead of
^I/PRPERS<prn>$, which is less shouty in the cases where people
haven't yet made the CG rules for BOS vs non-BOS.

Code now also uses pointer offsets to avoid unnecessary copying.



git-svn-id: svn+ssh://beta.visl.sdu.dk/usr/local/svn/repos/visl/tools/vislcg3/trunk@13496 cb2587b3-b7ff-0310-8c81-a2a651690ada
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants